ByteDance-Seed/Chain-of-Action
Python
Captured source
source ↗ByteDance-Seed/Chain-of-Action
Description: Official implementation of Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation. Accepted in NeurIPS 2025.
Language: Python
License: Apache-2.0
Stars: 104
Forks: 8
Open issues: 2
Created: 2025-07-03T06:19:29Z
Pushed: 2025-12-13T06:38:12Z
Default branch: main
Fork: no
Archived: no
README:
Quick start
Set up environment
conda create -n coa python=3.9 -y conda activate coa bash scripts/init.sh source ~/.bashrc
install dependencies and RLBench enviroment, see [init.sh](scripts/init.sh) for details
One-click Evaluation
The script will automatically download the required pretrained snapshot and the necessary evaluation dataset for the specified task.
bash scripts/eval.sh task=push_button
Experiments Results
Evluation over 60 RLBench tasks
Why we use 60 tasks for the main evaluation? Although the 18 RLBench tasks have been widely adopted as a benchmark since their introduction in “Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation”, they are primarily used to evaluate 3D-based hierarchical policies that depend heavily on high-precision 3D inputs and motion planners. Many of these tasks are extremely challenging for RGB-only visuomotor policies, often leading to uniformly low success rates and therefore limited discriminative power.
Evluation over 18 RLBench tasks
To enable convenient comparison with 3D-based hierarchical methods—such as RVT-2, we also report results on the RLBench-18 benchmark. Plase check appendix for more details.
> Somehow, RLBench (the most popular 3D-policy benchmark) has gained significant traction in VLA benchmarking, yet VLAs remain far from matching 3D SOTA methods such as 3DDA. For various reasons, most VLA policies tend to avoid comparing against all relevant 3D baselines. > > — Source: A Practitioner’s Guide to VLA Evaluation
As you can see, there is still substantial room for RGB-only visuomotor policies to close the performance gap.
Train & Eval
Download RLBench datasets
Execute the command to download all data.
python scripts/download_dataset.py
Detailed usage
python scripts/download_dataset.py --task reach_target --train-episodes 100 --eval-episodes 25
--task: Specify the task name to download (e.g., reach_target, stack_wine). Only one task will be downloaded.--train-episodes: Number of training episodes per task (default: 100, total 100).--eval-episodes: Number of evaluation episodes per task (default: 25, total 50).- To download the recommended 10-task subset, add the
--subsetflag. To download all tasks, do not specify--taskor--subset.
Evaluation
python scripts/eval.py task=task_name snapshot=path_to_snapshot
Training
python scripts/train.py task=task_name
For detailed parameter settings, please refer to [launch.yaml](src/cfgs/launch.yaml).
Key parameters include:
num_train_steps: total training steps (default: 20000)batch_size: training batch size (default: 128)task: task name (must be specified)demos: number of demonstrations per task (default: 100)eval_every_steps: evaluation interval in steps (default: 10000)vis_every_steps: visualization interval in steps (default: 2000)save_every_steps: model checkpoint interval in steps (default: 10000)num_eval_episodes: number of episodes per evaluation (default: 25)
You can customize these parameters by editing src/cfgs/launch.yaml directly, or override them via command line arguments (e.g., python scripts/train.py task=push_button batch_size=64).
Note on Open-Source Implementation
The open-source implementation slightly differs from the original version reported in the paper. We reconstructed the entire training and evaluation pipeline for better clarity and reproducibility.
During this process, a few settings were adjusted:
- Both the latent loss and action loss were changed from L2 to L1.
- The multi-token prediction head was reduced from 5 tokens to 2 tokens.
These updates generally lead to improved success rates across most tasks. As a result, your observed performance (e.g., 100% on “push button”) may exceed the numbers reported in the paper.
Updated Results (Open-Source Version)
For reference, below are the task-level success rates of the open-source implementation compared with those reported in the paper. The open-source version generally achieves higher performance due to the modified training configuration.
| Task | Paper version | Open-Source version | |------|--------------|----------------------| | Stack Wine | 0.80 | 0.76 | | Turn Tap | 0.56 | 0.72 | | Open Drawer | 0.88 | 0.96 | | Push Button | 0.76 | 1.00 | | Pick Up Cup | 0.80 | 0.92 | | Take Lid | 0.80 | 0.92 | | Press Switch | 0.44 | 0.52 | | Reach Target | 0.84 | 0.72 | | Sweep Dust | 0.92 | 0.96 | | Open Box | 0.76 | 0.96 | | Average | 0.756 | 0.844 |
Directory Structure
scripts/:Training, evaluation, data/snapshot downloading scripts
src/: Main source code directory, including the following subfolders and files:cfgs/: Configuration filesdataset/: Dataset loading and preprocessingenvs/: Simulation environmentmethods/: Algorithm implementationscoa/: Chain-of-Actionact/: ACT (To-do)dp/: Diffusion policy (To-do)base.py,utils.py: Common base classes and utilitiesutils.py,logger.py,video.py, : General utilities and main control scriptsworkspace.py: training workflow
exp_local/: Local experiment results.checkpoints/: Model weightseval_videos/: Evaluation videostrain.log: Training log.hydra/: Configuration snapshots and Hydra management filesREADME.md:Project documentation
Acknowledgement
This repository is built upon the robobase framework. The reproduced results of ACT and Diffusion Policy (DP) are based on the implementations provided in that repository.
Citation
@inproceedings{zhang2025chainofaction,
author = {Zhang, Wenbo and Hu, Tianrun and Zhang, Hanbo and Qiao, Yanyuan and Qin, Yuchu and Li, Yang and Liu, Jiajun and Kong, Tao and Liu, Lingqiao and Ma, Xiao},
title = {Chain-of-Action: Trajectory…Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo with moderate traction