Fourier ActionNet Dataset

1Fourier, 2Shanghai Jiaotong University

ActionNet includes bimanual manipulation of diverse objects in randomized tabletop scenes.

Introduction

The pursuit of generalist robots capable of human-like ability and adaptability hinges on the availability of rich, diverse, and scalable training data across different embodiments. While many efforts have been made to make large and diverse robotic datasets, their predominant focus on gripper-based systems limits their applicability to true humanoid platforms. At Fourier, we believe in the transformative potential of dexterous hands on a humanoid form factor—THE embodiment we envision for achieving general-purpose embodied intelligence. To advance research toward this vision, we proudly introduce the the Fourier ActionNet Dataset, a pioneering dataset for dexterous bimanual manipulation.

With over 30K teleoperated trajectories or around 140 hours of interaction data, ActionNet is one of the largest datasets featuring humanoid bimanual manipulation tasks with dexterous hands. With a focus on bimanual manipulation of diverse objects in randomized tabletop scenes, the dataset is collected using a teleoperation system for high-quality human demonstration. The dataset is shared under the CC BY-NC-SA 4.0 license. The accompanying training pipeline and evaluation tools are open-sourced under Apache 2.0 license.

Data Collection Setup

Teleoperation setup

The dataset is gathered using a teleoperation system that replicates the humanoid's first-person perspective. By aligning the vision system with diverse tasks, objects, and environments, the captured data ensures that the robot's performed skills closely resemble human actions, making it more effective for training.

Environment & Hardware

This dataset captures humanoid performance in tabletop tasks such as pick-and-place, pouring, and insertion across diverse environments with varying conditions, objects, and scene layouts. It incorporates multiple humanoid robots and dexterous hands to enhance variability and adaptability. Vision data is collected using the OAK-D W 97 camera, providing a wide field of view to align with the humanoid's perspective. The dataset includes recordings from Fourier GR1-T1, GR1-T2, and GR2 humanoids, along with two types of Fourier Dexterous hands featuring six and twelve degrees of freedom. By integrating diverse hardware configurations, this dataset supports robust learning and adaptability in real-world scenarios.

Teleoperation

The teleoperation system utilizes Vision Pro as the primary control device, enabling VR-based operation to ensure data collection from the humanoid's first-person perspective. In this system, a robot operator manipulates the humanoid using real-time visual feedback from its viewpoint, aligning their perspectives for more natural and accurate data collection. Additionally, the operator's actions, including hand movements, are directly transmitted to the humanoid, ensuring that the performed skills closely resemble real human execution, enhancing the humanoid's learning and adaptability.

Training

To embrace the exciting future of humanoids, we welcome everyone interested in humanoid development to join us in advancing this field. As part of this effort, we are providing our training pipeline, which includes the ACT, DP, and IDP3 algorithms. Our training pipeline and data format follow the LeRobotDatasetV2 structure, ensuring a more convenient and well-organized framework. For full details, please check out our fork , which provides the necessary scripts to visualize our data and convert our data into LeRobotDatasetV2 format. Also, we provide an implementation of iDP3 in the LeRobot framework.

Data Collection & Model Evaluation

Our dataset is collected using a VR-based teleoperation system. The system is available in our official Teleoperation repository. This repository provides resources for data collection, as well as model deployment scripts that use the same codebase. As a result, policies trained with our dataset and training pipeline can be seamlessly deployed within the same system, minimizing discrepancies between training and real-world execution.

Data Distribution

Data Distribution

The dataset provides diverse tasks, objects, durations, and skills:

  1. Task Distribution: The chart shows the distribution of different task types. Noisy manipulation is the most frequent, followed by pick-and-place, cabinet interaction, precise placement, and pouring. The main focus is on tabletop bimanual manipulation tasks, with a variety of actions to ensure robust skill learning.
  2. Item Distribution: A diverse range of objects, including tools, household items, and food, is used during collection. Our dataset features common household and office objects such as cups, clamps, and measuring tapes, ensuring robust object-handling skills.
  3. Duration Distribution: The histogram shows the distribution of task durations. With most trajectories around 15s long, our dataset has a nice mix of short and long tasks. A well-balanced distribution allows models to learn both short and prolonged interactions efficiently.
  4. Skill Distribution: The distribution of skill ranges among reaching, placing, holding and more. The dataset captures both fundamental and complex actions, covering a wide spectrum of robotic skills.

Data Annotation

We annotated the data automatically with a Vision-Language Model (Qwen2.5-VL-7B). Specifically, the model extracts 26 frames from the recorded videos and generates annotations for each video stream with a prompt like: "What verbal instruction would you give the robot to perform the task in the video?"

The generated annotations were then manually reviewed and verified. Through this process, all 30,000 data samples are annotated with concise prompts averaging 53 words each. Allowing the dataset to be used for imitation learning, VLA, and world model training.

Model Evaluation

To test the quality of our dataset, we have trained imitation learning algorithms on our data across multiple tasks. Specifically, we have evaluated the performance of three popular models: DP, ACT, and iDP3.

Notably, all three models have demonstrated impressive performance on the Fourier GR1, GR1-Pro, and GR2 humanoids:

Model Comparison

Performance of DP, ACT, and iDP3 across various tasks.

DP 1
ACT 1
IDP3 1
DP 2
ACT 2
IDP3 2

Humanoids Comparison

Same model applied across different humanoids: GR1-T1, GR1-T2, and GR2.

ACT with GR1-T1
ACT with GR1-T2
ACT with GR2

BibTeX

@article{fourier2025actionnet,
  author    = {Fourier ActionNet Team, Yao Mu},
  title     = {ActionNet: A dataset for dexterous bimanual manipulation},
  year      = {2025},
}