back to home

physical-superintelligence-lab / Psi0

Welcome to Psi-Zero, a Humanoid VLA towards Universal Humanoid Intelligence.

View on GitHub
475 stars
11 forks
1 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing physical-superintelligence-lab/Psi0 in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/physical-superintelligence-lab/Psi0)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Ψ₀: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation Contributors: Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni , Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang,Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang ------- $\Psi_0$ is an open vision-language-action (VLA) model for dexterous humanoid loco-manipulation. Our model first learns task semantics and visual representation from large-scale human egocentic videos, and then is post-trained on a smaller amount of real-world teleoperated robot data, to learn general dynamics of the embodiment. Our foundation model is capable of acquiring new long-horizontal dexterous loco-manipulation skill by fine-tuning using as few as 80 trajectories. ***Our key finding is that scaling the right data in the right way.*** At the top, the $\Psi_0$ model consists of two end-to-end trained components: a vision–language backbone (System-2) and a multimodal diffusion transformer (System-1) action expert. The backbone is based on Qwen’s Qwen3-VL-2B-Instruct, which extracts vision–language features from observations and instructions. These features condition a flow-based multimodal diffusion transformer inspired by Stable Diffusion 3. The action expert (≈500M parameters) predicts future whole-body action chunks, enabling efficient fusion of visual, linguistic, and action representations. At the lowest level (System-0), an RL-based tracking controller executes the predicted lower-body action commands, ensuring stable and precise physical control. Table of Contents • Finetune Ψ₀ on Unitree G1 Humanoid Robot • Installation • Data Collection • Fine-Tuning • Open-Loop Evaluation • Deployment • Baselines • GR00T N1.6 • OpenPi π0.5 • InternVLA-M1 • H-RDT • EgoVLA • Diffusion Policy • ACT • Simulation • Install SIMPLE • Data Generation • Fine-Tuning • Evaluation in SIMPLE • Reproduce Ψ₀: Pre-Training and Post-Training • Checkpoints • Troubleshootings • Citation Finetune Ψ₀ on Unitree G1 Humanoid Robot Installation Clone the project and change directory to the project root: We use uv to manage Python dependencies. Install if not already installed: Set up the $\Psi_0$ environment: > ℹ️ We manage the $\Psi_0$ environment and all the baselines through and they all share the same code. See Environment Management for more details. Test installation, a version number should be displayed. Verify the shared stack is importable. Data Collection > 📂 We open-sourced all the 9 real-world tasks. You can directly download the data and jump to the Fine-Tuning. > 🔥 We first release our internal test data collection pipeline which uses Apple Vision Pro to teleoperate Unitree G1 humanoid Robot with two Dex3-1 hands. See the detailed teleoperation guide here: Real-World Teleoperation Guide Pre-Processing: Convert Raw Data to LeRobot Format You should observe similar folder structure: Edit the task description file with the following format, eg., Run conversion script Calculate stats Create **$\Psi_0$** format stats (simply a copy for now) Now it's ready to finetune $\Psi_0$. > ✈️ If training env is already configured, directly launch training via Fine-Tuning > ✔️ Suppose the data is already collected and processed. Now we can proceed to fine-tune the $\Psi_0$ model. > There is a known issue of loading our real data, apply this fix first > 📝 Here we illustrate by using the pre-collected data from Huggingface psi-data. Set up the environment variables following . The environment variables will be loaded by the in python. Download the collected real-world data and extract it: > 👀 If you want to visualize the episode please refer to the Data Visualization in the examples. Launch the training script: > 🖥️ You can always change the GPUs, e.g., . > ⚠️ Please try to maintain a reasonable global batch size = device batch size x number of GPUs x gradient accumulation step. We use global batch size 128 throughout all the real-world and simulation experiments. Open-Loop Evaluation > Follow the steps in Load the training dataset, and run model inference to see how model fits the training data. Deployment Serve $\Psi_0$ (RTC mode) Start $\Psi_0$ Client (RTC mode) For detailed real-world deployment environment setup, please also refer to the dedicated documentation: Real-World Teleoperation Guide Baselines GR00T Install the env • training • serving a checkpoint • openloop eval on trained checkpoint using gt InternVLA-M1 Install the env • training • serving a checkpoint Simulation We use [SIMPLE]() to benchmark $\Psi_0$ and all the baselines. > 📢 SIMPLE is an easy-to-use humanoid benchmarking simulator built on the MuJoCo physics engine and Isaac Sim rendering. Install SIMPLE [Coming soon] Data Generation > 📂 We also provide 5 pre-collected whole-body humanoid loco-manipulation tasks at Huggingface psi-data. If you want to use the existing simulation data, jump to the Fine-Tuning Motion-Planning Based Data Generation [Coming soon] Teleoperation in Simulator [Coming soon] Fine-Tuning Download SIMPLE task data and extract it: > 💡 Dont forget first before following below commands. > 👀 If you want to visualize the episode please refer to the Data Visualization in the examples. Start training: > Please set up the envrionment variables if not done so yet. The training will create a run dir which is located under in the project root. If your GPU has limited VRAM, set to reduce optimizer-step memory usage at the cost of some speed. Evaluation in SIMPLE Serve $\Psi_0$ Run open-loop evaluation (offline) examples/simple/openloop_eval.ipynb Run the Evaluation Client > If the server is started on a remote server, run ssh port forward. eg., . > Once port forward is done, open a new terminal to test if server is up Launch the eval client through docker The policy rollout videos will be found in folder . > The evaluation for a single episode could take up to 6~10 minutes because SIMPLE us…