facebookresearch / vjepa2

PyTorch code and models for VJEPA2 self-supervised learning from video.

3,206 stars

367 forks

47 issues

PythonJupyter Notebook

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing facebookresearch/vjepa2 in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/facebookresearch/vjepa2)

Preview:

Repository Overview (README excerpt)

Crawler view

🆕 **[2026-03-16]:** :fire: V-JEPA 2.1 is released :fire: A new familly of models trained with a novel recipe that learns high quality and temporolly consistent dense features !!! **[2025-06-25]:** V-JEPA 2 is released. [ ] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Meta FAIR Mahmoud Assran∗, Adrien Bardes∗, David Fan∗, Quentin Garrido∗, Russell Howes∗, Mojtaba Komeili∗, Matthew Muckley∗, Ammar Rizvi∗, Claire Roberts∗, Koustuv Sinha∗, Artem Zholus*, Sergio Arnaud*, Abha Gejji*, Ada Martin*, Francois Robert Hogan*, Daniel Dugas*, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier*, Yann LeCun*, Michael Rabbat*, Nicolas Ballas* *Core Team [ ] [ ] [ ] Official Pytorch codebase for V-JEPA 2, V-JEPA 2-AC, V-JEPA 2.1. V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration. V-JEPA 2.1 Pre-training Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes [ ] [ ] V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features, as higlighted by PCA visualizations: The V-JEPA 2.1 approach leverages: (1) **Dense Predictive Loss**, a masking-based self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the self-supervised training loss; (2) **Deep Self-Supervision**, which applies the self-supervised loss at multiple intermediate representations of the encoder models; (3) **Multi-Modal Tokenizers** for images and videos; and we show that our approach benefit from (4) **Model and data scaling**. V-JEPA 2.1 performance across dense and global prediction tasks: V-JEPA 2 Pre-training **(Top)** The encoder and predictor are pre-trained through self-supervised learning from video using a masked latent feature prediction objective, leveraging abundant natural videos to bootstrap physical world understanding and prediction. **(Bottom)** Performance of V-JEPA 2 on downstream understanding and prediction tasks.   Benchmark V-JEPA 2 Previous Best EK100 39.7% 27.6% (PlausiVL) SSv2 (Probe) 77.3% 69.7% (InternVideo2-1B) Diving48 (Probe) 90.2% 86.4% (InternVideo2-1B) MVP (Video QA) 44.5% 39.9% (InternVL-2.5) TempCompass (Video QA) 76.9% 75.3% (Tarsier 2) V-JEPA 2-AC Post-training **(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.   Grasp Pick-and-Place Method Reach Cup Box Cup Box Octo 100% 10% 0% 10% 10% Cosmos 80% 0% 20% 0% 0% VJEPA 2-AC 100% 60% 20% 80% 50% Models V-JEPA 2 and V-JEPA 2.1 HuggingFace See our HuggingFace collection for V-JEPA 2. V-JEPA 2 Pretrained Checkpoints Model #Parameters Resolution Download Link Pretraining Config ViT-L/16 300M 256 checkpoint configs ViT-H/16 600M 256 checkpoint configs ViT-g/16 1B 256 checkpoint configs ViT-g/16 384 1B 384 checkpoint configs V-JEPA 2.1 Pretrained Checkpoints Model #Parameters Resolution Download Link Pretraining Config ViT-B/16 80M 384 checkpoint configs ViT-L/16 300M 384 checkpoint configs ViT-g/16 1B 384 checkpoint configs ViT-G/16 2B 384 checkpoint configs Pretrained backbones (via PyTorch Hub) Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended. Pretrained checkpoints on Huggingface You can also use our pretrained checkpoints on Huggingface for V-JEPA 2. Evaluation Attentive Probes We share the trained attentive probes for two of our visual understanding evals (Something-Something v2 and Diving48) and the action anticipation eval EPIC-KITCHENS-100. Model SSv2 Diving48 EK100 Checkpoint Training Config Inference Config Result Checkpoint Training Config Inference Config Result Checkpoint Training Config Inference Config Result ViT-L/16 checkpoint config config 73.7% checkpoint config config 89.0% checkpoint config config 32.7 R@5 ViT-g/16 384 checkpoint config config 77.3% checkpoint config config 90.2% checkpoint config config 39.7 R@5 V-JEPA 2-AC Our action-conditioned checkpoint was trained from the ViT-g encoder. Model Download Link Training Config ViT-g/16 checkpoint config Pretrained action-conditioned backbone (via PyTorch Hub) Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended. See energy_landscape_example.ipynb for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab. To run this notebook, you'll need to additionally install Jupyter and Scipy in your conda environment. Getting Started Setup **Note to macOS users:** V-JEPA 2 relies on , which does not support macOS (and, unfortunately, is also no longer under development). In order to run the V-JEPA 2 code on macOS, you will need a different implementation. We do not make specific recommendations, although some users have reported the use of (see PR 1) or (see PR 31). We…