Popular Boards

Mido Assran, Adrien Bardes, David P. Fan | ArXiv.org | (2025)

Key Takeaways

Sample Definition And Size

The study pre-trains V‑JEPA 2 on a dataset comprising over 1 million hours of internet video and images. For robotic planning, the action‑conditioned variant V‑JEPA 2‑AC is post‑trained using less than 62 hours of unlabeled robot videos from the Droid dataset. No additional robot data from deployment environments was used. ([arxiv.org](https://arxiv.org/abs/2506.09985))

Study Type

This is a self‑supervised learning study involving large‑scale pre‑training of a joint‑embedding predictive architecture (V‑JEPA 2) on video data, followed by post‑training of an action‑conditioned world model (V‑JEPA 2‑AC) for zero‑shot robotic planning. ([arxiv.org](https://arxiv.org/abs/2506.09985))

Conflicts Of Interest

No conflicts of interest are declared in the arXiv metadata. ([arxiv.org](https://arxiv.org/abs/2506.09985))

Results Summary

Key findings include: V‑JEPA 2 achieves 77.3% top‑1 accuracy on Something‑Something v2 (motion understanding) and 39.7 recall‑at‑5 on Epic‑Kitchens‑100 (human action anticipation). When aligned with a large language model (~8B parameters), it attains 84.0 on PerceptionTest and 76.9 on TempCompass (video question answering). V‑JEPA 2‑AC enables zero‑shot robotic pick‑and‑place planning on Franka arms without environment‑specific data or task‑specific training. ([arxiv.org](https://arxiv.org/abs/2506.09985))

Abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Referenced In

🫀New foundation model: EchoJEPA is trained on 18M heart ultrasounds uses latent prediction instead of pixel reconstruction.

⸂⸂⸜(രᴗര๑)⸝⸃⸃ Hey everyone!! 👋 Always interested in seeing new ways technology might contribute to the healthcare space. Anyways, this study here recently caught my attention, and I’m curious to hear your thoughts. 

Reading an echocardiogram is part art, part science.

The images are noisy, patients move, and video quality captured varies wildly. Ultrasound recordings have "speckle" 👉 that grainy, flickering pattern dominating every frame. It's not like simple film grain or JPEG artefacts, instead more like random interference that contains zero anatomical information. 

Yet for years, we've trained AI to reconstruct it pixel-by-pixel, essentially forcing models to become experts at guessing static, by porting natural video techniques into medical imaging: VideoMAE, MAE, contrastive learning; assuming that data scale would eventually bridge the domain gap. 

Here, they implement a slowly-evolving "teacher" network (EMA) that naturally learns to ignore flickering speckles while locking onto temporally stable structures: chamber geometry, valve motion, wall thickening; achieving state-of-the-art performance on left ventricular ejection fraction (LVEF) estimation and right ventricular systolic pressure (RVSP) prediction.

👉 The Result? 

  • 78% accuracy with just 1% of labels. 

  • Only 2.3% degradation under simulated "difficult patient" conditions (acoustic shadowing, depth attenuation)

  • Zero-shot pediatric. This model beats other fine-tuned paediatric models, without ever seeing a child’s heart.

EchoJEPA adapts Meta's V-JEPA2 architecture with two critical domain-specific modifications: temporal resolution and augmentation strategy, beating masked autoencoding by 27% on ejection fraction estimation.

With EchoJEPA providing automated echocardiography analysis, it allows access to expert-level cardiac assessment in resource-limited settings. This especially true for people whose echocardiography may deviate from the standard like those with obesity, lung disease and even children.

5