Popular Boards

Alif Munim, Adibvafa Fallahpour, Teodora Szasz | ArXiv.org | (2026)

Abstract

Foundation models for echocardiography often struggle to disentangle anatomical signal from the stochastic speckle and acquisition artifacts inherent to ultrasound. We present EchoJEPA, a foundation model trained on 18 million echocardiograms across 300K patients, representing the largest pretraining corpus for this modality to date. By leveraging a latent predictive objective, EchoJEPA learns robust anatomical representations that ignore speckle noise. We validate this using a novel multi-view probing framework with frozen backbones, where EchoJEPA outperforms leading baselines by approximately 20% in left ventricular ejection fraction (LVEF) estimation and 17% in right ventricular systolic pressure (RVSP) estimation. The model also exhibits remarkable sample efficiency, reaching 79% view classification accuracy with only 1% of labeled data versus 42% for the best baseline trained on 100%. Crucially, EchoJEPA demonstrates superior generalization, degrading by only 2% under physics-informed acoustic perturbations compared to 17% for competitors. Most remarkably, its zero-shot performance on pediatric patients surpasses fully fine-tuned baselines, establishing latent prediction as a superior paradigm for robust, generalizable medical AI.

Tags

Sample Definition And Size

The study pretrained EchoJEPA on 18 million echocardiograms from approximately 300,000 patients, representing the largest pretraining corpus for echocardiography to date ([arxiv.org](https://arxiv.org/abs/2602.02603?utm_source=openai)).

Study Type

This is a foundation model development study employing self-supervised learning with a latent predictive objective, evaluated via a novel multi-view probing framework using frozen backbones. It is not a clinical trial but a methodological AI model development and evaluation study ([arxiv.org](https://arxiv.org/abs/2602.02603?utm_source=openai)).

Conflicts Of Interest

No conflicts of interest or funding disclosures are provided in the arXiv metadata or abstract. The paper acknowledges support from the Simons Foundation and member institutions, but no competing interests are declared ([arxiv.org](https://arxiv.org/abs/2602.02603)).

Results Summary

Key findings include: approximately 20% reduction in error for left ventricular ejection fraction (LVEF) estimation and 17% reduction for right ventricular systolic pressure (RVSP) estimation compared to leading baselines; view classification accuracy of 79% using only 1% labeled data versus 42% for the best baseline trained on 100%; only ~2% performance degradation under acoustic perturbations versus ~17% for competitors; and superior zero-shot performance on pediatric patients, outperforming fully fine-tuned baselines ([arxiv.org](https://arxiv.org/abs/2602.02603)).

Referenced In

🫀New foundation model: EchoJEPA is trained on 18M heart ultrasounds uses latent prediction instead of pixel reconstruction.

⸂⸂⸜(രᴗര๑)⸝⸃⸃ Hey everyone!! 👋 Always interested in seeing new ways technology might contribute to the healthcare space. Anyways, this study here recently caught my attention, and I’m curious to hear your thoughts. 

Reading an echocardiogram is part art, part science.

The images are noisy, patients move, and video quality captured varies wildly. Ultrasound recordings have "speckle" 👉 that grainy, flickering pattern dominating every frame. It's not like simple film grain or JPEG artefacts, instead more like random interference that contains zero anatomical information. 

Yet for years, we've trained AI to reconstruct it pixel-by-pixel, essentially forcing models to become experts at guessing static, by porting natural video techniques into medical imaging: VideoMAE, MAE, contrastive learning; assuming that data scale would eventually bridge the domain gap. 

Here, they implement a slowly-evolving "teacher" network (EMA) that naturally learns to ignore flickering speckles while locking onto temporally stable structures: chamber geometry, valve motion, wall thickening; achieving state-of-the-art performance on left ventricular ejection fraction (LVEF) estimation and right ventricular systolic pressure (RVSP) prediction.

👉 The Result? 

  • 78% accuracy with just 1% of labels. 

  • Only 2.3% degradation under simulated "difficult patient" conditions (acoustic shadowing, depth attenuation)

  • Zero-shot pediatric. This model beats other fine-tuned paediatric models, without ever seeing a child’s heart.

EchoJEPA adapts Meta's V-JEPA2 architecture with two critical domain-specific modifications: temporal resolution and augmentation strategy, beating masked autoencoding by 27% on ejection fraction estimation.

With EchoJEPA providing automated echocardiography analysis, it allows access to expert-level cardiac assessment in resource-limited settings. This especially true for people whose echocardiography may deviate from the standard like those with obesity, lung disease and even children.

5