ViT for Medical Imaging

A vision transformer that detects pathology in chest X-rays at radiologist-level accuracy.

0.912

AUC (CheXpert)

0.86

F1 (Pneumonia)

1,200

GPU hours

The problem

CNNs underperform on long-range pathologies in chest X-rays. Labeled medical data is scarce.

Self-supervised MAE pre-training on 500k unlabeled chest X-rays, followed by supervised fine-tuning with class-balanced focal loss.

Beat the ResNet-50 baseline by 6.2 AUC points on CheXpert and reached radiologist-level F1 on three out of fourteen findings.