Skip to content
All projects
Multimodal
Featured
May 2026

CLIP-Style Multimodal Retrieval

A from-scratch CLIP-style model trained on 4M image-text pairs for fast cross-modal search.

71.4%
R@1 (test)
190M
Params
38h
Train time

The problem

Public CLIP weights are biased toward web data. Domain-specific retrieval needs custom training.

The approach

Two-tower contrastive architecture, mixed-precision training across 8×A100, and learned temperature scaling.

Results

Reached 71.4% R@1 on a held-out test set, only 3 points behind a 6× larger baseline.