Many SSL methods bear similarity to value-based RL: learning features for an image by predicting targets generated from nearby views, e.g., by taking a different crop or color augmentation. We explore a method that directly casts pre-training on unlabeled image data as an RL problem. Learning in this way resembles crop-consistency SSL, but offers a simple lever to use captions or curated image data to shape feature learning towards ``rewards'' of interest. Our experiments demonstrate improved representations when training on video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.
We define an RL problem over unlabeled images using a Markov chain perspective on image augmentations. An agent receives a view of an image $x$ and takes actions by applying an image transformation to change the view (e.g., zooming out, panning left, rotating the image, cropping to a subview), with the intention of finding views that maximize the likelihood of some specific semantic "annotation" of interest $p(\ell|x)$ (e.g., find the gentleman in the green suit, or a kite, or a boy playing).
Take two random crops of an image, as in any other SSL method, $x_1$ by cropping to bounding box $\mathbf{bb} _1$ and similarly $x_2$ to $\mathbf{bb}_2$. We interpret $(x=x_1, a=\mathbf{bb}_{1\to2}, x'=x_2)$ as a transition in our environment: applying a panning transformation from the view $x_1$ to create $x_2$. Learning a value function corresponds to, for any $\ell$, using the model's outputs at one crop $x_2$ to generate a target prediction for the other crop $x_1$: $$\min D(~Q_{AB}(\ell | x_1, a=\mathbf{bb}_{1\to2}), (1-\gamma)p(\ell | x_2) + \gamma \max_{a'} Q^{target}(\ell | x_2,a')) $$
In our paper, we evaluate annotation bootstrapping on a number of datasets, where crop-consistency SSL methods like SimCLR and DINO tend to suffer. Across datasets like EpicKitchens, COCO, and CC12M, we find that bootstrapping annotations improves over several weakly supervised and self-supervised base losses ($\text{AB}_{CLIP}$ over CLIP, $\text{AB}_{SimCLR}$ over SimCLR, $\text{AB}_{DINO}$ over DINO). The gap is greatest when rewards correspond to textual captions, where it significantly outperforms other approaches combining self-supervision with captions. Below, we show results for CC12M, and we refer to the paper for results on other datasets.
Pretrain Dataset | Method | ImageNet | Avg Cls* | Clevr/Depth | Clevr/Count |
---|---|---|---|---|---|
CC12M (no captions) | MAE | 61.3 | 75.4 | 82.8 | 90.4 |
I-JEPA | 60.0 | 76.0 | 80.1 | 90.0 | |
SimCLR | 67.3 | 79.0 | 76.5 | 89.4 | |
ABSimCLR (Ours) | 68.0+0.7 | 79.5+0.4 | 79.5+3.0 | 89.6+0.2 | |
DINO | 68.9 | 80.9 | 79.3 | 87.6 | |
ABDINO (Ours) | 70.6+1.8 | 82.2+1.3 | 80.4+1.1 | 89.9+2.4 | |
CC12M (w/ captions) | CLIP | 69.5 | 82.8 | 70.0 | 84.4 |
CLIP +Aug | 72.6 | 85.0 | 72.7 | 87.0 | |
SLIP +SimCLR | 72.0 | 84.3 | 72.4 | 87.2 | |
SiLC +DINO | 72.8 | 85.0 | 74.4 | 88.2 | |
ABCLIP (Ours) | 74.1+4.6 | 85.6+2.8 | 78.1+8.1 | 91.9+7.4 |