Visual Pre-Training on Unlabeled Images using Reinforcement Learning

Dibya Ghosh, Sergey Levine

Summary

Many SSL methods bear similarity to value-based RL: learning features for an image by predicting targets generated from nearby views, e.g., by taking a different crop or color augmentation. We explore a method that directly casts pre-training on unlabeled image data as an RL problem. Learning in this way resembles crop-consistency SSL, but offers a simple lever to use captions or curated image data to shape feature learning towards ``rewards'' of interest. Our experiments demonstrate improved representations when training on video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.

Method

We define an RL problem over unlabeled images using a Markov chain perspective on image augmentations. An agent receives a view of an image $x$ and takes actions by applying an image transformation to change the view (e.g., zooming out, panning left, rotating the image, cropping to a subview), with the intention of finding views that maximize the likelihood of some specific semantic "annotation" of interest $p(\ell|x)$ (e.g., find the gentleman in the green suit, or a kite, or a boy playing).

Take two random crops of an image, as in any other SSL method, $x_1$ by cropping to bounding box $\mathbf{bb} _1$ and similarly $x_2$ to $\mathbf{bb}_2$. We interpret $(x=x_1, a=\mathbf{bb}_{1\to2}, x'=x_2)$ as a transition in our environment: applying a panning transformation from the view $x_1$ to create $x_2$. Learning a value function corresponds to, for any $\ell$, using the model's outputs at one crop $x_2$ to generate a target prediction for the other crop $x_1$: $$\min D(~Q_{AB}(\ell | x_1, a=\mathbf{bb}_{1\to2}), (1-\gamma)p(\ell | x_2) + \gamma \max_{a'} Q^{target}(\ell | x_2,a')) $$

Method
Method

Results

In our paper, we evaluate annotation bootstrapping on a number of datasets, where crop-consistency SSL methods like SimCLR and DINO tend to suffer. Across datasets like EpicKitchens, COCO, and CC12M, we find that bootstrapping annotations improves over several weakly supervised and self-supervised base losses ($\text{AB}_{CLIP}$ over CLIP, $\text{AB}_{SimCLR}$ over SimCLR, $\text{AB}_{DINO}$ over DINO). The gap is greatest when rewards correspond to textual captions, where it significantly outperforms other approaches combining self-supervision with captions. Below, we show results for CC12M, and we refer to the paper for results on other datasets.

Pretrain Dataset Method ImageNet Avg Cls* Clevr/Depth Clevr/Count
CC12M (no captions) MAE 61.3 75.4 82.8 90.4
I-JEPA 60.0 76.0 80.1 90.0
SimCLR 67.3 79.0 76.5 89.4
ABSimCLR (Ours) 68.0+0.7 79.5+0.4 79.5+3.0 89.6+0.2
DINO 68.9 80.9 79.3 87.6
ABDINO (Ours) 70.6+1.8 82.2+1.3 80.4+1.1 89.9+2.4
CC12M (w/ captions) CLIP 69.5 82.8 70.0 84.4
CLIP +Aug 72.6 85.0 72.7 87.0
SLIP +SimCLR 72.0 84.3 72.4 87.2
SiLC +DINO 72.8 85.0 74.4 88.2
ABCLIP (Ours) 74.1+4.6 85.6+2.8 78.1+8.1 91.9+7.4
Method
Method Method
This project site borrows heavily from this project website