Visual Pre-Training on Unlabeled Images using Reinforcement Learning

Summary

Many SSL methods bear similarity to value-based RL: learning features for an image by predicting targets generated from nearby views, e.g., by taking a different crop or color augmentation. We explore a method that directly casts pre-training on unlabeled image data as an RL problem. Learning in this way resembles crop-consistency SSL, but offers a simple lever to use captions or curated image data to shape feature learning towards ``rewards'' of interest. Our experiments demonstrate improved representations when training on video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.

Method

We define an RL problem over unlabeled images using a Markov chain perspective on image augmentations. An agent receives a view of an image $x$ and takes actions by applying an image transformation to change the view (e.g., zooming out, panning left, rotating the image, cropping to a subview), with the intention of finding views that maximize the likelihood of some specific semantic "annotation" of interest $p(\ell|x)$ (e.g., find the gentleman in the green suit, or a kite, or a boy playing).

Take two random crops of an image, as in any other SSL method, $x_1$ by cropping to bounding box $\mathbf{bb} _1$ and similarly $x_2$ to $\mathbf{bb}_2$. We interpret $(x=x_1, a=\mathbf{bb}_{1\to2}, x'=x_2)$ as a transition in our environment: applying a panning transformation from the view $x_1$ to create $x_2$. Learning a value function corresponds to, for any $\ell$, using the model's outputs at one crop $x_2$ to generate a target prediction for the other crop $x_1$: $$\min D(~Q_{AB}(\ell | x_1, a=\mathbf{bb}_{1\to2}), (1-\gamma)p(\ell | x_2) + \gamma \max_{a'} Q^{target}(\ell | x_2,a')) $$

Results

In our paper, we evaluate annotation bootstrapping on a number of datasets, where crop-consistency SSL methods like SimCLR and DINO tend to suffer. Across datasets like EpicKitchens, COCO, and CC12M, we find that bootstrapping annotations improves over several weakly supervised and self-supervised base losses ($\text{AB}_{CLIP}$ over CLIP, $\text{AB}_{SimCLR}$ over SimCLR, $\text{AB}_{DINO}$ over DINO). The gap is greatest when rewards correspond to textual captions, where it significantly outperforms other approaches combining self-supervision with captions. Below, we show results for CC12M, and we refer to the paper for results on other datasets.

Pretrain Dataset	Method	ImageNet	Avg Cls*	Clevr_/Depth	Clevr_/Count
CC12M (no captions)	MAE	61.3	75.4	82.8	90.4
	I-JEPA	60.0	76.0	80.1	90.0
	SimCLR	67.3	79.0	76.5	89.4
	AB_SimCLR (Ours)	68.0+0.7	79.5+0.4	79.5+3.0	89.6+0.2
	DINO	68.9	80.9	79.3	87.6
	AB_DINO (Ours)	70.6+1.8	82.2+1.3	80.4+1.1	89.9+2.4

CC12M (w/ captions)	CLIP	69.5	82.8	70.0	84.4
	CLIP +Aug	72.6	85.0	72.7	87.0
	SLIP +SimCLR	72.0	84.3	72.4	87.2
	SiLC +DINO	72.8	85.0	74.4	88.2
	AB_CLIP (Ours)	74.1+4.6	85.6+2.8	78.1+8.1	91.9+7.4