Offline RL Policies Should be Trained to be Adaptive

Our paper studies the theoretical and practical utility of learning memory-based policies (policies that change through an episode) in offline reinforcement learning. Using Bayesian tools, we prove that this dependency on memory is necessary to be optimal and discuss how these memory-based "adaptive" policies must be trained. As a first step towards learning optimal adaptive strategies, we propose an ensemble-based offline RL algorithm for learning adaptive policies and demonstrate their favorable properties in challenging image-based offline RL problems.

Locked Doors

In this task, an agent is placed in a room with four doors, but only one of them is unlocked. The agent gets in its observation an image whose label corresponds to the unlocked door -- the agent's task requires both proper perception (classifying the image) and control (opening the appropriate door). On this task, any standard offline RL method learns a state-based policy, which means that it only ever tries a single door for any given image. But this is sub-optimal! Our agent learns an adaptive strategy -- trying one door, and if it fails, then updating it's belief vector and trying another one. This leads to far improved performance over conservative offline methods.

Procgen Mazes

Procgen Mazes is a challenging maze solving task, requiring perception from 64x64 images and generalization across different maze layouts and textures. The offline problem requires an agent to learn from transitions collected in 1000 maze layouts and learn a policy that works in new unseen mazes. Standard offline RL methods learns to try one path to the goal, leading to catastrophic failure whenever if this path fails to reach. Our agent starts by trying one path, but if it fails, then the belief vector update leads it to try another. This leads to the agent solving mazes 10% more frequently than the conservative solution.

Offline RL algorithms must account for the fact that the dataset they are provided may leave many facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies. In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP. As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.

[Paper PDF] [arXiv]

Citation

Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, Sergey Levine
Offline RL Policies Should be Trained to be Adaptive In ICML 2022.

Offline RL Policies Should be Trained to be Adaptive

Summary

Locked Doors

Procgen Mazes

D4RL

Full Abstract

Paper