Paper Reivew: Offline reinforcement learning: Tutorial, review, and perspectives on open problems

2 minute read

Published: December 15, 2020

This blog reviews a review paper of offline reinforcement learning and collects the related papers.

What is offline reinforcement learning algorithms

Offline reinforcement learning utilizes previously collected data (i.e., Data-driven reinforcement learning on a static dataset), without additional online data collection .
The training process does not interact with MDP at all, and the policy is only deployed after being trained.
Also named:
- Full off-policy Off-policy algorithms still often employ additional interaction (i.e., online data collection) during the learning process. Therefore, the term “fully off-policy” is sometimes used to indicate that no additional online data collection is performed.
- Batch reinforcement learning

The process of reinforcement learning involves iteratively collecting experience by interacting with the environment, typically with the latest learned policy, and then using that experience to improve the policy. In many settings, this sort of online interaction is impractical, either because data collection is expensive (e.g., in robotics, educational agents, or healthcare) and dangerous (e.g., in autonomous driving, or healthcare).
Offline reinforcement learning methods equipped with powerful function approximation may enable data to be turned into generalizable and powerful decision making engines, effectively allowing anyone with a large enough dataset to turn this dataset into a policy that can optimize a desired utility criterion.

Offline reinforcement learning requires understanding of the MDP dynamic sufficiently entirely from a fixed dataset.
Learning from entire offline data, without any additional on-policy interaction is ineffective. High-dimensionalHigh-dimensional and expressive function approximation generally exacerbates this issue, since function approximation leaves the algorithms vulnerable to distributional shift.
Offline reinforcement learning should develop policy better than the given dataset, which means the action from learning should be different from corresponding one in dataset. But current machine learning methods is based on the assumption that the data is independent and identically distributed, and pursue the underlying distribution of the given dataset.
There is no possibility of further improving exploration within the algorithms due to the reliance on fixed dataset. (seems that noting we can do unless enrich the dataset)

Based on direct estimation of policy return via Importance Sampling
- Suffer from high-variance because of the importance weight, especially in sequential settings with high-dimensional state and action space.
Based on estimation of value function via Dynamical Programming
Model based offline reinforcement learning