UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations

Reinforcement Learning is now applied in almost every pursuit of science and tech, either as a core methodology or to optimize existing processes and systems. Despite broad adoption even in highly advanced fields, RL lags in some fundamental skills. Sample Inefficiency is one such problem that limits its potential. In simple terms, RL needs thousands of episodes to learn reasonably basic tasks, such as exploration, that humans master in just a few shots (for example, assume a kid finally figuring out basic arithmetic in high school). Meta-RL circumvents the above problem by enabling an agent with prior experience. The agent remembers the events of previous episodes to adapt to new environments and achieve sample efficiency. Meta-RL is better than standard RL as it learns to explore and learns highly complex strategies far beyond the ability of standard RL, like learning new skills or conducting experiments to learn about the current environment.

Having discussed how good the memory-based Meta-RL is in the RL space, let’s discuss what limits it. Traditional Meta-RL approaches aim to maximize the cumulative reward across all the episodes in a sequence of consideration, which means it finds an optimal balance between exploration and exploitation. Generally, this balance means prioritizing exploration in early episodes to exploit them later. The problem now is that even state-of-the-art methods get stuck on local optimums while exploring, especially when an agent must sacrifice immediate reward in the quest for subsequent higher reward. In this article, we discuss the latest study that claims to be able to remove this problem from Meta-RL.

Researchers at the University of British Columbia presented “First-Explore, Then Exploit,” a Meta-RL approach that differentiates exploration and exploitation by learning two distinct policies. The explore policy first informs the exploit policy, which maximizes episode return; neither attempt to maximize individual returns but are combined post-training to maximize cumulative reward. As the exploration policy is trained solely to inform the exploit policy, poor current exploitation no longer causes immediate rewards to discourage exploration. The explore policy first performs successive episodes where it is provided with the context of the current exploration sequence, which includes previous actions, rewards, and observations. It is incentivized to produce episodes that, when added to the current context, result in subsequent high-return exploit-policy episodes. The exploit policy then takes context from the explore policy for n episodes to produce high-return episodes.

The official implementation of First-Explore is done in a GPT-2-style causal transformer architecture. Both policies share similar parameters and differ only in the final layer head.

For experimentation, the authors compared First-Explore against three RL environments: Bandits with One Fixed Arm, Dark Treasure Rooms, and Ray Maze, all of varying challenges. The One Arm Fixed Bandit is a multi-armed bandit problem designed to forgo immediate reward while having no exploratory value. The second domain is a grid world environment, where an agent who cannot see its surroundings looks for randomly positioned rewards. The final environment is the most challenging of all and also highlights the learning capabilities of First-Explore beyond Meta-RL. It consisted of randomly generated mazes with three reward positions.

First-Explore achieved twice the total rewards of meta-RL approaches in the domain of the Fixed Arm Bandit. This number further soared 10 times for the second environment and 6 times for the last. Besides Meta-RL approaches, First-Explore also substantially outperformed other RL methods when it came to forgoing immediate reward.

Conclusion: First- Explore posed an effective solution to the immediate reward problem plagues traditional meta-RL approaches. It bifurcated exploration and exploitation to learn two independent policies that, combined with post-training, maximized cumulative good, which meta-RL was unable to achieve regardless of the training method. However, it also faces some challenges, paving the way for future research. Among these challenges were the inability to explore the future, disregard for negative rewards, and long-sequence modeling. In the future, it will be interesting to see how these problems are resolved and if they positively impact the efficiency of RL in general.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

Adeeba Alam Ansari is currently pursuing her Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, earning a B.Tech in Industrial Engineering and an M.Tech in Financial Engineering. With a keen interest in machine learning and artificial intelligence, she is an avid reader and an inquisitive individual. Adeeba firmly believes in the power of technology to empower society and promote welfare through innovative solutions driven by empathy and a deep understanding of real-world challenges.

🧵🧵 [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)