Informações do Trabalho
Distributional Safety Critic for Stochastic Latent Actor-Critic
THIAGO SILVA MIRANDA
HEDER SOARES BERNARDINO
When employing reinforcement learning techniques in real-world applications, it is often desirable to constraint the agent, such that it does not perform actions that would lead to potential damage, harm or unwanted scenarios in general. In order to specify and enforce these constraints, current state-of-the-art safe reinforcement learning algorithms rely on the constrained Markov decision process framework which makes use of a cost function to inform the agent about how unsafe each transition is. Particularly, recent approaches focus on developing safe behavior under conditions where the full observability assumption is relaxed and, instead of having acess to the true state of the environment, the agent receives observations with incomplete information. In this vain, we develop a method that combines distributional reinforcement learning techniques with methods used to facilitate learning in partially observable environments. Our approach, called distributional safe stochastic latent actor-critic (DS-SLAC), uses a implicit quantile network as safety critic and learns based on a stochastic latent representation of the environment. We evaluate the DS-SLAC performance on four Safety-Gym tasks. Ultimately, DS-SLAC obtained results better than those reached by sate-of-the-art algorithms in two of the evaluated environments, while being able to develop a safe policy in three of them. Lastly, we also identify the main challenges of performing distributional reinforcement learning in the safety constrained partially observable setting.
Reinforcement Learning, Safety, Distributional RL
Obter arquivos extras