Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes Conference Paper