Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Research output: Contribution to journalArticleResearchpeer-review

3 Citations (Scopus)


We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Original languageEnglish
Pages (from-to)115-128
Number of pages14
Journal The journal of artificial intelligence research
Issue number1
Publication statusPublished - 23 Jan 2020

Bibliographical note

Publisher Copyright:
© 2020 AI Access Foundation. All rights reserved.

Cite this