We consider the model-free reinforcement learning problem and study the popular Q-learning algorithm with linear function approximation for finding the optimal policy. We provide finite-time error bounds for the convergence of Q-learning with linear function approximation under an assumption on the sampling policy. Unlike some prior work in the literature, we do not make the unnatural assumption that the samples are i.i.d. (since they are Markovian), and do not require an additional projection step in the algorithm. To show this result, we first consider a more general nonlinear stochastic approximation algorithm under Markovian noise, and derive a finite-time bound on the mean-square error, which we believe is of independent interest. The key idea of our proof is to use Lyapunov drift arguments and exploit the geometric mixing property of the underlying Markov chain.
Siva Theja Maguluri is an Assistant Professor in the School of Industrial and Systems Engineering at Georgia Tech. Before that, he was a Research Staff Member in the Mathematical Sciences Department at IBM T. J. Watson Research Center. He obtained his Ph.D. and MS in ECE as well as MS in Applied Math from UIUC, and B.Tech in Electrical Engineering from IIT Madras. His research interests are broadly in Applied Probability, Optimization and Reinforcement Learning, and include Scheduling, Resource Allocation and Revenue Optimization in a variety of systems including Data Centers, Cloud Computing, Wireless Networks, Block Chains, Ride hailing systems etc. He is a co-recipient of the “Best Publication in Applied Probability” award, presented by INFORMS Applied probability society every two years.