RTP-Q: A Reinforcement Learning System with Time Constraints Exploration Planning for Accelerating the Learning Rate

Gang ZHAO  Shoji TATSUMI  Ruoying SUN  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E82-A   No.10   pp.2266-2273
Publication Date: 1999/10/25
Online ISSN: 
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Artificial Intelligence and Knowledge
reinforcement learning,  planning,  reacting,  exploration,  exploitation,  

Full Text: PDF(726.7KB)>>
Buy this Article

Reinforcement learning is an efficient method for solving Markov Decision Processes that an agent improves its performance by using scalar reward values with higher capability of reactive and adaptive behaviors. Q-learning is a representative reinforcement learning method which is guaranteed to obtain an optimal policy but needs numerous trials to achieve it. k-Certainty Exploration Learning System realizes active exploration to an environment, but, the learning process is separated into two phases and estimate values are not derived during the process of identifying the environment. Dyna-Q architecture makes fuller use of a limited amount of experiences and achieves a better policy with fewer environment interactions during identifying an environment by learning and planning with constrained time, however, the exploration is not active. This paper proposes a RTP-Q reinforcement learning system which varies an efficient method for exploring an environment into time constraints exploration planning and compounds it into an integrated system of learning, planning and reacting for aiming for the best of both methods. Based on improving the performance of exploring an environment, refining the model of the environment, the RTP-Q learning system accelerates the learning rate for obtaining an optimal policy. The results of experiment on navigation tasks demonstrate that the RTP-Q learning system is efficient.