For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Convergence of the Q-ae Learning on Deterministic MDPs and Its Efficiency on the Stochastic Environment
Gang ZHAO Shoji TATSUMI Ruoying SUN
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Publication Date: 2000/09/25
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Algorithms and Data Structures
Q-learning, Q-ae learning, exploration, dynamic programming, planning,
Full Text: PDF(871.7KB)>>
Reinforcement Learning (RL) is an efficient method for solving Markov Decision Processes (MDPs) without a priori knowledge about an environment, and can be classified into the exploitation oriented method and the exploration oriented method. Q-learning is a representative RL and is classified as an exploration oriented method. It is guaranteed to obtain an optimal policy, however, Q-learning needs numerous trials to learn it because there is not action-selecting mechanism in Q-learning. For accelerating the learning rate of the Q-learning and realizing exploitation and exploration at a learning process, the Q-ee learning system has been proposed, which uses pre-action-selector, action-selector and back propagation of Q values to improve the performance of Q-learning. But the Q-ee learning is merely suitable for deterministic MDPs, and its convergent guarantee to derive an optimal policy has not been proved. In this paper, based on discussing different exploration methods, replacing the pre-action-selector in the Q-ee learning, we introduce a method that can be used to implement an active exploration to an environment, the Active Exploration Planning (AEP), into the learning system, which we call the Q-ae learning. With this replacement, the Q-ae learning not only maintains advantages of the Q-ee learning but also is adapted to a stochastic environment. Moreover, under deterministic MDPs, this paper presents the convergent condition and its proof for an agent to obtain the optimal policy by the method of the Q-ae learning. Further, by discussions and experiments, it is shown that by adjusting the relation between the learning factor and the discounted rate, the exploration process to an environment can be controlled on a stochastic environment. And, experimental results about the exploration rate to an environment and the correct rate of learned policies also illustrate the efficiency of the Q-ae learning on the stochastic environment.