For Full-Text PDF, please login, if you are a member of IEICE,|
or go to Pay Per View on menu list, if you are a nonmember of IEICE.
Learning the Balance between Exploration and Exploitation via Reward
Tetsuya YOSHIDA Koichi HORI Shinichi NAKASUKA
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences
Publication Date: 1999/11/25
Print ISSN: 0916-8508
Type of Manuscript: Special Section PAPER (Special Section on Concurrent Systems Technology)
multi-agent system, reinforcement learning, reward, exploration, exploitation,
Full Text: PDF>>
This paper proposes a new method to improve cooperation in concurrent systems within the framework of Multi-Agent Systems (MAS) by utilizing reinforcement learning. When subsystems work independently and concurrently, achieving appropriate cooperation among them is important to improve the effectiveness of the overall system. Treating subsystems as agents makes it easy to explicitly deal with the interactions among them since they can be modeled naturally as communication among agents with intended information. In our approach agents try to learn the appropriate balance between exploration and exploitation via reward, which is important in distributed and concurrent problem solving in general. By focusing on how to give reward in reinforcement learning, not the learning equation, two kinds of reward are defined in the context of cooperation between agents, in contrast to reinforcement learning within the framework of single agent. In our approach reward for insistence by individual agent contributes to facilitating exploration and reward for concession to other agents contributes to facilitating exploitation. Our cooperation method was examined through experiments on the design of micro satellites and the result showed that it was effective to some extent to facilitate cooperation among agents by letting agents themselves learn the appropriate balance between insistence and concession. The result also suggested the possibility of utilizing the relative magnitude of these rewards as a new control parameter in MAS to control the overall behavior of MAS.