Learning the Balance between Exploration and Exploitation via Reward

Tetsuya YOSHIDA  Koichi HORI  Shinichi NAKASUKA  

Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E82-A   No.11   pp.2538-2545
Publication Date: 1999/11/25
Online ISSN: 
DOI: 
Print ISSN: 0916-8508
Type of Manuscript: Special Section PAPER (Special Section on Concurrent Systems Technology)
Category: 
Keyword: 
multi-agent system,  reinforcement learning,  reward,  exploration,  exploitation,  

Full Text: PDF(701.1KB)>>
Buy this Article




Summary: 
This paper proposes a new method to improve cooperation in concurrent systems within the framework of Multi-Agent Systems (MAS) by utilizing reinforcement learning. When subsystems work independently and concurrently, achieving appropriate cooperation among them is important to improve the effectiveness of the overall system. Treating subsystems as agents makes it easy to explicitly deal with the interactions among them since they can be modeled naturally as communication among agents with intended information. In our approach agents try to learn the appropriate balance between exploration and exploitation via reward, which is important in distributed and concurrent problem solving in general. By focusing on how to give reward in reinforcement learning, not the learning equation, two kinds of reward are defined in the context of cooperation between agents, in contrast to reinforcement learning within the framework of single agent. In our approach reward for insistence by individual agent contributes to facilitating exploration and reward for concession to other agents contributes to facilitating exploitation. Our cooperation method was examined through experiments on the design of micro satellites and the result showed that it was effective to some extent to facilitate cooperation among agents by letting agents themselves learn the appropriate balance between insistence and concession. The result also suggested the possibility of utilizing the relative magnitude of these rewards as a new control parameter in MAS to control the overall behavior of MAS.