KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs

Ryo WATANABE  Junpei KOMIYAMA  Atsuyoshi NAKAMURA  Mineichi KUDO  

IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences   Vol.E100-A   No.11   pp.2470-2486
Publication Date: 2017/11/01
Online ISSN: 1745-1337
DOI: 10.1587/transfun.E100.A.2470
Type of Manuscript: PAPER
Category: Mathematical Systems Science
budgeted multi-armed bandits,  asymptotically optimal policy,  regret analysis,  

Full Text: PDF(1.6MB)>>
Buy this Article

We study the budgeted multi-armed bandit problem with stochastic action costs. In this problem, a player not only receives a reward but also pays a cost for an action of his/her choice. The goal of the player is to maximize the cumulative reward he/she receives before the total cost exceeds the budget. In the classical multi-armed bandit problem, a policy called KL-UCB is known to perform well. We propose KL-UCB-SC, an extension of this policy for the budgeted bandit problem. We prove that KL-UCB-SC is asymptotically optimal for the case of Bernoulli costs and rewards. To the best of our knowledge, this is the first result that shows asymptotic optimality in the study of the budgeted bandit problem. In fact, our regret upper bound is at least four times better than that of BTS, the best known upper bound for the budgeted bandit problem. Moreover, an empirical simulation we conducted shows that the performance of a tuned variant of KL-UCB-SC is comparable to that of state-of-the-art policies such as PD-BwK and BTS.