Multi-Armed Bandits in Multi-Agent Networks
- Additional Document Info
- View All
© 2017 IEEE. This paper addresses the multi-armed bandit problem in a multi-player framework. Players explore a finite set of arms with stochastic rewards, and the reward distribution of each arm is player-dependent. The goal is to find the best global arm, i.e., the one with the largest expected reward when averaged out among players. To achieve this goal, we develop a distributed variant of the well-known UCB1 algorithm. Confined to a network structure, players exchange information locally to estimate the global rewards, while using a confidence bound relying on the network characteristics. Then, at each round, each player votes for an arm, and the majority vote is played as the network action. The whole network gains the reward of the network action, hoping to maximize the global welfare. The performance of the algorithm is measured via the notion of network regret. We prove that the regret scales logarithmically with respect to time horizon and inversely in the spectral gap of the network. Our algorithm is optimal in the sense that in a complete network it scales down the regret of its single-player counterpart by the network size. We demonstrate numerical experiments to verify our theoretical results.
author list (cited authors)
Shahrampour, S., Rakhlin, A., & Jadbabaie, A.