Paper “Distributed Consensus-Based Multi-Agent Temporal-Difference Learning“, by M.S. Stanković, M. Beko and S.S. Stanković, has been published in Automatica!
The paper proposes two new distributed consensus-based algorithms for temporal-difference learning in multi-agent Markov decision processes. The algorithms are of off-policy type and are aimed at linear approximation of the value function. Restricting agents’ observations to local data and communications to their small neighborhoods, the algorithms consist of: a) local updates of the parameter estimates based on either the standard TD(λ) or the emphatic ETD(λ) algorithm, and b) dynamic consensus scheme implemented over a time-varying lossy communication network. The algorithms are completely decentralized, allowing efficient parallelization and applications where the agents may have different behavior policies and different initial state distributions while evaluating a common target policy. It was proved under nonrestrictive assumptions that the proposed algorithms weakly converge to the solutions of the mean ordinary differential equation (ODE) common for all the agents. It was also proved that the whole system may be stabilized by a proper choice of the network and that the parameter estimates weakly converge to consensus. Simulation results were also presented, illustrating the main properties of the algorithms and providing comparisons with similar existing schemes.