Targeted Search Control in AlphaZero for Effective Policy Improvement

An abstract strategy board game for two players
Post Reply
Rémi Coulom
Posts: 219
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

Targeted Search Control in AlphaZero for Effective Policy Improvement

Post by Rémi Coulom »

By Alexandre Trudeau and Michael Bowling

https://arxiv.org/pdf/2302.12359.pdf
AlphaZero is a self-play reinforcement learning algorithm that
achieves superhuman play in chess, shogi, and Go via policy itera-
tion. To be an effective policy improvement operator, AlphaZero’s
search requires accurate value estimates for the states appearing
in its search tree. AlphaZero trains upon self-play matches begin-
ning from the initial state of a game and only samples actions over
the first few moves, limiting its exploration of states deeper in the
game tree. We introduce Go-Exploit, a novel search control strategy
for AlphaZero. Go-Exploit samples the start state of its self-play
trajectories from an archive of states of interest. Beginning self-play
trajectories from varied starting states enables Go-Exploit to more
effectively explore the game tree and to learn a value function that
generalizes better. Producing shorter self-play trajectories allows
Go-Exploit to train upon more independent value targets, improv-
ing value training. Finally, the exploration inherent in Go-Exploit
reduces its need for exploratory actions, enabling it to train under
more exploitative policies. In the games of Connect Four and 9x9
Go, we show that Go-Exploit learns with a greater sample efficiency
than standard AlphaZero, resulting in stronger performance against
reference opponents and in head-to-head play. We also compare
Go-Exploit to KataGo, a more sample efficient reimplementation of
AlphaZero, and demonstrate that Go-Exploit has a more effective
search control strategy. Furthermore, Go-Exploit’s sample efficiency
improves when KataGo’s other innovations are incorporated.
Post Reply