AlphaGo

Any topic that does not fit elsewhere
Post Reply
Alvaro
Posts: 5
Joined: Fri Jan 01, 2016 11:03 am

AlphaGo

Post by Alvaro » Fri Jan 29, 2016 10:43 am

In case you live under a rock, a couple of days ago the good guys at Google DeepMind published a deeply mind-boggling paper on Nature about their computer go player, AlphaGo.

https://storage.googleapis.com/deepmind ... ing-go.pdf

It involves a few innovative ideas, but the biggest one is incorporating an evaluation function almost identical to what I had in mind (see my previous thread) to the MCTS paradigm.

I should mention that their program is estimated to be 1,200 Elo points stronger than CrazyStone, the strongest commercially available opponent. [Is there an emoticon for eyes falling off of their sockets?]

Well, on the one hand someone beat me to it. On the other, I can learn how to do it from them. :)

From a quick read of their paper, I gather these are the steps of the training process:
(1) Train a "traditional" NN to compute a probability distribution for the next move, similar to what's described in three recent papers (I posted links to them in my previous thread).
(2) Make a copy of that NN and tweak it using reinforcement learning, to reward moves that win games and penalize moves that lose them (the network from (1) was just trying to imitate humans). As opponents they use randomly picked previous versions of their own network; some variety in the opponents is supposed to make their NN more robust, which makes sense.
(3) Generate a position using the first network (I think the exact procedure is: pick a random number of moves to be played; use the NN from (1) to play that many moves; make a random legal move), then run the network from (2) against itself to the end of the game and label the position by the result. Rinse. Repeat... THIRTY MILLION TIMES!!!
(4) Use the data generated in (3) to train a NN that can be used as an evaluation function.

The network from (1) is used to create initial probability distributions for moves in their MCTS tree. The network from (4) is used at the leaves of the MCTS tree instead of running playouts to completion. Well, not really: They do both and take the average, which seems to be way stronger. This might indicate that their evaluation function still has some serious weaknesses that are ameliorated by mixing the answer with the results of random playouts.

So perhaps their evaluation function is still not good enough to make alpha-beta search a good alternative to MCTS. If they end up publishing the trained networks from (1) and (4), I will certainly try it myself. Networks like (1) have already been made available by other researchers, so there is a chance this may happen. If not, I may try to reproduce what they did, but with limited resources ["...thirty million..." still echoing in my head].

Rémi Coulom
Posts: 113
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

Re: AlphaGo

Post by Rémi Coulom » Fri Jan 29, 2016 11:03 am

Hi Alvaro,

Nice summary, thanks. Also the first step of their learning system used 50 GPUs for 3 weeks. 340 million training steps with a mini-batch size of 16 for the supervised-learning part is also impressive. Time to invest into some hardware, I guess.

Rémi

Daniel Shawul
Posts: 2
Joined: Sun Jan 31, 2016 5:06 am

Re: AlphaGo

Post by Daniel Shawul » Sun Jan 31, 2016 7:39 pm

Thanks Alvaro for the nice summary. Their results are indeed mind boggling. When I first heard of CNNs a year or so ago,
I thought they were just yet another way of improving the move selection phase of MCTS...

StefanK
Posts: 1
Joined: Tue Feb 02, 2016 5:09 pm

Re: AlphaGo

Post by StefanK » Tue Feb 02, 2016 5:19 pm

"So perhaps their evaluation function is still not good enough to make alpha-beta search a good alternative to MCTS."

If I understood it right, AlphaGo's value network returns only win or loss.
But maybe you're onto something. If the same position is played out multiple times, then a network could be trained on estimating the winrate.
Considering the marvelous efficiency of alpha-beta, that would certainly be an ironic return to the roots.

Rémi Coulom
Posts: 113
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

Re: AlphaGo

Post by Rémi Coulom » Tue Feb 02, 2016 6:06 pm

StefanK wrote:If I understood it right, AlphaGo's value network returns only win or loss.
It is trained from win/loss data, but it returns a probability of winning.

Post Reply