Implementing and improving the AlphaZero algorithm
Post Reply
Rémi Coulom
Posts: 127
Joined: Tue Feb 12, 2008 8:31 pm


Post by Rémi Coulom » Wed Nov 20, 2019 10:54 pm

I'll try to explain TD(1) as simply as possible.

The idea is to tune the evaluation function so that it predicts game outcomes as accurately as possible.

Depending on the game, the nature of the game outcome might be a bit different. Some Othello programs try to predict and maximize the final score. It is also possible to predict the probability of winning if the outcome is either win or loss. For games with draws, such as chess, the evaluation might return 3 outcome probabilities. TD(1) could work in any of these scenarios.

TD(1) works by playing plenty of self-play games that will serve as the training data. The more the better. It is not necessary to play very strong self-play games. For my Crazy Zero experiments, I usually use a search tree with about 400 nodes for each move. 3 or 4 plies of alpha-beta search might be enough for a traditional alpha-beta engine. It is important to avoid duplicates in the self-play games. Using a large set of random balanced starting positions might be a good way. Or choosing opening moves at random.

Then, positions from this training data are sampled at random, and the parameters of the evaluation parameters are tuned by gradient descent to match the outcome of the game.

The principle of gradient descent consists in defining an error function, and updating weights to reduce the error function.

When trying to predict the probability of winning, the error function is typically E = -log(probability_of_actual_result_according_to_evaluation_function). E is always positive. E is zero if the evaluation function predicted the result correctly, and can reach +infinity if the evaluation function predicted the wrong result with 100% certainty.

When trying to predict a score, the quadratic function is more common E = (predicted_score - actual_score)^2. In fact, AlphaGo used the quadratic error function to learn their probability of winning. It is a bit unusual, but the quadratic error function may be used to learn the probability of winning as well.

Gradient descent works with a learning rate that indicates how much evaluation parameters will change at each update. It is usual to start learning with a high learning rate (the highest that does not produce divergence), and decay it during learning. Typical decay is exponential (for instance, divide the learning rate by 2 every hour for one day). The equation to update one parameter (w) of the evaluation function is like this:

w <- w - learning_rate * (dE/dw)

dE/dw is the gradient of the error with respect to w. It can be estimated approximately by (E(w + epsilon) - E(W - epsilon)) / (2 * epsilon). If you know how to compute the derivative, you may also replace this numerical approximation by a simpler formula.

In intuitive terms, this equation says: if increasing the evaluation parameter w a little also increases the error E, then decrease w a litle. Conversely, if increasing the evaluation parameter w decreases the error E, then increase the evaluation parameter a little.

post-scriptum 1: traditional hand-made heuristics for alpha-beta engines were not always designed to predict the game outcome. For instance it is usual in chess to use an evaluation function in centipawns. In order to apply TD(1) to such an evaluation function, it is necessary to convert such a value to a probability of winning (between 0 and 1). The usual way to do it is to apply the logistic function to the centipawn evaluation:

probability_of_winning = 1.0 / (1.0 + exp(evaluation * scale))

With scale being a well-chosen constant.

post-scriptum 2: the weight update equation can be refined by using a different learning rate for each weight, adaptively tuned to how sensitive the error is to this weight. See for instance the Adam algorithm (

Post Reply