We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders.
An abstract strategy board game for two players
1 post • Page 1 of 1
https://www.lesswrong.com/posts/FF8i6SL ... ela-zero-2