AAI_2025_Capstone_Chronicles_Combined
Evaluating Deep Learning Model Convergence in Chess via Nash Equilibria
1
Abstract This paper investigates the shortcomings of traditional deep learning evaluation
metrics—such as test set accuracy and F1-score—in the context of chess position classification. While these metrics are widely used to assess convergence and generalization, they often fail to capture deeper issues related to distributional shift and robustness. Using a dataset of over 300 million expert-level chess positions from 2024 games, we train a ResNet-based model to predict game outcomes from individual board states. Despite favorable training performance and stable test metrics, a round-robin tournament between model snapshots reveals a surprising degradation in strategic robustness over time. We compute the Maximum Entropy Nash Equilibrium (Maxent Nash) from these self-play results and show that earlier, less-trained models consistently outperform their later counterparts in head-to-head matches. This divergence underscores the limitations of relying solely on static test sets and highlights the need for dynamic evaluation methods that account for adversarial and out-of-distribution scenarios. We argue that Maxent Nash provides a more insightful and real-time perspective on convergence in zero-sum game settings like chess. Introduction The game of chess has been seen as a frontier for programming and algorithms since the dawn of the first chess program “Turbochamp” written by Alan Turing and David Champernowne in 1948. Turbochamp was never completed due to the algorithm's complexity exceeding the capability of early computers (Michie, 2000). Though chess is a perfect information game with a relatively small set of rules and pieces, the combinatorial complexity and long term strategic motifs of strong human chess play eluded early chess programs. As such, computer chess was once considered the Drosophilia of AI. It wasn’t until the match between
75
Made with FlippingBook - Share PDF online