Learning to Play: Reinforcement Learning and Game Strategies

Some of the most compelling demonstrations of machine learning in recent years have come from game-playing agents. DeepMind’s AlphaGo defeated a world Go champion in 2016; a few years later, AlphaStar reached Grandmaster level in the real-time strategy game StarCraft II; and reinforcement-learning agents now routinely surpass human performance in a broad range of Atari games. What makes games such a useful testbed is that the rules are fixed, the goal is clear, and performance is easy to measure, which makes it possible to compare strategies precisely and to learn from experience in a controlled setting.

In this project, we begin from the basics. Before reaching for deep neural networks, we consider two complementary questions: whether it is possible to determine the best strategy for a simple game by reasoning carefully about its structure, and whether a computer can arrive at that same strategy without being told the rules explicitly, simply by playing the game many thousands of times and learning from the outcomes. These two questions sit at the heart of reinforcement learning, the branch of machine learning concerned with learning to make good decisions through interaction with an environment. The first question leads us to Monte Carlo estimation; the second to Q-learning. The following links illustrate what can be achieved when these ideas are applied at scale.

Group project

Pig is a simple dice game that is easy to describe but surprisingly rich to analyse. On each turn a player repeatedly rolls a standard six-sided die; they may choose to hold at any point, banking whatever total they have accumulated this turn, or they may roll again to try to build a higher score. If they roll a 1 at any point, however, they score nothing for that turn and play passes to their opponent. The first player to reach a cumulative score of 100 wins.

The central strategic question concerns when a player should choose to hold. Roll too conservatively and the opponent catches up; roll too aggressively and the player repeatedly scores nothing on a 1. A simplified version of this trade-off, ignoring the opponent’s score, can be made precise: if the current turn total is \(k\), the expected outcome from holding is exactly \(k\), while the expected outcome from rolling once more is \((5/6)(k + 4)\), since the die shows a value in \(\{2, 3, 4, 5, 6\}\) (mean 4) with probability \(5/6\), and otherwise scores nothing. Setting these equal gives a break-even turn total of \(k = 20\), below which rolling is favourable on average and above which holding is.

Figure 1:The expected outcome of holding versus rolling one more time, as a function of the current turn total (simplified single-player version). The two strategies break even at a turn total of 20.

The group project will explore this question using two complementary approaches, and will also examine the deeper question of what it means for a strategy to be optimal in the first place.

The first approach is Monte Carlo estimation. We define a value function \(V(i, j, k)\), the probability of winning from the state in which the active player has \(i\) points banked, the opponent has \(j\) points, and \(k\) points have been accumulated this turn. Rather than computing this analytically, we estimate it empirically: from any given state, we simulate a large number of complete games under a fixed opponent policy and record the fraction in which the active player wins. A hold/roll decision for each state then follows by comparing the estimated win probability of holding against that of rolling again. Of course, the quality of these estimates depends on the number of simulations used, which raises natural questions about accuracy and computational cost.

The second approach is Q-learning. Rather than estimating values state by state, we let an agent play the game repeatedly and learn from the outcomes. Q-learning (Watkins and Dayan, 1992) is a model-free reinforcement learning algorithm that maintains estimates \(Q(s, a)\) of the expected future reward for taking action \(a\) in state \(s\), updating them using the Bellman equation each time a reward is observed. Starting from random play, the agent’s Q-values converge to an optimal policy as experience accumulates, without the rules ever being specified explicitly.

A third, cross-cutting theme concerns the design of the reward signal. A reward of \(+1\) for winning and \(0\) for losing is the most natural choice, but it is not the only one: one could reward the player for each point banked, or penalise a bust on a 1 explicitly. These choices affect what the agent learns, and raise the question of what “optimal” actually means in this context. Maximising the probability of winning against a specific opponent is not the same as maximising expected score, and the two objectives can lead to different strategies. The group project will implement both the Monte Carlo and Q-learning approaches, compare the strategies they produce, and investigate how the choice of reward signal shapes the resulting policy.

Expected goals

By the end of the group project, we will have:

learned the key concepts of Markov decision processes (MDPs): states, actions, rewards, policies, and the Bellman equation;
implemented a Monte Carlo value estimation approach for Pig and used it to derive a hold/roll policy;
implemented a Q-learning agent that learns a strategy for Pig through self-play;
explored the design of the reward signal, comparing at least two reward formulations and discussing their effect on the learned policy;
considered what “optimal” means in this context and how the choice of objective relates to the choice of reward;
compared the Monte Carlo-estimated policy to the Q-learning policy and investigated the effect of the learning rate and exploration parameter (\(\varepsilon\)-greedy exploration) on convergence speed;
produced clear visualisations of the estimated value function and learned policy (for example, hold/roll regions as a function of game state);
written a group report presenting the mathematical background and the computational experiments.

Individual project

The group project establishes the core toolkit of Monte Carlo estimation and Q-learning on a small, tractable game. The individual project invites each student to take this toolkit in a direction of their own choosing. Possible directions include, but are not limited to, the following.

Translating Q-tables into human-useable strategies

A Q-learning agent represents its policy implicitly through the entries of a large table of Q-values, one for each (state, action) pair. Such a representation, while computationally convenient, is not directly interpretable by a human player. An individual project in this direction could investigate methods for extracting simplified, human-readable decision rules from a learned Q-table, for instance by fitting a threshold rule to the hold/roll boundary, or by applying a decision tree to the differences in Q-values across states. The aim is to understand how much policy quality is sacrificed when the full Q-table is compressed into a form that a person could actually use during play, and whether any simple rule comes close to matching the learned policy’s performance.

Transfer learning and curriculum design

Transfer learning asks whether knowledge acquired in one setting can accelerate learning in a related but more complex setting. Reverse curriculum learning is a specific strategy in which the agent begins training on a simplified or structured variant of the problem and progressively adapts to the full task, exploiting structural similarities to improve convergence. An individual project in this direction could implement a curriculum of Pig variants, for instance varying the winning threshold or the number of dice, and investigate whether training on an ordered sequence of tasks leads to faster or more stable convergence than training on the target task from scratch.

Yahtzee

Yahtzee is a dice game with a considerably larger state space than Pig: on each turn a player rolls five dice, re-rolling subsets up to twice, with the aim of filling thirteen scoring categories in the most favourable way. The optimal strategy can still be solved by dynamic programming, but the state space is large enough that careful implementation and efficient state representation matter substantially. An individual project in this direction could implement the full dynamic-programming solution and compare its performance against heuristic strategies.

Gridworld and more general MDPs

Gridworld is a standard benchmark environment in reinforcement learning in which an agent navigates a grid, collecting rewards and avoiding penalties, and must learn a navigation policy. It provides a clean setting for comparing Q-learning with other algorithms such as SARSA or deep Q-networks (DQN), and for studying exploration-exploitation trade-offs more carefully. An individual project could also consider continuous-state variants in which function approximation is required.

Multi-agent settings

In Pig the two players do not interact directly: the opponent’s strategy affects only when play passes, not the dice rolls themselves. More complex games (for instance, poker or two-player card games) involve explicit interaction, bluffing, and the need to model the opponent. An individual project in this direction could explore game-theoretic solution concepts such as Nash equilibria, and investigate whether reinforcement learning agents converge to them in practice.

Deep reinforcement learning

Once the core Q-learning algorithm is understood, a natural extension is to replace the tabular Q-function with a neural network, enabling the agent to handle larger state spaces. Of course, this additional representational power comes with new challenges: training instability, hyperparameter sensitivity, and the need for replay buffers. An individual project could implement a simple DQN for a game of the student’s choice and study when and why the approximation helps.

Mode of operation and evidence of learning

This project is primarily computational. Students are expected to implement algorithms in R or Python and to present their work in a reproducible format; R Markdown or a Jupyter Notebook is encouraged throughout. The emphasis is on understanding why the algorithms work, not merely on producing working code, and on interpreting results critically rather than simply reporting them.

Group phase (Epiphany term)

The group will meet weekly with the supervisor, divide implementation tasks, and maintain a shared record of progress. Evidence of learning takes the form of a group presentation and poster, and a collaborative written report covering the mathematical background, implementation details, and experimental results.

Individual phase (Easter term)

Students work independently on their chosen extension, meeting the supervisor individually. Evidence of learning takes the form of an individual written report. The report should include a clear problem statement, a description of the method, computational experiments with well-chosen visualisations, and a critical discussion of the results.

Prerequisites and resources

Prerequisites

Statistical Inference II.
An interest in games and in programming.
Statistical computing using R or Python (essential).
Familiarity with basic probability (expected values, conditional probability).

Resources

Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Watkins, C. J. C. H., and Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292.
Neller, T. W., and Presser, C. G. M. (2004). Optimal play of the dice game Pig. The UMAP Journal, 25(1), 25-47.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4, 237-285.
Wiering, M. A., and Van Otterlo, M. (Eds.) (2012). Reinforcement Learning: Adaptation, Learning, and Optimization. Springer.

Area of research

Reinforcement learning; decision theory; computational statistics; artificial intelligence.