Connie Trojan

How to Lose Blackjack (Optimally)

connie-trojan — Thu, 05 May 2022 10:34:00 +0000

This post is all about the game of blackjack. What is the optimal strategy? How much money can you expect to win on average, if you play optimally? Should you decide to play at all?

The first section will outline the formulation of the rules we’ll be considering. We’ll then mathematically define the task at hand by formulating the problem as a Markov decision process, and solve it with linear programming.

Blackjack

Blackjack is a card game where the aim is to collect cards totalling a value higher than that of the dealer without exceeding 21. The game begins with the dealer dealing a card to each player and themself. The cards have the following values: 2-10 are worth their face values, picture cards (jack, queen, king) are worth 10, and aces can be counted as either 1 or 11 according to choice. On their turn, players have two choices: hit (be dealt another card from the deck) or stick (stop drawing cards). If the total value of a player’s cards exceeds 21 they go bust and immediately lose, receiving a reward of -1. After all players have opted to stick, it is the dealer’s turn. The dealer plays according to a fixed strategy where they hit on any total up to 16 and stick on 17 or above. All remaining players win (receiving a reward of 1) if the dealer goes bust. Otherwise, remaining hands win if their total value is higher than the dealers, draw (receiving a reward of 0) if it is the same, or lose if it is lower.

We’ll assume that the cards are dealt from a deck that is sufficiently large that the probabilities of drawing any particular card do not change significantly over the course of the game, so that draws from the deck can be modelled as independent and identically distributed and there is no benefit to keeping track of the history of cards dealt to yourself or other players. This is not an unrealistic assumption for casino play, where cards are dealt from a shoe of several decks that is periodically re-shuffled, and card-counting is frowned upon anyway!

This means that each player can be considered to play independently against the dealer. It also means that only three pieces of information are required to decide which action to take next: the player’s current total (with an ace counted as 11 unless doing so results in a total over 21), the dealer’s card, and whether or not the player has an ace that is currently being counted as an 11. The last situation is known as having a usable ace or a soft hand, since the player can save themself from going bust by instead counting their ace as a 1.

Formulation as a Markov decision process

A Markov decision process (MDP) is a sequential decision-making process where outcomes depend on the decision maker and are partly random. At each time step, the player finds themself in some state $S \in \mathcal{S}$ and must select some action $A$ from an action set $\mathcal{A}(S)$ . After taking an action, the process moves to the next time step, moving randomly to some new state $S'$ (the probabilty of moving to any particular new state will likely depend on the action taken) and awarding a reward $R(S,S')$ to the player.

In blackjack, the possible states are triples of three numbers $(s,d,a)$ , where $s \in \{2,\ldots,21\}$ is the player’s current total, $d \in \{A,2,\ldots,10\}$ the dealer’s card, and $a \in \{0,1\}$ is whether the player has a usable ace ( $a=1$ ) or not ( $a=0$ ). In each of these states, the player has two possible actions: hit or stick. We also have three special terminal states: WIN, DRAW, and LOSE. When a player reaches one of these states they are awarded with the rewards 1, 0, or -1 respectively and can no longer take any actions or receive further rewards.

The possible state transitions are as follows:

If the player sticks on some total $s$ , they transition to WIN with probability $\mathbb{P}(\text{dealer goes bust or sticks on total less than s})$ , DRAW with probability $\mathbb{P}(\text{dealer sticks on total of s})$ , or LOSE with probability $\mathbb{P}( \text{dealer sticks on total greater than s})$ . These probabilities can be computed exactly by recursion (this is a little fiddly since there’s lots of cases to consider, but is possible since you can’t visit the same state twice), or estimated by simulating a large number of hands for the dealer and observing how often they stick on 17, 18, 19, 20, or 21 and how often they go bust (much easier!). The exact probabilities are represented in the figure below:

Note that these are mostly dominated by the fact that drawing a 10 occurs with probability 4/13, so starting from a 7 the most likely outcome is to end up sticking on 17, but starting from a 6 the most likely outcome is to go bust after hitting on 16.

If the player hits, the transitions are a little more complicated. First a card of value $c$ is drawn from the deck:

If $s \leq 10$ , the player cannot go bust and we transition to state $(s+c, d, a)$ (if an ace is drawn, we count it as an 11 so $c=11$ and $a = 1$ ).
If s \geq 11 then we always count new aces as 1 (since counting it as an 11 would cause us to go bust). It is now possible for the player to go bust:
- If $c \leq 21 - s$ , we transition to state $(s+c, d, a)$ .
- If $c > 21 - s$ and $a=1$ , we must use the ace and transition to $(s+c-10,d,0)$ .
- If $c > 21 - s$ and $a=0$ , go bust: transition to LOSE and receive reward -1.

Optimal strategy

Our objective is to find the decision strategy that maximises the expected reward. This can be represented as a function $\pi^*(S)$ of the current state that tells us the best decision to take. To find this, it will be useful to define one last concept: the value $v(S)$ of being in a particular state, i.e. the expected reward if we start in that state and play optimally. For a MDP, these values exist and are uniquely defined as satisfying the Bellman optimality equations:

$\begin{aligned}v(S) &= \mathbb{E}(R(S,S') + v(S') \, | \, S, A = \pi^*(S)) \, .\end{aligned}$

That is to say, the value of being in state $S$ is the expected value of the immediate reward received plus the value of the state we transition to, given that we start in state $S$ and take the optimal action $\pi^*(S)$ .

In particular, since $\pi^*(S)$ chooses the action that maximises expected reward, this means that:

$\begin{aligned}v(S) &= \max_{A \in \mathcal{A}(S)}\mathbb{E}(R(S,S') + v(S') \, | \, S, A)\\ &= \max_{A \in \mathcal{A}(S)} \sum\limits_{S' \in \mathcal{S}} \mathbb{P}(S' \, | \, S,A) (R(S,S') + v(S')) \, . \end{aligned}$

The $v(S)$ therefore satisfy the following inequalities:

$\begin{aligned}v(S) \, &\geq \, \sum\limits_{S' \in \mathcal{S}} \mathbb{P}(S' \, | \, S,A) (R(S,S') + v(S')) \quad \quad \forall A \in \mathcal{A}(S) \, ,\end{aligned}$

with equality for $A = \pi^*(S)$ .

This means we can find all of the state values by solving the following linear program:

$\begin{aligned}&\min\limits_{v \in \mathbb{R}^{|\mathcal{S}|}} \; \sum\limits_{S \in \mathcal{S}} \,v(S), \\ \\ &\text{subject to:} \\ & v(S) \, \geq \, \sum\limits_{S' \in \mathcal{S}} \mathbb{P}(S' \, | \, S,A) (R(S,S') + v(S')) \quad \quad \forall A \in \mathcal{A}(S) , \; S \in \mathcal{S}\end{aligned}$

The minimisation objective and $\geq$ constraints mean that the solution to the problem will be the unique set of values that satisfy the Bellman equations. The linear program can be efficiently solved with eg. the as long as the state and action spaces aren’t too large, which is the case for our formulation of blackjack.

We will then be able to recover the optimal strategy $\pi^*$ as:

$\begin{aligned}\pi^*(S) &= \argmax_{A \in \mathcal{A}(S)} \mathbb{E}(R(S,S') + v(S') \, | \, S, A) \, .\end{aligned}$

What does all of this look like for blackjack? Using the above method, the values of all of the possible blackjack states are as follows:

And, to answer one of the questions I asked to start with, the optimal blackjack strategy is:

Note that you should obviously also always hit on totals up to 11.

Time to play blackjack?

Not quite. The other question of relevance is how much money you can expect to win! Since we know the probabilities of all of the possible starting states, as well as the values of starting in all of them, we can also compute the expected value of the game. That value is… -0.0466. Unfortunately, even playing optimally we can expect to lose money. This is largely due to the fact that if the player goes bust they lose immediately – we don’t check to see if the dealer would have gone bust as well so there is no chance of a draw (if we did then we would expect to win a princely 0.136). The same holds true when the player has the option of splitting or doubling, although the house edge is smaller. This is by design – if there were no house edge then blackjack would not be a casino staple.

So unfortunately, the real winning move is not to play at all. Alternatively, if you’re playing with friends, make sure you volunteer to be the dealer!

The Prisoner’s Dilemma on Groundhog Day

connie-trojan — Mon, 18 Apr 2022 11:10:00 +0000

The is a famous problem in game theory. The situation is as follows: you and an accomplice have been arrested on suspicion of a serious crime. The prosecutors have sufficient evidence to convict both of you on a lesser charge but offer both of you a bargain in the hopes of a conviction on the serious charge. If you betray your accomplice and testify that they committed the crime then you will get off with a lesser sentence. You must make your decisions in isolation without communicating, but you are aware that your accomplice has been offered the same bargain. If only you take the bargain then you will serve no time in prison while your accomplice serves 3 years. If both of you stay silent then you will both serve 1 year on the lesser charge, and if you both testify against each other then you will both serve 2 years. What do you do?

The context and exact numbers in this formulation are unimportant – the key features are that mutual co-operation is better than mutual betrayal, while the best and worst outcomes come on either side of a unilateral betrayal. You could reformulate the problem in many contexts – for example, two rival companies might have to decide between spending either a high or low amount on advertising, given that they will only get a greater market share if they spend more than their competitor. While the collective “best” option might be to both spend a small amount and share the demand equally, each company might be tempted to spend a higher amount in the hopes of dominating the market and pocketing a greater profit. If both do so, then they will each have a similar number of customers as if they had both spent the smaller amount, but will be out the extra sum of advertising money.

All formulations of the dilemma have one thing in common: no matter what your fellow “player” chooses, you are always better off betraying them than co-operating – if they choose co-operation, you walk away with the best possible outcome if you opt to betray, and if they choose to betray then you can avoid the worst by betraying them in return. Unfortunately, the other player can follow the same line of reasoning themselves, so that if both of you act “rationally” you will both choose to betray the other despite the fact that mutual co-operation is better for both of you. The story doesn’t end here, however – the situation becomes much more interesting when you might have the opportunity to play again against the same person. Does having the chance to punish your competitor for breaking co-operation change the situation? This blog post will show that in the case that the game is repeated indefinitely, the answer is yes.

Nash Equilibrium

It turns out that when the prisoner’s dilemma is repeated indefinitely, there is no longer a clear strategy that dominates any other. To compare different strategies it will be useful to consider the game theoretic concept of a Nash equilibrium.

In a 2-player game, player 1 can select any action $A^1$ in an action set $\mathcal{A}^1$ . Similarly, player 2 can choose an action $A^2$ from set $\mathcal{A}^2$ . In the prisoner’s dilemma both players have the same action set, $\mathcal{A} = \{co-operate, betray\}$ . Once all players have made their decision, players receive rewards $R^1(A^1, A^2)$ and $R^2(A^1, A^2)$ depending on the actions chosen. The key objective in solving such games is to identify Nash equilibrium policies, defined as pairs of strategies where neither player can guarantee a better expected outcome by unilaterally switching to a different strategy.

Nash equilibria always exist for matrix games (there may also be more than one), but unless the game is zero-sum ( $R^1 = -R^2$ ), players may have different payoffs in different Nash equilibria. Note that Nash equilibrium strategies are often mixed, meaning that they can define a distribution over possible actions rather than identifying one action as optimal. For example, consider the two-player zero-sum game with the following payoffs for player 1:

	Player 1 chooses A1	Player 1 chooses A2
Player 2 chooses A2	1	-1
Player 2 chooses A2	-1	1

Since the game is zero-sum, the payoffs for player 2 are the above multiplied by -1.

Here, if player 1 chooses action 1 (or 2), then if player 2 can guess their strategy they would be able to guarantee a payoff of -1. However, if player 1 randomly chooses either option with probability 1/2 then their expected payoff is 0 regardless of player 2’s strategy. By using this strategy, player 1 can guard against any potential extra knowledge or scheming on player 2’s part. In fact, if both players choose this strategy then we have a Nash equilibrium.

In the prisoner’s dilemma, your personal aim is to minimise the time you spend in prison, and you suppose that the same is true of your accomplice. We can specify the “payoffs” of the prisoner’s dilemma by the following table:

	You choose co-operate	You choose betray
Opponent chooses co-operate	-1	0
Opponent chooses betray	-3	-2

Here, the optimal strategies $(betray, betray)$ form the only Nash equilibrium point in the game.

The Iterated Prisoner’s Dilemma

If we repeat the prisoner’s dilemma game a fixed number of times, then mutual betrayal remains the only rational choice and hence the only equilibrium strategy. We can see this by considering what happens in the last time period and working backwards – the last time you play the game, the situation is exactly the same as when you play only once, since there is no future strategy to consider and your opponent will never get to retaliate against a betrayal. Given this, in the penultimate time you play, there is also no incentive to co-operate since you know that a rational opponent will betray you next even if you co-operate now. By backwards induction it remains rational to betray in every round of the game.

The key issue here is that there is a fixed termination time, a point beyond which there are no consequences to consider beyond the immediate payoffs awarded by the game. This disappears if the game will be played infinitely many times, or if neither player knows which round will be the last. As long as both consider there is a sufficiently large chance of playing again, the potential future rewards will matter more than the outcome of any single game. It remains true that mutual betrayal is the only stationary Nash equilibrium strategy (a stationary strategy is one that doesn’t depend on previous outcomes). However, if both players remember past events then there is incentive to co-operate, and it turns out that there are many possible Nash equilibrium strategies.

Take the so-called grim trigger strategy, for example – in this strategy, you choose to co-operate in the first game and continue to do so until your opponent chooses to betray you, after which you never co-operate again. If both players choose this strategy, then we have a Nash equilibrium: clearly, if you know that your opponent will choose this strategy, then you will not benefit from choosing to betray them unprompted as you will collect a good reward once and then be stuck in mutual betrayal forevermore. Your best bet is any strategy (including grim trigger) which co-operates in response to a co-operative opponent, as then you will consistently get the better mutual co-operation reward.

Another Nash equilibrium strategy that aims for mutual co-operation is the tit-for-tat strategy – here, you co-operate in the first game and then always choose the action your opponent played last. This would probably serve better than grim trigger in practice against an opponent who doesn’t know exactly which strategy you are playing, since it offers the possibility of reconciliation without being too forgiving. It is a little harder to see that this is a Nash equilibrium strategy – notice that if your opponent chooses betrayal unprompted then they can only leave the resulting cycle of mutual betrayal by co-operating once, despite knowing they will be betrayed. As long as the reward for betraying a co-operative opponent and then co-operating knowing you will be betrayed is less than that for mutually co-operating both times instead (as is the case in most formulations of the problem), then there is nothing to be gained by betraying a tit-for-tat player.

So, what is the “optimal” way to play in the indefinitely iterated prisoner’s dilemma? The answer to this is not actually known, and may well depend on what you know about your opponent. Clearly the two strategies suggested in the previous section are good options, and have the potential to get you a much better payoff than the betrayal strategy, despite the fact that this is also a potential Nash equilibrium strategy. If you aren’t sure of your opponent’s strategy, then tit-for-tat might be the better option since you’d rather end up in mutual co-operation than betrayal even if you are betrayed at some point early on. Indeed, this strategy generally does well in iterated prisoner’s dilemma competitions. If there is a chance of miscommunication then you might choose to play tit-for-tat but with a chance of co-operation even if you heard that your opponent has just betrayed you, so as to have the possibility of avoiding becoming stuck in mutual betrayal or alternating betrayal and co-operation.

Vector Spaces and Teaching Your Computer to Read

connie-trojan — Sun, 03 Apr 2022 14:58:42 +0000

The key issue in using text data is the sheer number of words we have to learn about! To make matters worse, we do not have the same amount of information about each word. This is because the relative frequencies of words are incredibly skewed – in a given corpus, only a small number of words will make up the majority of the word count. We’ll illustrate this with an example corpus: the Sherlock Holmes series. This has a total word count around half a million, with just over 17500 unique words. The sorted frequencies for each word are plotted below, and you can already see that the distribution is heavily skewed towards the more common words. In fact, the 100 most common words make up almost 60% of the total word count, while half of the words in the vocabulary appear only once or twice.

This issue persists even for very large text corpora (the Oxford English Corpus has 2 billion words and the top 100 words still make up half of that count), and causes enormous problems for dealing with text data since we have lots of information about a handful of words and only a little about the rest. Modern language models usually handle this problem by being careful about how they choose to represent words. Simple methods treat each word as its own distinct entity: a string of characters or an index in a dictionary. In reality, though, words are related to each other, and a more informative representation would capture similarities and differences in word meanings.

Word embeddings (or word vectors) are representations of words as points in a vector space, where words with similar meanings are represented by points that are close together. This reduces the dimensionality of text datasets and makes it possible to transfer knowledge between words with similar meanings, effectively increasing the amount of data we have about each word.

Example – Food Vectors

To illustrate this, an example of how you might consider representing foods as points (or vectors) in 2d space:

Here, I’ve decided the two key pieces of information about any foodstuff are temperature (hot/cold) and state (solid/liquid). This places meals that you’d consume in similar situations close together in space. If we measure vector similarity by the cosine similarity (the cosine of the angle between them), we can compute a score for how similar certain words are on a scale from 1 (same meaning) to -1 (opposite meanings). In our example, similarity(soup,stew) = cos(10) = 0.98, while similarity(soup,salad) = cos(180) = -1.

This representation also gives rise to some interesting observations, since mathematical operations like addition and subtraction have natural definitions for vectors. For example, considering the words as vectors gives sense to the sum “yoghurt – cold + hot”, which has the answer…

soup.

Obviously, this representation is not perfect (is there a meaningful difference between soup and hot yoghurt?), but it’s not hard to imagine that if we increased the number of dimensions – by adding on extra directions like sweet/savoury, for example – it would be good enough to represent most of the meaningful differences and similarities between foods. In practice, when we use a model to learn word embeddings, the individual co-ordinates do not correspond to easily understood concepts like they did in the example above. However, we can still find interesting linear relationships between words: relationships like “king – man + woman = queen” still hold, and directions can be found that correspond to grammatical ideas like tense, so that “eat + = ate”.

word2vec

The key question now is how to find such a representation! This is usually done by training a model for some classification task that is easy to evaluate, and using the resulting fitted parameters as word embeddings. Perhaps the most famous example is word2vec. In its “skip-gram with negative sampling” variant, the prediction task is to estimate how likely any given two words are to appear near each other in a sentence. The motivation here is that words are likely to have similar meanings if they appear in similar contexts – if I tell you “I ate phlogiston for dinner”, you’ll be able to tell from context (proximity to the words ate and dinner) that phlogiston is most likely a food and could hazard a guess at how it would be used in other sentences.

We can easily get positive examples from the text by taking pairs of words that did appear near each other, and negative examples by randomly selecting some noise words. The embeddings are fitted to maximise classification accuracy on this dataset, and the model is designed so that the resulting embeddings have high cosine similarity if the words they represent appear in similar contexts.

We’ll use the Sherlock Holmes corpus again as an example, using the for python to learn 100-dimensional word vectors. The results are more easily visualised by projecting them into 2 dimensions with principal component analysis – the resulting projection preserves some of the structure of the vector space, including some clusters of similar words. See below for a visualisation of 100 common words from the dataset:

Some clusters of words with high cosine similarity have been highlighted. The model did well at grouping words with similar meanings together – some examples are listed in the table below. The model was also reasonably successful at grouping words by syntactic meaning – nouns, adjectives, and verbs were usually grouped together, with verbs even grouped by tense as well as meaning.

Word	Most similar to:
you	ye (0.85), yourselves (0.83)
say	saying (0.87), bet (0.85)
said	answered (0.88), cried (0.83)
brother	father (0.88), son (0.88)
sister	mother (0.93), wife (0.92)
coat	overcoat (0.94), waistcoat (0.94)
crime	murder (0.87), committed (0.85)
holmes	macdonald (0.76), mcmurdo (0.75)

Of course, those were the highlights of this particular set of embeddings – this corpus is actually on the small side so the semantic similarities found for some words were complete nonsense. The most useful thing about word embeddings, however, is how transferable they are across corpora – it is common to use word embeddings trained on a big corpus in a language model for a small corpus, and many sets of pre-trained word embeddings are available for this purpose.

An interesting property of word embeddings is that the embedding spaces for different languages often share a similar structure – embeddings trained with word2vec for different languages have a similar geometric structure, and it is even possible to learn a linear map between the embedding spaces that allows for translation of words. The matrix for this map can be trained using a list of translations for some common words, and the resulting projections are surprisingly effective at translating words, even allowing for the detection of errors in a translation dictionary.

Particle Filtering and COVID-19 (Part 2 – The Bootstrap Filter)

connie-trojan — Mon, 21 Mar 2022 09:36:00 +0000

This is the second part of a series on using particle filtering in epidemiology. In part 1 we saw an example of how to model observed case numbers as a partially observed random process and formulated the problem of inferring the true case numbers as a filtering problem. Unfortunately, exact inference is rarely possible for realistic models and was very computationally expensive even in our relatively simple example! This post will introduce the bootstrap particle filter, an approximate method that is much more computationally efficient and can be used in cases where it is not possible to compute the true filtering distribution.

Sequential Importance Sampling

A common way of dealing with intractable integrals or complicated probability distributions is by using Monte Carlo methods – in our case, we can simulate from the probability distribution $p(x_t \,|\, y_{1:t-1})$ for $x_t$ since we know the dynamics of the disease (we can do this sequentially by simulating $x_t$ from time 0 up to $t$ according to the binomial updates we defined earlier). We can also compute $p(y_t \,|\, x_t)$ directly since we know the distribution of the observation process. These are the ingredients required to use the Monte Carlo technique of importance sampling to approximate the filtering distribution – if we simulate a large number of samples (known as “particles”) for $x_t$ we can weight them proportionally to $p(y_t \,|\, x_t)$ and estimate properties of the distribution (for example, the mean) by treating our weighted particles as if they were a sample from the filtering distribution.

Unfortunately, it is quite inefficient to use importance sampling out of the box in this setting – most of our simulations will not be close to what really happened in the epidemic and will therefore have very small weights. We need to find a lot of plausible possibilities for the true sequence of case numbers in order to get a good estimate for the filtering distribution, otherwise we will just end up being very overconfident about the one particle we simulated that is somewhat close to the truth. This problem gets worse at each timestep as the number of possibilities for the path taken by the hidden states increases – In fact, the number of particles required to produce good estimates increases exponentially in $t$ . In our epidemiology example (see below for a comparison of the estimated filtering distribution with the true distribution) importance sampling did well initially but the shrinking credible intervals after $t = 10$ indicate that the sample weights are becoming increasingly concentrated on just a few particles.

One issue with this approach is that we keep and use all of the particles, no matter how unlikely they are to be close to the truth. For example, in our epidemic we know that the true number of cases must be at least as big as the number of cases we observed, so any simulated particle that does not satisfy this at each time point will have a weight of 0 since it cannot possibly reflect the true number of cases.

The Bootstrap Filter

The bootstrap filter is a way of sequentially generating particles that are more concentrated in areas of high density of $p(y_t \, | \, x_t)$ . Instead of continuing to propagate all of our particles forward and sequentially updating their weights, the bootstrap filter uses the weights to resample the particles, creating a new generation of particles that can all be given equal weight. This allows particles with negligible weight to be eliminated, while we simulate several descendants of the particles with higher weights. At each timestep, we sample the next generation of particles independently and with replacement, with each particle having a probability proportional to its weight of being selected.

We’ll illustrate this on our running example via an animation – the gif below shows the propagation of 4 particles, with the true number of cases indicated by the dashed line. The dots representing each current particle are scaled according to their weights, with x’s indicating removed particles. The resulting estimate of the filtering distribution is shown at the end.

Using 100 particles (see the figure below), we obtain a pretty reasonable approximation for the exact filter mean and quantiles, and for over 1000 particles the two are virtually indistinguishable. The only difference is that the bootstrap filter algorithm takes only seconds to run (even with a huge number of particles), while the exact distribution took several hours to compute!

In addition, the weights within the particle filtering algorithm can be used to estimate the data likelihood. If we do not know the true values of $p$ and $p_{obs}$ this can be used to estimate which values are likely given our data as we can run the particle filter for different possibilities for the parameters. For example, we can roughly locate the maximum likelihood estimators in our example by estimating the data likelihood on a grid of possible parameter values (see below for an approximate contour plot of the likelihood, with the rough location of the maximum indicated).

Optimising for estimates of $p$ and $p_{obs}$ is easy enough to do in our simple example with only two parameters, though much harder for more complicated models. Also, just estimating the parameters isn’t enough if we also want to perform inference on the hidden states, since simply substituting these into a particle filter wouldn’t reflect the fact that we are uncertain about what the true parameter values are – our uncertainty about the parameters results in a higher degree of uncertainty about the hidden states. A neat way of performing inference on the parameters and hidden states jointly is by using particle Markov chain Monte Carlo.

Particle Filtering and COVID-19 (Part 1 – The Filtering Problem)

connie-trojan — Mon, 14 Mar 2022 21:28:32 +0000

You’ve probably heard a lot about “particle filtering” in the last few years in the context of mask wearing. What you might not know is that particle filtering is also the name of a family of algorithms useful in epidemiology – in this context, “filtering” refers to statistical inference on quantities that cannot be observed directly and the “particles” are simulated possibilities for this hidden information. Particle filtering is used for inference on noisy or partially observed data – for example, in the epidemiology context the spread of a disease can be modelled as a random process which we can only partially observe, since not all infected individuals can be tested for the disease and tests are not 100% accurate.

Estimating true case numbers and monitoring the transmission rate of a disease are vital to understanding and containing its spread – in the COVID-19 pandemic such statistical analyses have been critical in informing government policy. As such, there has been much interest recently in using particle methods to analyse COVID-19 data. This blog post will introduce the filtering problem in the context of analysing a small simulated epidemic dataset, focusing on the task of predicting the true number of cases at a given time.

Epidemic Modelling

Our running example will be a realisation from the Reed-Frost epidemic model: we introduce a disease to a closed population of fixed size $N$ and monitor who is susceptible (S), infected (I), or recovered (R) as time goes on. Initially, all are susceptible to the disease, meaning they can become infected by contact with an infectious (I) individual. After they recover from the disease they are classified as removed (R) and cannot be infected again. It is assumed that the infectious period of the disease is short compared to its incubation period, so that individuals infected at time $t$ will infect others at time $t+1$ and then recover. We assume for simplicity that all susceptible individuals independently have probability $p$ of coming into contact with and being infected by any given infectious individual in this time. The epidemic will begin with the introduction of a single infectious individual at time 0.

Putting these assumptions together, the number of infections at each timestep will be binomially distributed, with parameters depending on the current number of susceptible individuals and the probability each one has of being infected. If we denote the number of susceptible individuals at time $t$ by $S_t$ and the number of new infections by $I_t$ , we can consider the unobserved “hidden state” in the process to be $x_t = (S_t, I_t)$ , since the number of recovered individuals at time $t$ can be computed from $I_t$ and $S_t$ as $R_t = N - S_t - R_t$ .

The probability of any one susceptible individual escaping infection at time $t + 1$ is $(1 − p)^{I_t}$ , since they would have to avoid contact with each of the $I_t$ infectious individuals to escape infection. The distribution of new infections at each timestep $t \geq 1$ is $I_{t+1} | (S_t, I_t) ∼ \text{Binom}(S_t, 1−(1−p)^{I_t} )$ , with $S_{t+1} = S_t −I_{t+1}$ since the recently infected people are no longer susceptible to the disease. Since the epidemic starts with the arrival of a single infectious person, we have that $I_0 = 1$ and $S_0 = N$ .

This model describes the underlying dynamics of the disease, but in a real epidemic we have an additional complication – at no point can we directly observe $I_t$ or $S_t$ ! We will assume a fixed probability $p_{obs}$ of detecting any given infection, reflecting how likely an infectious individual is to take a test and how likely the test is to detect the presence of the disease. This means that the number of cases we actually observe (we’ll call this $y_t$ ) is binomially distributed, given by $y_t | x_t ∼ \text{Binom}(I_t, p_{obs})$ .

Clear as mud? To illustrate all of that, the figure above shows a simulation from this model terminating at time $T = 30$ , with $N = 1000$ , $p = 0.0015$ , and $p_{obs} = 0.2$ , which will be used as a running example. Clearly, the observed number of cases gives an incomplete picture of the epidemic, as the true number of cases reaches much higher. If we know $p_{obs} = 0.2$ (for the moment we will assume we do), it is tempting to estimate the true number of cases by scaling the observed case numbers by a factor of $1/0.2$ – indeed, if we only had one datapoint this would be the best guess we could possibly make for the true number of cases since on average we detect 20% of infections. The dashed line on the graph indicates the outcome of using this approach – it is not too terrible as an estimate of the true number of cases but is extremely variable.

Such sudden and extreme swings in the number of cases are very unlikely in the Reed-Frost model! Ideally, we’d like to take this knowledge into account in our guesses for the number of cases. This would require using all of our observations so far to get a better idea of what the situation is like – for example, keeping track of the total number of infections detected so far would help give an indicator of how fast the disease is currently spreading – for example, at around time $t=11$ we can guess that over 30% of the population has already been exposed to the disease (by multiplying the total number of infections detected by $1/0.2$ ), and use our knowledge of the Reed-Frost model to guess that this means we are close to the peak and the disease will soon start dying out.

Formally, we can use all of the available information to make inferences about the true hidden state by using the filtering distribution, a probability distribution that captures our knowledge (and uncertainty) about the hidden states at time t given our sequence of observations up to that time. We can use this distribution to compute a best guess for the true number of cases or to estimate the probability that this took a certain value. It is important to consider this as a probability distribution, since when there is randomness in the observation process it is not possible to be completely certain of what the hidden states actually were.

The Filtering Distribution

The filtering distribution is the name given to the distribution $p(x_t \, | \, y_1,…,y_t)$ , where as before $x_t$ is the hidden state and $y_t$ our noisy or partial sequence of observations. This can be calculated recursively – supposing we know the filtering distribution for time $t - 1$ , we can obtain the filtering distribution for time $t$ using Bayes’ theorem:

$p(x_t \, | \, y_{1:t} ) = \frac{ p(y_t \,|\, x_t) \, p(x_t \,|\, y_{1:t-1}) }{ p(y_t \,|\, y_{1:t-1}) } \, ,$

where the predictive distribution $p(x_t \,|\, y_{1:t-1})$ can be found by integrating (or summing if the state space is discrete, as it is in our example) over possibilities for $x_{t-1}$ and the normalising constant $p(y_t \,|\, y_{1:t-1})$ by integrating over possibilities for $x_t$ . Unfortunately, in most cases the story does not end here, as we typically cannot evaluate the required integrals exactly. Note that since the population in our running example is closed and has fixed size, the state space for this model is finite with up to $N$ possibilities for both $S_t$ and $I_t$ . This means that we can compute the exact filtering distribution by exhaustively summing over every possibility, although this can become prohibitively computationally expensive for large populations – in our small example this took six hours to calculate! This exact filtering distribution is represented below, and we can see right away that its mean is a much better guess for the true number of cases than we got by just scaling up the observed number of cases.

We will explore what to do when the exact calculation is not possible (or is computationally infeasible) in part 2 of this post, and we will see that in our example a very good approximation can be obtained in mere seconds.

Stable Marriage and Kidney Donation

connie-trojan — Tue, 08 Feb 2022 16:59:51 +0000

The situation for kidney transplants is somewhat different to that of other organs – since humans only need one kidney to survive, it is possible to recieve a kidney from a living donor. Transplants from live donors are preferable, since they typically have a higher chance of success than a transplant from a deceased donor. Many patients are able to find a willing donor (a family member, for example) but such a transplant can only take place if the patient and donor are compatible. If this is not the case, a kidney exchange can be arranged – for example, a paired exchange where two patient-donor pairs exchange donated kidneys.

Given a large number of incompatible patient-donor pairs, can we organise a large scale kidney exchange so that as many patients as possible recieve a compatible kidney? This idea was suggested by Roth, Sönmez, and Ünver in 2002, who derived a solution from the work of Gale and Shapley on matching problems. Roth and Shapley jointly won the Nobel prize in economics in 2012 for their contributions to the field.

We’ll start by looking at the stable marriage problem, a simple matching problem solved by Gale and Shapley in the 1950s. We’ll then reformulate the problem for the kidney donation setting and look at how to solve it using similar ideas.

The Stable Marriage Problem

The formulation of the stable marriage problem as originally solved by Gale and Shapley is as follows: we wish to play matchmaker between a group of n women and one of n men. The individuals in each group would like to marry a member of the other, and have a personal preference order on members of the other group. The aim is to produce a matching (bijection) between the two groups that is stable, i.e. where there is no pair who are not married under the matching who mutually prefer each other over their current spouses.

Gale and Shapley showed that such a matching always exists, and can be constructed with the Gale-Shapley algorithm, in which members of one group propose to their favourite candidates, who provisionally accept the best offer made to them.

To illustrate this, we’ll look at a simple example with 6 people.

First, the men propose to their first choices, and the women who recieve a proposal provisionally accept the one from their favourite suitor. In this case, a and b both propose to A, who decides to accept a. C accepts her only suitor, c. The rejected men then propose to their second choices – here b must make a second proposal, to C this time.

The process repeats until everyone is engaged, with C accepting the new proposal from b and c finally being engaged to B.

The algorithm always terminates and constructs a valid matching, since men do not propose to the same woman more than once and everyone is compatible with every member of the other group. The resulting matching is also always stable – for example, if b prefers A to his current partner he must have proposed to her at some earlier stage and been rejected in favour of someone she liked better, so A must like her current partner more than b.

The Kidney Exchange Problem

This matching problem has some similarities to the stable marriage problem – we want to establish a matching between a group of n kidney donors and n patients in need of transplant, and each patient has a preference order on kidneys according to compatibility.

There are some additional complications, however – not every kidney is compatible with every patient (this is why the patient-donor pairs cannot simply perform a direct exchange). This means it may not always be possible to match every patient to a live donor – if a compatible kidney is not currently available, a patient may choose to instead be given a high priority spot on the waiting list in exhange for their donor’s kidney. The donors themselves do not have any preferences about where their donated kidney goes, but will not donate it unless their partner recieves a kidney or a high priority spot on the wait list in exchange.

The problem setup is as follows: each patient has a preference list, reflecting their preferences over the set of kidneys compatible with them and ending in w if the patient and donor are willing to exchange their kidney for a priority place on the waiting list. w is always last on the preference list since transplant from a compatible living donor is always preferable. A kidney exchange occurs as a cycle (a closed loop where each kidney is donated to the previous patient in the cycle) or a chain ending in w (where the last member of the donation chain accepts a spot on the waiting list). Roth et al. presented the following algorithm (called the Top Trading Cycles and Chains or TTCC algorithm) to solve this problem, building on work by Gale and Shapley.

Each pair first points to their first choice of kidney. We represent this as a directed graph with arrows pointing from each pair to the pair with their chosen kidney. If a cycle is formed then all members of cycle are removed and their transplants can take place. If no cycle was formed then the longest chain ending in w is selected (with ties broken by e.g. patient priority) and the corresponding transplants can take place. The free kidney belonging to the first pair in the chain is left available for selection.

This process then repeats, with pairs selecting their second (third,fourth…) preferences if their first has been removed from the system. At each stage it is guaranteed that at least one cycle or chain will be formed. The process terminates when either all patients have left the system, or when all patients remaining in the system have exhausted their list of options, in which case they may choose to wait for the next exchange to be run. Unclaimed kidneys (found at the start of a w-chain) are donated to patients on the waiting list.

To illustrate this we will consider a simple example with 7 patient-donor pairs.

In the first round, each patient points to their first choice of kidney. A cycle has formed (highlighted in red), so patients (and kidneys) 1, 2 and 3 can immediately leave the system and take part in transplants.

Next, patient 4 must select their second choice – in this case, no more compatible kidneys are available and they settle for a spot on the waiting list. A w-chain is formed (highlighted in red) and patients 4, 5 and 6 can leave the system, although we have yet to decide what to do with donor 6’s kidney.

In the next round patient 7 chooses their next best available choice (donor 6’s kidney) and the w-chain from earlier can be extended.

Finally, all patients have been allocated a kidney or a spot on the waiting list. The remaining kidney from donor 7 will be given to the highest priority compatible patient on the waiting list.

This algorithm does not always result in a matching for all patients and donors, but it does always result in a matching that is Pareto efficient – there is no other matching that is better for some patients without being worse for others. Hence the matching is stable in the sense that no other would be unanimously preferred by all patients, and no subgroup can all improve their situation by coming to a different agreement amongst themselves. The algorithm is also strategy-proof – no patient can achieve a better outcome by lying about their preferences.

Connie Trojan

How to Lose Blackjack (Optimally)

Blackjack

Formulation as a Markov decision process

Optimal strategy

Time to play blackjack?

Further reading

The Prisoner’s Dilemma on Groundhog Day

Nash Equilibrium

The Iterated Prisoner’s Dilemma

Further Reading

Vector Spaces and Teaching Your Computer to Read

Example – Food Vectors

word2vec

Further Reading

Particle Filtering and COVID-19 (Part 2 – The Bootstrap Filter)

Sequential Importance Sampling

The Bootstrap Filter

Further Reading

Particle Filtering and COVID-19 (Part 1 – The Filtering Problem)

Epidemic Modelling

The Filtering Distribution

Further Reading

Stable Marriage and Kidney Donation

The Stable Marriage Problem

The Kidney Exchange Problem

Further Reading