All models are wrong.: July 2010

On my previous post about Paul the octopus, a commenter asked a couple of questions which I thought merited a separate post to address. The first:

"There is of course a detail that you've (likely intentionally) overlooked, which is that the chance of winning a football match is not usually exactly 50-50. Given that Germany are one of the world's top teams they could be expected to win more matches than they lose.

Having done a bit of research, I've discovered that since Paul started making his predictions (at least for the public) at the start of Euro 2008, Germany have won 22 games, lost only seven and drawn four. Their win record thus stands at 66.7% over the last two years, which is probably a fairer representation of their chances of victory in a randomly determined match.

Can your analysis take account of this?"

This is an interesting question, and it boils down (as happens surprisingly often with probability) to a matter of perspective.

Suppose you have a friend called Peter who knows a bit about football. He's successfully predicted the results of the same six games that Paul has. Since Peter knows about football, he knows that the chance of Germany beating Australia (for example) was probably not exactly 50%. Does this matter?

Well, not really. The analysis we carried out last time was testing a specific hypothesis - that Paul was picking teams at random. This was our 'null' hypothesis, our default state of belief, if you will. Our 'alternative' hypothesis was that he has done better than you'd expect him to by chance alone. In testing this I claimed that Paul's chance of predicting the winner - if he's just picking at random - is 50/50. Crucially, this doesn't depend on the real chances of either outcome. This might not seem intuitive at first, but imagine Paul was picking the team after the game had happened - at this point the winner is known, so if he's picking at random he has a 50% chance of picking the right team. Since Paul's picking doesn't (we presume) interfere with the outcome of the game, if we're assuming he's picking 'blind' then it doesn't matter whether he chooses before or after the result is determined.

So what about Peter? We could test the same hypothesis and we would come to the same conclusion. The only difference is that we're not (as) impressed because we would expect him to be doing better than chance anyway - he has extra information to help make his decisions. Paul, meanwhile, is just an octopus, and so no-one would expect him to know anything (except possibly how to count to eight).

On a separate note, and with regards to the probabilities we've calculated telling us that Paul has some apparently incredible ability, it's worth stressing that that isn't what we've shown either. All we've done is shown that if Paul was picking at random (as - call me a sceptic - he probably was) he's just got quite lucky. This in itself isn't really that remarkable though - Paul was only brought to our attention after a string of successful predictions. There may well have been hundreds of other octopuses/coins/babies making similar predictions and getting them wrong, and we've just got to see the one who got them right. If you see a golfer hit a hole in one it seems remarkably improbable, but if you think about all the millions of shots that didn't go in, that single event occurring doesn't seem so incredible.

But anyway, on to the second question:

"Secondly, I would note that there are three possible results in most football matches (win, loss, draw) rather than two, although there seems to be no way for Paul to predict anything other than a win or loss. So far none of the matches he has made predictions for have resulted in a draw, but the possibility exists nonetheless. How does that affect the overall dataset?"

This is a good point (and one which I ignored previously for the sake of keeping things simple), and an interesting one to discuss.

As we've discussed, if our (null) hypothesis remains that Paul is picking at random, his probability of picking either team is just 0.5. However, since it's not certain that one of those teams will go on to win the game, his chance of picking the winner is actually going to be less than that. For instance, if 2 in 3 games end in one of the two teams winning, Paul then has a 1 in 3 chance of picking the team that wins, a 1 in 3 chance of picking the team that loses, and a 1 in 3 chance of there being no winning team to be picked at all.

What this amounts to is that Paul's chances of correct predictions are in fact even lower than those we'd already calculated, but unless you are willing to believe an octopus has been keeping an eye on the football pages of Bild, the chances are he's just very lucky.

An octopus named Paul has been making the news due to his alleged ability to correctly predict the winner of Germany's international football matches. It started with Euro 2008, where he supposedly called 4 of Germany's 6 games correctly. The BBC reported this as "nearly 70%", which is perhaps being a little generous, as a 70% success rate sounds rather more impressive than correctly making 4 out of 6 50/50 guess.

For the World Cup Paul has (apparently) correctly picked the results of Germany's 5 games up until tonight's semi-final, where he has controversially chosen Spain to triumph. So is Paul a Predicting Phenomenon, or just lucky?

We'll start with his World Cup picks where he's got 5 out of 5 right (so far). Our null hypothesis is that Paul is merely picking at random, and since each pick is a 50/50 choice this is the same as saying the probability he picks correctly is 0.5. The probability of getting 5 correct selections is then the same as tossing a coin 5 times and getting 5 heads. This is easy to compute, as we just multiply the probabilities together to get 0.5 x 0.5 x 0.5 x 0.5 x 0.5 = (0.5)⁵ = 1/32 or about 3%. That seems pretty unlikely (although not too astronomical).

Amusingly, were we performing a statistical hypothesis test, we would in fact be likely to say that the data are not consistent with the null hypothesis that Paul is picking at random. This is because the probability that he would have got all 5 predictions correct is less than 5%, the standard cut-off used in hypothesis testing (we would say "the data are significant at the 5% level"). Of course, this highlights the danger of the common practice of just looking at a p-values (which is what our probability above is) and concluding that the null hypothesis must be true or false - it would take a rather stronger run of successes to convince most people that an octopus could really correctly predict football results. A 5% significance level means that even if our null hypothesis is true, an outcome will appear 'significant' (and we would question the null hypothesis) if the chances of it happening are less than 1 in 20. This really isn't that unlikely.

We do have more data, however, thanks to Paul's Euro 2008 picks. This takes his record to 9 correct out of 11 - is this statistically significant as well? Once again we want to calculate the probability that Paul would get this success rate picking at random, but it's slightly harder to work out this time. What we want to know is the probability that Paul would be at least this successful were he picking at random. So whereas before we just had to calculate the probability of 5 heads from 5 tosses, here we need to calculate the probability of 9 heads from 11 tosses, 10 heads from 11 tosses, and 11 heads from 11 tosses; adding these three probabilities up will tell us how 'lucky' Paul is.

So 11 heads from 11 tosses is easy, like the case with 5 out of 5, it's just 0.5 multiplied by itself 11 times. What about 10 heads, or 9? Things get a little trickier. Whilst there is only one way to get 11 heads from 11 tosses, there are several ways to get 9 heads. This might sound odd, but if you imagine tossing a coin twice, you can get either:

1) Tails followed by tails (TT)
2) Heads followed by heads (HH)
3) Tails followed by heads (TH)
4) Heads followed by tails (HH)

All of these outcomes are equally likely, but two of them (HT and TH) correspond to getting one head and one tail, and it's this which makes computing the probability of 9 heads from 11 tosses a bit tricky. Fortunately there's a simple formula for calculating this, known as the binomial coefficient. I'll spare you the details (since it's mathsy, and you can read Wikipedia if you like), and tell you how to use Google to get the number you want. Just type in "x choose y" and Google will tell you how many ways there are to get y heads from x coin tosses. Here, we want 11 choose 9, which gives us 55 ways to get 9 heads from 11 tosses. The probability of getting any one particular combination of 9 heads and 2 tails is just 0.5 multiplied by itself 11 times; once for each 50/50 coin toss. Since there are 55 different ways of doing this we then want 55 times this to allow for each possibility. So the final probability that one would get 9 heads from 11 tosses is 55 x (0.5)¹¹, about 2.7% or 1 in 37.

Similarly, we see there are 11 ways to get 10 heads from 11 coin tosses, so the probability of exactly 10 heads is 11 x (0.5)¹¹, about 1 in 186.

We can now put these three probabilities together and add them up to give Paul's prediction p-value as (1 + 11 + 55) x (0.5)¹¹ = 3.3% or about 1 in 30.

So even taking Paul's two mistakes into account, his four extra correct picks mean the chances of him managing his record at random have only increased marginally, and his punditry powers which remain statistically significant.

So how important is the match tonight? He's selected Germany's opponents Spain to progress, and of course if he's right it will be further evidence that his powers are not merely down to chance, so what if he's wrong? It would take his world cup prediction record to 5 right out of 6, would this still be statistically significant? The probability of getting at least 5 out of 6 right is calculated the same as our example above with 9 out of 11. The probability of 6 out of 6 is just (0.5)⁶, and there are 6 ways of getting 5 heads from 6 tosses, so the probability of exactly 5 out of 6 is 6 x (0.5)⁶. Adding these together we get a probability of about 11%, or 1 in 9. With just one wrong choice his picking would stop being statistically significant.

What about his lifetime record? That would go to 9 out of 12. I'll spare you the maths now and just tell you that the probability of getting at least 9 out of 12 right is 7.3%, or 1 in 14. Again, statisticians would stop heralding Paul as the mussel eating messiah.

So if he's wrong tonight he'll seem unremarkable (to p-value cultists, at least), whilst if he's right he'll be pushed further towards probabilistic stardom. This does of course demonstrate the dangers of trying to perform statistics entirely through p-values (which many practitioners do), and how susceptible they can be towards even a single result one way or the other.

Now I'm off to watch the football where I'll be rooting for the Germans. If they win the World Cup it means England come joint second, right?

All models are wrong.

Saturday 10 July 2010

Mega Football Versus Small Octopus

Wednesday 7 July 2010

I'd like to be, under the sea...

Followers

Blog Archive

About Me