Introduction
In my previous article we derived Bayes’ theorem from conditional probability. If you are unfamiliar with Bayes’ theorem, I highly recommend reading that article before carrying on with this one:
In this article we will use Bayes’ theorem to update our belief and show how with more data we become more certain of our hypothesis.
Bayes’ Theorem
We can write Bayes’ theorem as follows:
Equation generated in LaTeX by author.
- P(H) is the probability of our hypothesis which is the prior. This is how likely our hypothesis is before we see our evidence/data.
- P(D|H) is the likelihood, which is the probability of our evidence/data being correct given our hypothesis.
- P(H|D) is the probability our hypothesis is correct from our data/evidence. This is commonly known as the posterior.
- P(D) is the probability of our evidence/data. This is referred to as the normalising constant, which is the sum of the product of the likelihoods and prior:
Equation generated in LaTeX by author.
Bayesian Updating
We can use Bayes’ theorem to update our hypothesis when new evidence comes to light.
For example, given some data D which contains the one _d_1_ data point, then our posterior is:
Equation generated in LaTeX by author.
Lets say we now acquire another data point _d_2, so we have more evidence to evaluate and update_ our belief (posterior) on. However, our prior now becomes our old posterior because this represents our new prior belief of our hypothesis:
Equation generated in LaTeX by author.
In terms of updating, we say that the posterior is proportional to the product of the likelihood and prior:
Equation generated in LaTeX by author.
This updating may seem arbitrary, but let’s now go through an example to make it more concrete.
We normally omit the denominator P(D) as it is just a normalising constant to make the probabilities sum to 1. There is great thread here from Stat Exchange that explains this very well.
Example
Lets say I have three different dice with three different number ranges:
- Dice 1: 1–4
- Dice 2: 1–6
- Dice 3: 1–8
We randomly select a dice and do three subsequent rolls with that given dice. Using these rolls (data), we can compute how likely we picked up either dice 1, 2 or 3 after reach role (posterior).
First Roll
In the first roll, we get the number 4. What is the probability that we selected dice 1, 2 or 3?
We can compute this using Bayes’ theorem as follows:
Equation generated in LaTeX by author.
Where we have used the following values:
- Prior, P(Dice), is simply 0.33 because each dice has an equal chance of being selected.
- Likelihood, P(roll 4 | Dice), is just the probability of rolling a 4 for each dice.
- Probability of the data (normalising value), P(roll 4), is just the sum of the likelihood and prior products.
So after this first roll, dice 1 is the most likely. This makes sense as it only has 4 outcomes, hence any roll between 1–4 it will be the most likely one.
Second Roll
Using the same dice, we now roll a second time and get a 2. However, we have new priors, which are the calculated posteriors above where we rolled a 4.
Therefore, the probabilities are now:
Equation generated in LaTeX by author.
With this new information, dice 1 has become increasing likely to be the one we picked up.
Third Roll
We roll a third time with our chosen dice and get a 5. Using our previous posterior as our new prior, the probabilities are now:
Equation generated in LaTeX by author.
The probability that it is dice 1 is now 0 because it is impossible to roll a 5 with dice 1. Hence, given these three data points (rolls) dice 2 is the one we most likely picked up!
The value with the highest posterior is known as the Maximum a posteriori (MAP). This is analogous to the Maximum Likelihood but for Bayesian Statistics and is the mode value of the posterior distribution.
The three posterior values for the dice forms a posterior distribution, albeit a very small one!
Conclusion
In this article we have shown how you can use Bayes’ theorem to update your beliefs when you are presented with new data. This way of doing statistics is very similar to how we think as humans because with new information it can either reinforce or change our beliefs.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no “fluff” or “clickbait,” just pure actionable insights from a practicing Data Scientist.
Connect With Me!
Something Extra!
My parents live in East Sussex and last weekend I went to visit them. During my trip we had dinner in a town named Tunbridge Wells. Do you know who lived there? Thomas Bayes!
If you are in the area it is well worth checking out!
Photo by author.
Source