Posted on October 9, 2017 by Yash .

Share Post

Find out why Bayesian formulations are a better option than maximum likelihood estimates for data analysis. Read our blog post for valuable insights and examples.

Maximum Likelihood

Maximum Likelihood Estimation (MLE) suffers from overfitting when the number of samples are small. Suppose a coin is tossed 5 times and you have to estimate the probability of the coin toss event, then a Maximum Likelihood estimate dictates that the probability of the coin is (#Heads/#Total Coin Toss events). This would be estimated if we assumed that samples were generated according to the binomial distribution as shown in Figure 2. Consider an event where out of 5 different coin tosses, we ended up with all 5 heads or 4 heads and 1 tail. Then MLE value would be either 1.0 or 0.8 which we know is not accurate as a fair coin has only two possibilities – either heads or tails and hence the unbiased coin tossing probability should be 0.5. We do know that as the number of coin tosses increased, we could end up with a more realistic value of the coin toss probability value. To incorporate this belief, a conjugate prior is introduced. Here we shall illustrate with an experiment how we can arrive at a true value of the probability of the coin toss by incrementing the number of samples and harnessing the conjugate prior known as the Beta distribution.

Conjugate Prior

A conjugate prior is a distribution that describes the distribution of the latent variable with a mathematical formulation similar to that of the likelihood. In the above scenario of a coin toss, if we assumed that the probability of the coin was a random variable that conformed to a probability distribution then, we can say that the probability of the coin was picked according to the:

Where alpha and beta are parameters that can guide a beta distribution. The beta distribution has the same form as the binomial likelihood function:

Hence, we can derive the probability of the coin toss as a posterior formulation:
which essentially implies that the probability of a coin toss is dependent on the number of heads and the number of tails. The mean and variance of the beta distribution are indicators of the probability estimate and the uncertainty involved. We can see that the uncertainty or variance reduces as the number of samples increases.

Python Experiment

Let us now simulate these from a sample of the data collected from Univ of California,
Berkeley where 40K coin tosses were performed.
https://www.stat.berkeley.edu/~aldous/Real-World/coin_tosses.html

Define the parameters of the distribution and the experiment. We have considered various coin tosses at different levels to see how the mean and variance behave as the number of samples increases. The data frame with the toss results is shown on the right. We shall now define some parameters such as the mean and the variance of the beta distribution.

Let us now plot various values of alpha and beta by increasing the number of samples.

We can now plot these values iteratively for smaller values that are shown below in dotted lines and the more accurate estimates with larger samples plotted in dark lines.

You can see that at the values closer to 0.5 the variance is much lower and the mean is centered around it. This shows an example of how using conjugate priors and formulating a bayesian posterior will help us arrive at the true estimate.

Excited about this content? Read more by going to our Colaberry Blog

Share Post