Posted on October 9, 2017 by Harish Krishnamurthy .

# Maximum Likelihood

Maximum Likelihood Estimation (MLE) suffers from overfitting when number of samples are small. Suppose a coin is tossed 5 times and you have to estimate the probability of the coin toss event, then a Maximum Likelihood estimate dictates that the probability of the coin is (#Heads/#Total Coin Toss events). This would be estimate if we assumed that samples were generated according to the binomial distribution as shown in Figure 2. Consider an event where out of 5 different coin tosses, we ended up with all 5 heads or 4 heads and 1 tails. Then MLE value would be either a 1.0 or 0.8 which we know is not accurate as a fair coin has only two possibilities – either a heads or a tails and hence the unbiased coin tossing probability should be 0.5. We do know that as the number of coin tosses are increased, we could end up with a more realistic value of the coin toss probability value. To incorporate this belief, a conjugate prior is introduced. Here we shall illustrate with an experiment as to how we can arrive at a true value of the probability of the coin toss by incrementing the number of samples and harnessing the conjugate prior known as the Beta distribution.

# Conjugate Prior

A conjugate prior is a distribution that describes the distribution of the latent variable with a mathematical formulation similar to that of the likelihood.In the above scenario of a coin toss, if we assumed that the probability of the coin was a random variable that conformed to a probability distribution then, we can say that the probability of the coin was picked according to:

Where alpha and beta are parameters that can guide a beta distribution. The beta distribution has the same form of the binomial likelihood function:

Hence, we can derive the probability of the coin toss as a posterior formulation:

which essentially implies that the probability of a coin toss is dependent on number of heads and number of tails. The mean and variance of the beta distribution is an indicator of the probability estimate and the uncertainty involved. We can see that the uncertainty or variance reduces as the number of samples increases.

# Python Experiment

Let us now simulate these from a sample of the data collected from Univ of California,

Berkeley where 40K coin tosses was performed.

https://www.stat.berkeley.edu/~aldous/Real-World/coin_tosses.html

Define the parameters of the distribution and the experiment. We have considered various coin tosses at different levels to see how the mean and variance behaves as the number of samples increases. The dataframe with the toss results are shown on the right. We shall now define some parameters such as the mean and the variance of the beta distribution.

Let us now plot various values of alpha and beta by increasing the number of samples.

We can now plot these values iteratively for smaller values that are shown below in dotted lines and the more

accurate estimates with larger samples plotted in dark lines.

You can see that at the values closer to 0.5 the variance is much lower and the mean is centered around it. This shows an example of how using conjugate priors and formulating a bayesian posterior will help us arrive at the true estimate.