Likelihood and Maximum Likelihood Estimation

HOME I'm planning a series of posts looking at what happens under the hood when we analyse a data set using some of the estimation functions in the wiqid package. I'll focus mainly on Bayesian methods, but this first post will look at the likelihood, which is used for both Bayesian analysis and maximum likelihood estimation.

We'll use a simple occupancy model. It has just two parameters and both must between 0 and 1. That means that we can plot all possible combinations of the two parameters in a simple two-dimensional graph. As we'll see we need to add a third dimension, but three is still manageable.

The data set and model

The data are detection/non-detection observation for Blue Ridge two-line salamanders (Eurycea wilderae) in Great Smoky Mountains National Park, and are described and analysed by MacKenzie et al (2006). A sample of 39 sites were visited on 5 separate occasions, and the number of occasions when at least one salamander was found recorded. Here are the data, sorted in descending order:

y <- c(4, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

n <- 5

sum(y > 0) # 18

We assume that salamanders are present or absent at a site on all 5 occasions, and they were seen on < 5 occasions because probability of detection is < 1. We also assume that salamanders are not recorded when they are not present, ie, we have false absences but no false presences. Salamanders were seen at 18 sites out of 39 (46%), but many of the remaining 21 sites will have salamanders present too.

We want to estimate the probability that a site is occupied by salamanders, allowing for the effect of imperfect detection. We need to jointly estimate probability of occupancy (\(\psi\)) and probability of detecting salamanders on a single visit if they are present (\(\pi\)). The simplest model assumes that \(\psi\) is the same for all sites and \(\pi\) is the same for all sites and all occasions.

We'll see a lot of probabilities in the next few pages, so I'll refer to the parameters as "occupancy" and "detection", and drop the "probability of..." bit.

The likelihood

The likelihood is the term for the probability of observing the actual data, \(\bf y\), given the model and specific parameter values, in this case \(\Pr({\bf y} | \psi, \pi)\).

We can calculate this for each site with:

likeSite <- psi * dbinom(y, n, pi) + ifelse(y==0, 1 - psi, 0)

The first term is the probability that the site is occupied AND we detected salamanders on y occasions, using dbinom to calculate the latter. A site where y = 0 might be occupied but we detected nothing OR it might not be occupied, so when y = 0 we add in the probability of non-occupancy, 1 - psi. The overall likelihood is the product of the contributions for each site:

like <- prod(likeSite)

Because this situation is very simple, we can actually plot the whole likelihood surface, for all values of  (\(\psi\)) and  (\(\pi\)) from 0 to 1. Using steps of 0.005 will give us a nice-looking plot, but we'll avoid the values 0 and 1, as the data would be impossible in those cases

PSI <- PI <- seq(0.0025, 1, 0.005)
# [1] 0.0025 0.9975
# [1] 200

Now we do a double loop to fill a 200 x 200 matrix with the values for the likelihood:

LIKE <- matrix(NA, 200, 200)
for (i in 1:200) {
  for(j in 1:200) {
    LIKE[i, j] <- prod(PSI[i] * dbinom(y, n, PI[j]) + ifelse(y==0, 1 - PSI[i], 0))
# [1] 1.851698e-197  9.141915e-22

Likelihoods can get very small (ie, close to zero), as you can see. There is a limit to the smallest number your computer can handle: do .Machine$double.xmin to see the limit for your computer, mine is 2.225074e-308, so no problem for this data set. But in general it's safer to work with log likelihoods.

Do a plot of the "terrain" and add contours:

image(LIKE, xlab="occupancy, psi", ylab="detection, pi", main="Likelihood surface",
    col = terrain.colors(12))
contour(LIKE, add=TRUE, drawlabels=FALSE)


The likelihood surface has an elongated "hill" in the lower-right portion of the plot. We can find the highest point of the hill, and that will give us the maximum likelihood estimates (MLEs) of \(\psi\) and  \(\pi\). We need arrayInd to get the row and column indices of the maximum value:

( where <- arrayInd(which.max(LIKE), dim(LIKE)) )
#     [,1] [,2]
# [1,] 120   52
PSI[where[1]]   # MLE of psi
# [1] 0.5975
PI[where[2]]     # MLE of pi
# [1] 0.2575
abline(v = PSI[where[1]], h = PI[where[2]], col='white') # add to the plot

Finding the maximum likelihood

With our simple model with 2 parameters limited to (0, 1), we can calculate the likelihood for 40,000 possible combinations and find the highest. With more parameters and parameters not limited to (0, 1), we soon run to billions of combinations.

The usual way of finding the maximum likelihood is to use a algorithm which climbs to the top of the hill as directly as possible. We can do that with the nlm or optim functions in R. First we need to write a function to calculate the likelihood:

  • psi and pi are passed as a two-element vector;
  • optim has trouble with very small numbers, so we use the log(likelihood);
  • since optim finds the minimum instead of the maximum, the function returns the negative log(likelihood).
get_negLogLike <- function(x, y, n) {
  psi <- x[1]
  pi <- x[2]
  like <- prod(psi * dbinom(y, n, pi) + ifelse(y==0, 1 - psi, 0))
get_negLogLike(c(0.7, 0.3), y=y, n=5)  # check
# 49.8396

optim(c(0.5,0.5), get_negLogLike, y=y, n=5)
# $par
# [1] 0.5946092 0.2587594
# etc...

The two values reported as $par are the MLEs of \(\psi\) and  \(\pi\) respectively. Compare with the values of 0.5975 and 0.2575 that we got from our all-combinations approach above.

The Nelder–Mead simplex algorithm

So how does optim work?

The default algorithm is the Nelder-Mead or simplex algorithm. With our example, we are working in 2 dimensions, so our simplex is a triangle, with each vertex defined by a value for  \(\psi\) and a value for  \(\pi\). We'll look at multidimensional simplices later. The plot below shows the lower-right portion of our likelihood surface with a starting triangle around (0.5, 0.5).

Here's the Nelder-Mead algorithm to find the maximum likelihood:

1. Calculate the likelihood at each vertex and find the lowest (circled); this vertex will be replaced with a higher one.

2. Reflect the lowest vertex to a new point on the opposite side of the triangle. Calculate the likelihood for the reflected point and see if it's higher or lower than the remaining vertices. If it's intermediate, use the blue triangle for the next step.

3. If the reflected point has the highest likelihood, try extending the triangle, going further from the original triangle. Calculate the likelihood for the extended point: if this is best, use the yellow triangle for the next step, otherwise use the blue triangle.

4. If the reflected point has the lowest likelihood, try contracting the triangle, moving only half way to the opposite side. If this has higher likelihood than the reflected point, use the pink triangle for the next step, otherwise use the blue triangle.


5. Calculate the difference in likelihood between the new point and the original, discarded vertex (circled). If the difference is less than the target tolerance, stop, otherwise go to 1 and repeat.

Let's see how this works for our example data set:

There are a couple of extension steps as the algorithm heads for the ridge. It overshoots and there's a contraction step before it moves up along the ridge. Once the top of the hill is inside the triangle, a series of contraction steps close in on the maximum value. You can download an mp4 file with the animation here.

The code for the examples on this page is here. If you have problems running ImageMagick for the animation, there's a workaround here.

Some odd points

Higher dimension simplices

If you have more then 2 parameters, your simplex will be more ... complex. With 3 parameters, it's a tetrahedron, which you can probably imagine being reflected. With more, visualization is no longer possible for most people. With 10 parameters, each vertex is defined by a value for each parameter, and the simplex has 11 vertices. The process of finding the vertex with the lowest likelihood and reflecting, extending or contracting is analogous.

Markov chains

The Nelder-Mead algorithm generates a first-order Markov chain. Each step depends only on the outcome of the previous step, and is not affected by any of the steps before that. Just like the game of snakes and ladders (or "chutes and ladders"), where you go next depends only on where you are now and the next die roll; it doesn't matter how many ladders you've climbed or how many snakes you slid down before getting to your present position.

Though perhaps I should add that, unlike snakes and ladders, and unlike MCMC, there is nothing random about the Nelder-Mead algorithm. But MCMC is the topic of a future post...


MacKenzie, D.I., Nichols, J.D., Royle, A.J., Pollock, K.H., Bailey, L.L., & Hines, J.E. (2006) Occupancy estimation and modeling : inferring patterns and dynamics of species occurrence Elsevier Publishing.

Updated 19 Feb 2015 by Mike Meredith