Point estimates: mean, median or mode?

HOME The output of a Bayesian analysis is a complete description of the posterior probability distribution of each of the parameters we are interested in. But mostly we need to summarise the distribution with a single number, a "point estimate". We can use the posterior mean, median or mode. Which is the best?

The answer will depend on the situation, and specifically on what are the consequences if the point estimate differs from the true value. We'll explore this with a simple example.

How many coins in the jar?

We've entered a competition to guess the number of coins in a glass jar, with a prize of $100. We're allowed to examine it closely, but not to take off the top. We have weighed it and tried to estimate the proportion of each type of coin. We still have doubts, but we have produced a Bayesian probability distribution of the number of coins.

For the competition, we need to enter a single number, we can't enter the whole distribution. Should we use the mode or the median or the mean? The distribution is not symmetrical, so the three values are different.

In fact, the best estimate depends on the rules of the competition, which the organisers are still arguing over. They all agree that an exactly correct guess earns $100, but not if close guesses should also be rewarded.

Ana says, "Simple. If they are wrong, they get $0."

Beth and Carl worry that no one may get the exact right answer, then everyone will complain. Beth suggests a sliding scale, where the prize money is reduced in line with the error. She suggests giving $99 if the error is just one coin, $98 if it's 2, and $1 if the error is 99.

Carl wants to give more money to people who are very close - if the error is only 3 or 4 they would only lose a few cents. But he thinks Beth is too generous to those with big errors, he would not reward errors of 45 or more. He wants the loss to be proportional to the squared error.

The three proposals are shown in the graph below:

With Ana's all-or-nothing plan, we should use the mode of the probability distribution. That's the value with the highest probability of being correct.

With the other schemes, we need to think about possibly getting a smaller prize for being nearly right, and the mode is not the best for that. It turns out that with Beth's linear loss, the median is the best bet, and with Carl's squared-error loss, we need the mean.

Here we are using loss functions: Ana's is L0, Beth's L1 and Carl's L2. In the words of Rasmus Bth: "Behind every great point estimate stands a minimized loss function." See his blog for more on loss functions and point estimates. For more on mean, median and mode, see here.

What does this mean in practice?

I can't imagine an example in wildlife statistics where Ana's L0 function applies, where you need to be spot on and any small error is a disaster. The mode is not going to be the best summary of an asymmetric posterior distribution.

Let's also note that maximum likelihood estimation gives us the mode of the likelihood curve, again not optimal unless the distribution is symmetrical.

In real situations, small errors are unimportant; problems begin when the errors are big. And I think mostly we have to deal with Carl's L2 loss function rather than Beth's smoothly declining L1 loss function. So the most appropriate point estimate will be the mean.

Other considerations

It's complicated to calculate the mode from MCMC output. I think this is the main reason it is not included in the output of typical MCMC packages.

The median is easy to explain: it's the "50-50" point, there's a 50% probability the true value is higher, 50% probability it's lower. The mean of a probability distribution is much less intuitive.

The median is unaffected by transformations. If theta is a probability on the logit scale,
all.equal(median(plogis(theta)), plogis(median(theta)))
TRUE

This does not work with the mean or mode.

Updated 20 Nov 2019 by Mike Meredith