Conjugate distributions

HOME Conjugate distributions provide useful tricks for combining informative priors with likelihoods to produce posterior distributions. In the days before powerful computers and clever algorithms, they were often the only way. They only work for a single variable. Nevertheless, Gibbs sampling, which the wiqid package uses when possible, builds on the idea of conjugate distributions.

As our example, we'll use estimation of detection probability from data for repeat visits to a site which is known to be occupied by our target species. First, we'll describe the beta distribution, then see how that can be combined with our data. A discussion of priors will follow, and we'll finish with brief descriptions of conjugate priors for other types of data.

The beta distribution

The beta distribution only covers values between 0 and 1, so it's useful for probabilities and proportions. It has two shape parameters, usually called \(\alpha\) and \(\beta\), but in R's *beta family of functions they are termed shape1 and shape2. Both must have values greater than zero (though they can be tiny), and the shapes that we are mostly interested in have both parameters greater than 1.

Some example curves are shown on the right.

To see more examples for yourself, copy and paste these two lines into the R Console (you will need to be on line):

library(shiny)
runUrl("http://mikemeredith.net/blog/201502/visBeta.zip")

A new web page with a shiny app should appear (if you get a "Failed to download..." error, see the workaround here). The default controls use the mode of the curve and the concentration, the concentration is a measure of how narrow the curve is, ie, it's the opposite to the spread. See John Kruscke's book (section 6.2.1) for a description.

Update: With version 0.11.1 of the shiny package, it's possible to break a slider by pulling it too far to the left. This will be fixed in the next version; in the meantime, treat sliders gently!

A plot of a curve with mode 0.2 and concentration 5 is shown below. Click on the 'Shapes' button and a beta(0.8, 0.8) curve appears, which is U-shaped. U-shaped curves can't be produced with the Mode sliders.

 

The binomial distribution

The binomial distribution is useful for binary data, such as detection/nondetection. Suppose you laid out a transect as in the salamander study we looked at in the last post, and visited it 5 times looking for salamanders. If you saw salamanders on 3 occasions out of 5, your estimate of probability of detection would be \(\hat{\pi} = 3/5 = 0.6\). But that's only an approximate value, and there's a range of plausible values for \(\pi\).

The first step is to look at the likelihood, the probability of observing 3 out of 5 with various values of \(\pi\). We can plot that with the same shiny app: move the 'data' sliders away from zero, with the right end at 5 and the left end at 3. The grey curve is the likelihood. It has actually been scaled so that the area under the curve is 1, for comparison with the beta curve.

The curve is quite broad, with a maximum (the grey dot) at \(\pi = 0.6\), confirming that 0.6 is the maximum likelihood estimate, \(\hat{\pi}\).

The binomial and the beta

With the likelihood displayed, try adjusting the Prior sliders so that the red curve matches the grey curve.

The first thing to note is that you can get them to match exactly. The two distributions are obviously related, and are described as conjugate. It you experiment with different data, you'll also see that the beta distribution that matches the binomial likelihood has parameters equal to successes + 1 and failures + 1.

The data slider has a maximum at 10 by default, but you can increase this with the last slider in the column.

What if you only did a single trial? Try setting the data slider to 1 trial and 0 or 1 success. You'll need the Shapes sliders to match the beta distribution with these.

Of course you can't get data from zero trials, but you might be able to guess what the likelihood curve ought to look like.

Using a beta distribution as the prior

If your prior distribution for a binomial model is a beta distribution, then the posterior is also a beta distribution. Moreover, the parameters of the posterior are easy to calculate:

posterior shape1 = prior shape1 + number of successes, posterior shape2 = prior shape2 + number of failures

In the shiny app, check the "Show posterior" box. Now the red curve, which you control with the sliders, is the prior, and the blue dashed curve shows the posterior distribution.

We'll use the shiny app to see how the choice of prior affects the posterior in a variety of cases. But first let's say that you should decide on your prior before seeing the data, because humans are very bad at pretending they don't know. We tend to include everything we know, including the data if we'd had a peek at that, and then we're producing an informal posterior, not a genuine prior.

A strongly informative prior

Analysis of the salamander study described in the previous post produces a posterior probability for the detection parameter which is approximately Beta(15, 45), which has mode 24 and concentration 60. (Use the slider at the bottom of the column to increase the upper limit on concentration.) Adding another 5 trials with 3 successes (likelihood mode 0.6) produces a posterior which is very close to the prior, shifted just a little to the right.

Weaker priors

If we were working in the same area with the same species, that would be appropriate: but not if you're working with a different species or in different habitat. The curve is very close to zero for all values of detection outside the range 0.1 to 0.5, it says we are 99.9% certain that detection falls in this range. That's too strong, we need a broader prior distribution, and we can do that by reducing the concentration. But how much should we reduce it?

Try this in the shiny app: keep the mode at 0.24 and set the data sliders for 30 trials with 18 successes. If the prior concentration is high (> 30), the posterior is closer to the prior; if it's low (< 30), the posterior is closer to the data (ie, the likelihood). When concentration = trials, they have equal weight.

Think of the prior concentration as the "equivalent sample size", the number of new trails needed for the prior and the new data to have equal influence on the posterior.

Please note that I'm not suggesting that you put your study data into the shiny app and adjust the prior until you find something that fits the data! This should be a "what if" exercise done before data collection: if my data showed a different result than the prior, at what point would I believe the data rather than the prior.

A uniform prior

If you move the concentration all the say down to 2 (the minimum), you have a uniform prior, Beta(1, 1). Now the posterior coincides with the (scaled) likelihood. This looks like the ideal "noninformative" prior when we have no idea what the real value could be.

Thomas Bayes himself used a uniform prior, and Laplace formally expressed the idea that, when in a state of ignorance, we should assign an equal probability to each possibility. But there are important objections:

1. As Ronald Fisher would put it, “Not knowing the chance of mutually exclusive events and knowing the chance to be equal are two quite different states of knowledge.”

2. What is uniform on one scale is not uniform on another. On a logit scale - which we use to relate detection to covariates - a Beta(1,1) distribution becomes a bell-shaped distribution with standard deviation 1.8. And a flat distribution on the logit scale corresponds to a Beta(0, 0) distribution.

In spite of these objections, uniform priors seem to be the norm for Bayesian analysis in ecology.

U-shaped priors

Two U-shaped priors have been recommended for their mathematical properties. Jeffrey's prior is a Beta(0.5, 0.5) distribution, which gives the same amount of information when transformed to other scales. Haldane's prior is Beta(0, 0). If you try this in the shiny app (where it's actually Beta(0.0001, 0.0001), as zero values for the parameters cause divide-by-zero errors), you will see that the posterior mean corresponds to the maximum likelihood estimate.

These U-shaped distributions put half the posterior weight on 0 and half on 1, with values around 0.5 having the lowest probability. In practice, we usually know that detection is not zero (we do sometimes detect the animal) and it's not 1 (we sometimes miss it), so these priors are not plausible.

Choice of prior

You should always clearly state the prior you used. Your readers (editors, referees, supervisor, donor, ...) will reject your conclusion if they do not accept your prior. In ecology, uniform priors are uncontroversial.

If your study is a follow-up to previous research and you can assume that the parameter value has not changed, you can combine the old and new data in a single analysis, rather than using the posterior of the old analysis as the prior for the new. This focuses debate on whether the parameter is the same, rather than on what prior to use. It also works if the output from the first analysis was in the form of an MCMC chain rather than a beta distribution.

Other conjugate pairs

For count data with a Poisson likelihood, the conjugate prior is the gamma distribution. You can run a shiny app for gamma/Poisson distributions with the following code:

runUrl("http://mikemeredith.net/blog/201502/visGamma.zip")

The updating rule for the Poisson/gamma combination is:

posterior shape = prior shape + total count, posterior rate = prior rate + number of units

where the total count is, say, the total goals scored in n matches or the total number of ticks found on n rats, and the the number of units is the number of matches or rats examined, n.

For the mean of normally distributed data with known variance, the conjugate prior is also the normal distribution. For variance of a normal distribution with known mean, the conjugate prior is the gamma distribution. In practice, we usually know neither the mean nor the variance, but these relationships are the raw materials for a Gibbs sampler.

Wikipedia has a long list of conjugate priors for other distributions.

Updated 11 April 2015 by Mike Meredith