Thinking about prior distributions

HOME Bayesian analysis requires priors. I see that as a huge practical advantage, especially when producing information for management: we can incorporate information derived from past research over more than 100 years. The tradition in ecology, however, is to use priors which give minimal information. Truly 'uninformative' priors do not exist - with the possible exception of a uniform prior for a probability - so care is needed to provide information which makes biological sense within the context of your own analysis.

As a colleague has pointed out, there are many specific types of priors described in the literature, and rarely a proper discussion of the reasons for choosing that particular distribution and those parameters. This is understandable with large complex models, such as multi-species occupancy models, with many priors and hyperpriors, but remains a problem. This post aims to give you some ways to explore possible prior distributions and assess their suitability for your analysis.

Stick to what you know

Set priors on quantities that have biological meaning. You will have an intuitive feel for the plausible range of values. They may be transformed - perhaps as a log or a logit - in the model, but set priors for the real values.

This applies specifically to standard deviations (SDs) for normal and t distributions, where JAGS uses precision, tau \(\tau\), instead of SD. I do have a hazy idea of what an SD value implies - 2/3 of the values will be within 1 SD of the mean and most all within 2 SD - but no idea what a value for \(\tau\) means. So I put a prior on SD and convert that to \(\tau\) for use in dnorm or dt. For example:

SD ~ dunif(0, 10)
tau <- 1/SD^2

When is uniform informative?

Probabilities have to be between 0 and 1, so a uniform probability distribution is proper, and it should not be a informative, right? But once you transform your probabilities to the logit scale, they look very different:

p <- runif(1e6)
hist(p, freq=FALSE, main="", col='skyblue')
lp <- qlogis(p)
[1] 1.81596
hist(lp, freq=FALSE, main="", xlab="logit(p)", col='skyblue')
curve(dnorm(x, 0, sd(lp)), add=TRUE, col='red', lwd=2)

On the logit scale our "uninformative" uniform prior becomes a really tight distribution with an SD of 1.8, which would be judged to be highly informative! For a prior on the logit scale, we'd want a normal prior with SD 5 or 10. Let's see what that means on the probability scale:

q <- rnorm(1e6, 0, 5)
hist(q, freq=FALSE, main="", xlab="logit(p)", col='skyblue')
p <- plogis(q)
hist(p, freq=FALSE, main="", col='skyblue')

Now our minimally-informative, broad prior on the logit scale becomes highly informative - and nonsensical - on the probability scale. I can't think of an example where values near 0 and 1 are most plausible and those in the middle are least plausible.

In a logistic regression model the intercept is the logit-scale equivalent of the probability when all covariates = 0. With centred covariates that corresponds to something meaningful, and you should put a prior on the probability then convert to the logit form. For example:

p ~ dbeta(1, 1)
lp <- logit(p)

Similar logic applies to log-transformed density or abundance: put a prior on the biological quantity and convert as necessary for the regression.


Look at the distribution you use

A key part of assessing the suitability of a prior distribution is to plot it. I also find it helpful to generate 20-30 random values from the distribution and plot them as a rug.

In a recent post on cross-validation we used a half-Cauchy hyper-prior with scale parameter 2.25 for the coefficients of psi and p. Let's see what that looks like, and compare it with a normal distribution and a t distribution with 4 degrees of freedom. We'll use the dt2 and rt2 functions in the wiqid package, as they allow us to specify the location and scale.

xx <- seq(0, 10, length=101)
den0 <- dnorm(xx, 0, 2.25)  # Normal
den1 <- dt2(xx, loc=0, scale=2.25, df=1) # Cauchy
den4 <- dt2(xx, loc=0, scale=2.25, df=4) # t_4
plot(xx, den0, type='l', lwd=2, lty=2, col='red',
  xlab="Parameter", ylab="Probability Density")
lines(xx, den1, lwd=2)
lines(xx, den4, lwd=2, lty=2, col='blue')
legend('topright', c("Gaussian", "t_4", "Cauchy"), lwd=2,
  lty=c(2,2,1), col=c('red', 'blue', 'black'), bty='n')
rug1 <- abs(rt2(20, loc=0, scale=2.25, df=1))
rug4 <- abs(rt2(20, loc=0, scale=2.25, df=4))
rug(rug4, col='blue')
[1]    0.2337097 2865.6930550
[1] 0.2971721 6.6131075

In the plot we see that most of the mass for the Cauchy distribution is below 5, but the tails allow for very large values - we have values in the thousands, which are surely preposterous. The t4 distribution gives much more sensible values.

The t-distribution is coded in dt2 and in JAGS with the scale parameter, \(\sigma\). For the normal distribution, \(\sigma = SD\), but that's not true with the t distribution, of which the Cauchy is a special case. In fact, for Cauchy the mean and the SD are both undefined, though we can use location and scale. That in itself is a reason to prefer a t-distribution with 4 degrees of freedom instead of Cauchy - it does have mean, sd and skew.


Updated 28 Oct 2019 by Mike Meredith