A simple Gibbs sampler

HOME As discussed in the last post, conjugate distributions provide an easy way to calculate posterior distributions for a single parameter, such as detection of a species during a single visit to a site where it is present. If we have more than one unknown parameter in our model - as with a simple occupancy model, where we have detection and occupancy - we may still be able to use conjugacy via a Gibbs sampler.

Gibbs sampling works if we can describe the posterior for each parameter if we know all the other parameters in the model.

We can see this with the salamanders data that we used for maximum likelihood estimation in a previous post. In this study, \(S = 39\) sites were visited on \(K = 5\) occasions and searched for salamanders. Salamanders were observed at \(x_{obs} = 18\) sites, but detection is low (at 12 sites they were detected on only 1 occasion out of 5),  so we suspect that the actual number of occupied sites is \(x > x_{obs}\). The total number of detections was \(d = 30\).

If we knew the actual number of occupied sites, \(x\), then the occupancy likelihood is binomial with \(x\) successes in \(S\) trials, and the detection likelihood is binomial with \(d\) successes in \(x \times K\) trials (note that only visits to occupied sites count as trials). But we don't know \(x\).

If we knew occupancy, \(\psi\), and detection, \(\pi\), we could calculate the conditional probability that a site where salamanders were not observed was actually occupied, ie,  \(\psi_0 =\) Pr(occupied | not detected). The number of additional occupied sites - sites occupied but no detections - \(x_{add}\), is described by a binomial probability with \((S -x_{obs})\) trials with probability of success \(\psi_0\). And \(x = x_{obs} + x_{add}\). But we don't know \(\psi\) and \(\pi\).

Gibbs sampling

This is the Gibbs sampling algorithm:

Start by choosing arbitrary starting values for  \(\psi\) and \(\pi\). Then:

1. Use the current values of \(\psi\) and \(\pi\) to calculate the conditional probability of occupancy, \(\psi_0\). Draw a random value from the \(\text{Binomial}(S-x_{obs}, \psi_0)\) distribution and use this as \(x_{add}\). Calculate a new value for  \(x = x_{obs} + x_{add}\).

2a. The parameters of the beta posterior for \(\psi\) are prior shape1 + \(x\), prior shape2 + \((S - x)\). Draw a random value from this distribution and use this as the new value of \(\psi\).

2b. The parameters of the beta posterior for \(\pi\) are prior shape1 + \(d\), prior shape2 + \((x \times K - d)\). Draw a random value from this distribution and use this as the new value of \(\pi\). (The priors for  \(\psi\) and \(\pi\) could be different.)

Repeat steps 1, 2a and 2b thousands of times, storing the values for \(\psi\) and \(\pi\).

The R code to do this is at the foot of the page. The animation below shows how this works for the salamander data. I used uniform Beta(1, 1) priors and 0.5 as the starting values for both occupancy and detection.

As we see, the sampler quickly moves away from the starting point, and generates a series of points in an ellipse centered at about \(\psi = 0.6, \pi = 0.25\). You can download an MP4 version of the animation here.

 

Three ways to describe distributions

First - formula, usually in a shorthand form giving the type of distribution and the parameter values, eg, \(y ~\text{~}~ \text{Binomial}(6, 0.2)\) or \(x ~\text{~}~ \text{Normal}(167, 8.2)\); here "~" means "is distributed as". Plug the value of \(x\) or \(y\) into the corresponding R function to get the probability (density): dbinom(4, 6, 0.2) or dnorm(180, 167, 8.2).

Second - graph, an actual plot or a vector of \(x\)'s and a corresponding vector of \(y\)'s. The \(x\)'s are values of the parameter, usually several hundred and usually equally-spaced, and the \(y\)'s are the corresponding probability densities. The density function produces this as its output, but you can make your own vectors. Use the approx function to get the probability density for a particular value of the parameter.

Third - sample, a large sample of random draws from the target distribution. For example, I wanted to see what a uniform Beta(1, 1) distribution looked like when converted to the logistic ("logit") scale. I created a sample of 100,000 random draws with the rbeta function, converted to logit, and checked the histogram and mean and standard deviation:

prob <- rbeta(1e5, 1, 1)
hist(prob)
logit <- qlogis(prob)
hist(logit)
mean(logit)
sd(logit)

Gibbs sampler output

The first few values in each of the output vectors are affected by the arbitrarily-chosen starting point, so we discard the initial values, referred to as "burn-in". Although only 1000 values are shown in the plot above, I did 10,100 iterations, and we'll discard the first 100 values:

psi <- psi[101:10100]
pi <- pi[101:10100]

Now we have a pair of vectors of 10,000 random draws from the posterior distributions for \(\psi\) and \(\pi\). We can plot histograms and calculate means and credible intervals. If you have the wiqid package installed, you can do this:

library(wiqid)
par(mfrow=1:2)
plotPost(psi)
plotPost(pi)
par(mfrow=c(1, 1))

Without wiqid, just do:

hist(psi)
mean(psi)
quantile(psi, c(0.025, 0.975))

...and the same for pi.

So far we have only looked at marginal posterior probabilities for each parameter separately. The scatter plot shows that they are related. If we think detection is relatively low, we'll expect many of the sites where salamanders were not seen to be occupied, \(x_{add}\) will be relatively high, and so will \(\psi\). If we think detection is high,  \(x_{add}\) and \(\psi\) will be low.

We can get the correlation coefficient with

cor(psi, pi)
[1] -0.5490722

And we can use Ben Bolker's HPDregionplot function to plot a two-dimensional 95% credible interval (a credible region):

library(emdbook)
plot(psi, pi, pch=19, cex=0.1)
HPDregionplot(cbind(psi, pi), col="red", lwd=3, add=TRUE)

Now let's look in more detail at the output vectors.

MCMC - Markov chain Monte Carlo

The trace plots below show the first 200 values in my chains for \(\psi\) and \(\pi\), after discarding the first 100 values as burn-in, with the red line indicating the mean of the vector. These are based on random numbers, so your values will be different.

par(mfrow=1:2)
plot(psi[1:200], type='l')
abline(h=mean(psi), col='red')
plot(pi[1:200], type='l')
abline(h=mean(pi), col='red')
par(mfrow=c(1, 1))

And here are the first 6 values in my vectors for  \(\psi\) and \(\pi\).

head(cbind(psi, pi))
           psi        pi
[1,] 0.6405463 0.2876150
[2,] 0.4438618 0.2526907
[3,] 0.5311369 0.2939948
[4,] 0.6424184 0.3032949
[5,] 0.4546159 0.2504523
[6,] 0.5936303 0.2485805

As should be clear from the algorithm, the set of values in each row depends only on the previous row. What happened before that is irrelevant. This serial dependency shows up in the trace plots, though it's not obvious from just 6 rows of values.

This type of sequence is called a Markov chain.

The values are random draws from the respective posterior distributions. In fact, computers don't do random, so pseudorandom numbers are used instead. This type of computer simulation was first used in the 1940's and was dubbed Monte Carlo simulation, after the famous casino in Monaco.

The draws are random, but not independent. They are autocorrelated, the autocorrelation decreasing as the lag (the distance between the values) increases. We can see autocorrelation plots with the acf function:

The correlation between each value and the one immediately before (lag = 1) is high, about 0.65 for \(\psi\) and 0.50 for \(\pi\), then declines to be very small at a lag of about 15. Lack of independence means that our sample is effectively smaller than it seems. The coda package has a function to calculate the effective size of an MCMC sample:

library(coda)
effectiveSize(psi)  # 1422
effectiveSize(pi)   # 2110

So although the sample size is 10,000, the effective size is much less. A thousand or so is fine for reliable point estimates, but not for 95% credible intervals. These depend on a small number of very high or very low values, and a single 'clump' of high or low values can affect the estimate. For good 95% CrIs you need effective sample sizes of 10,000 or more.

 

Markov chain Monte Carlo (MCMC) methods revolutionised Bayesian analysis when the first software package, BUGS (Bayesian analysis Using Gibbs Sampling), became available in the mid-1990s. As Gilks et al (1994) put it, "Gibbs sampling succeeds because it reduces the problem of dealing simultaneously with a large number of intricately related unknown parameters and missing data into a much simpler problem of dealing with one unknown quantity at a time, sampling each from its full conditional distribution." p.169, original emphasis.

In a future post, I'll look at using a "random walk" to generate an MCMC sample. This does not involve conjugate priors.

The R code for the Gibbs sampler

Here's the code for the Gibbs sampler for the salamanders analysis:

# The data:
S <- 39    # Number of sites
K <- 5     # Number of visits to each site
xObs <- 18 # Number of sites where salamanders were detected
d <- 30    # Total number of detections

# Priors: independent beta(1, 1) priors
priPsi <- c(1, 1)
priPi <- c(1, 1)

nIter <- 1000               # Number of iterations
psi <- pi <- numeric(nIter) # Objects to hold results
psi[1] <-  pi[1] <- 0.5     # Starting values

for(i in 2:nIter) {
  # 1. Calculate prob(occupied | not detected):
  psi0 <- (psi[i-1] * (1 - pi[i-1])^K) / (psi[i-1] * (1 - pi[i-1])^K + (1 - psi[i-1]))
  # ...and draw number of additional sites occupied
  xAdd <- rbinom(1, S - xObs, psi0)
  x <- xObs + xAdd
  # 2a. Draw new psi from beta(occupied+prior, unoccupied+prior)
  psi[i] <- rbeta(1, x + priPsi[1], S - x + priPsi[2])
  # 2b. Draw new pi from beta(detected+prior, undetected+prior).
  pi[i] <- rbeta(1, d + priPi[1], x * K - d + priPi[2])
}

# Plot the results
plot(psi, pi, xlim=c(0.2, 1), ylim=c(0, 0.8), type='l', col='grey',
  xlab="occupancy, psi", ylab="detection, pi")
points(psi, pi, pch=16, cex=0.1)
points(psi[1], pi[1], pch=16, col='red')  # starting point

The code for everything on this page, including the animated plot, is here.

References

Gilks, W. R., Thomas, A., Spielgelhalter, D. J. (1994) A language and program for complex Bayesian modelling The Statistician, 43(1):169-177.

Updated 27 Feb 2015 by Mike Meredith