MCMC samples with a random walk

HOME In the last post we looked at a way to use conjugate distributions for several parameters via a Gibbs sampler. The output from this was an MCMC sample of random draws from the posterior distribution. We can produce similar MCMC samples without using conjugate distributions with a method often called "Metropolis-within-Gibbs".

The idea for the sampler was developed by Nicholas Metropolis and colleagues in a paper in 1953. This was before the Gibbs sampler was proposed, but it uses the same idea of updating the parameters one by one. A better name would be "componentwise random walk Metropolis sampler". The rules for the random walk ensure that a large number of samples will be a good description of the posterior distribution.

We'll illustrate this with the same salamander data that we used before. For Gibbs sampling we used conjugacy, so we did not need to calculate the likelihood, we just needed the number of successes and failures. For the random walk method we do need to calculate likelihood, so take a look at the post where we looked at maximum likelihood estimation.

Recall that 39 sites were visited on 5 separate occasions, and the number of occasions when at least one salamander was found recorded. Here are the data, sorted in descending order:

y <- c(4, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
n <- 5

We want to estimate the probability that a site is occupied by salamanders (\(\psi\)) and probability of detecting salamanders on a single visit if they are present (\(\pi\)). The simplest model assumes that \(\psi\) is the same for all sites and \(\pi\) is the same for all sites and all occasions.

Exploring with a random walk

Imagine you want to explore a mountain covered in dense forest. You have a GPS receiver and an altimeter, and you want to spend most of your time exploring the upper part of the mountain, but you also want to investigate the lower areas. You have already reached the mountain and pitched camp in the forest. This is what you do each day:

Choose a point at a random distance from your camp and go there to check the elevation.

If the new point is higher than your camp, move camp to the new location.

If it's lower, you might still move. If it's a bit lower, you'd probably move; if it's a lot lower, you're unlikely to move. The probability of moving is the ratio of the elevations at the new location and the old location. If you decide not to move, go back to the old camp and try again next day.

By following this plan, the number of nights you spend on different parts of the mountain will be proportional to the elevation.

Metropolis sampling

This is the component-wise Metropolis random walk algorithm:

Start by choosing arbitrary starting values for  \(\psi\) and \(\pi\). Then:

1. Update \(\psi\) with these steps:

a. Draw  a random value from a proposal distribution which is symmetrical around the current value of \(\psi\). I drew from a  \(\text{Uniform}(-\delta, \delta)\) distribution and added the value to the current \(\psi\). This is the candidate value.

b. Calculate likelihood prior for the current value of \(\psi\) and for the candidate value, using the prior distribution for \(\psi\).

c. Calculate the ratio: candidate likelihood prior / current likelihood prior.

d. Decide whether to accept or reject the candidate value:

If the ratio in step c is greater than 1, accept.

If the ratio is less than one, draw a random value from \(\text{Uniform}(0, 1) \) and accept if this is less than the ratio.

e. If you accept, the new value of \(\psi\) = the candidate value; if you reject, it is the same as the old value of \(\psi\).

2. Update \(\pi\) using the same steps. Use the updated value of \(\psi\) to calculate the likelihood.

Repeat steps 1 and 2 thousands of times, storing the values for \(\psi\) and \(\pi\).

Two important features make this algorithm simple:

(1) the proposal distribution is symmetric; if it were asymmetric, we would have to include the relative probabilities of moving in either direction in the calculation.

(2) the prior distributions for  \(\psi\) and \(\pi\) are independent; in principle we should include all the priors in the calculation of likelihood prior, but provided they are independent, the prior for the parameter(s) not being updated are the same for the current value and the candidate value, and cancel out in the ratio.

The R code for the Metropolis sampler for the salamander data is at the foot of the page. The animation below shows how this works for the salamander data. I used uniform Beta(1, 1) priors and 0.5 as the starting values for both occupancy and detection. For the maximum step size, \(\delta\), I used 0.1 for both (though as we shall see, these are not the optimal values).

The general pattern is similar to the Gibbs sampler, but this time it takes several steps to move away from the starting point, and many of the steps are either vertical (constant \(\psi\)) or horizontal (constant \(\pi\)). You can download an MP4 version here.

Checking... and tuning

Just as with the Gibbs sampler, we discard the first values (burn-in):

psi <- psi[101:10100]
pi <- pi[101:10100]

We can look at trace plots for the Metropolis output and check the effective sample size, just as we did for the Gibbs output:

library(coda)
effectiveSize(psi)  # 366
effectiveSize(pi)   # 677

This is based on 10,000 iterations! For the Gibbs sampler, the effective sizes were 1,400 and 2,100. Effective samples of less than 1000 are scarcely adequate descriptions of the posterior distribution.

As the trace plots show, the sampler is moving in tiny steps, especially for \(\psi\). The autocorrelation is high, so effective sample sizes are small. We check the acceptance rates, the proportion of candidate values that are accepted:

mean(diff(psi) != 0)  # 0.80
mean(diff(pi) != 0)   # 0.60

Most of the candidate values are being accepted, which is a symptom of having a step size that is too small. With the Metropolis sampler we can control the step size. For the run we are looking at I used a maximum step size of 0.1, so the average step size will be half that, 0.05.

Step size up to 0.6

I reran the analysis with a maximum step size of 0.6. This gave acceptance rates of 0.27 for \(\psi\) and 0.12 for \(\pi\). Here are the trace plots:

Now the sampler takes big steps, but most of the time it isn't moving at all; it's trying to step too far and most of the candidate values are rejected. This also results in high autocorrelation and small effective sample sizes, now 788 and 512.

Tuning the Metropolis sampler

I did a series of runs with different combinations of step sizes and the one with the lowest autocorrelation and highest effective sample size used 0.3 for \(\psi\) and 0.2 for \(\pi\).

The effective sample sizes now are 1240 and 1420. The acceptance rates were 0.50 and 0.36. The optimum value for acceptance rates for updating parameters one at a time is 0.44, but rates between 0.15 and 0.50 are usable.

This process of adjusting the step size is know as tuning or adapting the sampler, and software such as WinBUGS or JAGS include an adaptation phase before starting the main Markov chain. A practical advantage of the Gibbs sampler is that it does not need tuning, as all values generated are accepted.

Metropolis sampler output

Now that we have an MCMC sample with reasonable effective size, we plot histograms and calculate means and credible intervals. If you have the wiqid package installed, you can do this:

library(wiqid)
par(mfrow=1:2)
plotPost(psi)
plotPost(pi)
par(mfrow=c(1, 1))

You can also look at correlations between the MCMC chains and plot a credible region, just as we did for the Gibbs sampler, with the same results to within Monte Carlo error.

R code for the Metropolis sampler

Here's the code for the Metropolis sampler for the salamanders analysis:

# The data:
y <- c(2, 1, 1, 4, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
n <- 5

# Function to calculate likelihood:
get_like <- function(psi, pi, y, n) {
  prod(psi * dbinom(y, n, pi) + ifelse(y==0, 1 - psi, 0))
}

# Priors: independent beta(1, 1) priors
priorPsi <- function(psi, shape1=1, shape2=1) dbeta(psi, shape1, shape2)
priorPi <- function(pi, shape1=1, shape2=1) dbeta(psi, shape1, shape2)
# Change shape1 = 1 and shape2 = 1 to match your prior distribution.

# Run the random walk sampler
nIter <- 10100               # number of iterations
stepmaxPsi <- 0.3            # maximum step size (after tuning)
stepmaxPi <- 0.2
psi <- pi <- numeric(nIter) # objects to hold results
psi[1] <-  pi[1] <- 0.5     # starting values
likelihood <- get_like(psi[1], pi[1], y, n)  # initial likelihood

for(i in 2:nIter) {
  # 1. Update psi:
  # Generate candidate value
  cand <- psi[i-1] + runif(1, -stepmaxPsi, stepmaxPsi)
  # If candidate value is outside [0,1], keep the old value of psi
  if (cand < 0 | cand > 1) {
    psi[i] <- psi[i-1]
  } else {
    # Calculate likelihood at candidate value
    candLlh <- get_like(cand, pi[i-1], y, n)
    # Calculate likelihood * prior at old value and candidate value
    jointOld <- likelihood * priorPsi(psi[i-1]) 
    jointCand <- candLlh * priorPsi(cand) 
    # Acceptance probability is ratio, or 1 if ratio > 1
    A <- min(1, jointCand / jointOld)
    # Decide whether to accept or not
    if(A > runif(1)) {   # if accepted
      psi[i] <- cand
      likelihood <- candLlh
    } else {
      psi[i] <- psi[i-1]
    }
  }
  
  # 1. Update pi:
  # Generate candidate value
  cand <- pi[i-1] + runif(1,  -stepmaxPi, stepmaxPi)
  # If candidate value is outside [0,1], keep the old value of pi
  if (cand < 0 | cand > 1) {
    pi[i] <- pi[i-1]
  } else {
    # Calculate likelihood at candidate value
    candLlh <- get_like(psi[i], cand, y, n)
    # Calculate likelihood * prior at old value and candidate value
    jointOld <- likelihood * priorPi(pi[i-1])
    jointCand <- candLlh * priorPi(cand)
    # Acceptance probability is ratio, or 1 if ratio > 1
    A <- min(1, jointCand / jointOld)
    # Decide whether to accept or not
    if(A > runif(1)) {   # if accepted
      pi[i] <- cand
      likelihood <- candLlh
    } else {
      pi[i] <- pi[i-1]
    }
  }
}
  
# Plot the results
# ----------------
plot(psi, pi, xlim=c(0.2, 1), ylim=c(0, 0.8), type='l', col='grey',
  xlab="occupancy, psi", ylab="detection, pi")
points(psi, pi, pch=16, cex=0.1)
points(psi[1], pi[1], pch=16, col='red')  # starting point

The code for everything on this page, including the animated plot, is here.

 

References

Metropolis N, Rosenbluth A, MN R, Teller E (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21, 1087-1092.

Updated 5 March 2015 by Mike Meredith