Cross-Validation for Multi-Species Occupancy Models

HOME The issue of model selection for complex Bayesian models is still very much open, so I was interested in a paper by Kristen Broms, Mevin Hooten and Ryan Fitzpatrick (2016, Ecology). There is good stuff there about using information criteria to compare models, but in this post I want to focus on their use of cross-validation. Most of the goodness-of-fit measures we use do not work for individual trials, ie, 1/0 data, and trials need to be aggregated so that the data are of the form: y successes in n trials. The scoring methods used by Broms et al do work with single trials.

They provided the code in their supplementary materials, but that is tailored to their study of 26 species of fish at 113 sites in a river basin in Colorado. This is an attempt to simplify it and make it easier for other researchers to adapt it to their own study.

This assumes that you have read the original paper and the supplements for details of the study and the mathematical approaches. I will not repeat those here.

The data

The data are provided in the supplementary materials to the paper. You can download a reformatted version with:

load(url("http://mikemeredith.net/data/fishData.RData"))
ls()
README
str(Y)
str(Psi.cov)
str(X)

Here Y is a 3-dimensional array, species x sites x surveys, with 1 where the species was detected, otherwise 0. The number of surveys differed among sites, up to a maximum of maxSurveys = 24. The vector nSurveys gives the number of surveys for each site. The Y array has NA where fewer than 24 surveys were conducted.

The number of species, nspecies = 26, is the number of native species known to occur in the river basin, though two of these were not caught during the surveys.

Psi.cov is a matrix with the sites as rows and columns for site covariates. The first column, headed int, is a column of ones which will correspond to the intercept in the analysis. The remaining columns correspond to the vectors prefixed Psi.* in the original data.

X is a 3-dimensional array, sites x surveys x variables, with the covariates used for detection probability. X[, , 1] contains ones and will correspond to the intercept; the remainder are the matrices named X.* in the original data. Again, NAs appear when less than 24 surveys were conducted.

Your data will be quite different, but you should be able to wrangle them into similar matrices and arrays.

The models

The data file, "fishData.RData", contains lists specifying the covariates in the five models used by Broms et al (2016), which you can display with:

modelPsi
modelP

I used the covariate names for these model definitions, although just using the column numbers would also work - at least until you modify the array by adding or removing things. I think names are less likely to lead to undetected errors than numbers.

The JAGS file

We'll write a single JAGS file to do all the analyses. The model implemented and the fold will be determined by passing a subset of the Y, Psi.cov and X objects to JAGS as data.

The data block

To simplify the data input, we'll use the data block in the JAGS file to calculate the number of sites and coefficients for each run. Code in the data block is run once only, at the start of the model creation process. See the JAGS User Manual for details.

data {
  # set up the constants used in the model.
  Dpsi <- dim(Psi.cov)
  nPsiCoef <- Dpsi[2]
  Dp <- dim(Det.cov)
  nDetCoef <- Dp[3]
  DY <- dim(Y)
  nPossSpp <- DY[1]
  nSites <- DY[2]

  muPsiCoef.tau <- 1 / 2.25 ^ 2
  muDetCoef.tau <- 1 / 2.25 ^ 2
  t.tau <- 1 / 2.25 ^ 2
}

The model code

Here's the whole code block for the model. This does not use data augmentation and limits the number of species to 26. We'll add explanations afterwards.

model {
  for (i in 1:nPossSpp) {
    ### LIKELIHOOD
    # Ecological model
    for (j in 1:nSites) {
      logit(psi[i, j]) <- sum(psiCoef[i, ] * Psi.cov[j, ]) # same code for all models
      Z[i, j] ~ dbern(psi[i, j])
      # Observation model
      for (k in 1:nSurveys[j]) {
        logit(detect[i, j, k]) <- sum(detCoef[i, ] * Det.cov[j, k, ])
        true.detect[i, j, k] <- detect[i, j, k] * Z[i, j]
        Y[i, j, k] ~ dbern(true.detect[i, j, k])
      }
    }

    ### PRIORS for coefficients for each species
    for (m in 1:nPsiCoef){
      psiCoef[i, m] ~ dnorm(muPsiCoef[m], tauPsiCoef[m])
    }
    for (m in 1:nDetCoef){
      detCoef[i, m] ~ dnorm(muDetCoef[m], tauDetCoef[m])
    }
  }

  ### HYPER-PRIORS for coefficients
  for(m in 1:nPsiCoef){
    muPsiCoef[m] ~ dnorm(0, muPsiCoef.tau)
    halfsigmaPsiCoef[m] ~ dt(0, t.tau, 1)      # Cauchy distribution
    sigmaPsiCoef[m] <- abs(halfsigmaPsiCoef[m])# folded or half-Cauchy
    tauPsiCoef[m] <- pow(sigmaPsiCoef[m], -2)
  }
  for(m in 1:nDetCoef){
    muDetCoef[m] ~ dnorm(0, muDetCoef.tau)
    halfsigmaDetCoef[m] ~ dt(0, t.tau, 1)
    sigmaDetCoef[m] <- abs(halfsigmaDetCoef[m])
    tauDetCoef[m] <- 1 / (sigmaDetCoef[m] * sigmaDetCoef[m])
  }
}

The likelihood is standard occupancy stuff, with an ecological model for the latent occupancy state, Z, and an observation model which accounts for probability of detection.

We use vector multiplication with the vector of coefficients for the species and the vector of covariate values for the site or survey. The first element in the covariate vector is always 1 and the first coefficient is thus the intercept. This code will work with covariate matrices with different numbers of columns for the different models.

The priors for the coefficients for each species are normal with mean and SD to be estimated. Priors for the SDs (ie, the hyperpriors) use a half-Cauchy distribution.

We save the data block and the model code to a file named "MSOM_KnownSps.jags".

Running the analysis

We can run all the models to check convergence before starting the cross-validation. The following code will do this; it took 3.3 hrs on my 4-core laptop. If you have more cores, you will likely want to run the models in parallel. See the next section for code to do this.

Note that we provide starting values for Z, as usual, and also for the hyper-parameter SDs to keep them below 5, as random draws from a Cauchy distribution can be very large.

library(jagsUI)
wanted <- c("muPsiCoef", "sigmaPsiCoef",
            "muDetCoef", "sigmaDetCoef",
            "detCoef", "psiCoef")
modelOut <- list()
for(i in seq_along(modelPsi)) {
  inits <- function() list(Z = matrix(1, nSpecies, nSites),
       halfsigmaPsiCoef = runif(length(modelPsi[[i]]), 0, 5),
       halfsigmaDetCoef = runif(length(modelP[[i]]), 0, 5))

  jdata <- list(Y = Y, nSurveys = nSurveys,
   Psi.cov = Psi.cov[, modelPsi[[i]]], Det.cov = X[, , modelP[[i]]])

  modelOut[[i]] <- jags(jdata, inits, wanted, "MSOM_KnownSps.jags",
    n.chains=3, n.adapt=1000, n.iter= 5000, parallel=TRUE)
}
for(i in seq_along(modelOut))
  wiqid::diagPlot(modelOut[[i]], param=1:18, main=names(modelPsi)[i])

Running cross-validation

Dividing the date set into folds

We'll do 5-fold cross validation, dividing the 113 sites into 5 groups, called "folds". One fold is held out and the model fitted to the remaining folds, then the parameter estimates are used to make predictions for the sites held out, which are compared with the actual observations. Each site is allocated to one fold. We can do this with the following code (which does not require us to install extra programs):

nFolds <- 5
( foldSize <- nSites %/% nFolds + 1 )
tmp <- matrix(0, nFolds, foldSize)
set.seed(123) # Note that a new RNGkind is used for sample.int from R 3.6.0
for(i in 1:foldSize)
  tmp[, i] <- sample.int(nFolds)
foldID <- as.vector(tmp)[1:nSites]
table(foldID)

The method used here is not the same as in the paper. Each fold gets one randomly-selected site from among the first five in the data, then one from the second five, one from the third five, and so on. If the data are arranged in groups or in a logical order - first to last or west to east, say - this gives more representative folds compared with pure random allocation.

Scoring the fit of the predictions to the observed data

Broms et al used 4 methods for scoring the difference between the predictions and the observations. See the paper for details. The 0-1 loss compares the observations to 0-1 predictions while the others compare with a probability. We prepare simple functions to calculate the different scores:

# probs : an array of probabilities
# preds : an array of 1/0 predictions
# bool  : a matching array of observed TRUE/FALSE or 1/0 values

# output: scalar, the sum or mean score

# 0-1 loss function
score01 <- function(preds, bool)
  sum(abs(preds - bool), na.rm=TRUE)

# Area under ROC curve
auroc <- function(probs, bool) {
  ok <- is.finite(bool) & is.finite(probs)  # remove NAs
  bool <- bool[ok]
  probs <- probs[ok]
  (mean(rank(probs)[bool==1]) - (sum(bool)+1)/2) / sum(bool==0)
}

# Brier score
brier <- function(probs, bool) {
  sum(bool * (1 - probs)^2 + (1 - bool) * probs^2, na.rm=TRUE)
}

# Log(likelihood) score for Bernoulli trials
loglikBern <- function(probs, bool) {
  sum(dbinom(bool, 1, probs, log=TRUE), na.rm=TRUE)
}

In our case, the probability we want is the unconditional probability of detection. The usual probability of detection in occupancy models - detect in the code above - is conditional on the site being occupied. We assume that if the site is not occupied the species will not be detected, ie, no false positives. For scoring, we need the unconditional probability of detection, ucpd, ie, the probability that the site is occupied and the species is detected. Hence ucpd = psi * detect. Broms et al refer to this as integrated.probability.

(What's the probability of detecting a polar bear in downtown New York? Answers: (1) Practically zero, polar bears don't occur in New York. (2) Practically one, polar bears are big, white animals, easily detected; also scary, folks will be running and screaming. Both are right: (2) is conditional probability of detection, ignores the (tiny) probability that the bear would be there; (1) is unconditional probability of detection.)

Setting up the parallel processing

We have five models and we need to run each five times, once for each fold. The 25 model runs are independent, so can easily be run in parallel.

  • With just 4 cores (my laptop), run the folds in series but run the 3 chains in JAGS in parallel.
  • If you have 8 cores (my old desktop), you can run each fold in a different core, but don't use parallel=TRUE in the call to JAGS.
  • With 20 cores or  more (my new desktop), run the folds in parallel and the chains in parallel too.

The code below sets up parallel processing using foreach for my 24-core desktop. Each fold runs in a separate process, and each of these spawns 3 new processes for the 3 chains, so we have 20 processes in all.

library(parallel)
library(foreach)
library(doParallel)
detectCores()  # On my big desktop: 24
ncores <- 5  # 5 folds running in parallel assuming >5 cores.
  # If >= 15 cores, run 3 parallel chains per fold, otherwise do in serial.
cl <- makeCluster(ncores)
registerDoParallel(cl)

Running the models and estimating the scores

On my desktop I used an outer foreach loop for each of the folds, and an inner for loop for the 5 models; this took about 15 mins with n.iter=500 and 1.8 hrs with n.iter=5000. For the laptop, I ran the folds in series instead of parallel by changing %dopar% to %do%; with n.iter=500 it took 1.8 hrs.

system.time(
result <- foreach(fold = 1:5, .errorhandling='pass', .packages=c("jagsUI")) %dopar% {

  TEST <- foldID == fold  # Just for readability
  TRAIN <- foldID != fold
  
  # matrix of results for this fold
  CVscores <- matrix(fold, nrow=length(modelPsi), ncol=5)
  rownames(CVscores) <- names(modelPsi)
  colnames(CVscores) <- c("fold", "score01", "Log_score", "AUC", "Brier")

  # Run all 5 models
  for(m in seq_along(modelPsi)) {

    inits <- function() list(Z = matrix(1, nSpecies, sum(TRAIN)),
         halfsigmaPsiCoef = runif(length(modelPsi[[m]]), 0, 5),
         halfsigmaDetCoef = runif(length(modelP[[m]]), 0, 5))

    jdata <- list(Y = Y[, TRAIN, ], nSurveys = nSurveys[TRAIN],
      Psi.cov = Psi.cov[TRAIN, modelPsi[[m]]], Det.cov = X[TRAIN, , modelP[[m]]])

    wanted <- c("psiCoef", "detCoef") # all we need for score calculation

    out <- jags(jdata, inits, wanted, "MSOM_KnownSps.jags", codaOnly=wanted,
        # n.chains=3, n.adapt=1000, n.iter= 5000, n.thin=5, parallel=TRUE)
        n.chains=3, n.adapt=100, n.iter= 500, parallel=TRUE)
           # only use parallel=TRUE if you have enough cores

    # Work through the MCMC chains to calculate psi, detect, and ucpd,
    #  plus predictions for Z and Y.

    nIter <- length(out$sims.list$deviance)
    fPsi.cov = Psi.cov[TEST, modelPsi[[m]]] # covariate matrices for this fold and model
    fDet.cov = X[TEST, , modelP[[m]]]
    sc01 <- llk <- AUC <- br <- numeric(nIter)   # objects to hold the output
    psi <- Zpred <- matrix(NA, nSpecies, sum(TEST))
    detect <- true.detect <- Ypred <- ucpd <- array(NA, c(nSpecies, sum(TEST), maxSurveys))

    for(iter in 1:nIter) {
      psiCoef <- out$sims.list$psiCoef[iter,,]
      detCoef <- out$sims.list$detCoef[iter,,]
      # The code here mimics the JAGS code:
      for (i in 1:nSpecies) {
        # ecological model
        for (j in 1:sum(TEST)) {
          psi[i, j] <- plogis(sum(psiCoef[i, ] * fPsi.cov[j, ]))
          Zpred[i, j] <- rbinom(1, 1, psi[i, j])
          # observation model
          for (k in 1:nSurveys[TEST][j]) {
            detect[i, j, k] <- plogis(sum(detCoef[i, ] * fDet.cov[j, k, ]))
            true.detect[i, j, k] <- detect[i, j, k] * Zpred[i, j]
            Ypred[i, j, k] <- rbinom(1, 1, true.detect[i, j, k])
            ucpd[i, j, k] <- detect[i, j, k] * psi[i, j]
          }
        }
      }
      sc01[iter] <- score01(Ypred, Y[, TEST, ])
      llk[iter] <- loglikBern(ucpd, Y[, TEST, ])
      AUC[iter] <- auroc(ucpd, Y[, TEST, ])
      br[iter] <- brier(ucpd, Y[, TEST, ])
    } # end of MCMC chains

    CVscores[m, 'score01'] <- mean(sc01)
    CVscores[m, 'Log_score'] <- -2 * mean(llk)
    CVscores[m, 'AUC'] <- mean(AUC)
    CVscores[m, 'Brier'] <- mean(br)
  } # end "Run all 5 models"


  CVscores
}  )  # end of fold processing

result
tmp <- simplify2array(result)
apply(tmp, 1:2, mean) # Compare with Broms et al Table 2

The first part of the code is similar to the code in the "Running the analysis" section above, except that we only pass the data for the training sites to JAGS and we only harvest the coefficients, which we need to calculate the scores.

The post-processing uses nested loops. It would be more efficient and faster to use R's functions for handling matrices and arrays, but they can take a lot of memory. Working patiently through the MCMC iterations one by one keeps memory demands within reason; it needs about 600 MB of memory. Using the same nested loops as the JAGS model makes it easy to see that we are doing the same calculations.

We simulate predicted values of Y for the test sites, Ypred, and compare these with the actual values to obtain the 0-1 score.

The unconditional probability of detection, ucpd, is compared with the actual values to obtain the log-likelihood, AUC and Brier scores.

        fold score01 Log_score    AUC   Brier
limited    3 600.22     2032.4  0.889  309.80
mid1       3 560.98     1955.5  0.902  295.00
mid2       3 508.32     1913.9  0.913  272.94
full       3 504.76     1946.3  0.911  277.02
extra      3 602.21     2038.0  0.888  311.29

We don't get the same values as in Table 2 of the paper - indeed, I don't get the same values every time that I run this code - but they are in the same ballpark and give the same results: 0-1 score is lowest for the FULL model and the others are best for the MIDDLE-2 model.

 

I hope this code will be useful if you want to do CV with MSOM models with survey covariates, where you are working with the 1/0 data points. (It's different if you don't have survey-level covariates and can aggregate across occasions to get number of detections and number of occasions.) A big thank you to Kristen Broms, Mevin Hooten and Ryan Fitzpatrick for the original code in the publication.

Updated 18 Aug 2019 by Mike Meredith