Royle-Nichols models for occupancy

HOME In occupancy modelling we are interested in the probability that a species is present at a location, taking account of the possibility that it may be there but not be detected. So we need to jointly estimate probability of occupancy and probability of detection. In many studies we would expect probability of occupancy to be affected by the number of individuals at the location, so it would make sense to include that as a covariate in the analysis. But of course we don't know how many are present - if we did we would not be trying to estimate occupancy. The Royle-Nichols model (Royle & Nichols 2003) treats the number available for detection at a site as a latent variable which can be estimated from detection/non-detection data.

In the simplest model, the number available for detection at site \( i \), \( N_i \), is modelled as being drawn from a Poisson distribution with parameter \( \lambda \). This is the biological model. The observation model assumes that detection of each individual is a Bernoulli trial, individuals being detected with probability \( r \). The data do not show which individuals are detected, just whether the species was detected, and that is recorded if at least one individual is observed.

To get from the probability of detecting one individual, \( r \), to the probability of detecting the species, \( p \), think of the probability of non-detection: The probability of not detecting an individual is \( 1 - r \) and if \( N_i \) individuals are present, the probability of detecting none of them is \( (1 - r)^{N_i} \). Detecting "at least one" is equivalent to detecting "not none", and the probability of that is \( p = 1 - (1 - r)^{N_i} \).

The model parameters to be estimated are \( \lambda \) and \( r \). The probability of occupancy, \( \psi \), is derived from \( \lambda \):

\[ \psi = {\rm I\!P}(N_i > 0) = 1 - e^{-\lambda} \]

As usual, we can model \( \lambda_i \) as a function of site covariates, and \( r_{it} \) as a function of site and survey covariates, but we assume that detection probability is the same for all individuals at site \( i \) and time \( t \) and detections are independent.

Maximum likelihood estimation for the Royle-Nichols occupancy model is available with the occSSrn function in the wiqid package and occuRN in unmarked. It was difficult to make the Bayesian version work in WinBUGS - it could be done by working with \( r' = 1 - r \)  and \( y' = 1 - y \) (ie, \( y' = 1 \) if the species is not detected, and \( y' = 0 \) if it is detected).

Implementation in JAGS

Simulating data

Let's begin by simulating a data set to work with. That process itself will clarify how the model works.

# Parameters for simulated data:
nSite <- 200   # number of sites
nRep <- 10     # number of replicate surveys
lambda <- 1.5  # mean number of individuals at each site
r <- 0.1       # probability of detection of one individual

set.seed(2017)
# Biological process:
N <- rpois(nSite, lambda)  # Latent number available for detection
mean(N) # Mean N for this sample (not same as lambda)
[1]  1.36
mean(N > 0) # True occupancy for this sample
[1]  0.725

# Observation process:
p <- 1 - (1-r)^N
y <- rbinom(nSite, nRep, p) # number of occasions the species was detected at each site
head(y)
[1]  3 1 2 2 2 2
mean(y > 0)  # naive estimate of occupancy 
[1] 0.555

Maximum likelihood estimation

We can now fit the Royle-Nichols model with the occSSrn0 function in wiqid; occSSrn0 requires only the vector y and the number of replicates at each site, not the full detection/nondetection matrix, and runs quickly:

library(wiqid)
occSSrn0(y, n=rep(nSite, nRep))
Call: occSSrn0(y = y, n = rep(nSite, nRep))

Real values:
             est  lowCI  uppCI
psiHat    0.6830 0.5755 0.7856
lambdaHat 1.1488 0.8569 1.5401
rHat      0.1143 0.0859 0.1506

AIC: 1424.452 

The parameter estimates \( \lambda \) and \( \psi \) are a bit low for this realisation. We can compare with the standard psi(.)p(.) model:

occSS0(y, n=rep(nSite, nRep))
Call: occSS0(y = y, n = rep(nSite, nRep))

Real values:
          est  lowCI  uppCI
psiHat 0.6247 0.5402 0.7023
pHat   0.1969 0.1717 0.2247

AIC: 1431

The AIC value for this model is a lot higher, indicating that the Royle-Nichols model would give better predictions than the standard model.

The JAGS model

Here is the JAGS model I used. Note that in this model z[i] is only used to calculate the proportion of occupied sites in the sample, it does not appear in the calculation of probability of detection.

model {
  # Likelihood
  for (i in 1:nSite) {
    # Process model
    N[i] ~ dpois(lambda)
    z[i] <- step(N[i] - 1) # z=1 if N>0, ie. site is occupied

    # Observation model
    p[i] <- 1 - pow(1-nRep, N[i])
    y[i] ~ dbin(p[i], n)
  }
  # Priors
  lambda ~ dunif(0, 10)
  r ~ dunif(0, 1)
	
  # Derived quantities
  psi.sample <- mean(z[])   # Proportion of occupied sites in the sample
  psi <- 1 - exp( -lambda)  # Prob(N > 0 | lambda), probability that a random site is occupied.
}

For this trial run I used a Uniform(0, 10) prior and will check to see if the upper limit constrains the posterior. It would be better to use a gamma prior with sensible parameters for a run with real data that you wished to report.

This is saved in the file named "modelRN.txt" in the current working directory.

Running the model from R

library(jagsUI)
# The data
str(JAGSdata <- list(y = y, nRep = nRep, nSite = nSite))
List of 3
 $ y    : int [1:200] 3 1 2 2 2 2 0 2 1 0 ...
 $ nRep : num 10
 $ nSite: num 200
# Starting values
Nst <- as.numeric(y > 0)  # N must be >0 if y > 0, so start at 1
inits <- function() list(N = Nst, lambda=runif(1,0,3)) 
# Parameters to monitor
params <- c("lambda", "r", "psi", "psi.sample", "N")
# Takes < 2 mins on my laptop:
out <- jags(JAGSdata, inits, params, "modelRN.txt", DIC=FALSE,
  n.chains = 3, n.adapt=1000, n.burnin=0, n.iter=100000, parallel=TRUE)

Here are the first few rows of the output:

            mean    sd  2.5%   50% 97.5% overlap0 f Rhat  n.eff
lambda     1.183 0.184 0.884 1.163 1.600    FALSE 1    1  60856
r          0.114 0.016 0.083 0.114 0.146    FALSE 1    1  31678
psi        0.689 0.054 0.587 0.687 0.798    FALSE 1    1  36368
psi.sample 0.689 0.041 0.620 0.685 0.780    FALSE 1    1  58580
N[1]       2.110 0.932 1.000 2.000 4.000    FALSE 1    1  80386
N[2]       1.390 0.638 1.000 1.000 3.000    FALSE 1    1 300000
N[3]       1.703 0.807 1.000 2.000 4.000    FALSE 1    1 255018

For this data set, the estimate of \( \lambda \) and \( \psi \) are a bit low, as we saw for the ML estimates. I'd like to compare the estimates of \( N \) with the true values for each of the 200 sites, but plotting  \( \hat{N} \) versus  \( N \) (graph not shown) is not very informative because integers are involved. Instead, we'll tabulate the mean estimate  \( \hat{N} \) for each of the true values of \( N \):

Nhat <- unlist(out$mean)[-(1:4)]
plot(jitter(N), Nhat, ylim=range(0, Nhat))
abline(0, 1, col='red')
abline(v=mean(N), h=mean(Nhat), lty=3)

# Get mean(Nhat) for each value of N:
table(N)
N
 0  1  2  3  4  6  7 
55 70 40 23 10  1  1 
tapply(Nhat, N, mean)
        0         1         2         3         4         6         7 
0.3663258 1.0717541 1.6843473 1.9255339 2.2463930 2.1042233 4.2685067 

There are only 2 sites with >4 individuals, so we shouldn't take too much notice of those. The estimate for \( N = 0 \) is inevitably biased high (as any value >0 in the MCMC chain will immediately produce an estimate >0), those for \( N > 1 \) appear to be biased low, at least for this simulated sample.

Investigating accuracy

Probability of occupancy

I ran the simulation 1000 times (with different seeds) and estimated occupancy using both the RN model and the null model (the psi(.)p(.) model). Comparing the estimates with the data-generating value, \( \psi = 1 - e^{-\lambda} = 0.777 \), the null model showed a bias of -0.077 and the RN model was almost unbiased (0.006).

Root mean squared errors (RMSEs) calculated using the data-generating value were 0.087 for the null model and 0.053 for the RN model.

Some of the error here is due to sampling error, the fact that the proportion of occupied sites in the sample is not the same as the data-generating \( \psi \). Comparing estimates with the proportion in sample, psiSample, seems a fairer way to test the two methods. Defining "error" as the difference between the estimate and psiSample gives RMSEs of 0.083 for the null model and 0.044 for the RN model.

While doing this, I used the loo package (Vehtari et al 2016) to estimate looaic and waic model selection criteria. There were a few issues with this which I'll look at in a separate post, but in all 1000 simulations the RN model was preferred over the null model by both criteria.

I also ran the ML analyses, using the occSS0 and occSSrn0 functions in wiqid. These gave very similar results, as one would expect as we are running the Bayesian analysis with flat priors. AIC for the RN model was lower than the null model for 86% of the samples.

Now the acid test: how does the RN model perform when probability of detection does not depend on number available for capture, merely on whether that is > 0? I did 1000 simulations with data generated with the same biological model and a detection model where \( y_i ~ Binomial(p * z_i, n_i) \) where \( z_i = 1 \) when \( N > 0 \) and 0 otherwise, and \( p = 0.179 \), the expected value of \( 1 - (1 - r)^{N_i} \) for \( N > 0 \). The exact same code was used for the analysis. Now the null model has lower bias (0.003 vs 0.124) and smaller RMSE (0.031 vs 0.130) when compared with psiSample. But both looaic and waic indicated that the RN model was better for all except 1 of the data sets. MLE analysis again gave very similar results, except that AIC was lower for the null model for 88% of the samples.

Based on these simulations, choosing the correct model gives more accurate results and lower bias. AIC picks the right model >80% of the time, but not always. LOO and WAIC were almost always lower for the RN model, even when the data came from a null model.

Numbers at each site

For the same 1000 simulations of the RN model, I checked the estimates of the number of individuals available for detection at each site as I did above. All 1000 had sites with values of \( N \) between 0 and 4, and 931 simulations had sites with \( N =5 \). Higher values were uncommon, so are not included here. The mean estimates across all the simulations are in the table below:

True N 0 1 2 3 4 5
Mean estimate 0.61 1.44 1.97 2.39 2.74 3.08

For this particular scenario, it seems that the estimates are more or less unbiased for \( N = 2 \), but are positively biased for \( N < 2 \) and negatively biased for \( N > 2 \). The results would certainly be different with other values for \( \lambda \) and \( r \) and for different numbers of replicates. But the take home message would likely be similar: (a) don't regard the output as unbiased estimates of abundance, or even "number of animals available for detection", at each location; but (b) the estimates may still be useful to investigate trends in abundance with habitat covariates.

 

References

Royle, J. A. and J. D. Nichols (2003). "Estimating abundance from repeated presence-absence data or point counts." Ecology 84(3): 777-790.

Vehtari A, Gelman A and Gabry J (2016). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing. doi: 10.1007/s11222-016-9696-4 (link).

Updated 27 November 2017 by Mike Meredith