Jottings: Goodness of Fit (GoF) assessment

HOME Updated 5 June, more on Bayesian p-values and SCR. Started 28 May 2018.

GoF analysis is (well, should be) an important step in modelling, but is often forgotten when emphasis is on choice of best model based on information criteria (AIC, WAIC, etc). There is no reason to believe that the best model in the set proposed is any good.


Reviewers are increasingly asking for GoF information, and rightly so. In a recent study, count data were fitted with a Poisson regression. A reviewer asked for GoF, which turned out to be appallingly bad. A quasi-Poisson fit however was good, and gave a more parsimonious (and sensible) best model.

For straightforward (Gaussian) linear models, the idea of "explained variance" (R2) is useful, but applying that to logistic or Poisson models is not so simple. Once you have hierarchical models, it becomes messy (but see Nakagawa, S. & Schielzeth, H. (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4, 133-142.)

Basic approach

We compare the observed data with the values predicted by the fitted model (the "expected" values). The discrepancy is summarised with an omnibus statistic such as the χ2 statistic or the Freeman-Tukey discrepancy. To assess the significance of the statistic (if that's the objective), we use a parametric bootstrap: generate a thousand or so simulated data sets from the fitted model and calculate the discrepancy statistic for each. If the discrepancy for the real data is typical of those from the simulated data, the model is a good fit. The GOF p-value is the proportion of simulate data sets with a discrepancy greater than observed.

We can run a similar check with an MCMC analysis, generating posterior predictive 'data' at each step of the MCMC process and calculating discrepancies for actual and predicted data. The series of discrepancies can be compared with a scatter plot and a "Bayesian p-value" (originally "posterior predictive p-value", see Gelman, A., Meng, X.-L., & Stern, H. (1996) Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733-807). See for example Kry & Schaub (2012) pp.225-226 and 236; King et al (2010) pp.138-140. The jagsUI package has a function, pp.check, to harvest and display the results of this.

Kry & Royle (2016, AHM book, pp.192-198) have an example comparing Bayesian and bootstrap methods. Royle et al (2014, SCR book, pp.232-241) give an extensive discussion based on Bayesian p-values - see below.

In a Bayesian framework, we would not simply look at this single discrepancy measure and declare victory if p > 0.05 (telling us that we can't reject the null hypothesis that the discrepancy is zero...). We can make full use of these posterior predictions to investigate the model fit and try to improve the model.

Bayesian p-values and occupancy models

For an occupancy model without survey-specific covariates, the code to generate the posterior predictive data might look like this:

y[i] ~ dbinom(p * z[i], n[i])
yPred[i] ~ dbinom(p * z[i], n[i])

where y[i] is the number of occasions out of n[i] that the species was detected at site i, z[i] is an indicator of occupancy, 1 if the site is occupied, p is the probability of detection on one occasion if the site is occupied, and yPred[i] is the posterior prediction.

For sites where y[i] > 0 (ie, the species was detected), z[i] == 1, yPred[i] is based on the knowledge that the site is occupied, and the value of psi[i] (probability that site i is occupied given covariates) doesn't enter into the calculation. If the site is occupied, we don't even look at the model prediction. This cannot be a good test of the adequacy of the occupancy part of the model. For sites where y[i] = 0 (ie, the species was not detected), z[i]flips between 0 and 1 in proportion to the conditional (on the data) probability of occupancy, which not what we want either.

Bayesian p-values and SCR models

Royle et al (2014, SCR book, pp.232-241) only consider this method of assessing fit. Individual binary 0/1 data points from \(y_{ijk}\) contain insufficient information to allow fit assessment, so they suggest aggregating over 1 or 2 dimensions. They give 3 options and apply these to the fisher data:

  • individual x trap frequencies (aggregation over occasions \(y_{ij.}\));
  • individual encounter frequencies (aggregated over occasions and traps \(y_{i..}\));
  • trap frequencies (aggregated over animals and occasions \(y_{.j.}\)).

For each, they calculate an expected value (estimated as p*n) and obtain Freeman-Tukey discrepancies between observed and expected. They also draw replicate data and get the Freeman-Tukey discrepancy for between that and expected. Discrepancies are summed for each, giving a pair of values (observed and replicated) for each statistic at each step of the MCMC chain.

I tried this for a data set where p0 and sigma had very different p0 and sigma (differing by a factor of 2.5) and p0(sex) sigma(sex) having lower AICc than p0(.) sigma(.). Bayesian p-values indicated that the p0(.) sigma(.) model was entirely adequate, and there was hardly any difference between the p-values for the 2 models. Further investigation showed that a behavioural response model, p0(sex+bk) sigma(sex), had an even lower AICc (deltaAICc = 11), so the null model really was not adequate. Neither was the Bayesian p-value as an assessment tool!

Assessing different aspects of the model

The simplest approach is to "eyeball" posterior predictive plots superimposed on the data. The plotPostPred function in the BEST package (Kruschke 2013) plots a bundle of curves showing the estimated distribution (in this case a t-distribution) together with a histogram of the data. In a regression setting, posterior predictions of the response for a series of covariate values are added to the scatterplot of the data (Kruschke 2014, DBDA2, Fig.2.6 p.29). It's important that such plots present the uncertainty in the posterior estimates and that is easily done.

Gelman et al (2014, BDA, section 6.3) suggest how simulated data can be used to check minimum and maximum values in the data set, and Gelman & Hill (2007, pp.158ff) extend this to checks for zero counts.

Closer to our own field, Ergon & Gardner (2014) simulate posterior predictive data and use it to check different aspects of the model:

  • spatial capture-recapture model - number of unique traps each animal was caught in,
  • survival model - number of individuals later captured,
  • dispersal model - changes in mean location of capture across primary sessions.

They simulated the data sets in R after fitting the model in JAGS, with a function that can be used either with values from the posterior distribution of parameters (single rows from the MCMC chain after converting to a matrix) or with specific user-chosen values.

Although an interesting starting point, theirs is a CJS model (only tracking survival of marked animals) and is a monster with 11 parameters to estimate.

A workable GoF strategy for open cucumber models

  • Use the actual habitat mask, trap locations and trap usage data.
  • Use posterior draws for Nsuper, recruitment, survival, p0 and sigma.
  • Assign random ACs within the habitat mask, and alive/dead status using recruitment and survival.
  • Get capture histories as usual.

For closed SCR model for the data set with big male/female differences, I tried the following checks, comparing the actual data with 1000 simulated posterior-predictive data sets:

  1. animals captured,
  2. total captures,
  3. maximum captures per trap for an individual animal, \(\max(y_{ij.})\),
  4. maximum captures per animal, \(\max(y_{i..})\),
  5. number of traps visited by animals, \(\sum(y_{.j.} > 0)\),
  6. maximum captures per trap, \(\max(y_{.j.})\).

Again there was no difference between the p0(.) sigma(.) and p0(sex) sigma(sex) models. However, check #3 showed a big difference between actual and simulated data: in the actual data, no animal was caught more than once in a specific trap, which occurred with only 2% of the simulated data sets. This subsequently led to investigation of the p0(sex+bk) sigma(sex) model.

Some specific checks for male and female data would likely be needed, but in this case it would be complicated as the sex of one captured individual was unknown.

I used a similar procedure for a very long study (where closure is questionable) with the period divided into sessions. The above checks showed adequate fit, but a "nave survival" check - number of individuals captured later - for a conventional closed model picked up the fact that survival was less than 1.