Jottings: Goodness of Fit (GoF) assessment 

HOME 
Updated 5 June, more on Bayesian pvalues and SCR.
Started 28 May 2018. GoF analysis is (well, should be) an important step in modelling, but is often forgotten when emphasis is on choice of best model based on information criteria (AIC, WAIC, etc). There is no reason to believe that the best model in the set proposed is any good.
Reviewers are increasingly asking for GoF information, and rightly so. In a recent study, count data were fitted with a Poisson regression. A reviewer asked for GoF, which turned out to be appallingly bad. A quasiPoisson fit however was good, and gave a more parsimonious (and sensible) best model. For straightforward (Gaussian) linear models, the idea of "explained variance" (R^{2}) is useful, but applying that to logistic or Poisson models is not so simple. Once you have hierarchical models, it becomes messy (but see Nakagawa, S. & Schielzeth, H. (2013) A general and simple method for obtaining R2 from generalized linear mixedeffects models. Methods in Ecology and Evolution, 4, 133142.) Basic approachWe compare the observed data with the values predicted by the fitted model (the "expected" values). The discrepancy is summarised with an omnibus statistic such as the χ^{2} statistic or the FreemanTukey discrepancy. To assess the significance of the statistic (if that's the objective), we use a parametric bootstrap: generate a thousand or so simulated data sets from the fitted model and calculate the discrepancy statistic for each. If the discrepancy for the real data is typical of those from the simulated data, the model is a good fit. The GOF pvalue is the proportion of simulate data sets with a discrepancy greater than observed. We can run a similar check with an MCMC analysis, generating
posterior predictive 'data' at each step of the MCMC process and
calculating discrepancies for actual and predicted data. The
series of discrepancies can be compared with a scatter plot and
a "Bayesian pvalue" (originally "posterior predictive
pvalue", see Gelman, A., Meng, X.L., & Stern, H. (1996)
Posterior predictive assessment of model fitness via realized
discrepancies. Statistica Sinica, 6, 733807). See for example
Kéry &
Schaub (2012) pp.225226 and 236;
King et al (2010)
pp.138140. The Kéry & Royle (2016, AHM book, pp.192198) have an example comparing Bayesian and bootstrap methods. Royle et al (2014, SCR book, pp.232241) give an extensive discussion based on Bayesian pvalues  see below. In a Bayesian framework, we would not simply look at this single discrepancy measure and declare victory if p > 0.05 (telling us that we can't reject the null hypothesis that the discrepancy is zero...). We can make full use of these posterior predictions to investigate the model fit and try to improve the model. Bayesian pvalues and occupancy modelsFor an occupancy model without surveyspecific covariates, the code to generate the posterior predictive data might look like this:
where For sites where Bayesian pvalues and SCR modelsRoyle et al (2014, SCR book, pp.232241) only consider this method of assessing fit. Individual binary 0/1 data points from \(y_{ijk}\) contain insufficient information to allow fit assessment, so they suggest aggregating over 1 or 2 dimensions. They give 3 options and apply these to the fisher data:
For each, they calculate an expected value (estimated as p*n) and obtain FreemanTukey discrepancies between observed and expected. They also draw replicate data and get the FreemanTukey discrepancy for between that and expected. Discrepancies are summed for each, giving a pair of values (observed and replicated) for each statistic at each step of the MCMC chain. I tried this for a data set where p0 and sigma had very different p0 and sigma (differing by a factor of 2.5) and p0(sex) sigma(sex) having lower AICc than p0(.) sigma(.). Bayesian pvalues indicated that the p0(.) sigma(.) model was entirely adequate, and there was hardly any difference between the pvalues for the 2 models. Further investigation showed that a behavioural response model, p0(sex+bk) sigma(sex), had an even lower AICc (deltaAICc = 11), so the null model really was not adequate. Neither was the Bayesian pvalue as an assessment tool! Assessing different aspects of the modelThe simplest approach is to "eyeball" posterior predictive
plots superimposed on the data. The Gelman et al (2014, BDA, section 6.3) suggest how simulated data can be used to check minimum and maximum values in the data set, and Gelman & Hill (2007, pp.158ff) extend this to checks for zero counts. Closer to our own field, Ergon & Gardner (2014) simulate posterior predictive data and use it to check different aspects of the model:
They simulated the data sets in R after fitting the model in JAGS, with a function that can be used either with values from the posterior distribution of parameters (single rows from the MCMC chain after converting to a matrix) or with specific userchosen values. Although an interesting starting point, theirs is a CJS model (only tracking survival of marked animals) and is a monster with 11 parameters to estimate. A workable GoF strategy for open cucumber models
For closed SCR model for the data set with big male/female differences, I tried the following checks, comparing the actual data with 1000 simulated posteriorpredictive data sets:
Again there was no difference between the p0(.) sigma(.) and p0(sex) sigma(sex) models. However, check #3 showed a big difference between actual and simulated data: in the actual data, no animal was caught more than once in a specific trap, which occurred with only 2% of the simulated data sets. This subsequently led to investigation of the p0(sex+bk) sigma(sex) model. Some specific checks for male and female data would likely be needed, but in this case it would be complicated as the sex of one captured individual was unknown. I used a similar procedure for a very long study (where closure is questionable) with the period divided into sessions. The above checks showed adequate fit, but a "naïve survival" check  number of individuals captured later  for a conventional closed model picked up the fact that survival was less than 1. 