Correlated covariates

HOME There is usually some correlation between the covariates that we want to include in our model, a phenomenon know as "collinearity" or "multicollinearity". This is not a problem if the correlation is small, but you need to be careful if the absolute correlation between two covariates is > 0.7. You can avoid problems by discarding one of a pair of correlated covariates, but that can be a mistake if both have a real biological impact. We'll look at a toy example where both covariates should be included.

A toy example

Suppose we have measurement of some variable, such as tree size expressed as diameter at breast height (DBH) at 9 sites. DBH depends on temperature and rainfall, both of which we have scaled so that they range from -2 to +2.

rain <- c(2, 1, 1, 0, 0, -1, -1, -2, -2)
temp <- c(-2, -2, -1, -1, 0, 0, 1, 1, 2)
dbh <- 2.5 + 1 * temp + 0.5 * rain
# Add is small error term, otherwise 'lm' gives strange results
set.seed(2017)
( dbh <- dbh + rnorm(9, 0, 0.02) )

The data are shown in the plot below, where the diameter of the circles is proportional to DBH.

The plot shows that tree size increases with rainfall (for a given temperature) and also with temperature (for a given rainfall). The (absolute) correlation between rainfall and temperature is high at 0.93, and this is also included in the plot.

Fitting a linear model with both rainfall and temperature as predictors appears to work well and we recover the coefficients we used to generate DBH with only a small effect of the random error we added:

coef( tr <- lm(dbh ~ temp + rain) )
(Intercept)         temp         rain  
     2.4995       1.0032       0.5097 

Using just one covariate

With such a high correlation, perhaps we should choose one of the covariates for our model and ignore the other. The plots below show the effect of ignoring one covariate:

The plot for rainfall-only seems odd: it appears that high rainfall is associated with small trees; that is indeed true, but this is due to sites with high rainfall having low temperatures. This shows up when we run linear regressions:

coef(t0 <- lm(dbh ~ temp))
(Intercept)        temp 
  2.2810770   0.5299035 
coef(r0 <- lm(dbh ~ rain))
(Intercept)        rain 
  2.0695785  -0.4218396

The coefficient for temperature is underestimated (0.53 instead of 1.00) and the coefficient for rainfall is now negative (-0.42 instead of +0.50). These results are just wrong. In this example, the two covariates have separate effects on DBH and both need to be included. The correlation is due to our choice of sites: had we included sites with temperature and rainfall both high or both low, the correlation would be much lower.

Model Intercept Temperature Rainfall
dbh ~ temp + rain 2.50 1.00 0.51
dbh ~ temp 2.28 0.53  
dbh ~ rain 2.07   -0.42
Truth (data generating values) 2.5 1 0.5

In this case, if you want to disentangle the effects of temperature and rainfall, you must include both covariates in the model.

Which model should we use to  make predictions?

Akaike's information criterion (AIC) attempts to select the best model for predictions:

AIC(tr, t0, r0)
   df        AIC
tr  4 -38.223970
t0  3   6.544253
r0  3  18.695476

The model with both covariates is clearly the best, but as we shall see in a  moment, it's important to include the covariances when making predictions.

Why is collinearity a problem?

Richard McElreath (2016, p.142) has a toy example where the length of a person's leg is used to predict their height. This works fine for one leg. Left and right legs have slightly different lengths, and if both are included the coefficients have huge confidence intervals and neither is significant. This arises because the left leg provides no information that is not provided by the right leg, and vice versa. (If you wanted to include both legs, a good strategy would be to use the mean leg length as one covariate and the difference in leg lengths as a second covariate.)

The estimates of coefficients of correlated covariates are themselves correlated; let's look at the correlation matrix for the DBH model:

cov2cor(vcov(tr))
            (Intercept)      temp      rain
(Intercept)   1.0000000 0.6546537 0.6546537
temp          0.6546537 1.0000000 0.9285714
rain          0.6546537 0.9285714 1.0000000

The correlation between the coefficients for temp and rain is the same as the correlation between the covariates. As a result, the variances are inflated, leading to larger SEs, wider confidence intervals, and higher p values. The "variance-inflation factor" (VIF) is calculated with the vif function in the car package:

library(car)
vif(tr)
    temp     rain 
7.259259 7.259259 
sqrt(vif(tr))
    temp     rain 
2.694301 2.694301 

The variances are inflated by a factor of 7.3 and standard errors and confidence interval widths are larger by a factor of 2.7 as a result of collinearity. For this artificial example the variance is tiny even after inflation. (For more information on VIF, see Fox (2002 p.216ff).)

Does this apply to Bayesian estimation too?

Let's run a simple model with broad priors in JAGS and see what happens.

writeLines("model {
for(i in 1:length(dbh)) {
  dbh[i] ~ dnorm(mu[i], tau)
  mu[i] <- b0 + bR*rain[i] + bT*temp[i]
  }
  b0 ~ dnorm(0, 0.01)
  bR ~ dnorm(0, 0.01)
  bT ~ dnorm(0, 0.01)
  sigma ~ dunif(0, 5)
  tau <- pow(sigma, -2)
} " , con="model.txt")

jagsData <- list(dbh=dbh, rain=rain, temp=temp)
wanted <- c("b0", "bT", "bR", "sigma")
library(jagsUI)
( out <- jags(jagsData, NULL, wanted, "model.txt", DIC=FALSE,
  n.chains=3, n.adapt=100, n.iter=1000, n.burnin=0) )

And here are the summaries of the marginal posterior distributions:

       mean    sd  2.5%   50% 97.5% overlap0 f  Rhat n.eff
b0    2.499 0.014 2.470 2.499 2.528    FALSE 1 1.000 15000
bT    1.003 0.021 0.961 1.003 1.046    FALSE 1 1.000 15000
bR    0.510 0.022 0.467 0.510 0.553    FALSE 1 1.000 15000
sigma 0.029 0.012 0.016 0.027 0.061    FALSE 1 1.001  8836

The marginal posteriors never tell the whole story, and in this case it's especially important to look at cross-correlations:

library(coda)
crosscorr(out$samples)
                b0         bT         bR        sigma
b0    1.0000000000 0.65631381 0.65498018 0.0002794896
bT    0.6563138090 1.00000000 0.92849410 0.0062084503
bR    0.6549801822 0.92849410 1.00000000 0.0037814604
sigma 0.0002794896 0.00620845 0.00378146 1.0000000000
with(out$sims.list, plot(bT, bR))

Again we see that the posteriors for the coefficients have the same correlation as the data for temperature and rain, and this shows up too in the plot:

One huge advantage of using MCMC chains to represent posterior distributions: it is easy to generate similar chains for derived quantities. So if we want to make predictions for the DBH of trees when (for example) temperature = 0.5 and rainfall = 1.5, we can simply do this:

mu <- with(out$sims.list, b0 + bT * 0.5 + bR * 1.5)
hist(mu)
mean(mu)
[1] 3.765288
quantile(mu, c(0.025, 0.5, 0.975))
    2.5%      50%    97.5% 
3.660236 3.765288 3.872797

How precise is our prediction compared with the point estimates of bT and bR? We can compare the coefficients of variation (CVs):

out$sd$bT / out$mean$bT
[1] 0.02138588
out$sd$bR / out$mean$bR
[1] 0.04229678
sd(mu) / mean(mu)
[1] 0.01409571

We see that the prediction is more precise than the individual coefficients on which it is based. This is because the CVs of the coefficients refer to marginal distributions which ignore cross-correlations, while the calculation of mu takes cross-correlation into account.

References

McElreath, R. (2016) Statistical rethinking: a Bayesian course with examples in R and Stan, CRC Press, Boca Raton etc.

Fox, J. (2002) An R and S-PLUS Companion to Applied Regression, Sage Publications Inc, Thousand Oaks CA.

Updated 24 Jan 2017 by Mike Meredith