# Categorical covariates in JAGS

HOME I've recently been asked about coding categorical covariates in JAGS. You can do it in the usual frequentist manner with dummy variables for all but one of the variables, but JAGS allows other, more easily interpreted ways of coding, especially if there is only one categorical covariate in the model and it is a logistic or log regression. Many of our wildlife analyses have logistic models under the hood.

In R, the class factor is used for categorical variables. A factor has a levels attribute with the names of the categories, in alphabetical order by default, and the values are stored as integers giving the level number.

### A toy example

I'll generate some simulated data for a logistic regression model (as that generalises to a lot of our other models). We have four players with different degrees of skill and a continuous covariate which does nothing.

playerNames <- c("Andy", "Bob", "Chan", "Dora")
skill <- c(0.3, 0.6, 0.7, 0.8)
set.seed(2017)
player.char <- sample(playerNames, size=200, replace=TRUE, prob=c(1,3,2,2))
player <- factor(player.char)
summary(player)
Andy  Bob Chan Dora
25   84   52   39
success <- rbinom(200, 1, p=skill[as.integer(player)])
fluff <- rnorm(200)
tapply(success, player, mean) # Check: should be close to 'skill'
Andy       Bob      Chan      Dora
0.2800000 0.5833333 0.7115385 0.8461538 

I've used character strings for the player names and then for the levels of the player factor (they appear in the summary), which I think is much safer than using numbers for categories. A factor can easily be converted to integers if necessary, as in the call to rbinom above. Note that the number of observations per category varies, something that happens often with real data.

### Frequentist analysis with glm

We'll quickly run a glm analysis for the full model:

glm(success ~ player + fluff, family='binomial')

Call:  glm(formula = success ~ player + fluff, family = "binomial")

Coefficients:
(Intercept)      playerBob     playerChan     playerDora        fluff
-0.9589         1.2927         1.8726         2.6850       0.1198

Degrees of Freedom: 199 Total (i.e. Null);  195 Residual
Null Deviance:      263.6
Residual Deviance: 239.2        AIC: 249.2

Notice that we have coefficients for Bob, Chan and Dora, but not for Andy. The intercept refers to Andy's plays and the other coefficients give the difference between Andy and each of the other players. It isn't possible to have an intercept and a coefficient for each player, as they would be confounded: you could add any value you liked to the intercept provided you subtracted the same value from the player coefficients and you'd get exactly the same result.

### The null model - no covariates

To begin with, let's run a JAGS model with no covariates. In this case we need to estimate a single probability of success, p, and do not need the logit link.

modelText <- "
model {
for(i in 1:length(success)) {
success[i] ~ dbern(p)
}

# Priors
p ~ dbeta(1,1)
}"
writeLines(modelText, con="modelNull.txt")

Here I'm using a Beta prior, which is the appropriate distribution for a probability, as it is restricted to the range of 0 to 1. Beta(1,1) is a uniform prior with little information, and no one is likely to question its choice, except to suggest a more informative prior. And a more informative prior could easily be coded with, say, p ~ dbeta(2,3). Many people use Uniform(0,1) for a uniform prior instead of Beta(1,1); that's mathematically equivalent, but is not recognised as a probability by JAGS.

JAGSdata <- list(success=success)
wanted <- "p"

library(jagsUI)
outNull <- jags(JAGSdata, NULL, wanted, "modelNull.txt",
outNull
...
mean    sd    2.5%     50%   97.5% overlap0 f  Rhat n.eff
p          0.629 0.033   0.563   0.629   0.692    FALSE 1 1.000  3668
...
plot(outNull)

This works fine and the plots (not shown) indicate good mixing and convergence.

### One categorical covariate

If we have only one covariate and that covariate is categorical, the simplest strategy is to estimate a value for p for each category. We do not use the logit link, and dealing directly with p simplifies interpretation and discussion of priors.

modelText <- "
model {
for(i in 1:length(success)) {
success[i] ~ dbern(p[player[i]])
}

# Priors
for(i in 1:4) {
p[i] ~ dbeta(1,1)
}
}"
writeLines(modelText, con="modelPlayer.txt")

Again I'm using a Beta(1,1) prior, which could easily be replaced by an informative beta prior, a different one for each player if you wanted.

JAGSdata <- list(success=success, player=as.integer(player))
str(JAGSdata)
List of 2
$success: int [1:200] 0 1 0 0 1 1 0 1 1 1 ...$ player : int [1:200] 1 3 3 2 4 4 2 3 3 2 ...
wanted <- "p"

The factor player was converted to integers for JAGS, with Andy = 1, Bob = 2, etc. This is used as the index into the p vector so that the right p is used for each observation in the success vector.

outPlayer <- jags(JAGSdata, NULL, wanted, "modelPlayer.txt",
outPlayer
...
mean    sd    2.5%     50%   97.5% overlap0 f  Rhat n.eff
p[1]       0.299 0.087   0.144   0.294   0.487    FALSE 1 1.001  1627
p[2]       0.582 0.053   0.479   0.582   0.684    FALSE 1 1.000  3484
p[3]       0.705 0.061   0.578   0.708   0.817    FALSE 1 1.000  6000
p[4]       0.830 0.058   0.706   0.835   0.927    FALSE 1 1.001  2855
...
# Compare with the proportion of successes for each player:
tapply(success, player, mean)
Andy       Bob      Chan      Dora
0.2800000 0.5833333 0.7115385 0.8461538 

The posterior means are close to the actual proportion of successes in the data, though pulled somewhat towards 0.5 as we would expect - with a uniform prior, the posterior mode would be equal to the observed proportion of successes; the posterior distribution is skewed, so the mean is closer to 0.5.

### More than one covariate

With more than one covariate we do need to use the logit link. There are several strategies, but with only one categorical covariate, we can keep it simple by using a different intercept for each category.

The intercept for each player is the logit(probability of success) when fluff = 0, which is actually mean of fluff. So we might be able to provide a sensible prior for probability of success. In the code below, I've specified a uniform prior, Beta(1,1), for the intercepts on the probability scale, then converted to logit for use in the regression. You could use an informative Beta prior for each player if you wished.

#### One intercept per category

modelText <- "
model {
for(i in 1:length(success)) {
logit(p[i]) <- bPlayer[player[i]] + bFluff * fluff[i]
success[i] ~ dbern(p[i])
}

# Priors
for(i in 1:4) {
pPlayer[i] ~ dbeta(1,1)         # prior specified on probability scale...
bPlayer[i] <- logit(pPlayer[i]) # ... then converted to logit scale.
}
bFluff ~ dunif(-5,5)
}"
writeLines(modelText, con="model1.txt")

Here I'm using my favourite uniform priors, dunif(-5,5), for logistic regression coefficients when continuous covariates are standardised (fluff was generated with mean = 0 and SD = 1). These may not be appropriate for your reported analysis, but it is easy to see if the prior is constraining the posterior.

JAGSdata <- list(success=success, player=as.integer(player), fluff=fluff)
str(JAGSdata)
List of 3
$success: int [1:200] 0 1 0 0 1 1 0 1 1 1 ...$ player : int [1:200] 1 3 3 2 4 4 2 3 3 2 ...
$fluff : num [1:200] 0.48 0.688 -0.201 0.101 -0.577 ... wanted <- c("bPlayer", "bFluff") out1 <- jags(JAGSdata, NULL, wanted, "model1.txt", n.chains=3, n.adapt=100, n.iter=2000) out1 ... mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff bPlayer[1] -1.013 0.448 -1.924 -0.998 -0.157 FALSE 0.989 1.002 5123 bPlayer[2] 0.337 0.221 -0.095 0.338 0.764 TRUE 0.936 1.002 1195 bPlayer[3] 0.943 0.316 0.345 0.931 1.588 FALSE 0.999 1.000 6000 bPlayer[4] 1.835 0.466 1.011 1.811 2.827 FALSE 1.000 1.000 6000 bFluff 0.123 0.166 -0.196 0.122 0.448 TRUE 0.771 1.000 6000 ... The bPlayer coefficients are on the logit scale, so we need to convert to probabilities (for the cases where fluff = 0) for comparison with the known probabilities: logitP1 <- out1$sims.list$bPlayer P1 <- plogis(logitP1) colMeans(P1) [1] 0.275 0.582 0.715 0.853 #### One category as reference category The method above won't work if you have more than one categorical covariate. An alternative is to use the strategy adopted by default by glm: the model has an intercept which refers to one of the categories, and coefficients for the difference between this reference category and each of the others. We adapt our code by adding the intercept, b0, and setting the coefficient for the reference category to zero. modelText <- " model { for(i in 1:length(success)) { logit(p[i]) <- b0 + bPlayer[player[i]] + bFluff * fluff[i] success[i] ~ dbern(p[i]) } # Priors b0 ~ dunif(-5,5) bPlayer[1] <- 0 # First level is reference category for(i in 2:4) { bPlayer[i] ~ dunif(-5,5) # Difference between each player and the reference player } bFluff ~ dunif(-5,5) }" writeLines(modelText, con="model2.txt") Here we might say that b0 is the intercept for the reference player, so we can provide a sensible prior for that, eg, p0 ~ dbeta(1,1) ; b0 <- logit(p0). But I have no intuitive feel for how you would choose priors for the differences (on the logit scale) among players. The data list, JAGSdata, is the same and we add b0 to the parameters in wanted. The output looks like this:  mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff b0 -0.974 0.444 -1.891 -0.970 -0.117 FALSE 0.988 1.018 147 bPlayer[1] 0.000 0.000 0.000 0.000 0.000 FALSE 1.000 NA 1 bPlayer[2] 1.312 0.495 0.366 1.309 2.294 FALSE 0.996 1.014 172 bPlayer[3] 1.916 0.540 0.881 1.904 3.017 FALSE 1.000 1.011 255 bPlayer[4] 2.773 0.634 1.563 2.749 4.068 FALSE 1.000 1.008 276 bFluff 0.124 0.165 -0.196 0.123 0.447 TRUE 0.773 1.001 2184 Notice that the effective sample sizes, n.eff, are very low - we ran 3 chains of 2000 iterations = 6000 total - and the trace plots suggest poor mixing. The main reason for this is that the reference category, Andy, has relatively few observations, only 25 out of 200. That means that it's difficult to estimate the intercept and hence difficult to estimate the other coefficients. We can improve things by making Bob, who has the most observations, the reference category. That could be done by changing the JAGS code (set bPlay[2] <- 0) but it's easier to change the data by releveling the player factor so that Bob comes first: player_Bob <- relevel(player, ref="Bob") summary(player_Bob) Bob Andy Chan Dora 84 25 52 39 JAGSdata <- list(success=success, player=as.integer(player_Bob), fluff=fluff) We run the model with the new data and get:  mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff b0 0.337 0.222 -0.078 0.329 0.783 TRUE 0.941 1.000 4153 bPlay[1] 0.000 0.000 0.000 0.000 0.000 FALSE 1.000 NA 1 bPlay[2] -1.349 0.517 -2.431 -1.332 -0.387 FALSE 0.998 1.000 6000 bPlay[3] 0.599 0.383 -0.151 0.593 1.364 TRUE 0.942 1.000 4634 bPlay[4] 1.467 0.503 0.510 1.453 2.502 FALSE 0.999 1.001 2687 bFluff 0.124 0.166 -0.199 0.123 0.450 TRUE 0.770 1.001 2864 Effective sample sizes are much better and the diagnostic plots now look good. To get the estimated skill levels (with fluff = 0) we need to add the intercept, b0, to the bPlayer coefficients: logitP2 <- out2$sims.list$bPlayer + out2$sims.list$b0 P2 <- plogis(logitP2) colMeans(P2) [1] 0.581 0.276 0.714 0.850 ... remembering that Bob is now first in the list and Andy is second. #### Categorical coefficients sum to 0 Choosing a reference category can make interpretation of the output difficult, especially if there are several categorical covariates, each with its own reference category. An alternative is to have a (non-zero) coefficient for each category but to ensure that they add up to 0 - or, if you prefer, the mean is 0. Here's the JAGS model: modelText <- " model { for(i in 1:length(success)) { logit(p[i]) <- b0 + bPlayer[player[i]] + bFluff * fluff[i] success[i] ~ dbern(p[i]) } # Priors b0 ~ dunif(-5,5) for(i in 1:4) { temp[i] ~ dnorm(0, 0.01) } bPlayer <- temp - mean(temp) bFluff ~ dunif(-5,5) }" writeLines(modelText, con="model3.txt") The bPlayer coefficients are generated in two steps: first we get a set of temp values drawn from broad normal distributions, then we subtract the mean from each. This time I'm not using my dunif priors for temp, as they will in fact be bumping up against the limits. The normal prior with mean 0 has the effect of keeping them within a reasonable range. The details of running the model are the same, and here is the output:  mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff b0 0.531 0.187 0.174 0.528 0.912 FALSE 0.998 1.000 6000 bPlayer[1] -1.506 0.377 -2.279 -1.498 -0.803 FALSE 1.000 1.001 3560 bPlayer[2] -0.189 0.244 -0.674 -0.190 0.291 TRUE 0.787 1.002 2120 bPlayer[3] 0.406 0.296 -0.164 0.401 0.986 TRUE 0.919 1.000 6000 bPlayer[4] 1.289 0.381 0.597 1.277 2.102 FALSE 1.000 1.001 1909 bFluff 0.124 0.170 -0.207 0.122 0.453 TRUE 0.768 1.000 6000 We can check that the bPlayer coefficients sum to 0 or very near to zero: tst <- rowSums(out3$sims.list$bPlayer) range(tst) [1] -1.065814e-14 1.065814e-14 This time the individual bPlayer coefficients tell us the difference in skill for each player from the mean of all players (on the logit scale). The mean with fluff = 0 is given by the intercept. So to recover estimates of skill we need to add the intercept: logitP3 <- out3$sims.list$bPlayer + out3$sims.list\$b0
P3 <- plogis(logitP3)
colMeans(P3)
[1] 0.283 0.584 0.714 0.852

Notice that we are using the "mean of all players" here; we do not take into account that the less-skilled Andy only makes a small number of attempts. It would be possible to use a weighted mean, with weights proportional to the number of attempts, if that was needed.

Updated 1 February 2018 (better object names and typos fixed) by Mike Meredith