Representing categorical variables in Stan


@bgoodri shows how to represent categorical variables in Stan.

I have the following syntactically correct unpooled logistic regression model with both varying slopes and varying intercepts:

data {
  int<lower=0> N; // number of samples
  int<lower=0> numCities; // number of cities
  int<lower=0> numSources; // number of sources
  int<lower=0> numTypes; // number of types
  int<lower=0> numSeasons; // number of seasons
  int<lower=0> numMonths; // number of months
  int<lower=0> numYears; // number of years
  int<lower=0> numStatuses; // number of species statuses
  int<lower=0, upper=1> mislabelled[N]; // vector of mislabelled samples
  int<lower=0, upper=numCities> city[N]; // vector of cities
  int<lower=0, upper=numSources> source[N]; // vector of sources
  int<lower=0, upper=numTypes> type[N]; // vector of types
  int<lower=0, upper=numSeasons> season[N]; // vector of seasons
  int<lower=0, upper=numMonths> month[N]; // vector of months
  int<lower=0, upper=numYears> year[N]; // vector of years
  int<lower=0, upper=numStatuses> status[N]; // vector of species statuses

parameters {

 // varying intercept
  vector[numCities] alpha_city;
  vector[numSources] alpha_source;
  vector[numTypes] alpha_type;
  vector[numSeasons] alpha_season;
  vector[numMonths] alpha_month;
  vector[numYears] alpha_year;
  vector[numStatuses] alpha_status;

  // varying slopes
  vector[numCities] beta_city;
  vector[numSources] beta_source;
  vector[numTypes] beta_type;
  vector[numSeasons] beta_season;
  vector[numMonths] beta_month;
  vector[numYears] beta_year;
  vector[numStatuses] beta_status;

model {
  vector[N] eta;

  // priors for varying interceptss 
  alpha_city ~ normal(0, 1);
  alpha_source ~ normal(0, 1);
  alpha_type ~ normal(0, 1);
  alpha_season ~ normal(0, 1);
  alpha_month ~ normal(0, 1);
  alpha_year ~ normal(0, 1);
  alpha_status ~ normal(0, 1);

  // priors for varying slopes 
  beta_city ~ normal(0, 1);
  beta_source ~ normal(0, 1);
  beta_type ~ normal(0, 1);
  beta_season ~ normal(0, 1);
  beta_month ~ normal(0, 1);
  beta_year ~ normal(0, 1);
  beta_status ~ normal(0, 1);

  for (i in 1:N) {
    eta[i] = alpha_city[city[i]] +
             alpha_source[source[i]] +
             alpha_type[type[i]] +
             alpha_season[season[i]] +
             alpha_month[month[i]] +
             alpha_year[year[i]] +
             alpha_status[status[i]] +
             beta_city[city[i]] + 
             beta_source[source[i]] +
             beta_type[type[i]] +
             beta_season[season[i]] +
             beta_month[month[i]] +
             beta_year[year[i]] +

  // likelihood
  mislabelled ~ bernoulli_logit(eta);

My (verbose) model contains only categorical predictors and one categorical response, which both must be coded as dummy/indicator variables.

However, @bgoodri doesn’t indicate bounds in the data block between 0 and 1.

When I try to fit some fake data coded as 0/1

N <- 1000
numCities <- 25
numSources <- 3
numTypes <- 4
numSeasons <- 4
numMonths <- 12
numYears <- 2
numStatuses<- 5
mislabelled <- sample(0:1, N, replace = TRUE)
city <- sample(0:1, N, replace = TRUE)
source <- sample(0:1, N, replace = TRUE)
type <- sample(0:1, N, replace = TRUE)
season <- sample(0:1, N, replace = TRUE)
month <- sample(0:1, N, replace = TRUE)
year <- sample(0:1, N, replace = TRUE)
status <- sample(0:1, N, replace = TRUE)

(fit <- stan("mislabelled_unpooled.stan", data = list(N = N, 
                                         numCities = numCities,
                                         numSources = numSources,
                                         numTypes = numTypes,
                                         numSeasons = numSeasons,
                                         numMonths = numMonths,
                                         numYears = numYears,
                                         numStatuses = numStatuses,
                                         mislabelled = mislabelled, 
                                         city = city, 
                                         source = source, 
                                         type = type, 
                                         season = season, 
                                         month = month, 
                                         year = year, 
                                         status = status),
         iter = 5000, control = list(max_treedepth = 15)))

I get an Exception error

 Exception: []: accessing element out of range. index 0 out of range; expecting index to be between 1 and 25; index position = 1alpha_city  (in 'model40248cfcbc3_mislabelled_unpooled' at line 70)"

which stems from the fact that 0/1 doesn’t match the bounds in the data block.

Line 70 in my model corresponds to the specification of eta.

Setting the <upper> limit in the data block to numCities, etc. won’t work since the dummy structure is not preserved.

Am I missing something?

This isn’t about the bounds in the data block. This is because you are trying to access element 0 of an array, e.g. alpha_city[0]. The first element in Stan arrays is 1, not 0 (e.g. alpha_city[1]). As in the linked example, the lower bound is 1 rather than 0. If you add 1 to all of the indicators, then you shouldn’t have this issue.

Note also that the data you’re generating does not actually span the possible values. For example, you specify 25 cities, but the city indicators can only be 0 or 1. Instead, try this.

city <- sample(1:numCities, N, replace = TRUE)

@simonbrauer Thanks. This works.

However, doesn’t doing

city <- sample(1:numCities, N, replace = TRUE)

destroy the requirement of a dummy variable needing to be either 0 or 1 as in @bgoodri SO post (see the original poster’s question?

@bgoodri shows two different ways to do the same thing.

The first version is dummy-coded, where all but one level of each category gets its own column and they are coded 0/1. So with 25 cities, there would be 24 variables, all of them 0/1.

The second approach uses one set of indices for each variable. So regardless of the number of cities, you only have one variable. But the values should range from 1 to 25 (or whatever your maximum is).

Again, they’re doing the same thing as far as the model is concerned. But you may find one or the other easier based on how comfortable you are coding R or Stan. For example, the first version scales up very nicely if you add additional variables because they’re just tacked on to X. But it becomes harder (as far as writing code) to put specific priors on particular categorical variables if you do that.


Got it.

My model obviously falls into @bgoodri’s second solution. I’ll keep your other points in mind and accept your response as the solution to my question.