Fitting a logistic model with only categorical variables

I’m attempting to fit the following syntactically-correct logistic model in Stan using HMC via stan(), but am getting an error. The model contains only categorical predictors as well as a categorical response, which I have coded as dummy variables.

data {
  int<lower = 0> N; // number of samples
  int<lower = 0, upper = 1> mislabelled[N]; // vector of mislabelled samples
  int<lower = 1, upper = 25> city[N]; // vector of cities
  int<lower = 1, upper = 3> source[N]; // vector of sources
  int<lower = 1, upper = 4> season[N]; // vector of seasons
  int<lower = 1, upper = 12> month[N]; // vector of months
  int<lower = 1, upper = 5> status[N]; // vector of species statuses
}

parameters {
  real beta0;
  vector[24] beta1;
  vector[2] beta2;
  vector[3] beta3;
  vector[11] beta4;
  vector[4] beta5;
}

model {
  vector[N] eta;
  beta0 ~ normal(0, 1);
  beta1 ~ normal(0, 1);
  beta2 ~ normal(0, 1);
  beta3 ~ normal(0, 1);
  beta4 ~ normal(0, 1);
  beta5 ~ normal(0, 1);
  eta = beta0 + beta1[city[1:N]] + beta2[source[1:N]] + beta3[season[1:N]] + beta4[month[1:N]] + beta5[status[1:N]];
  mislabelled ~ bernoulli_logit(eta);
}

Here is my fake data:

library(rstan)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE) # only have to compile once unless code is changed

N <- 10
mislabelled <- sample(0:1, N, replace = TRUE) 
city <- sample(1:25, N, replace = TRUE)
source <- sample(1:3, N, replace = TRUE) 
season <- sample(1:4, N, replace = TRUE)
month <- sample(1:12, N, replace = TRUE)
status <- sample(1:5, N, replace = TRUE)

(fit <- stan("mislabelled.stan", data = list(N = N, mislabelled = mislabelled, city = city, source = source, season = season, month = month, status = status), iter = 2000))

The error I am getting is an out-of-range error.

Exception: vector[multi] indexing: accessing element out of range. index 3 out of range; expecting index to be between 1 and 2  (in 'model3ee2204da41_mislabelled' at line 28)

Line 28 is where the equation for eta is specified.

I’m a bit confused since I don’t have any indices between 1 and 2 as the error seems to suggest.

Any ideas on what’s going on here?

Seems your beta vector size does not compatible with the sample data? Sample data maximum value is always larger than the beta value length.

1 Like

I’m not sure follow…

The beta vectors have length one less than the number of observations for a given vector. So, in the case of city, there are 25 cities, but one city is used by the model as the reference city (as is usually the case with categorical dummy variables in general). Hence, why beta1 is of length 24.

Could you perhaps clarify?

From your data simulation:

source <- sample(1:3, N, replace = TRUE)

From your Stan program

vector[2] beta2;
...
beta2[source[1:N]]

The maximum allowable index for beta2 is 2. The index must be “between” 1 and 2 inclusive, i.e. it must be either 1 or 2. But you are passing it an index of up to 3.

Ah.

So, the length of all beta vectors defined in the parameters block need to be equal to the upper limit for the respective covariate defined the the data block. That is:

parameters {
  real beta0;
  vector[25] beta1;
  vector[3] beta2;
  vector[4] beta3;
  vector[12] beta4;
  vector[5] beta5;
}

Implementing the above fixes things and the model runs as expected.