Fitting a logistic model with only categorical variables

jphill01 · March 3, 2023, 9:44pm

I’m attempting to fit the following syntactically-correct logistic model in Stan using HMC via stan(), but am getting an error. The model contains only categorical predictors as well as a categorical response, which I have coded as dummy variables.

data {
  int<lower = 0> N; // number of samples
  int<lower = 0, upper = 1> mislabelled[N]; // vector of mislabelled samples
  int<lower = 1, upper = 25> city[N]; // vector of cities
  int<lower = 1, upper = 3> source[N]; // vector of sources
  int<lower = 1, upper = 4> season[N]; // vector of seasons
  int<lower = 1, upper = 12> month[N]; // vector of months
  int<lower = 1, upper = 5> status[N]; // vector of species statuses
}

parameters {
  real beta0;
  vector[24] beta1;
  vector[2] beta2;
  vector[3] beta3;
  vector[11] beta4;
  vector[4] beta5;
}

model {
  vector[N] eta;
  beta0 ~ normal(0, 1);
  beta1 ~ normal(0, 1);
  beta2 ~ normal(0, 1);
  beta3 ~ normal(0, 1);
  beta4 ~ normal(0, 1);
  beta5 ~ normal(0, 1);
  eta = beta0 + beta1[city[1:N]] + beta2[source[1:N]] + beta3[season[1:N]] + beta4[month[1:N]] + beta5[status[1:N]];
  mislabelled ~ bernoulli_logit(eta);
}

Here is my fake data:

library(rstan)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE) # only have to compile once unless code is changed

N <- 10
mislabelled <- sample(0:1, N, replace = TRUE) 
city <- sample(1:25, N, replace = TRUE)
source <- sample(1:3, N, replace = TRUE) 
season <- sample(1:4, N, replace = TRUE)
month <- sample(1:12, N, replace = TRUE)
status <- sample(1:5, N, replace = TRUE)

(fit <- stan("mislabelled.stan", data = list(N = N, mislabelled = mislabelled, city = city, source = source, season = season, month = month, status = status), iter = 2000))

The error I am getting is an out-of-range error.

Exception: vector[multi] indexing: accessing element out of range. index 3 out of range; expecting index to be between 1 and 2  (in 'model3ee2204da41_mislabelled' at line 28)

Line 28 is where the equation for eta is specified.

I’m a bit confused since I don’t have any indices between 1 and 2 as the error seems to suggest.

Any ideas on what’s going on here?

mingqian.guo · March 3, 2023, 11:11pm

Seems your beta vector size does not compatible with the sample data? Sample data maximum value is always larger than the beta value length.

jphill01 · March 3, 2023, 11:40pm

I’m not sure follow…

The beta vectors have length one less than the number of observations for a given vector. So, in the case of city, there are 25 cities, but one city is used by the model as the reference city (as is usually the case with categorical dummy variables in general). Hence, why beta1 is of length 24.

Could you perhaps clarify?

jsocolar · March 4, 2023, 2:13am

From your data simulation:

source <- sample(1:3, N, replace = TRUE)

From your Stan program

vector[2] beta2;
...
beta2[source[1:N]]

The maximum allowable index for beta2 is 2. The index must be “between” 1 and 2 inclusive, i.e. it must be either 1 or 2. But you are passing it an index of up to 3.

jphill01 · March 4, 2023, 2:38am

Ah.

So, the length of all beta vectors defined in the parameters block need to be equal to the upper limit for the respective covariate defined the the data block. That is:

parameters {
  real beta0;
  vector[25] beta1;
  vector[3] beta2;
  vector[4] beta3;
  vector[12] beta4;
  vector[5] beta5;
}

Implementing the above fixes things and the model runs as expected.

Topic		Replies	Views
Representing categorical variables in Stan Modeling	4	1053	March 8, 2023
Error in categorical_logit function Modeling rstan	1	382	June 14, 2023
Simple Multinomial Logistic Regression Performance Modeling performance	5	1006	September 30, 2020
Categorical Predictor with Measurement Error Regression Modeling specification	2	412	January 15, 2021
Efficient implementation of the for loop? Modeling rstan , fitting-issues	1	596	February 18, 2022

Fitting a logistic model with only categorical variables

Related topics