Here:
@bgoodri shows how to represent categorical variables in Stan.
I have the following syntactically correct unpooled logistic regression model with both varying slopes and varying intercepts:
data {
int<lower=0> N; // number of samples
int<lower=0> numCities; // number of cities
int<lower=0> numSources; // number of sources
int<lower=0> numTypes; // number of types
int<lower=0> numSeasons; // number of seasons
int<lower=0> numMonths; // number of months
int<lower=0> numYears; // number of years
int<lower=0> numStatuses; // number of species statuses
int<lower=0, upper=1> mislabelled[N]; // vector of mislabelled samples
int<lower=0, upper=numCities> city[N]; // vector of cities
int<lower=0, upper=numSources> source[N]; // vector of sources
int<lower=0, upper=numTypes> type[N]; // vector of types
int<lower=0, upper=numSeasons> season[N]; // vector of seasons
int<lower=0, upper=numMonths> month[N]; // vector of months
int<lower=0, upper=numYears> year[N]; // vector of years
int<lower=0, upper=numStatuses> status[N]; // vector of species statuses
}
parameters {
// varying intercept
vector[numCities] alpha_city;
vector[numSources] alpha_source;
vector[numTypes] alpha_type;
vector[numSeasons] alpha_season;
vector[numMonths] alpha_month;
vector[numYears] alpha_year;
vector[numStatuses] alpha_status;
// varying slopes
vector[numCities] beta_city;
vector[numSources] beta_source;
vector[numTypes] beta_type;
vector[numSeasons] beta_season;
vector[numMonths] beta_month;
vector[numYears] beta_year;
vector[numStatuses] beta_status;
}
model {
vector[N] eta;
// priors for varying interceptss
alpha_city ~ normal(0, 1);
alpha_source ~ normal(0, 1);
alpha_type ~ normal(0, 1);
alpha_season ~ normal(0, 1);
alpha_month ~ normal(0, 1);
alpha_year ~ normal(0, 1);
alpha_status ~ normal(0, 1);
// priors for varying slopes
beta_city ~ normal(0, 1);
beta_source ~ normal(0, 1);
beta_type ~ normal(0, 1);
beta_season ~ normal(0, 1);
beta_month ~ normal(0, 1);
beta_year ~ normal(0, 1);
beta_status ~ normal(0, 1);
for (i in 1:N) {
eta[i] = alpha_city[city[i]] +
alpha_source[source[i]] +
alpha_type[type[i]] +
alpha_season[season[i]] +
alpha_month[month[i]] +
alpha_year[year[i]] +
alpha_status[status[i]] +
beta_city[city[i]] +
beta_source[source[i]] +
beta_type[type[i]] +
beta_season[season[i]] +
beta_month[month[i]] +
beta_year[year[i]] +
beta_status[status[i]];
}
// likelihood
mislabelled ~ bernoulli_logit(eta);
}
My (verbose) model contains only categorical predictors and one categorical response, which both must be coded as dummy/indicator variables.
However, @bgoodri doesn’t indicate bounds in the data
block between 0 and 1.
When I try to fit some fake data coded as 0/1
N <- 1000
numCities <- 25
numSources <- 3
numTypes <- 4
numSeasons <- 4
numMonths <- 12
numYears <- 2
numStatuses<- 5
mislabelled <- sample(0:1, N, replace = TRUE)
city <- sample(0:1, N, replace = TRUE)
source <- sample(0:1, N, replace = TRUE)
type <- sample(0:1, N, replace = TRUE)
season <- sample(0:1, N, replace = TRUE)
month <- sample(0:1, N, replace = TRUE)
year <- sample(0:1, N, replace = TRUE)
status <- sample(0:1, N, replace = TRUE)
(fit <- stan("mislabelled_unpooled.stan", data = list(N = N,
numCities = numCities,
numSources = numSources,
numTypes = numTypes,
numSeasons = numSeasons,
numMonths = numMonths,
numYears = numYears,
numStatuses = numStatuses,
mislabelled = mislabelled,
city = city,
source = source,
type = type,
season = season,
month = month,
year = year,
status = status),
iter = 5000, control = list(max_treedepth = 15)))
I get an Exception error
Exception: []: accessing element out of range. index 0 out of range; expecting index to be between 1 and 25; index position = 1alpha_city (in 'model40248cfcbc3_mislabelled_unpooled' at line 70)"
which stems from the fact that 0/1 doesn’t match the bounds in the data
block.
Line 70 in my model corresponds to the specification of eta
.
Setting the <upper>
limit in the data
block to numCities
, etc. won’t work since the dummy structure is not preserved.
Am I missing something?