Trying to write a four level multi-level model, encountering error

Hi, I’m trying to code up a four level multi-level model. I’m not sure if my error is computational or my ‘statistical thinking’ about the problem is flawed.

I’m looking at prices of goods within stores, which are within banners (supermarket chains, ex. Safeway or Whole Foods), within regions.

I’m working in Rstan, importing the stan file via stanc() and stan_model(). The model compiles fine, but I encounter the error:

Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : 
  Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=region_level_lookup; position=0; dims declared=(5); dims found=(17)  (in 'model50482148368c_four_level' at line 24)

failed to create the sampler; sampling not done

So the error refers to the fact that, for the region-level lookup, the model expects there to be 5 dims (equal to the number of banners) and 17 are found. The reason that there are 17 is because there are 17 different banner/region combinations, for example, there are Whole Foods in both the West Coast and East Coast region.

Note that this isn’t a problem for the banner level lookup, where each of the 17 stores fall into one of 5 banners, for a total of 17 store/banner combinations.

I suspect that the error has something to do with the “lookups” are performed within the “transformed parameters” block.

Either that, or I may be making some more fundamental error around how multi-level models are conceptualized.


I’ve attached stan data rdump, and the stan model is below.

And here is my stan model

  data {
   // Define variables in data
   // Number of level-1 observations (an integer)
   int<lower=0> N_obs;
   // level 1 categorial predictor
   int upc_id[N_obs];
   //Number of Level 1 categorial predictors
   int<lower=0> N_upc;
   // Continuous outcome
   real Price[N_obs];
   // Number of level-2 clusters
   int<lower=0> N_stores;
   // Number of level-3 clusters
   int<lower=0> N_banners;
   // Number of level-4 clusters
   int<lower=0> N_regions;
   // Cluster IDs (for all levels)
   int<lower=1> store_id[N_obs];
   int<lower=1> banner_id[N_obs];
   int<lower=1> region_id[N_obs];
   // Level 3 look up vector for level 2
   int<lower=1> banner_level_lookup[N_stores];
   // Level 4 look up vector for level 3
   int<lower=1> region_level_lookup[N_banners];
 }

parameters {
  // Population intercept
  real beta_0;

  // Population Slope- a different slope for each factor
  //vector[N_upc] beta_1;

  // Level-1 errors
  real<lower=0> sigma_e0;
  // Note that the subscripts changed between the two_level and 3 level model, not _j represents the 3rd level not the 2nd
  
  // Level-2 random effect
  vector[N_stores] u_0ijk;
  real<lower=0> sigma_u0ijk;

  // Level-3 random effect
  vector[N_banners] u_0jk;
  real<lower=0> sigma_u0jk;
  
  // Level-4 random effect
  vector[N_regions] u_0k;
  real<lower=0> sigma_u0k;
}

transformed parameters  {

  // Varying intercepts
  vector[N_stores] beta_0ijk;
  vector[N_banners] beta_0jk;
  vector[N_regions] beta_0k;

  // Level-2- start from the population and work your way down (inverse)!
  beta_0k = beta_0 + u_0k; //population -> regions

  // Level-3 //regions -> banners
  beta_0jk = beta_0k[region_level_lookup] + u_0jk;
  
  // Level-4 //banners -> stores
  beta_0ijk = beta_0jk[banner_level_lookup] + u_0ijk;

}

model {
  // Prior part of Bayesian inference

  // Random effects distribution
  u_0k  ~ normal(0, sigma_u0k);
  u_0jk ~ normal(0, sigma_u0jk);
  u_0ijk ~ normal(0, sigma_u0ijk);

  // Likelihood part of Bayesian inference
   Price ~ normal(beta_0ijk[store_id], sigma_e0);
}

generated quantities {
  vector[N_obs] log_lik;
  for (n in 1:N_obs) log_lik[n] = normal_lpdf(Price[n] | beta_0ijk[store_id][n], sigma_e0);
}

I should add that I’ve built a 3 level model and it works fine.

stan_data_dump.R (224.2 KB)

1 Like

It looks like it’s failing because there’s an inconsistency in the dimensions of the data that you’re declaring and passing.

In the stan model, you’re declared:

int<lower=0> N_banners;
int<lower=1> region_level_lookup[N_banners];

So the integer array region_level_lookup should be length N_banners, but in the data you’re passing:

> N_banners
[1] 5
> length(region_level_lookup)
[1] 17

So you’ve told stan to expect an integer array of length 5, but then you’ve passed an array of length 17

1 Like

OK, I can see that mistake. I guess I should just ask- is there a way to do a 4-level multi-level model in Stan? If so, what is my code missing/getting wrong?

Multilevel modelling isn’t my area, so someone else will have to weigh in on that sorry. Are you able to specify your model in brms or rstanarm? That could save you a lot of time if it’s an option

Hey! I think you need something like this:


...

  // in transformed parameters...

  // Building levels - Non-centered parameterization 
  // Level-4
  vector[N_stores] beta_0ijk = sigma_u0ijk * u_0ijk;
  // Level-3
  vector[N_banners] beta_0jk = sigma_0jk * u_0jk;
  // Level-2
  vector[N_regions] beta_0k = sigma_0k * u_0k;
  // define in "model" or transformed parameters
  // mapping from each level to obs/N_obs
  vector[N_obs] mu = beta_0 + beta_0k[region_id] + beta_0jk[banner_id] + beta_0ijk[store_id];
  
...

  // in model block...

  Price ~ normal(mu, sigma_e0);
  
  // Prior on beta_0?!
  // assuming Price is on log-scale...
  beta_0 ~ normal(xx, xx);

  // ... because of non-centered parameterization
  u_0k  ~ std_normal();
  u_0jk ~ std_normal();
  u_0ijk ~ std_normal();

  // Needs priors on sigmas! Maybe exponential(1), or normal(0, 2.5) or something like that...
  // (assuming Price is on log-scale)
  sigma_0k  ~ exponential(1);
  sigma_u0jk ~ exponential(1);
  sigma_u0ijk ~ exponential(1);

  // Need prior for sigma_e0!
  // (assuming Price is on log-scale)
  sigma_e0 ~ exponential(1);
  
...

So basically just specify each level, and then map them all together to the Price[N_obs] variable via mu[N_obs]. You can use mu[n] in generated quantities as well. Also, I have laid out a non-centered parameterization, which I think works better most of the time, especially if you have many levels.

Looks like you want to add covariates as well. Check out the chapter in the Stan user guide about non-centered parameterization of the multivariate normal. That’ll come in handy for that.

Hope that helps! :)
Cheers!

Edit: Yes, such a model would probably be easier to to in brms or rstanarm, as @andrjohns pointed out. For example, rstanarm::stan_lmer(Price ~ 1 + (1|store) + (1|banner) +(1|region), data = ..., ...) will most likely give you the fastest possible implementation of such a model.

2 Likes

The model you proposed is a two-level multi-level model with a number of random effects. This will partially pool for each variable but not deliver what I’m looking for. What I’m trying to build is a model with multiple nested hierarchies: stores within banners within regions. In the model I’m proposing, the betas themselves are dependent on the betas at higher levels, for example, the effect of the banner depends on the region that it is in. Hence the need for a “banner_level_lookup” variable.

1 Like

For this you simply need to define the IDs as interactions:

  • Level 1: region_id
  • Level 2: region_id:banner_id
  • Level 3: region_id:banner_id:store_id

and the easiest way is to define those IDs like this right away (outside of Stan). The code is then like the code I posted above.

I’m not familiar with how to define an interaction outside of a regression equation. Are you saying I create new variables? So for level 2, if there are 4 regions and 8 banners, there could be up to 32 different ID’s, and I would create a new variable called “level 2”?

Hey!

Yeah, right! For example in R what I do to make these “hierarchical factors” is:

region_banner_factor <- paste0(region_id, "_", banner_id)
region_banner_factor <- factor(region_banner_factor)
region_banner_levels <- levels(region_banner_factor)
region_banner_id <- as.numeric(region_banner_factor)

# ...and to convert back from IDs to factors just do
region_banner_levels[region_banner_id]

… and the same for the three-way interaction with store. Then just use region_id, region_banner_id, and region_banner_store_id in your Stan model.

This should be equivalent to what the lme4 syntax (and thus rstanarm::stan_{g}lmer) for nested hierarchical models does. The lme4 syntax would be (1|region_id/banner_id/store_id).

Cheers!
Max