Extremely slow-fitting hierarchical regression model - optimization suggestions?

Hi,

I’m having difficulty fitting the following model - it samples extremely slowly, even though I’ve run models with comparable structure / datapoints / number of parameters before. I suspect it has something to do with the way I’m trying to put random intercepts for subject into the lexicon_intercept and grammar_intercept terms, but I can’t quite figure out what might be going wrong. Any advice for changes (great or small) would be greatly appreciated.

data { 
  int<lower = 1> N_trials;
  int<lower = 1> N_subjects;
  int<lower = 1, upper = N_subjects> subject[N_trials];

  int<lower = 0,upper = 1> shift_or_not[N_trials];
  real frequency[N_trials]; 
  real weight[N_trials];
  real know_remote[N_trials];
  real cosine_sim[N_trials];
}
parameters {
  vector[N_subjects] lexicon_intercept_subject_adjustor;
  real lexicon_intercept;
  vector[N_subjects] grammar_intercept_subject_adjustor;
  real grammar_intercept;
  real b_weight;
  real b_cosine;
  real b_freq;
}
model {
  real lexicon ;
  real grammar;
  real dep;
  
  target += normal_lpdf(lexicon_intercept_subject_adjustor | 0, 2);
  target += normal_lpdf(grammar_intercept_subject_adjustor |  0, 2);

  target += normal_lpdf(lexicon_intercept | 0, 2);
  target += normal_lpdf(grammar_intercept | 0, 2);  
  target += normal_lpdf(b_weight | 0, 1); 
  target += normal_lpdf(b_cosine | 0, 1);  
  target += normal_lpdf(b_freq | 0, 1);  
  for(n in 1:N_trials){
    for (s in 1:N_subjects){
    lexicon = (lexicon_intercept+ lexicon_intercept_subject_adjustor[s] + b_freq*frequency[n]+ b_cosine*cosine_sim[n]);
    grammar = (grammar_intercept + grammar_intercept_subject_adjustor[s] + b_weight*weight[n]);
    dep = inv_logit((know_remote[n] * lexicon)+(know_remote[n] * (1-lexicon) * grammar)+((1-know_remote[n]) * grammar));
    shift_or_not[n] ~ bernoulli_logit(dep);
  }
}
}


R code is here:

library(rstan)
library(tidyverse)

n_subjects <- simplified_data%>% 
  select(BetterSubject) %>% 
  distinct() %>% 
  summarise(
    count = n()
  )


  

data_list_simple <-  list(N_trials = nrow(simplified_data),
         shift_or_not = simplified_data$shift_or_not,
         weight = simplified_data$SWP,
         LocalBase = simplified_data$LocalBase.x,
         frequency = simplified_data$ZeroFilledRemoteFreq,
         know_remote = simplified_data$know_remote,
         cosine_sim = simplified_data$ZerpFilledCosineSim,
         subject = simplified_data$BetterSubject,
         N_subjects = n_subjects$count
         )


Data used is here: data.csv (250.2 KB)

create a large vector for the logit and then make a big vectorised call to bernoulli_logit. So rewrite the model likelihood such that the Bernoulli_logit is outside of the nested loop.

(I haven’t read any other details of your model)

1 Like

Hi Canaan,

I believe the culprit is this loop:

  for(n in 1:N_trials){
    for (s in 1:N_subjects){
    lexicon = (lexicon_intercept+ lexicon_intercept_subject_adjustor[s] + b_freq*frequency[n]+ b_cosine*cosine_sim[n]);
    grammar = (grammar_intercept + grammar_intercept_subject_adjustor[s] + b_weight*weight[n]);
    dep = inv_logit((know_remote[n] * lexicon)+(know_remote[n] * (1-lexicon) * grammar)+((1-know_remote[n]) * grammar));
    shift_or_not[n] ~ bernoulli_logit(dep);
  }
}

Here, each entry of shift_or_not is receiving N_trials different sampling statements, rather than one for each entry. Setting up the model to use a large vector of probabilities, like @wds15 suggested above, would help make sure that this isn’t happening

1 Like

Hi @wds15 and @andrjohns - thanks so much for your helpful replies. Unfortunately, I’m not exactly sure how to implement the vectorized call you suggested, although I think I understand why it should be used. Would you mind giving me a small example to show how I might do this?

Thanks very much!

Hi again,

I gave this a try on my own, and got something that at least compiles; it’s still extremely slow (but the estimation for total time is down from before). Did I do this right / any other tips? I read that putting all the predictors into a matrix is very fast, but I’m not sure how to go about that either.

data { 
  int<lower = 1> N_trials;
  int<lower = 1> N_subjects;
  int<lower = 1, upper = N_subjects> subject[N_trials];

  int<lower = 0,upper = 1> shift_or_not[N_trials];
  real frequency[N_trials]; 
  real weight[N_trials];
  real know_remote[N_trials];
  real cosine_sim[N_trials];
}
parameters {
  vector[N_subjects] lexicon_intercept_subject_adjustor;
  real lexicon_intercept;
  vector[N_subjects] grammar_intercept_subject_adjustor;
  real grammar_intercept;
  real b_weight;
  real b_cosine;
  real b_freq;
}
model {
  real lexicon ;
  real grammar;
  real dep;
  vector[N_trials] BIG_dep;
  target += normal_lpdf(lexicon_intercept_subject_adjustor | 0, 2);
  target += normal_lpdf(grammar_intercept_subject_adjustor |  0, 2);

  target += normal_lpdf(lexicon_intercept | 0, 2);
  target += normal_lpdf(grammar_intercept | 0, 2);  
  target += normal_lpdf(b_weight | 0, 1); 
  target += normal_lpdf(b_cosine | 0, 1);  
  target += normal_lpdf(b_freq | 0, 1);  
  for(n in 1:N_trials){
    for (s in 1:N_subjects){
    lexicon = (lexicon_intercept+ lexicon_intercept_subject_adjustor[s] + b_freq*frequency[n]+ b_cosine*cosine_sim[n]);
    grammar = (grammar_intercept + grammar_intercept_subject_adjustor[s] + b_weight*weight[n]);
    dep = inv_logit((know_remote[n] * lexicon)+(know_remote[n] * (1-lexicon) * grammar)+((1-know_remote[n]) * grammar));
    BIG_dep[n] = dep;
  }
}
    shift_or_not ~ bernoulli_logit(BIG_dep);

}


generated quantities{
  vector[N_trials] yrep;
  for (n in 1:N_trials){
    real lexicon = (lexicon_intercept + b_freq*frequency[n]+ b_cosine*cosine_sim[n]);
    real grammar = (grammar_intercept + b_weight*weight[n]);
    real dep = inv_logit((know_remote[n] * lexicon)+(know_remote[n] * (1-lexicon) * grammar)+((1-know_remote[n]) * grammar));
    yrep[n] = bernoulli_logit_rng(dep);
    
  }
}

Hi Canaan,

A few things here. You still have the same issues with indexing in your loop. Because you’re looping over N_trials and N_subjects, you’re just replacing the values in BIG_dep N_subjects times, and then fitting the model with the last iteration. You can actually do these calculations without looping by using the subject variable to select the right intercept for the right individual:

  lexicon = (lexicon_intercept+ lexicon_intercept_subject_adjustor[subject]
              + b_freq * frequency + b_cosine * cosine_sim);
  grammar = (grammar_intercept + grammar_intercept_subject_adjustor[subject] + b_weight * weight);
  dep = (know_remote .* lexicon) + (know_remote .* (1-lexicon) .* grammar)
                    +((1-know_remote) .* grammar);

Where lexicon, grammar, and dep are all of the vector[N_trials] type. You’ll also notice that I’ve removed the inv_logit from dep. This is implicitly done as part of the bernoulli_logit distribution, and you don’t need to do this yourself.

You should also change the target += normal_lpdf to ~ normal(), unless you’re planning on calculating things like Bayes factors, as the ~ normal() can be more efficient. To add to this, rather than using normal(0,1), Stan has an std_normal() distribution which will be more efficient.

Down in the generated quantities:

generated quantities{
  vector[N_trials] yrep;
  for (n in 1:N_trials){
    real lexicon = (lexicon_intercept + b_freq*frequency[n]+ b_cosine*cosine_sim[n]);
    real grammar = (grammar_intercept + b_weight*weight[n]);
    real dep = inv_logit((know_remote[n] * lexicon)+(know_remote[n] * (1-lexicon) * grammar)+((1-know_remote[n]) * grammar));
    yrep[n] = bernoulli_logit_rng(dep);
    
  }
}

Rather than re-computing dep, if you move the computation to the transformed parameters block, you can re-use it here. Also, the bernoulli_logit_rng is vectorised, so you can call this in the same way as the main sampling statement.

If I make those changes to your syntax, the model runs for me in about 5 minutes (for 2000 iterations).

Full syntax:

data { 
  int<lower = 1> N_trials;
  int<lower = 1> N_subjects;
  int<lower = 1, upper = N_subjects> subject[N_trials];
  int<lower = 0,upper = 1> shift_or_not[N_trials];
  vector[N_trials] frequency; 
  vector[N_trials] weight;
  vector[N_trials] know_remote;
  vector[N_trials] cosine_sim;
}
parameters {
  vector[N_subjects] lexicon_intercept_subject_adjustor;
  real lexicon_intercept;
  vector[N_subjects] grammar_intercept_subject_adjustor;
  real grammar_intercept;
  real b_weight;
  real b_cosine;
  real b_freq;
}

transformed parameters {
  vector[N_trials] dep;

  {
  vector[N_trials] lexicon;
  vector[N_trials] grammar;
  lexicon = (lexicon_intercept+ lexicon_intercept_subject_adjustor[subject]
              + b_freq * frequency + b_cosine * cosine_sim);
  grammar = (grammar_intercept + grammar_intercept_subject_adjustor[subject] + b_weight * weight);
  dep = (know_remote .* lexicon) + (know_remote .* (1-lexicon) .* grammar)
                    +((1-know_remote) .* grammar);
  }
}

model {
  lexicon_intercept_subject_adjustor ~ normal(0, 2);
  grammar_intercept_subject_adjustor ~ normal(0, 2);
  lexicon_intercept ~ normal(0, 2);
  grammar_intercept ~ normal(0, 2);  
  b_weight ~ std_normal(); 
  b_cosine ~ std_normal();  
  b_freq ~ std_normal(); 

  shift_or_not ~ bernoulli_logit(dep);
}
generated quantities{
  int yrep[N_trials];
  
  yrep = bernoulli_logit_rng(dep);
}
1 Like

Wow, thanks so much @andrjohns! I apologize for these having been such basic questions, but I’m just starting out hand-coding stuff and have a lot to learn - I really appreciate you taking the time to help me out - I’ll give this code a shot and try to work backwards to understand the principles underlying it more clearly.

No worries, we all started from the same place!

1 Like

Actually, @andrjohns, I have one more (hopefully quick) question:

I’m struggling with extremely low ESS for some parameters, which I suspect is due to using the wrong parameterization (centered or non-centered) for the random intercepts in Lexicon and Grammar. However, I’m having a bit of trouble wrapping my head around how to go about changing things in the model I wrote; would you mind providing a short example or a few tips on how to go about thinking about this? I’ve seen several pages with MWEs elsewhere on the forum but I’m having trouble wrappping my head around it with the vectorization I have going in my model.

It probably isn’t a non-/centered parameterisation issue, since your distributions are N(0,2) and the non-centered parameterisation expresses distributions as functions of N(0,1). You tend to see it used with hierarchical models that have priors N(\mu,\sigma^2) where \mu and \sigma also have priors themselves.

If you want to try it out in your model, you’d use the following:

parameters{
  vector[N_subjects] lexicon_intercept_subject_adjustor_raw;
}

transformed parameters {
  vector[N_subjects] lexicon_intercept_subject_adjustor = lexicon_intercept_subject_adjustor_raw * 2;
}

model {
  lexicon_intercept_subject_adjustor_raw ~ std_normal();
}

The other thing to do is double-check the math for the lexicon, grammar, and depvariables, and make sure that they’re all being constructed correctly.

If that doesn’t help then try increasing the adapt_delta value to something like 0.95 or 0.99 (and also the max_treedepth to 15 if you run into treedepth issues) and run the chains for longer

Ah i see - ok I’ll give this a shot, thanks!