Very slow multilevel model

performance

#1

Hi,

I am fitting a multilevel model to a collection of J curves with total N data points. I have a group level predictor grp_level_x for each of the J curves.
And I am fitting the following functional form:
y = (A*((x-B)/(100-B))^(-n))

The following stan code runs very slow and I am unable to run it long enough to see if this works. I had implemented an unpooled version before and that gave me reasonable results so now I was trying to get a multilevel model working. I am relatively new to stan so any help is appreciated. Thanks,

data {

int<lower=4> N;
  int<lower=0> J;
  vector[N] y;
  vector[N] x;
  int curveId[N];
  vector[J] grp_level_x;
}
transformed data{

}
parameters {
   vector<lower = 0.25,upper=5>[J] n;
   vector<lower=0>[J] A;
   vector[J] B;
   real<lower=0> eps;
   
   // B model
   real alpha_B;
   real beta_B;
   real<lower=0> eps_B;
   
   // n model
   real alpha_n;
   real beta_n;
   real<lower=0> eps_n;
   
   // A model
   real alpha_A;
   real beta_A;
   real<lower=0> eps_A;
}
transformed parameters {

}
model {
  
  vector[N] y_hat;
  
  for (i in 1:N){
    y_hat[i] = (A[curveId[i]]*((x[i] - B[curveId[i]])/(100. - B[curveId[i]]))^(-n[curveId[i]]));
  }
  
  for (j in 1:J){
      B[j] ~ lognormal(alpha_B + beta_B*grp_level_x[j],eps_B);
      n[j] ~ lognormal(alpha_n + beta_n*grp_level_x[j],eps_n)T[0.25,10];
      A[j] ~ lognormal(alpha_A + beta_A*grp_level_x[j],eps_A);
  }
  y ~ normal(y_hat,eps);
}

#2

First off, add priors for eps_*, alpha_*, and beta_*. If you don’t know anything else, just do normal(0.0, approrpriate_guess_of_scale).

If this doesn’t work, and since your simple models work, what I’d do is:

  1. Just do J independent fits in one model and see if it works. Something like:
for (j in 1:J){
      B[j] ~ normal(0.0, 2.0);
      n[j] ~ normal(0.0, 2.0);
      A[j] ~ normal(0.0, 2.0);
  }

Or whatever your prior was in the original fits. I think lognormals are kinda heavy-tailed, so if you just need a zero avoiding prior, maybe switch to a gamma?

  1. Make it hierarchical, but don’t do group level predictors
for (j in 1:J){
      B[j] ~ lognormal(alpha_B, eps_B);
      n[j] ~ lognormal(alpha_n, eps_n)T[0.25,10];
      A[j] ~ lognormal(alpha_A,eps_A);
  }
  1. Add in the group level predictors

Maybe they need centered or something. Hopefully something in there works!

edit: fixed formatting


#3

Thanks Ben. I had been following a similar route but as soon as I jump to hierarchical (even #2 above) the code gets stuck. If I run 4 chains, a couple of them run very fast but the remaining stay stuck for multiple hours.
One other question I have is regarding my likelihood function

y_ij ~ A_j ((x - b_j)/(100 - b_j)) ^(-n_j)

It’s not really defined for x’s smaller than b_j. For each of the J curves the minimum value of x is different.How do I specify that and is this something which might be causing stan to get stuck?

Thanks,
Amit


#4

Priors on the hyperparameters didn’t help at all?

It sounds like you want to constrain all the values of B to be greater than the minimum x in each curve fit?

Vector constraints aren’t coded in Stan yet, but you can do this stuff manually. Check the “Transformations of Constrained Variables” section in the manual.

Something like this, maybe:

data {
  real max_x_in_group[J];  // Pass in the maximum x values in each group (B can't go less than this)
}

parameters {
  real B_unconstrained[J];
}

// Transform from unconstrained to constrained space
transformed parameters {
  real B[J];
  for(i in 1:J) {
    B[i] = exp(B_unconstrained[i]) + max_x_in_group[i];
  }
}

model {
  // Add in Jacobian adjustment
  for(i in 1:J) {
    target += B_unconstrained[i];
  }
  ...
}

#5

No, the priors on hyperparameters didn’t help. Thanks for the transformation trick.


#6

That’s a bummer. I had high hopes


#7

The scale of declaration doesn’t match the truncation (5 and 10).

For speed, you should vectorize everything other than the truncation one.

Have you tried simulated data? Often problems can arise if the model’s not well specified.


#8

Yes Bob, I have been working with the simulated data now for past one week. I am running into issues but have found most of the answers by searching through the forum discussions. I am still not there but making progress one step at a time!


#9

How large are N and J Here? How many observations per group are you simulating? I have a strange feeling that there is some kind of unidentifiability for the (A_{j}, \alpha_{A}, \beta_{A}, \epsilon_{A}) tuples that may be present for small numbers of observations per group. It almost feels like you want some kind of sum-to-zero constraint on A_{j, \text{raw}}, where A_{j} = A_{j, \text{raw}} + \exp\{\alpha_{A} + \beta_{A}z_{j}\} , where z_{j} is your group level predictor. This doesn’t quite work as written though (the distribution for A_{j, \text{raw}} is a truncated normal with lower bound \exp\{\alpha_{A} + \beta_{A}z_{j}\}, which is very strange). Fair chance I’m talking complete rubbish though.

Do you have some pairs plots of (A_{j}, \alpha_{A}, \beta_{A}, \epsilon_{A}) for a single j? For each of the three curve defining parameters?


#10

I have been able to get the hierarchical model to work.

However, in the current setup I do not have a generated quantities block in my training code. I am using a separate generative code to make predictions on my test data by feeding in the fitted mean values. I would like to keep the training and testing part separate. I am currently doing the following to run the generative code:

sm.sampling(data=data_gen,iter=1, chains=1, verbose=False, n_jobs=1, algorithm=“Fixed_param”)

  1. I have to run this in a loop to get multiple realizations and it also throws out bunch of warnings.
  2. Also, I can only feed in the mean values to make predictions.

Is there a better way to do this?

@hhau : I have been to able to make the model work and it is providing reasonable estimates.


#11

Don’t make predictions based on the means – you did the full Bayesian inference so use the full posterior!

Why not? This is what you want to do.

Just break the data into two pieces in the data block if you really want to make sure you aren’t mixing anything:

data {
  vector[N1] data_set_1;
  vector[N2] data_set_2;
}

And then just use data_set_2 in the generated quantities and data_set_1 in the first bit.


#12

Without knowing which warnings, it’s impossible to say whether you should be worried about them or not.

Why? Is it an efficiency concern? It’s not “cheating” if that’s what you’re worried about. In fact, it’s typical to use the inputs of the held-out data, not the outputs, as there’s information there.

We’re about to roll out a standalone generated quantities option that will let you run generated quantities on previously fits.

Until then, you can do full Bayesian posterior inference in R or Python on the outside, but you have to be careful with averaging over the draws and with parameterizations of things like densities.


#13

Thanks @Bob_Carpenter @bbbales2
It’s an efficiency concern. My model is a hierarchical model where I have group level predictors. So, the final use of my model is to generate curves and their bounds for unseen group level predictors. I want to use the fitted model to make these predictions and not have to go through the fitting phase.
The standalone version is what I was looking for. Thanks for the info.


#14

To emulate this, just iterate through your posterior draws and use the parameter values of each draw to generate posterior predictive distributions.

So like if your parameters were:

parameters {
  real a;
}

And you generated 4000 post warmup draws, then just iterate through each of the 4000 posterior values of a and do your generated quantities that way.


#15

Yes, that will work for me! Thanks for the suggestion.