Differences in estimating random effects between Stan and NONMEM

Hi there,
I apologize ahead of time if my question is not well formed as I’m quite new to Stan (1 month of us). My questions involve a nonlinear mixed effects models on a large data set (25 K data points). Based on a template for a two-level linear mixed effects model in stan ( https://rpubs.com/kaz_yos/stan-multi-1), the individual random effects are declared as parameters to be estimated. Using this template, I also declared the random effects as parameters in my model (which I can post if requested). However, I have approximately 10K of them. The run time in stan for this model is likely to be around a month. My questions are:

  1. Is my model taking so long because it is attempting to estimate a posterior with those 10K parameters over 25K data points?
  2. If it is, then is there a way to bypass estimating these random effects, but rather only estimate the standard deviation of these random effects, thereby dropping my total parameters from about 10K to less than 50? This is what NONMEM does and the same model takes about 1 day to estimate the parameters.
  3. If this is not the source of the nearly order of magnitude greater run time compared to NONMEM, then what is the bottleneck?

Any insight would be appreciated. Thanks.

We’re working on max marginal likelihood and max posterior mode for Stan, but we’re not there yet.

How many iterations are you running that you’re estimating a month run time?

At scale, you have to be careful to ensure everything’s vectorized and avoids recompution and has efficient variables (like Cholesky factors rather than covariance matrices). You also have to worry about statistical efficiency in terms of mixing in hierarchical models, and the parameterization and priors can have a big effect here.

Making the code more efficient by vectorizing, declaring constants, avoiding recomputing quantities, etc., and also making it more statistically efficient with good parameterizations and priors are both

Thanks for the quick reply Bob. So if I understand you correctly, those modes are not implemented in stan yet, but will bypass estimating those parameters individually? To answer your other questions:

I’m running 4 K iterations (2K warmup), with alpha_delta = .99, eps = .05, and max_tree_depth = 15.

Here is my model.


	int N;             			// number of total Observations
	int P;              			// number of subjects
 	int IDp[N];            			// Patient ID 
 	int IDs[N];            			// Study ID 
 	int SEX[N];
 	int AGE[N];
 	int COMED[N];
 	real APOE4[N];
  vector[N] time;      			// time of observation (years)
	real S[N];           			// measured ADAScog score


  	//Fixed effects and covariates 
  	real<lower=0,upper=1> theta_S0;         // Mean baseline score for average individual, 
  	real theta_r;                           // Mean progression rate for average individual, 
  	real<lower=0> theta_SEX;
  	real theta_AGE;
  	real<lower=0> theta_APOE4_b;
  	real theta_APOE4_r;
  	real theta_COMED;
  	real<lower=0,upper=100> tau;           
  	real<lower=0,upper=20> beta;

  	//Inter-patient re
  	vector[P] eta_pb;                     	// re for patient baseline
  	vector[P] eta_pr;            		      // re for patient rate
	  real<lower=0> omega_pb;               	// std for patient baseline
	  real<lower=0> omega_pr;               	// std for patient rate
transformed parameters{
  	vector[N] baseline_cov;
  	vector[N] rate_cov;
  	vector[N] S0;
  	vector[N] r;
  	vector<lower=0,upper=1>[N] muS;
    vector[N] ones;
  	for(i in 1:N) {

      	baseline_cov[i] = theta_S0*theta_SEX*SEX[i]*(1+theta_APOE4_b*(APOE4[i]-.72));
      	rate_cov[i] = theta_r*(1+theta_AGE*(AGE[i]-75))*(1+theta_APOE4_r*(APOE4[i]-.72))*(1+theta_COMED*(1-COMED[i]));
      	S0[i] = baseline_cov[i]*exp(eta_pb[IDp[i]]);
      	r[i] = rate_cov[i] + eta_pr[IDp[i]];
      	muS[i] = S0[i]/(S0[i]^beta +(1-S0[i]^beta)*exp(-beta*r[i]*time[i]))^(1/beta);


    	tau~uniform(0,100); //orig uniform(0,100)

	// Likelihood
    	S ~beta(muS*tau, (1-muS)*tau);

I did vectorize in the past when I used a simple base model ( beta = 1), but I don’t think there is vectorized exponentiation in stan, correct? I’m not to sure what is optimal in terms of how parameters and priors are declared, but if you have some general rules of thumb that would be great. If there are any other glaringly obvious bottlenecks here, feel free to critique it. Thanks.

First and foremost, the maximum marginal likelihood algorithms like that used by default in NONMEM are approximate algorithms. They work best when your model is conditionally Gaussian, otherwise they can give significantly biased posterior mean estimates and even worse uncertainty estimates. The Hamiltonian Monte Carlo algorithm used in Stan is slower but much more robust – it will take extra time to try to get a reasonable answer.

The fact that Stan is taking so long relative to NONMEM indicates that your model is very much not conditionally Gaussian and, while maximum marginal likelihood will be faster, it is unlikely to yield an extremely accurate answer.

This is also consistent with your model specification itself – eta_pb and eta_pr can be non-centered, and that may significantly speed up your model. The uniform priors can also put too much mass on extreme values and slow things down. We recommend taking a look at the “Optimizing Stan Models” section of the Stan manual, and you might also want to check out https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.

1 Like

For models like this, you should at least be starting with stan_lmer in rstanarm or brm in brms, which will usually go almost as fast as possible.

options(mc.cores = parallel::detectCores())
post <- stan_lmer(Y_ij ~ X_1ij + (1 | cluster))

If you want to reduce the wall time by a couple percent, you can write a Stan model yourself, but you need to have a good benchmark.