Estimating covariance of random effects


I first want to say that these forums have been very helpful for me, who is a very novice user to both stan, and statistical modeling. I had a slow running model (40 days) that with some simple tweeks went down to 4 days. With that being said, given my inexperience, my question is likely extremely basic. I want to ask the devs rather than self-teach due to a very tight work deadline.

I have a 3 level hierarchical mixed effects model that represents disease progression of Alzheimer’s. There is inter-study variability (19 studies) and inter-subject variability (4485 individuals) with about 25,0000 total observations. I need to estimate the covariance between the inter-study variability on the baseline component of the model and the rate component of the model. The same goes for the inter-subject variability. I’m not sure how to do that. Can someone advise going forward? Thanks.

Ps. If there are other noticeable features of my code that is not very well optimized or blatantly wrong, feel free to critique.

Here is my code:


	int N;             			// number of total Observations
	int P;              			// number of subjects
	int M;                    // number of studies
 	int IDp[N];            			// Patient ID 
 	int IDs[N];            			// Study ID 
 	int SEX[N];
 	int AGE[N];
 	int COMED[N];
 	real APOE4[N];
  vector[N] time;      			// time of observation (years)
	real S[N];           			// measured ADAScog score


  	//Fixed effects and covariates 
  	real<lower=0,upper=1> theta_S0;         // Mean baseline score for average individual
  	real theta_r;                                          // Mean progression rate for average individual
  	real<lower=0> theta_SEX;
  	real theta_AGE;
  	real<lower=0> theta_APOE4_b;
  	real theta_APOE4_r;
  	real theta_COMED;
  	real<lower=0> tau;            
  	real<lower=0> beta;

  	//Inter-patient re
  	vector[P] eta_pb;                     	// re for patient baseline
  	vector[P] eta_pr;            		      // re for patient rate
  	vector[M] eta_sb;
  	vector[M] eta_sr;
	  real<lower=0> omega_pb;               	// std for patient baseline **Want to estimate covariance between 
                                                                                 omega_pb and omega_pr as well as omega_sb and omega_sr**
	  real<lower=0> omega_pr;               	// std for patient rate
	  real<lower=0> omega_sb;
	  real<lower=0> omega_sr;
transformed parameters{
  	vector[N] baseline_cov;
  	vector[N] rate_cov;
  	vector[N] S0;
  	vector[N] r;
  	vector<lower=0,upper=1>[N] muS;
    vector[N] ones;
    for(i in 1:N) ones[i] = 1;
  	for(i in 1:N) {

      	baseline_cov[i] = theta_S0*(1+theta_SEX*(SEX[i]-1))*(1+theta_APOE4_b*(APOE4[i]-.72));
      	rate_cov[i] = theta_r*(1+theta_AGE*(AGE[i]-75))*(1+theta_APOE4_r*(APOE4[i]-.72))*(1+theta_COMED*(1-COMED[i]));
      	S0[i] = baseline_cov[i]*exp(eta_pb[IDp[i]] + eta_sb[IDs[i]]);
      	r[i] = rate_cov[i] + eta_pr[IDp[i]] + eta_sr[IDs[i]];
      	muS[i] = S0[i]/(S0[i]^beta +(1-S0[i]^beta)*exp(-beta*r[i]*time[i]))^(1/beta);

    	omega_pb~normal(0,1); //orig uniform(0,1)
    	omega_pr~normal(0,1); //orig uniform(0,1)
    	omega_sb~normal(0,1); //orig uniform(0,1)
    	omega_sr~normal(0,1); //orig uniform(0,1)

    	tau~normal(80,8); //orig uniform(0,100)

	// Likelihood
    	S ~beta(muS*tau, (1-muS)*tau);

[edit: escaped code]


Wow this is right up my alley. I’ve been working on a disease progression model of Alzheimer’s in Stan that’s based off of this model. We actually submitted it to Stancon (hoping it gets accepted).

We were modeling a different aspect of the disease though. It looks like you’re modeling ADAS (normalized between 0 and 1) as a function of a bunch of covariates. Can you describe what you’re trying to do with the model and what exactly that transformed parameter block is doing? Or if you have a reference, you can link it.


This paper below is, for now, the exact model used here minus the incorporation of estimating covariance:

Let me know if you have some insights into my post request. Thanks!


That’s a risky combination! What is the inference problem and how large an effective sample size do you need for it?

Lots of ways to speed this up. Please ask before running a Stan program for 40 days!

Use a non-centered parameterization for your random effects (like eta_pr) unless the number of groups is large.

Try to rescale parameters so that the posterior is on the unit scale so thateach parameter is roughly normal(0, 1) marginally in the posterior.

Share reused expressions and vectorize in the for loop.

Define ones as transformed data using vector[N] ones = rep_vector(1, N). But then dont even do that because you don’t appear to use it for anything.

Move as much of the computation in that loop as you can into data. For example, (AGE[i] - 75) can be precomputed. And remove duplicate computation. You can also employ vectorization.

If SEX is a boolean value (you don’t supply constraints, which I’d recommend), then you don’t want to do selection by subtracting 1 and multiplying. Use a branch that drops the term when it isn’t used—its much more efficient to not compute things.

I don’t know if it makes sense in your model, but you’re making lots of things be greater than one by constraining them to be positive then adding one. You can also constrain them to be greater than one directly, though it changes the meaning of the priors a bit.

I have no idea what all that code’s doing, of course, so this isn’t advice about the modeling. If the model isn’t well specified for the data, there can be problems.


Thanks Bob for the useful suggestions. I plan on working those modifications in very soon. By the way, the model was originally taking ~40 days, but now is down to 4 days from suggestions you made a couple months ago.

Aside from that, can you comment on how I would estimate the covariance between omega_pb adn omega_pr as well as omega_sb and omega_sr?



Those are scalar parameters. You can just compute the sample covariance in the draws to estimate posterior covariance (covariance of random variables for the parameters conditioned on data).