Cross-classified Hierarchical Model, Heterogenous Variances

Hello all,

I am struggling with the expansion of my code designed to estimate a cross-classified hierarchical model. When estimating a model with homogeneous variances and fixed effects that are constant across groups (when the assumptions are too stringent to be realistic), I have code that runs in approximately 12-16 hours per replication, with four chains of 50000 (on a virtual linux machine with 6 cores). Parameter recovery is good.

I am trying to modify that code so that the variances are heterogeneous across groups, and to also allow covariates to have differential effects across groups (a model that might be more realistic).

Unfortunately, the code is exceptionally slow. After some trial and error, it appears as though the fixed effects are not an issue, but the variances are an issue. It seems as though the multiplication of an offset by a vector of values at each level (as opposed to situations where, with homogeneous variances, the residual was a single value), i.e. stan modmu_offset*mod_resid[tt] Any suggestions for making the code more efficient?

I would greatly appreciate any suggestions.

The code is below, and a trial data set is attached that I am using for parameter recoverypr_hetfe_hetre_2_respfile.csv (604.8 KB)
.

mymodel<- "

data { 
int<lower=1> N;            // number of observations 

int<lower=1> J;            // number of students 
int<lower=1> K;            // number of items 
int<lower=1> P;            // number of families
int<lower=1> M;      	     // number of item models
int<lower=1> Q;		         // number of templates

int<lower=1> X;            // length of multten array
int<lower=1> Y;            // length of threedig array

int<lower=1> A;            // length of vv array
int<lower=1> B;            // length of hh array

int peeps[N];                  // student giving response n 
int<lower=1,upper=K> item[N];  // item for response n 
int<lower=0,upper=1> score[N];   // correctness for response n 

int<lower=0,upper=J> pid[J];   // Person ID number
int<lower=0,upper=K> iid[K];   // Item ID number 
int<lower=0,upper=P> fid[P];   // family ID number   
int<lower=0,upper=M> mid[M];   // item model ID number
int<lower=0,upper=Q> tid[Q];   // template ID number

int<lower=1,upper=P> parent[K]; 		//indexes items to families
int<lower=1,upper=M> mm[P];     		//indexes families to item models
int<lower=1,upper=Q> tt[M];	        //indexes item models to templates

int multten[X];                 // Array of indices for families - numbers are some multiple of ten
int threedig[Y];                // Array of indices for families - numbers are maximum of three digits

int vv[A];	//Array of indices for imodels - display format is verbal
int hh[B];	//Array of indices for imodels - display format is horizontal

}

parameters {

vector[J] uj;              		

vector <lower=0> [P] sigma_item;
vector <lower=0> [M] fam_resid;
vector <lower=0> [Q] mod_resid;

real betai_offset;
real fammu_offset;
real modmu_offset;

vector[Q] template_mu;
vector[Q] disp_horiz;
vector[Q] disp_verb;
vector[M] char_m10;                    //fixed effects of content characteristics 
vector[M] char_d3;                     //fixed effects of content characteristics

}

transformed parameters{
vector[N] eta;
vector[K] betai;
vector[P] family_mu;
vector[M] model_mu;

// varying item family difficulty across item families within models
// decomposition of family means into a model mean and fixed effects

model_mu = template_mu[tt] + modmu_offset*mod_resid[tt]; 
model_mu[vv] = model_mu[vv] + disp_verb[tt[hh]];
model_mu[hh] = model_mu[hh] + disp_horiz[tt[vv]];

// varying item family diffculty across item families within models
// decomposition of family means into a model mean and fixed effects

family_mu = model_mu[mm] + fammu_offset*fam_resid[mm];
family_mu[multten] = family_mu[multten] + char_m10[mm[multten]];
family_mu[threedig] = family_mu[threedig] + char_d3[mm[threedig]];

// item difficulties parameterized as random, with parent-specific means
betai = family_mu[parent] + betai_offset*sigma_item[parent]; 

//log odds of a correct probability
eta = uj[peeps]-betai[item];

}

model { 

//hyperprior
betai_offset ~ normal(0,1);
fammu_offset ~ normal(0,1);
modmu_offset ~ normal(0,1);

//prior on random variable theta to scale the item difficulty parameters
uj ~ normal(0,1);

//weakly informative prior on item variance
sigma_item ~ cauchy(0,2.5);
fam_resid ~ cauchy(0,2.5);
mod_resid ~ cauchy(0,2.5);

//likelihood function                                         
score ~ bernoulli_logit(eta);

}
"

my_stan_code <- stanc(model_code=mymodel)
my_compiled_model <- stan_model(stanc_ret = my_stan_code,verbose=FALSE)

## READ IN DATA FILE AS DD

N <- length(unique(dd$obs))
K <- length(unique(dd$item))
J <- length(unique(dd$peeps))
P <- length(unique(dd$family))
M <- length(unique(dd$imodel))
Q <- length(unique(dd$template))

peeps <- dd$peeps
item <- dd$item
score <- dd$score

### People ID - Length J
pid <- sort(unique(dd[c("peeps")])[,"peeps"])
### Item ID - Length K 
iid <- sort(unique(dd[c("item")])[,"item"])
### Family ID - Length P
fid <- sort(unique(dd[c("family")])[,"family"])
### Item Model ID - Length M
mid <- sort(unique(dd[c("imodel")])[,"imodel"])
### Template ID - Length T
tid <- sort(unique(dd[c("template")])[,"template"])

### Item - Family Index - Length K, P Unique Values
### For each row in parent[], it returns the family ID number
dd<-dd[with(dd,order(item)),]
parent <- unique(dd[,c("item","family")])[,"family"]
### Family - Item Model Index - Length P, M Unique Values
### For each row in mm[], it returns the model ID number
dd<-dd[with(dd,order(family)),]
mm <- unique(dd[c("family","imodel")])[,"imodel"]
### Item Model - Template Index - Length M, T Unique Values
dd<-dd[with(dd,order(imodel)),]
tt <- unique(dd[c("imodel","template")])[,"template"]

### Content Characteristics
### For each row denoting a family, are the numbers multiples of 10 or are they three digits
### Array of indices

dd<-dd[with(dd,order(family)),]

multten<-unique(dd[dd$m10==1,c("family")])
threedig<-unique(dd[dd$d3==1,c("family")])

X<-length(multten)
Y<-length(threedig)

### Form Characteristics 
### For each row denoting a template, is the form a verbal representation or is it arranged horizontally
### Base category is comprised of items displayed in numeric format, numbers arranged 
### array of indices

dd<-dd[with(dd,order(imodel)),]

vv<-unique(dd[dd$verbal==1,c("imodel")])
hh<-unique(dd[dd$horiz==1,c("imodel")])

A<-length(vv)
B<-length(hh)

#########################
### Estimate Model
#########################

myfit <-sampling(my_compiled_model,
              data = c("N","J","K","P","M","Q",
                       "A","B","X","Y",
                       "peeps","item","pid",
                       "iid","fid","mid","tid",
                       "parent","score","mm","tt",
                       "multten","threedig",
                       "vv","hh"), 
              pars = c("uj","sigma_item","fam_resid","mod_resid","template_mu","char_m10","char_d3","disp_verb","disp_horiz"),
              control = list(stepsize=0.001, adapt_delta=0.999, max_treedepth=15),
              iter = 50, warmup=25, chains = 1, thin=1, save_warmup=TRUE,
              verbose = T) 

Going back through previous posts (I have admittedly been wrestling with this model code for a while), one solution to group-specific random effects would be to specify the offsets as vectors, i.e., vector [M] modmu_offset; and in the model block, use the .* operator to multiply the elements of that vector by the corresponding elements of the mod_resid vector.

Not sure if the .* approach might be the best option, and curious to hear if anyone has suggestions for making the estimation of the fixed effects more efficient also.

It seems unlikely that you need this many draws from each chain.

What are your n_effs? For most situations (where you only need a couple digits of precision on your estimates), n_effs in the hundreds are fine, which usually don’t take more than a few thousand draws in total to get.

Taking that number down to 2000 or 1000 seems like one effective way to speed up the model haha.

1 Like

Thanks for the feedback!

After reviewing autocorrelations of the parameters, I made the decision to thin my chains with an interval of 10, yielding 10,000 observations (excluding warm-up, four chains x 2500 observations each). My effective sample sizes are around 6,000 for the variances, which are the most difficult to estimate.

I had run chains that long in an effort to minimize the number of divergent transitions (still getting a few)… But it seems as though divergences may just… happen?

If I look at my other diagnostics (graphical and numeric), the chains mix well, and it seems as though I could run shorter chains (it seems that I reach stationarity across all of my parameters around 2000 iterations) - and I could even run more chains that are shorter if I wanted to boost sample size?

Also - thanks for the laughs :)

Don’t thin your samples in Stan, and probably don’t work with the autocorrelations yourself. Just have faith in the n_eff estimates and use those if you wanna get error estimates on your outputs. Have a look at your se_means in your output. I’m guessing the error estimates are way smaller than they’d need to be? Anything in the thousands of n_effs seems like a really big number. Maybe describe how you’re deciding that you need thousands of samples (assuming n_eff is accurate) and I can convince you that you don’t need that many haha?

Divergent transitions will probably just happen with a constant rate. Increasing the number of warmup iterations could help avoid them, but it’s not a foolproof strategy by any stretch of the imagination. There’s an option in Rstan to adapt the timestep more aggressively that can help this (adapt_delta = 0.99 is usually the prescribed thing). Feel free to ask if you wanna know why this helps :D. Are you using Rstan?

Ben,

Thank you again!

On thinning: Reading through various discussions here and in other forums (including the previous iteration of this forum in the Google Group), it seems like there is some debate on whether or not to thin chains. Is there a particular reason not to thin chains in Stan, specifically?

On effective sample sizes: Do you have any recommended resources for “rules of thumb” for sample size? I have seen folks cite approximately 400 per parameter in order to obtain sufficient precision of estimates (unfortunately, with no apparent attribution) and others cite an effective sample size of 4,000, citing Raftery & Lewis, 1992.

Your divergence comments are really helpful. After reviewing some of the graphical summaries, and doing a little more digging, it does seem as though I should expect some divergent transitions with regularity. I have set adapt_delta to 0.99 (after some initial trial and many errors).

Best,

Shauna

Thinning chains is throwing away work :P. If you have a really low n_eff on a parameter even though you’ve got a few thousand samples from your MCMC, it might be best to look at your model before you start running the sampler for a really long time. There’s probably a parameterization to change, or a prior to tighten.

For effective sample sizes, what I do is just run the simulations long enough so that my plots aren’t significantly affected by MCMC noise. So I’m basically always covered by the default settings (4 chains, 1k samples each).

The se_mean estimates on the print output will give you an idea of the scales of the MCMC error in your estimates. You can look at that, or just generate the quantity you’re interested in, repeat the fit, and look at it again and see if the differences are small enough.

MCMC has changed a lot since 1992, so I’d ditch the rules of thumb and just try to think about what sort of tests you can do in the context of your model to prove to yourself your MCMC estimates are accurate enough (in my case that’s where the visual tests on plots come in).

Does adapt_delta = 0.99 get rid of your divergences?

1 Like

My first step (or several steps) was to modify the parameterizations of my models based on feedback from this list, which greatly reduced the correlations between estimated parameters and improved the speed of estimation.

Changing the value does not get rid of my divergences, though it did reduce the number from the default setting. Regardless of chain length, I have I had the greatest number of challenges with the estimation variance parameters. I am assuming this is not unusual?

After running a series of tests, it would appear that I could safely use the first 1,000 iterations as burn-in and be confident that I am estimating means from the desired distribution, and that those means would be well-estimated. Maybe the 1,000 as burn-in is conservative, but I am OK with taking a conservative approach where estimation is concerned :)

Not unusual at all. And it’s why we want to monitor slow-moving parameters.

When there’s not much data and a not very informative prior, the non-centered parameterizations are necessary. If there’s a lot of data, then centered parameterizations are necessary.

2 Likes

Is there any chance you could use a non-centered parameterization to fix this? There’s a lot to be said about divergences, but as a place to start have you seen this: Diagnosing Biased Inference with Divergences ? That describes the simplest case of a hierarchical model benefiting from a non-centered parameterization.

Yeah, like Bob said, it’s pretty standard for those to be the hard things to estimate.

1 Like

Bob and @bbbales2, thank you both for your thoughts and feedback.

I have been working on a couple of things since I last wrote - hence the delay in my reply.

I did update the code I was running, to include non-centered parameterizations at each level, and within the model block of the code I sampled each from a N(0,1) distribution:

...
parameters {
...
vector[K] betai_offset;
vector[P] fammu_offset;
vector[M] modmu_offset;
...

// varying item family difficulty across item families within models
model_mu = template_mu[tt] + modmu_offset .* mod_resid[tt]; 
...

// varying item family diffculty across item families within models
family_mu = model_mu[mm] + fammu_offset .* fam_resid[mm];
...

// item difficulties parameterized as random, with parent-specific means
betai = family_mu[parent] + betai_offset .* sigma_item[parent]; 

The linked analysis by @betanalpha is something that greatly informed my thinking about divergences, along with the initial feedback on my models I received from @bgoodri - who I believe had suggested to me that I include these “offsets” – though I have to admit I didn’t understand the implications of utilizing non-centered paramaterizations at the time (and though I wouldn’t say I fully understand now, I certainly have a much better idea!).

Yeah, so just to be clear:

parameters{
  vector[M] modmu_offset;
}

model {
  modmu_offset ~ normal(0, 1);
  model_mu = template_mu[tt] + modmu_offset .* mod_resid[tt]; 
}

is the non-centered version of this:

parameters{
  vector[M] model_mu;
}

model {
  model_mu ~ normal(template_mu[tt], mod_resid[tt]);
}

The non-centering working better when you have less data usually. Did this make your divergences go away yet? That’s always the goal.

Ben, yes. Thankfully, even the truncated code was sufficiently clear to make that connection, this is good news!

These did not make my divergences go away. Where I had the greatest challenges were in the variance parameters at levels 2 and 3. I am wondering if, even with the non-centered parameterizations, the priors could be specified differently (better?) in order to improve estimation performance. I used cauchy(0,2.5), but should I change this? I’ve noticed that in many heirarchical exmples, cauchy(0,5) is common. I have read a number of papers (many of which are linked here: Prior Choice Recommendations · stan-dev/stan Wiki · GitHub), but I am still trying to map those recommendations onto my particular problem set.

Thanks again!

We really need to update that priors page, haha. At one points cauchys were popular around here, but I think with lack of any other information, just go with a normal prior.

I’m not sure why we don’t do cauchys anymore. It’s probably as simple as in most situations the heavy tails don’t give us anything and sampling them is super hard. There’s an alternative parameterization for cauchys in the manual that can help as well, but it sounds like for your case I’d start by switching those out.

Wait nevermind, I just hadn’t searched far enough, from another thread (Why Cauchy as prior for residual error in ODE models?) check out: https://betanalpha.github.io/assets/case_studies/weakly_informative_shapes.html . I too should read this…

2 Likes

Looking at the discussion on the linked article, what I see in terms of the summarized model behavior is very similar to what I saw in my initial runs from my own models. I saw sampled values far greater than anything I would have expected and I also found that the variance estimates tended to be heavily biased.

Given the context of the model, I have some ideas for how big the variances will be at each level, and I think that going with a normal distribution (lighter tails) is entirely appropriate. I also imagine that it will significantly speed up calculation… Time to check on this, and I will return my results here (at the moment I’m collecting estimates: cauchy distributions as priors and centered parameterization, cauchy & non-centered, normal & non-centered). That said, even though I have reduced my chain length for these preliminary runs, it still does take a bit of time for the models to run… :)

1 Like

I just completed three distinct runs of my model, with the following modifications:

  • Centered parameterization with Cauchy priors for the residuals at each level
  • Non-Centered parameterization with Cauchy priors for the residuals at each level
  • Non-Centered parameterization with Normal priors for the residuals at each level

The results of this effort reflect what I have read elsewhere, that the quality of model estimates needs to be examined via not just one but several statistics. In every case, my model converged and the result effective sample sizes and R Hat values were “good”, with effective sample sizes around 4000 and wR-hat values less than 1.1 for each parameter. Suffice it to say, these values didn’t really tell the whole story, because the quality of the estimates was NOT the same across runs! The number of divergences in the first condition was 20; and moving to a non-centered parameterization of the model reduced the number of divergences to 5. Using a non-centered parameterization and normal priors eliminated any divergences (adapt_delta was set to 0.99 in all cases), and the model using the normal priors converged in 1/10th the time.

As an illustration - as promised! - here is a quick plot that shows the posterior distribution for one of the variance parameters across all three conditions, with the generating value also shown. (Thank you, @betanalpha for including such useful code in your demonstrations and articles, because I leveraged that code here with some modifications.)

Rplot

I limited the plot to only show draws that were less than or equal to 5, but you can clearly see the impact of what I thought were small changes to the model…

Thanks, all for all of the feedback and input!

3 Likes

Glad to hear it all worked out! Thanks for being so patient!

That’s a cool plot of the posterior. That does tell a pretty compelling story!

Thanks for following up. If the centered and non-centered Cauchy were supposed to be reparameterizing the same variable, then it looks like one of them didn’t converge properly.

@Bob_Carpenter, I don’t believe that the centered parameterization converged properly. I was not able to effectively recover even the mean structure of the model (I am still having difficulty recovering the variances, even with continued efforts to tweak the non-centered parameterization).