Help with specification of hierarchical priors?

Good morning,

This is another post in my efforts to really understand how to specify hierarchical models in Stan. This post is related to linked to previous posts on hierarchical modeling and prior specification. (You can find the most recent discussion here: Options for Priors on Random Effects with Non-Centered Parameterizations. Previous versions of the code can be seen in this post: Cross-classified Hierarchical Model, Heterogenous Variances

The model code that I have works well. I am able to recover mean parameters, and the biggest challenge(s) I have are around the recovery of variances within the model - which is to be expected given the model structure.

Based on suggestions from great folks like @betanalpha and @bgoodri, I have attempted to specify hierarchical priors on the variances (actually the standard deviations) in my model. This has, however, lead to two less-than-deal outcomes that I think might be related and are related to how I have specified priors/hyperpriors:

  1. The code runs MUCH more slowly than when I specified the priors directly; and
  2. I am getting the following message when I run my model for several chains: Rejecting initial value: Log probability evaluates to log(0), i.e. negative infinity. Stan can't start sampling from this initial value.

My code is below. The specification of the hyperpriors in the model statement (and the corresponding declaration/definition of those parameters) is the only newly added piece of code which has lead to the slowing down of my code and the initial evaluation problems.

It may very well be that I have specified the hyperpriors incorrectly and there is a simple way to improve performance. It also might be that the model is more difficult to estimate. That said, this group has been great, and any suggestions would be greatly appreciated.

data { 
int<lower=1> N;            // number of observations 

int<lower=1> J;            // number of students 
int<lower=1> K;            // number of items 
int<lower=1> P;            // number of families
int<lower=1> M;      	     // number of item models
int<lower=1> Q;		         // number of templates

int<lower=1> X;            // length of multten array
int<lower=1> Y;            // length of threedig array

int<lower=1> A;            // length of vv array
int<lower=1> B;            // length of hh array

int peeps[N];                  // student giving response n 
int<lower=1,upper=K> item[N];  // item for response n 
int<lower=0,upper=1> score[N];   // correctness for response n 

int<lower=0,upper=J> pid[J];   // Person ID number
int<lower=0,upper=K> iid[K];   // Item ID number 
int<lower=0,upper=P> fid[P];   // family ID number   
int<lower=0,upper=M> mid[M];   // item model ID number
int<lower=0,upper=Q> tid[Q];   // template ID number

int<lower=1,upper=P> parent[K]; 		//indexes items to families
int<lower=1,upper=M> mm[P];     		//indexes families to item models
int<lower=1,upper=Q> tt[M];	        //indexes item models to templates

int multten[X];                 // Array of indices for families - numbers are some multiple of ten
int threedig[Y];                // Array of indices for families - numbers are maximum of three digits

int vv[A];	//Array of indices for imodels - display format is verbal
int hh[B];	//Array of indices for imodels - display format is horizontal

}

parameters {

vector[J] uj; 

real<lower=0> ub_si;
real<lower=0> ub_fr;
real<lower=0> ub_mr;
real<lower=0> ub_st;	

real<lower=0> sigma_item;
real<lower=0> fam_resid;
real<lower=0> mod_resid;
real<lower=0> sigma_temp; 	//when level 4 is treated as random, large sample condition

vector[K] betai_offset;
vector[P] fammu_offset;
vector[M] modmu_offset;
vector[Q] tempmu_offset;

// vector[Q] template_mu; 	//when template means are treated as fixed, small sample (empirical) condition

real grand_mean; 		//when level 4 is treated as random, large sample condition

real disp_horiz;
real disp_verb;
real char_m10;                    //fixed effects of content characteristics 
real char_d3;                     //fixed effects of content characteristics

}

transformed parameters{

vector[N] eta;
vector[K] betai;
vector[P] family_mu;
vector[M] model_mu;
vector[Q] template_mu;

// when the fourth level is treated as random

template_mu = grand_mean + tempmu_offset*sigma_temp;

// varying item family diffculty across item families within models
// decomposition of family means into a model mean and fixed effects

model_mu = template_mu[tt] + modmu_offset*mod_resid; 
model_mu[vv] = model_mu[vv] + disp_verb;
model_mu[hh] = model_mu[hh] + disp_horiz;

// varying item family diffculty across item families within models
// decomposition of family means into a model mean and fixed effects

family_mu = model_mu[mm] + fammu_offset*fam_resid;
family_mu[multten] = family_mu[multten] + char_m10;
family_mu[threedig] = family_mu[threedig] + char_d3;

// item difficulties parameterized as random, with parent-specific means
betai = family_mu[parent] + betai_offset*sigma_item; 

//log odds of a correct probability
eta = uj[peeps]-betai[item];

}

model { 

//hyperpriors on standard deviations

ub_si ~ uniform(0,2.5);
ub_fr ~ uniform(0,2.5);
ub_mr ~ uniform(0,2.5);
ub_st ~ uniform(0,2.5);			//when level 4 is treated as random

//prior on standard normal deviates for non-centered parameterization at levels 3, 4
betai_offset ~ normal(0,1);
fammu_offset ~ normal(0,1);
modmu_offset ~ normal(0,1);
tempmu_offset ~ normal(0,1);

//prior on random variable theta to scale the item difficulty parameters
uj ~ normal(0,1);

//priors on standard deviations
sigma_item ~ normal(0,ub_si);
fam_resid ~ normal(0,ub_fr);
mod_resid ~ normal(0,ub_mr);
sigma_temp ~ normal(0,ub_st);	//when level 4 is treated as random

//likelihood function                                         
score ~ bernoulli_logit(eta);

}

The uniform priors on ub_si, ub_fr, ub_mr, ub_st are a problem. If you really want uniform priors, you should specify the upper limit in the parameter declaration.

real<lower=0, upper=2.5> ub_si;

Otherwise you are telling Stan that ub_si could be 5, and at the same time you are asking Stan to calculate the loglikelihood of ub_si = 5, which according to your prior is log(0). Hence, the warning.

This is the technical reason for the error. You probably want to avoid a hard constraint anyway. I would use a half-normal, half-student t, or exponential prior.

parameters{
   real<lower=0> ub_si;
}
model{
   ub_si ~ normal(0, 1);
}

implies a 99% probability that 0 < ub_si < 2.5.

quantile(abs(rnorm(10000, 0, 1)), c(.9, .95, .99))
     90%      95%      99% 
1.636037 1.952216 2.510354 

@Stijn, this makes good sense, and the fix is easy enough. What I was trying to do was to specify ub_si , ub_fr , ub_mr , ub_st using a uniform prior (I will need to see in what article/discussion I originally saw this specification suggested), and these are then the upper bounds (creative naming of parameters on my part!) for the priors on my standard deviations, i.e. sigma_item ~ normal(0,ub_si), rather than specifying those upper bounds directly.

My understanding was this specification might help me to effectively recover variances in the model I am working with. Was this a case of “good idea, poor execution”? Again, thanks for the help with avoiding uniform priors!

I honestly did not look at the whole model I just quickly scanned for known issues with that error message. From your description I would still go with the normals but the uniform can work if you define the upper limit and then you do not need the ‘ ~uniform’

As @stijn notes, if you wanted to use uniform priors then the parameter declarations need to match the bounds. At the same time uniform priors have undesirable properties – here they’ll push up against the upper boundary which will make sampling difficult. Using a half-normal with standard deviation set to half the bound will have roughly the same containment without the sampling issues at the upper boundary.

Modeling deeper hierarchies like this is challenging in general as the deeper you go the more the variance tends to accumulate and explode by the time you get to the individual parameters, so you have to very carefully tune the priors to maintain reasonable distributions all the way through the hierarchy. Additionally you have to be careful to ensure that the nested means are identifiable.

Personally I prefer to make the nested standard deviations hiearchical while keeping everything centered around a single population mean to avoid identifiability issues. For individuals nested in families nested in neighborhoods this might look something like

individual_tilde ~ normal(0, 1);
family_tilde ~ normal(0, 1);
neighborhood_tilde ~ normal(0, 1.5);
neighborhood_sigma ~ normal(0, 0.5);
mu ~ normal(0, 0.1);

individual_effect[n] = mu + individual_tilde[n] * family_tilde[family[n]] * neighborhood_tilde[neighborhood[n]] * neighborhood_sigma;
5 Likes

Dear @betanalpha,

I just want to make sure I fully understand the paremeterization on your response.

Intuitively, I would parameterize a non-centered hierarchical prior on a standard deviation like:

individual_sigma = population_mean
		   + individual_tilde[n] * individual_sigma
		   + family_tilde[family[n]] * family_sigma
	           + neighborhood_tilde[neighborhood[n]] * neighborhood_sigma; 

Would this be equivalent to your parmeterization?

If so, why multiply the tildes instead of summing them up?

Thank you,

These are different models. Your code implements a multilevel model, one level for each classification, but ignores the nesting. This is a common approach and often works well. My code implements a nested hierarchical model that models individuals within families within neighborhoods. This is less common but when implemented correctly can lead to cleaner inferences with respect to the nesting structure.

2 Likes

All,

I wanted to weigh in on this thread because I’ve been relatively silent since this was last posted.

@betanalpha - this structure gives me a lot to think about with respect to the specification of hierarchical priors, and I’m wondering how that specification might change the behavior of some of the models I’m currently working with - and in particular what “cleaner” inferences might look like given the problem I am working with (I have some ideas for how I would LIKE this specification to improve my estimates!) – as soon as I can spare some VM and brain space, I will be running some comparisons.

I will return to the thread and post an update when I have completed some initial simulations runs investigating the behavior of the different variance specifications given the data I am working with… or if I need to return @betanalpha to see if I am implementing your suggested variance structure correctly on an applied an example which is proving difficult to estimate given the sample size at each level. Thank you again!

Best,
Shauna

Happy to help! Simulation studies are really helpful to understand the consequences of more complex latent models like these, but make sure to generate simulations by sampling both parameters and from the joint distribution \pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta). Sampling parameters from the prior is critical to best understanding the consequences of the prior.

1 Like

Modeling deeper hierarchies like this is challenging in general as the deeper you go the more the variance tends to accumulate and explode by the time you get to the individual parameters, so you have to very carefully tune the priors to maintain reasonable distributions all the way through the hierarchy. Additionally you have to be careful to ensure that the nested means are identifiable.

I’m working on a problem where I have observations of chemical concentrations of water samples from well facilities, in water distribution systems, in counties and years; and I’ve noticed, after prior predictive simulation (thanks for the example on this in your online tutorials Michael!), that great care is needed to formulate the priors at each level to ensure that the resulting prior predictive distribution doesn’t say something ludicrous about the overall distribution of concentrations. Likewise, I’m dealing with a log-normal likelihood in this example, which maybe makes it even more tedious (but interesting).

I’m late to the thread, but I would really like to better understand this parameterization, Michael (or anyone else).

individual_tilde ~ normal(0, 1);
family_tilde ~ normal(0, 1);
neighborhood_tilde ~ normal(0, 1.5);
neighborhood_sigma ~ normal(0, 0.5);
mu ~ normal(0, 0.1);
individual_effect[n] = mu + individual_tilde[n] * family_tilde[family[n]] * neighborhood_tilde[neighborhood[n]] * neighborhood_sigma;

I was hoping to maybe get some more context for the ideas behind this code, which maybe seems paraphrased, or perhaps I’m just dim. For example, why is “neighborhood_sigma” seemingly the only scale parameter? Why is “neighborhood_tilde” 50% larger than “family_tilde”, but which is on the same scale as “individual_tilde”? Sorry, if the answers are obvious, but I’m hoping to get a hint to perhaps trigger the light bulb for me that ties it together. Also, if anyone has or knows of any online code examples, or applications in the literature where this type of approach to nested hierarchies is taken, I’d be grateful to hear about them.

Thanks, in any case, for the info already in this thread. It has been really interesting and eye opening to think about.

-Roy