N(0,1) Terms in Hierarchical Model Fail to Converge (Rhat > 1.1)

Hey all,

I’m trying to fit a hierarchical model to some price data in Stan. The model is basically a random effects model with partially pooled dummies for every product type (UPC) and county (stfips) in a particular area of the United States. The dependent variable is log of unit price for a product. There are on the order of ~140,000 observations of individual transactions aggregated over some time period.

Unfortunately I can’t attach the data or be much more specific due to confidentiality concerns. But the Stan program I’m attempting to run is attached here: p_model_8.stan (2.4 KB)

This model had successfully fit before on a subset of the data that I am attempting to use now. However, once I expanded the data to include more and slightly different types of products, I began to have issues.

Specifically, I encounter two problems that I can’t seem to explain:

  1. The model takes 1-2 days to fit.
  2. More significantly, the fit always has several of the Upc_Epsilon terms failing to converge between chains, with Rhat values >1.1. The Upc_Sigma term also consistently has high Rhat values. These are generally on the order of between 1.2 and 1.7.

The second problem in particular flummoxes me, since the Upc_Epsilon terms should just be normally distributed noise that doesn’t depend on the data. The first problem makes it hard to run multiple models and debug problems.

Things I have attempted so far to diagnose this problem:

  1. Increased maximum treedepth as is discussed here.
  2. Removed the Upc_Sigma term from the model. Obviously this removes the Upc_Sigma’s convergence issues, but the Upc_Epsilon terms still have high Rhat values.
  3. Changed priors on the sigma terms from Cauchy priors to student-t priors with 3 degrees of freedom, and, when that failed, changed the priors to gamma(2, 0.5) priors in line with a recommendation I read in this Psychometrika paper.

None of these attempts have resolved the issue so far. Being a relatively new user of Stan, I’m not quite sure how to proceed from here. Any suggestions for fixing this are greatly appreciated.

Additionally, this is my first post on this forum. I’ve read the forum etiquette guidelines and searched for similar posts before posting this question, but if I’ve missed anything please let me know and I will try to correct it.

Thank you!

A couple of things you could try/think about.

  • It could be that you need a longer warmup period.

  • I suspect that fixing the effect of one group to zero is not the most efficient way to avoid linear dependency. Since you have an overall mean, alpha, you could just drop Area_Mean and Upc_Mean (which is the same to set them to zero) and let the first group be free.

  • You could try a centered parameterisation instead of the non-centered one. With more data the centered parameterisation could be more appropriate (https://arxiv.org/pdf/1312.0906.pdf). You used the centered parameterisation for Sigma.

  • I know you checked this but I think it might be worthwhile to investigate your prior on the sigmas. These cauchys in combination with the exponential transformation give huge variation on the natural scale of the prices. To a lesser extent the same argument holds for Sigma and its lognormal prior. You could use the results from your successful analysis to get an idea of what likely values are for the sigmas. With predictive posterior checks, you might also try to figure out whether the implied multiplicative effect of the log_normal holds.

  • If you got everything to converge you can vectorise the likelihood statement. This should help with speeding things up.

P_vec ~ lognormal(Mu, Sigma[Area_Ind])
1 Like

Anything that touches the data depends on the data in a Bayesian model.

I’d also check the lognormals—these are a bit tricky in terms of where their mass is.

You can also vectorize the definition of Mu and don’t need to save it as a transformed parameter. The vectorizations are particularly useful with large data sizes.

You definitely want to use a non-centered parameterization of

Sigma ~ lognormal(Sigma_LogMean, 1);

That’d be:

Sigma_std ~ normal(0, 1);
Sigma = exp(Sigma_LogMean + Sigma_std);

but you’ll need to convert everything from an array to a vector. This breaks the dependency between Sigma and Sigma_LogMean in the prior. The broad student-t priors aren’t going to help convergence.

You an also use vector operations to define other quantities, e.g.,

Area_Effect[2:n_areas] = Area_Mean + Area_Sigma * Area_Epsilon;

Hey all,

Thanks for the thoughtful responses. I’ve attempted to implement most of these suggestions, and the time it takes to fit the model decreased dramatically, to about 3 hours or so per run. But the model is still not converging when I run it on all of the data.

I’ve attached the revised version I’ve been using here pm8_noncentered2.stan (1.8 KB). After this one didn’t work, I attempted to change the priors for the variability parameters to gamma(2, 0.1), which didn’t change the results.

I’ve identified one subset of the data (subsetting done by broad category of product) on which the revised model converges basically instantly, and another subset of the data on which the model runs for several hours and throws several divergent transitions. So it looks like the type of product that is included matters for some reason.

That’s about as efficient as you’re going to make this.

How well the model fits the data, and in particular, whether the data is consistent with both the prior and with the likelihood, will determine how quickly things will sample. So you want to check your lognormal distribution makes sense for the P_vec you get. Also, hierarchical variance models can be tricky and the student-t(3) priors are still very wide. You might want to try this with just normal priors. My guess is that the Sigma terms are not going to be well identified and there are a lot of degrees of freedom in their fit with fairly loose priors.