Stan does not converge

zhuang · May 7, 2025, 2:57pm

time	count	treatment	group	pair_id
0	59	0	1	1
4	118.8333	0	1	2
8	143.8333	0	1	3
12	133.25	0	1	4
16	118.1667	0	1	5
20	105.6667	0	1	6
24	96	0	1	7
28	96.66666	0	1	8
32	88.08334	0	1	9
36	82.58334	0	1	10
40	76.91666	0	1	11
44	76	0	1	12
48	77.25	0	1	13
0	17.5	0	2	14
4	25.58333	0	2	15
8	29.41667	0	2	16
12	28.16667	0	2	17
16	26.25	0	2	18
20	23.5	0	2	19
24	22.41667	0	2	20
28	24.75	0	2	21
32	26	0	2	22
36	25.16667	0	2	23
40	24.33333	0	2	24
44	26.16667	0	2	25
48	27.16667	0	2	26
0	13.625	0	3	27
4	16.125	0	3	28
8	16.125	0	3	29
12	17.375	0	3	30
16	14.5	0	3	31
20	14.875	0	3	32
24	15.375	0	3	33
28	15.875	0	3	34
32	14.875	0	3	35
36	18.25	0	3	36
40	15.875	0	3	37
44	17.75	0	3	38
48	19	0	3	39
0	7.333333	0	4	40
4	10.08333	0	4	41
8	9.083333	0	4	42
12	8.583333	0	4	43
16	7.666667	0	4	44
20	7.5	0	4	45
24	7.333333	0	4	46
28	5.583333	0	4	47
32	7.083333	0	4	48
36	7.166667	0	4	49
40	8.333333	0	4	50
44	10.91667	0	4	51
48	11.33333	0	4	52
0	51.25	1	1	1
4	93	1	1	2
8	134	1	1	3
12	142.125	1	1	4
16	135.5	1	1	5
20	125.625	1	1	6
24	113.125	1	1	7
28	107.25	1	1	8
32	103.125	1	1	9
36	104.375	1	1	10
40	103.125	1	1	11
44	110.375	1	1	12
48	114.25	1	1	13
0	12.91667	1	2	14
4	22.16667	1	2	15
8	25.08333	1	2	16
12	26.41667	1	2	17
16	26.33333	1	2	18
20	26.16667	1	2	19
24	25.83333	1	2	20
28	28.08333	1	2	21
32	27.75	1	2	22
36	26.08333	1	2	23
40	28.58333	1	2	24
44	27.91667	1	2	25
48	28.58333	1	2	26
0	10.91667	1	3	27
4	11.83333	1	3	28
8	13.41667	1	3	29
12	14.5	1	3	30
16	16	1	3	31
20	17.08333	1	3	32
24	16.33333	1	3	33
28	19.08333	1	3	34
32	17.16667	1	3	35
36	16.25	1	3	36
40	18.08333	1	3	37
44	20.16667	1	3	38
48	22.33333	1	3	39
0	3.5	1	4	40
4	6.666667	1	4	41
8	6.666667	1	4	42
12	8	1	4	43
16	8.666667	1	4	44
20	10.41667	1	4	45
24	10.41667	1	4	46
28	10.91667	1	4	47
32	10.75	1	4	48
36	11.5	1	4	49
40	11.58333	1	4	50
44	11.16667	1	4	51
48	13	1	4	52

This is my raw data. the data has 4 groups, within each group they are treated (denoted as 1) or untreated (denoted as 0) randomly. They are also paired within each group, so I set up multilevel models. The aim is to **estimate the effect of treatment (theta) on count in each group.

I wrote stan codes for analyzing each group individually and all groups collectively. However, both stan codes do not converge completely. Can someone help me to solve this issue?

stan code for analyzing each group individually:

[edit: escaped Stan code]

data {
	int n;		// number of obs
	int J;		// number of pairs
	vector[n] T;	// treatment
	vector[n] y; 	// count
	int<lower=1, upper=13> pair_id[n];	// pair indicator
}

parameters {
	vector[J] alpha;
	real theta;
	real sigma_y;
	real mu_a;
	real sigma_a;
}

model {
		
			theta ~ normal(0,10);
			sigma_y ~ uniform(0,10);
			sigma_a ~ uniform(0,10);
			mu_a ~ normal(0, 10);
					
		
			alpha ~ normal(mu_a, sigma_a);
	
		
		for (i in 1:n) {
		y[i] ~ normal(alpha[pair_id[i]]+T[i]*theta, sigma_y);
	}
}

stan code for analyzing all groups collectively:

data {
	int n;		// number of obs
	int J;		// number of pairs
	int K;		// number of groups
	vector[n] T;	// treatment
	vector[n] y; 	// count
	int<lower=0, upper=K> group[n];	// group indicator
	int<lower=0, upper=K> pair_group[J]; // pair group indicator
	int<lower=0, upper=J> pair_id[n];	// pair indicator
}

parameters {
	vector[J] alpha;
	real theta[K];
	real sigma_y[K];
	real mu_a[K];
	real sigma_a[K];
}

model {
		for (k in 1:K) {
			theta[k] ~ normal(0,10);
			sigma_y[k] ~ uniform(0,10);
			sigma_a[k] ~ uniform(0,10);
			mu_a[k] ~ normal(0, 10);

		}

		for (j in 1:J) {
			alpha[j] ~ normal(mu_a[pair_group[j]], sigma_a[pair_group[j]]);
		}
		
		for (i in 1:n) {
		y[i] ~ normal(alpha[pair_id[i]] + T[i]*theta[group[i]], sigma_y[group[i]]);
	}
}

Corey.Plate · May 7, 2025, 7:51pm

I imagine the posterior you’re trying to explore is simply too flat, causing the chains to meander about where they initialize, thanks to the extremely flat priors over the parameters of interest and the small amount of data available to fit this hierarchical model.

The location hyperparameter for the alpha parameter can be almost anything when the scale parameter for the location hyperprior, mu_a, is 10, and the scale hyperparameter for alpha, sigma_a, is a uniform(0,10), meaning that it is just as likely to be 0 as it is to be 10.

I imagine alpha is basically unconstrained by the prior over [-∞,∞]. With very little data to constrain the likelihood function, my suspicion is that it would take the model an unreasonable number of warmups to find the typical set, and that, even if it did, you’d have to use a lot of samples with aggressive thinning to keep the number of kept samples reasonable and informative.

Why such highly unconstrained priors?

zhuang · May 7, 2025, 8:37pm

Hi Corey,

Thank you very much for you reply.

I used noninformative priors simply because there is nothing I can use the restrain the parameters. I also fit the model using lmer and got some results for alpha, sigma_alpha, etc. Do you think it is a good idea reset priors based on these results?

Corey.Plate · May 7, 2025, 8:52pm

If those results you got on Imer were fit on the same data I would not use those results to set your priors here. However, you do not need to have such uninformative priors. You can, absent any information, use a standard normal for something like mu_a and a strictly positive prior like half-normal, gamma, or lognormal for something like sigma_a. Since these are hyperpiors tasked with providing hyperparameters that regularize the model, you can use these more informative distributions to help constrain your normal prior for, e.g., alpha.

Bob_Carpenter · May 7, 2025, 9:20pm

Comments on Stan code

The general consensus is that it’s a bad idea to indent with tabs in computer code, because they display differently everywhere.

You are using an older version of RStan with a deprecated array syntax. We don’t use that array syntax any more. It should be, for example,

array[K] real theta;

I would recommend moving to cmdstanr if you’re only doing inference rather than trying to build a precompiled package. It keeps up with Stan releases and is easier to install (because it doesn’t have to have binary compatibility with your build of R).

First, you can rewrite your model in vectorized form as

theta ~ normal(0, 10); 
sigma_y ~ uniform(0, 10);  
sigma_a ~ uniform(0, 10);
mu_a ~ normal(0, 10);
alpha ~ normal(mu_a[pair_group], sigma_a[pair_group]);  
y ~ normal(alpha[pair_id] + T .* theta[group], sigma_y[group]);

Note the .* for elementwise products. That makes it easier to read and maybe a bit faster, but it’s not going to solve your problem.

Problematic declarations

There are a couple problems with the variable declarations. First, if you have a variable that needs to be positive, it needs to be declared with a <lower=0> constraint, e.g.,

array[K] real<lower=0> sigma_y;

But given that you’re putting a uniform(0, 10) prior on it, it needs to also have an upper bound, because Stan requires the model to have support (positive density, finite log density) wherever the parameters satisfy their declared constraints. There’s no constraint on sigma_y, but it only has support on (0, 10), so sigma_y need to be declared with <lower=0, upper=10> to be consistent. In general, I would recommend a softer constraint than a hard interval. If the data’s consistent with sigma values near 10, it will cause both statistical and computational issues.

The second issue is the centered parameterization of the hierarchical, prior. To switch to a non-centered parameterization, you can declare mu_a as follows (this may also require a more recent version of Stan than you’re using).

vector<offset=mu_a, multiplier=sigma_a>[K] mu_a;

This is only going to work if you have multiple observations for each pair group. Otherwise, the best solution’s going to be singular.

Priors

These are not “non-informative.” We call them weakly informative because they’re fixing a rough scale of parameters. For example, normal(0, 10) is not consistent with a value of 100, so your prior’s saying you would be surprised if the value was outside of the (-30, 30) range.

zhuang · May 7, 2025, 9:35pm

Thank you Bob, it is very helpful.

Could you recommend some books for learing stan? I picked it up primarily by watching videos and chatgpt, so I think I need some systemic learning for stan if I want to use it in future work. Thank you very much
Z

zhuang · May 7, 2025, 9:42pm

Thanks Corey, I re-adjusted the priors and finally got stan to converge to some meaningful numbers, similar to the ones in lmer, actually.

Bob_Carpenter · May 7, 2025, 9:49pm

Although it’s not teaching you Stan directly for the most part, the intro to Bayesian modeling and workflow that I like is Richard McElreath’s Statitstical Rethinking. I also wrote an introduction to Stan that includes an intro to Bayes and probability theory and Monte Carlo methods:

I used Python in case that’s a dealbreaker. There’s a lot more if you go to our web site and look under the pulldown “Learning Resources”. The case studies are particularly useful.

The best thing at describing how we actually do stats in Stan is this, which @andrewgelman and @avehtari are converting into a book:

Topic		Replies	Views
Why do I find chains running in parallel and not converging General stanc	3	2310	April 16, 2018
Simple intercept only hierarchical model with two groups: deadly slow, poor convergence Modeling rstan , specification , performance	12	2139	October 1, 2021
Divergence or Convergence based on rhat using different datasets Modeling	16	1869	May 1, 2018
Help with understanding my results Modeling fitting-issues , specification , performance	5	1195	February 8, 2019
Question: Stan Modelspecification for Hierarchical Linear Model (Gelman's Radon example) RStan rstan	13	1093	November 19, 2019

Stan does not converge

Comments on Stan code

Problematic declarations

Priors

Related topics