Stan does not converge

time count treatment group pair_id
0 59 0 1 1
4 118.8333 0 1 2
8 143.8333 0 1 3
12 133.25 0 1 4
16 118.1667 0 1 5
20 105.6667 0 1 6
24 96 0 1 7
28 96.66666 0 1 8
32 88.08334 0 1 9
36 82.58334 0 1 10
40 76.91666 0 1 11
44 76 0 1 12
48 77.25 0 1 13
0 17.5 0 2 14
4 25.58333 0 2 15
8 29.41667 0 2 16
12 28.16667 0 2 17
16 26.25 0 2 18
20 23.5 0 2 19
24 22.41667 0 2 20
28 24.75 0 2 21
32 26 0 2 22
36 25.16667 0 2 23
40 24.33333 0 2 24
44 26.16667 0 2 25
48 27.16667 0 2 26
0 13.625 0 3 27
4 16.125 0 3 28
8 16.125 0 3 29
12 17.375 0 3 30
16 14.5 0 3 31
20 14.875 0 3 32
24 15.375 0 3 33
28 15.875 0 3 34
32 14.875 0 3 35
36 18.25 0 3 36
40 15.875 0 3 37
44 17.75 0 3 38
48 19 0 3 39
0 7.333333 0 4 40
4 10.08333 0 4 41
8 9.083333 0 4 42
12 8.583333 0 4 43
16 7.666667 0 4 44
20 7.5 0 4 45
24 7.333333 0 4 46
28 5.583333 0 4 47
32 7.083333 0 4 48
36 7.166667 0 4 49
40 8.333333 0 4 50
44 10.91667 0 4 51
48 11.33333 0 4 52
0 51.25 1 1 1
4 93 1 1 2
8 134 1 1 3
12 142.125 1 1 4
16 135.5 1 1 5
20 125.625 1 1 6
24 113.125 1 1 7
28 107.25 1 1 8
32 103.125 1 1 9
36 104.375 1 1 10
40 103.125 1 1 11
44 110.375 1 1 12
48 114.25 1 1 13
0 12.91667 1 2 14
4 22.16667 1 2 15
8 25.08333 1 2 16
12 26.41667 1 2 17
16 26.33333 1 2 18
20 26.16667 1 2 19
24 25.83333 1 2 20
28 28.08333 1 2 21
32 27.75 1 2 22
36 26.08333 1 2 23
40 28.58333 1 2 24
44 27.91667 1 2 25
48 28.58333 1 2 26
0 10.91667 1 3 27
4 11.83333 1 3 28
8 13.41667 1 3 29
12 14.5 1 3 30
16 16 1 3 31
20 17.08333 1 3 32
24 16.33333 1 3 33
28 19.08333 1 3 34
32 17.16667 1 3 35
36 16.25 1 3 36
40 18.08333 1 3 37
44 20.16667 1 3 38
48 22.33333 1 3 39
0 3.5 1 4 40
4 6.666667 1 4 41
8 6.666667 1 4 42
12 8 1 4 43
16 8.666667 1 4 44
20 10.41667 1 4 45
24 10.41667 1 4 46
28 10.91667 1 4 47
32 10.75 1 4 48
36 11.5 1 4 49
40 11.58333 1 4 50
44 11.16667 1 4 51
48 13 1 4 52

This is my raw data. the data has 4 groups, within each group they are treated (denoted as 1) or untreated (denoted as 0) randomly. They are also paired within each group, so I set up multilevel models. The aim is to **estimate the effect of treatment (theta) on count in each group.

I wrote stan codes for analyzing each group individually and all groups collectively. However, both stan codes do not converge completely. Can someone help me to solve this issue?

stan code for analyzing each group individually:

[edit: escaped Stan code]

data {
	int n;		// number of obs
	int J;		// number of pairs
	vector[n] T;	// treatment
	vector[n] y; 	// count
	int<lower=1, upper=13> pair_id[n];	// pair indicator
}

parameters {
	vector[J] alpha;
	real theta;
	real sigma_y;
	real mu_a;
	real sigma_a;
}

model {
		
			theta ~ normal(0,10);
			sigma_y ~ uniform(0,10);
			sigma_a ~ uniform(0,10);
			mu_a ~ normal(0, 10);
					
		
			alpha ~ normal(mu_a, sigma_a);
	
		
		for (i in 1:n) {
		y[i] ~ normal(alpha[pair_id[i]]+T[i]*theta, sigma_y);
	}
}

stan code for analyzing all groups collectively:

data {
	int n;		// number of obs
	int J;		// number of pairs
	int K;		// number of groups
	vector[n] T;	// treatment
	vector[n] y; 	// count
	int<lower=0, upper=K> group[n];	// group indicator
	int<lower=0, upper=K> pair_group[J]; // pair group indicator
	int<lower=0, upper=J> pair_id[n];	// pair indicator
}

parameters {
	vector[J] alpha;
	real theta[K];
	real sigma_y[K];
	real mu_a[K];
	real sigma_a[K];
}

model {
		for (k in 1:K) {
			theta[k] ~ normal(0,10);
			sigma_y[k] ~ uniform(0,10);
			sigma_a[k] ~ uniform(0,10);
			mu_a[k] ~ normal(0, 10);

		}

		for (j in 1:J) {
			alpha[j] ~ normal(mu_a[pair_group[j]], sigma_a[pair_group[j]]);
		}
		
		for (i in 1:n) {
		y[i] ~ normal(alpha[pair_id[i]] + T[i]*theta[group[i]], sigma_y[group[i]]);
	}
}

I imagine the posterior you’re trying to explore is simply too flat, causing the chains to meander about where they initialize, thanks to the extremely flat priors over the parameters of interest and the small amount of data available to fit this hierarchical model.

The location hyperparameter for the alpha parameter can be almost anything when the scale parameter for the location hyperprior, mu_a, is 10, and the scale hyperparameter for alpha, sigma_a, is a uniform(0,10), meaning that it is just as likely to be 0 as it is to be 10.

I imagine alpha is basically unconstrained by the prior over [-∞,∞]. With very little data to constrain the likelihood function, my suspicion is that it would take the model an unreasonable number of warmups to find the typical set, and that, even if it did, you’d have to use a lot of samples with aggressive thinning to keep the number of kept samples reasonable and informative.

Why such highly unconstrained priors?

Hi Corey,

Thank you very much for you reply.

I used noninformative priors simply because there is nothing I can use the restrain the parameters. I also fit the model using lmer and got some results for alpha, sigma_alpha, etc. Do you think it is a good idea reset priors based on these results?

If those results you got on Imer were fit on the same data I would not use those results to set your priors here. However, you do not need to have such uninformative priors. You can, absent any information, use a standard normal for something like mu_a and a strictly positive prior like half-normal, gamma, or lognormal for something like sigma_a. Since these are hyperpiors tasked with providing hyperparameters that regularize the model, you can use these more informative distributions to help constrain your normal prior for, e.g., alpha.

Comments on Stan code

The general consensus is that it’s a bad idea to indent with tabs in computer code, because they display differently everywhere.

You are using an older version of RStan with a deprecated array syntax. We don’t use that array syntax any more. It should be, for example,

array[K] real theta;

I would recommend moving to cmdstanr if you’re only doing inference rather than trying to build a precompiled package. It keeps up with Stan releases and is easier to install (because it doesn’t have to have binary compatibility with your build of R).

First, you can rewrite your model in vectorized form as

theta ~ normal(0, 10); 
sigma_y ~ uniform(0, 10);  
sigma_a ~ uniform(0, 10);
mu_a ~ normal(0, 10);
alpha ~ normal(mu_a[pair_group], sigma_a[pair_group]);  
y ~ normal(alpha[pair_id] + T .* theta[group], sigma_y[group]);

Note the .* for elementwise products. That makes it easier to read and maybe a bit faster, but it’s not going to solve your problem.

Problematic declarations

There are a couple problems with the variable declarations. First, if you have a variable that needs to be positive, it needs to be declared with a <lower=0> constraint, e.g.,

array[K] real<lower=0> sigma_y;

But given that you’re putting a uniform(0, 10) prior on it, it needs to also have an upper bound, because Stan requires the model to have support (positive density, finite log density) wherever the parameters satisfy their declared constraints. There’s no constraint on sigma_y, but it only has support on (0, 10), so sigma_y need to be declared with <lower=0, upper=10> to be consistent. In general, I would recommend a softer constraint than a hard interval. If the data’s consistent with sigma values near 10, it will cause both statistical and computational issues.

The second issue is the centered parameterization of the hierarchical, prior. To switch to a non-centered parameterization, you can declare mu_a as follows (this may also require a more recent version of Stan than you’re using).

vector<offset=mu_a, multiplier=sigma_a>[K] mu_a;

This is only going to work if you have multiple observations for each pair group. Otherwise, the best solution’s going to be singular.

Priors

These are not “non-informative.” We call them weakly informative because they’re fixing a rough scale of parameters. For example, normal(0, 10) is not consistent with a value of 100, so your prior’s saying you would be surprised if the value was outside of the (-30, 30) range.

Thank you Bob, it is very helpful.

Could you recommend some books for learing stan? I picked it up primarily by watching videos and chatgpt, so I think I need some systemic learning for stan if I want to use it in future work. Thank you very much
Z

Thanks Corey, I re-adjusted the priors and finally got stan to converge to some meaningful numbers, similar to the ones in lmer, actually.

Although it’s not teaching you Stan directly for the most part, the intro to Bayesian modeling and workflow that I like is Richard McElreath’s Statitstical Rethinking. I also wrote an introduction to Stan that includes an intro to Bayes and probability theory and Monte Carlo methods:

I used Python in case that’s a dealbreaker. There’s a lot more if you go to our web site and look under the pulldown “Learning Resources”. The case studies are particularly useful.

The best thing at describing how we actually do stats in Stan is this, which @andrewgelman and @avehtari are converting into a book: