Scale hyperparameter "bounces" off zero - Linear regression

Hi!

I’m having a hard time fitting a hierarchical model with STAN (it’s my fault, not STAN’s). The problem can be summarized as follows: I have a set of individuals separated into T subgroups, each one with N individuals, and I apply a different treatment to each subgroup. For each individual, I have a set of M data pairs (x,y) and the proposed model is:

y \sim \mathcal{N}\left(\theta_0 + \theta_1.x,\sigma_y\right).

On the other hand, the link between individuals under the same treatment is established by putting priors on the parameters \theta_0 and \theta_1:

\theta_0 \sim \mathcal{N}\left(\alpha_0,\beta_0\right),
\theta_1 \sim \mathcal{N}\left(\alpha_1,\beta_1\right).

After running the sampler, the results are satisfactory in many aspects, but it happens that the posterior distribution of \beta_1 is very concentrated near the value 0, and therefore each set of N posterior samples of of \theta_1 (one per individual, given a sample of \left(\alpha_1,\beta_1\right) for a particular treatment) lies very close to the sampled value of \alpha_1. At first, I thought this simply implies that there are no appreciable differences in the values of \theta_1 for individuals under the same treatment, since the value of \beta_1 defines the dispersion of this parameter. However, after reviewing the traceplots I notice that the trace of \beta_1 seems to “bounce” off the value 0. Is this a problem with the model, or is it something that normally occurs when there are no differences between the values ​​of \theta_1?

I’m using the interface cmdstanpy in Python 3.9. Here’s the code:

functions {
matrix filter_nan(vector x,vector y,int K) { // Returns input vectors after deleting values where y is NaN
int N = size(x); // Size of input vectors
matrix[2,K] xy;  // Output matrix
int j = 0; // Auxiliar counter

for(i in 1:N) { // Loop over all input values
if(is_nan(y[i])) { // Ignore NaNs
}
else {
j += 1; // Increase auxiliar counter
xy[1,j] = x[i]; // Store x value
xy[2,j] = y[i]; // Store y value
}
}

return xy;
}
}
data {
int<lower=0> T;  // Number of treatments
int<lower=0> N;  // Number of individuals per treatment
int<lower=0> M;  // Number of data points per individual
vector[M] x;     // Explanatory variable
real y[T,N,M];   // Response variable
int K[T,N];      // Number of not-NaN values for each individual

real mean_x; // Mean value of variable x
real sd_x;   // Standard deviation of variable x
real mean_y; // Mean value of variable y
real sd_y;   // Standard deviation of variable y
}
transformed data {
vector[M] x_std = (x - mean_x)/sd_x; // Standardized explanatory variable
}
parameters {
// Hyperparameters
vector[T] alpha_0; // Mean of theta_0 distribution
vector[T] alpha_1; // Mean of theta_1 distribution
real<lower=0> beta_0[T]; // Standard deviation of theta_0 distribution
real<lower=0> beta_1[T]; // Standard deviation of theta_1 distribution

// Parameters
matrix[T,N] theta_0_raw;  // Standardized y-intercept of linear model, after reparametrization
matrix[T,N] theta_1_raw;  // Standardized slope of linear model, after reparametrization
real<lower=0> sigma_y;    // Standard deviation of linear model (same for all individuals and all treatments)
}
transformed parameters {
matrix[T,N] theta_0;  // Standardized y-intercept of linear model, before reparametrization
matrix[T,N] theta_1;  // Standardized slope of linear model, before reparametrization

for (i in 1:T) {
theta_0[i] = alpha_0[i] + beta_0[i]*theta_0_raw[i]; // Reparametrization of theta_0
theta_1[i] = alpha_1[i] + beta_1[i]*theta_1_raw[i]; // Reparametrization of theta_1
}
}
model {
// Hyperpriors
alpha_0 ~ normal(0,1); // Prior of hyperparameter alpha_0
alpha_1 ~ normal(0,1); // Prior of hyperparameter alpha_1
beta_0 ~ normal(0,1);  // Prior of hyperparameter beta_0
beta_1 ~ normal(0,1);  // Prior of hyperparameter beta_1

// Priors
for (i in 1:T) {
theta_0_raw[i] ~ std_normal();   // Prior of parameter theta_0_raw
theta_1_raw[i] ~ std_normal();   // Prior of parameter theta_1_raw
}
sigma_y ~ normal(0,1); // Prior of parameter sigma_y_std

// Likelihood
for(i in 1:T) { // Treatments loop
for(j in 1:N) { // Individuals loop
if(K[i,j] != 0) {
matrix[2,K[i,j]] xy = filter_nan(x_std,to_vector(y[i,j]),K[i,j]); // Filter NaNs
vector[K[i,j]] y_std = to_vector((xy[2] - mean_y)/sd_y); // Standardized response variable
vector[K[i,j]] mu_y_std = theta_0[i,j] + theta_1[i,j]*to_vector(xy[1]); // Linear model

y_std ~ normal(mu_y_std,sigma_y); // Likelihood (normal)
}
}
}
}


I based my choice of the prior of \beta_1 on the guidelines given in the STAN documentation on GitHub. I also tested a \text{Gamma}(2,0.1) as suggested in the STAN’s User Guide, but the issue persists.

Here is the traceplot (result of trace_plot() method of package arviz; the rows, in order, correspond to \alpha_0, \alpha_1, \beta_0, \beta_1, \theta_{0,raw}, \theta_{1,raw}, \sigma_y, \theta_0, \theta_1):

Do you mean the traceplots for beta_0_raw? If so, there’s nothing pathological-looking for each chain individually given that they are positive-bound with a half-normal prior, but it does appear that the chains aren’t quite mixing (and certainly many of the other parameters are also not mixing either).

1 Like

Thanks for your answer, Mike. I mean parameter beta_1, which is the scale in the distribution of theta_1 (the 4th traceplot counting from above). If the chains aren’t mixing well, what could be the reason?

What do the diagnostics look like for your model run? Like R-hat, effective sample sizes, divergent transitions, etc.

1 Like

R-hat, computed using arviz.stats.diagnostics.rhat with method “rank” as recommended by Vehtari et al. (2019):

alpha_0 --> [1.0005769649827772, 1.0054072745092795, 1.0023537339076731, 0.9997855428028828]
alpha_1 --> [1.001864355125387,  1.0002215806140762, 1.001409586288863,  1.0008983684618922]
beta_0 -->  [1.0019336712313256, 1.0023003422465009, 1.0007937310605026, 1.0004301100316715]
beta_1 -->  [1.0005593339982681, 1.0012431220563995, 1.0048257725411378, 1.0002552628181007]
sigma_y --> 0.9996408560494754


Effective sample sizes, computed using arviz.stats.diagnostics.ess:

alpha_0 --> [385.1719463862104, 471.14315173768296, 1053.620438202132,  1330.9781345520635]
alpha_1 --> [5533.91355121317,  5616.106077686486,  4824.792580444725,  5061.8671337863625]
beta_0 -->  [600.4433065181482, 802.7004026449933,  1231.4454483759562, 1307.845002511886]
beta_1 -->  [1448.33035271917,  940.8664031204794,  1189.7891103158381, 1576.446733292215]
sigma_y --> 3894.943574084451


Divergent transitions, computed using np.sum(mcmc_obj.method_variables()['divergent__']), where mcmc_obj is the CmdStanMCMC object:

0

Nothing looks too amiss with these diagnostics and you say the results are “satisfactory in many aspects”, so perhaps its enough to say that the fit is sufficient?

I’m not 100% sure what you mean by:

The traceplot itself never goes below 0 because \beta_1 is defined in your model as a strictly positive quantity, so no samples below 0 are kept. I don’t believe there is any issue with the estimates of \beta_1 being very small.

1 Like

Thanks, @amas0 . I also thought that a high density near zero is not necessarily an issue, but rather indicates that the differences between individuals are insignificant relative to parameter \theta_1 (complete pooling would work fine). What made me doubt is that this does not happen for parameter \theta_0, whose distribution has a width \beta_0 “located far” from zero. I had never come across such a situation, and so I hesitated (I didn’t want to make the mistake of interpreting the phenomenon incorrectly as a consequence of having made a mistake in fitting the statistical model). Thank you all so much again for your answers. We can close this thread.

1 Like

Yeah, I believe the takeaway here is that if your model is to be trusted, then within a treatment you have very consistent random slopes, \theta_1, i.e. given a treatment, your outcome scales consistently with the covariate x. On the other hand, you have a range of valid random intercepts, \theta_0, within each treatment. More or less, the fit tells you that within a treatment, intercepts vary at the individual level, but the scaling with x is consistent across individuals.

From there it’s up to your domain expertise and interrogating the model/data to see if that is consistent with what you expect.