Scale hyperparameter "bounces" off zero - Linear regression


I’m having a hard time fitting a hierarchical model with STAN (it’s my fault, not STAN’s). The problem can be summarized as follows: I have a set of individuals separated into T subgroups, each one with N individuals, and I apply a different treatment to each subgroup. For each individual, I have a set of M data pairs (x,y) and the proposed model is:

y \sim \mathcal{N}\left(\theta_0 + \theta_1.x,\sigma_y\right).

On the other hand, the link between individuals under the same treatment is established by putting priors on the parameters \theta_0 and \theta_1:

\theta_0 \sim \mathcal{N}\left(\alpha_0,\beta_0\right),
\theta_1 \sim \mathcal{N}\left(\alpha_1,\beta_1\right).

After running the sampler, the results are satisfactory in many aspects, but it happens that the posterior distribution of \beta_1 is very concentrated near the value 0, and therefore each set of N posterior samples of of \theta_1 (one per individual, given a sample of \left(\alpha_1,\beta_1\right) for a particular treatment) lies very close to the sampled value of \alpha_1. At first, I thought this simply implies that there are no appreciable differences in the values of \theta_1 for individuals under the same treatment, since the value of \beta_1 defines the dispersion of this parameter. However, after reviewing the traceplots I notice that the trace of \beta_1 seems to “bounce” off the value 0. Is this a problem with the model, or is it something that normally occurs when there are no differences between the values ​​of \theta_1?

I greatly appreciate any help you can give me.

I’m using the interface cmdstanpy in Python 3.9. Here’s the code:

functions {
  matrix filter_nan(vector x,vector y,int K) { // Returns input vectors after deleting values where y is NaN
    int N = size(x); // Size of input vectors
    matrix[2,K] xy;  // Output matrix
    int j = 0; // Auxiliar counter
    for(i in 1:N) { // Loop over all input values
      if(is_nan(y[i])) { // Ignore NaNs
      else {
        j += 1; // Increase auxiliar counter
        xy[1,j] = x[i]; // Store x value
        xy[2,j] = y[i]; // Store y value
    return xy;
data {
  int<lower=0> T;  // Number of treatments
  int<lower=0> N;  // Number of individuals per treatment
  int<lower=0> M;  // Number of data points per individual
  vector[M] x;     // Explanatory variable
  real y[T,N,M];   // Response variable
  int K[T,N];      // Number of not-NaN values for each individual
  real mean_x; // Mean value of variable x
  real sd_x;   // Standard deviation of variable x
  real mean_y; // Mean value of variable y
  real sd_y;   // Standard deviation of variable y
transformed data {
  vector[M] x_std = (x - mean_x)/sd_x; // Standardized explanatory variable
parameters {
  // Hyperparameters
  vector[T] alpha_0; // Mean of theta_0 distribution
  vector[T] alpha_1; // Mean of theta_1 distribution
  real<lower=0> beta_0[T]; // Standard deviation of theta_0 distribution
  real<lower=0> beta_1[T]; // Standard deviation of theta_1 distribution
  // Parameters
  matrix[T,N] theta_0_raw;  // Standardized y-intercept of linear model, after reparametrization
  matrix[T,N] theta_1_raw;  // Standardized slope of linear model, after reparametrization
  real<lower=0> sigma_y;    // Standard deviation of linear model (same for all individuals and all treatments)
transformed parameters {
  matrix[T,N] theta_0;  // Standardized y-intercept of linear model, before reparametrization
  matrix[T,N] theta_1;  // Standardized slope of linear model, before reparametrization
  for (i in 1:T) {
    theta_0[i] = alpha_0[i] + beta_0[i]*theta_0_raw[i]; // Reparametrization of theta_0
    theta_1[i] = alpha_1[i] + beta_1[i]*theta_1_raw[i]; // Reparametrization of theta_1
model {
  // Hyperpriors
  alpha_0 ~ normal(0,1); // Prior of hyperparameter alpha_0
  alpha_1 ~ normal(0,1); // Prior of hyperparameter alpha_1
  beta_0 ~ normal(0,1);  // Prior of hyperparameter beta_0
  beta_1 ~ normal(0,1);  // Prior of hyperparameter beta_1
  // Priors
  for (i in 1:T) {
    theta_0_raw[i] ~ std_normal();   // Prior of parameter theta_0_raw
    theta_1_raw[i] ~ std_normal();   // Prior of parameter theta_1_raw
  sigma_y ~ normal(0,1); // Prior of parameter sigma_y_std
  // Likelihood
  for(i in 1:T) { // Treatments loop
    for(j in 1:N) { // Individuals loop
      if(K[i,j] != 0) {
        matrix[2,K[i,j]] xy = filter_nan(x_std,to_vector(y[i,j]),K[i,j]); // Filter NaNs
        vector[K[i,j]] y_std = to_vector((xy[2] - mean_y)/sd_y); // Standardized response variable
        vector[K[i,j]] mu_y_std = theta_0[i,j] + theta_1[i,j]*to_vector(xy[1]); // Linear model
        y_std ~ normal(mu_y_std,sigma_y); // Likelihood (normal) 

I based my choice of the prior of \beta_1 on the guidelines given in the STAN documentation on GitHub. I also tested a \text{Gamma}(2,0.1) as suggested in the STAN’s User Guide, but the issue persists.

Here is the traceplot (result of trace_plot() method of package arviz; the rows, in order, correspond to \alpha_0, \alpha_1, \beta_0, \beta_1, \theta_{0,raw}, \theta_{1,raw}, \sigma_y, \theta_0, \theta_1):

Do you mean the traceplots for beta_0_raw? If so, there’s nothing pathological-looking for each chain individually given that they are positive-bound with a half-normal prior, but it does appear that the chains aren’t quite mixing (and certainly many of the other parameters are also not mixing either).

1 Like

Thanks for your answer, Mike. I mean parameter beta_1, which is the scale in the distribution of theta_1 (the 4th traceplot counting from above). If the chains aren’t mixing well, what could be the reason?

What do the diagnostics look like for your model run? Like R-hat, effective sample sizes, divergent transitions, etc.

1 Like

Thanks for your answer, @amas0. Below are the diagnostics:

R-hat, computed using arviz.stats.diagnostics.rhat with method “rank” as recommended by Vehtari et al. (2019):

alpha_0 --> [1.0005769649827772, 1.0054072745092795, 1.0023537339076731, 0.9997855428028828]
alpha_1 --> [1.001864355125387,  1.0002215806140762, 1.001409586288863,  1.0008983684618922]
beta_0 -->  [1.0019336712313256, 1.0023003422465009, 1.0007937310605026, 1.0004301100316715]
beta_1 -->  [1.0005593339982681, 1.0012431220563995, 1.0048257725411378, 1.0002552628181007]
sigma_y --> 0.9996408560494754

Effective sample sizes, computed using arviz.stats.diagnostics.ess:

alpha_0 --> [385.1719463862104, 471.14315173768296, 1053.620438202132,  1330.9781345520635]
alpha_1 --> [5533.91355121317,  5616.106077686486,  4824.792580444725,  5061.8671337863625]
beta_0 -->  [600.4433065181482, 802.7004026449933,  1231.4454483759562, 1307.845002511886]
beta_1 -->  [1448.33035271917,  940.8664031204794,  1189.7891103158381, 1576.446733292215]
sigma_y --> 3894.943574084451

Divergent transitions, computed using np.sum(mcmc_obj.method_variables()['divergent__']), where mcmc_obj is the CmdStanMCMC object:


Nothing looks too amiss with these diagnostics and you say the results are “satisfactory in many aspects”, so perhaps its enough to say that the fit is sufficient?

I’m not 100% sure what you mean by:

The traceplot itself never goes below 0 because \beta_1 is defined in your model as a strictly positive quantity, so no samples below 0 are kept. I don’t believe there is any issue with the estimates of \beta_1 being very small.

1 Like

Thanks, @amas0 . I also thought that a high density near zero is not necessarily an issue, but rather indicates that the differences between individuals are insignificant relative to parameter \theta_1 (complete pooling would work fine). What made me doubt is that this does not happen for parameter \theta_0, whose distribution has a width \beta_0 “located far” from zero. I had never come across such a situation, and so I hesitated (I didn’t want to make the mistake of interpreting the phenomenon incorrectly as a consequence of having made a mistake in fitting the statistical model). Thank you all so much again for your answers. We can close this thread.

1 Like

Yeah, I believe the takeaway here is that if your model is to be trusted, then within a treatment you have very consistent random slopes, \theta_1, i.e. given a treatment, your outcome scales consistently with the covariate x. On the other hand, you have a range of valid random intercepts, \theta_0, within each treatment. More or less, the fit tells you that within a treatment, intercepts vary at the individual level, but the scaling with x is consistent across individuals.

From there it’s up to your domain expertise and interrogating the model/data to see if that is consistent with what you expect.