Putting prior on the constant separately in a regression

Tommaso11397 · August 10, 2021, 10:03am

Hi, I am having a weird issue: I run a hierarchical model on Stan with the following likelihood:

model {
   for (n in 1:N) {
    m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
  }
  
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  
  y ~ normal(m, d);
}

Basically the expected value of the Normal is a regression on a set of data and a constant. I used to put the constant directly in the data as a column of 1’s, but then I decided to take it outside and give it an uninformative prior. Note that the other regressor have an uncentered prior which depends on cluster specific variables.

I decided to take out the constant because I wanted to do the QR reparametrization and I saw that you are supposed to not have it in the matrix of data. However, when I take out the constant the estimated treatment effect becomes null, while in the previous model it was positive and coincided with the frequentist estimate.

What am I doing wrong?

mike-lawrence · August 10, 2021, 4:09pm

Can you show a snippet of your design matrix (the thing you’re doing the QR transform on) both before and after you’ve removed the intercept/constant?]

Tommaso11397 · August 11, 2021, 1:21pm

Hi Mike, thanks for all the help on my posts.

So this is the matrix of individual-level data without the constant (with the constant I would just add a column on the left

Data Matrix

My hierarchical model has only two levels: individuals and schools (treatment was randomly assigned to schools). As you can see from above, schools have specific slopes (betas) and intercepts (alphas). I used to have the constant in the data matrix, so that the intercept would be one of the betas and in that case we had correct results. I do not understand why taking out the constant and giving it the same prior changes the results, it does not make much sense.

By looking at the results of a simulation I noticed a weird thing, I’ll post the image below, the interecepts in the last schools behave very weirdly and do not converge.

As you can see, from the 91th school to the 121st the alphas have absurd values and do not converge. What would that be due to? I want to stress that the two model are perfectly (in theory) equivalent.

Last thing, I apoligize for the long reply, this happens also in the model without QR.

simonbrauer · August 11, 2021, 2:20pm

If I understand it correctly, the intercept had a prior on it before but no longer does. Is that correct? Also, w is a vector of (data) weights on [0,1] which weight two different standard parameters?

If those two are correct, I wonder if the absurd school intercepts are those with a w close to 0 or 1 and something funky is going on with sigma2_t or sigma2_c that might make the intercepts imprecise. If one of those two go towards infinity, I assume the intercepts could also spread out infinitely. I wouldn’t put my money on that being the issue, but maybe worth checking.

mike-lawrence · August 11, 2021, 2:23pm

Can you post your full code?

Tommaso11397 · August 11, 2021, 2:48pm

Here is the complete code, let me know if I need to explain some of the context.


data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> g[N];  // group for individual
  matrix[N, K] x;             // individual predictors
  matrix[J,L] u;                 // group predictors
  vector[N] y;                 // outcomes
  
  vector[N] w;                      // treatment assigned
}

transformed data{
  matrix[L,J] u_transp;                 // group predictors transposed
  u_transp = u';
}

parameters {
  real alpha[J];
  matrix[K, J] z;               // this will be the basis for the beta's
                                // transformation
  
  matrix[K,L] gamma;              // school coefficients
  vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale
  
  vector[L] eta;
  
  real effect;                      // super-population average treatment effect
  real<lower=0> sigma2_t;            // residual SD for the treated
  real<lower=0> sigma2_c;            // residual SD for the control
  
  cholesky_factor_corr[K] L_Omega;
 
}

transformed parameters{
   
   vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
   matrix[K,J] beta;            // vector of school-specific coefficients
   
  
  // Here we define beta, not that since it is the transformation of a 
  // multivariate normal, we do not need to give a prior to beta, but rather
  // to z and to the variance component (tau) and the correlation component
  // of the Cholensky decomposition (L_Omega)
  
  beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;
  
}

model {
  
  vector[J] c;
  vector[N] m;
  vector[N] d;
  vector[N] uno;
  
  
  // Here we assign priors to the variances of the beta_j's, to the correlation
  // component L_Omega and to the random component z
  
  tau_unif ~ uniform(0,pi()/2);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(z) ~ std_normal();
  
  
  // Here we model the likelihood of our outcome variable y
  
  effect ~ normal(0, (9.11134)^2);
  sigma2_c ~ inv_gamma(1, 0.01);          
  sigma2_t ~ inv_gamma(1, 0.01);
  
  for (j in 1:J) {
    c[j] = eta' * u_transp[ ,j];
  }
  
  alpha ~ normal(c,(9.11134)^2);
  
  for (n in 1:N) {
    m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
  }
  
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  
  y ~ normal(m, d);

}

generated quantities{
  real effect_fs;                      // finite-sample average treatment effect
  real effect_qte25;                   // quantile treatment effect at p = 0.25
  real effect_qte50;                   // quantile treatment effect at p = 0.50
  real effect_qte75;                   // quantile treatment effect at p = 0.75
  real y0[N];                       // potential outcome if W = 0
  real y1[N];                       // potential outcome if W = 1
  real effect_unit[N];                 // unit-level treatment effect
  
  for(n in 1:N){
    real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];            
    real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
    if(w[n] == 1){                
      y0[n] = normal_rng(mu_c , sigma2_c);
      y1[n] = y[n];
    }else{                        
      y0[n] = y[n];       
      y1[n] = normal_rng(mu_t, sigma2_t);
    }
    effect_unit[n] = y1[n] - y0[n];
  }
  effect_fs = mean(effect_unit);
  effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25); 
  effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50); 
  effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75); 
}

P.S.: this is the version without QR (same results, same issue), so that we get rid of one layer of complexity.

mike-lawrence · August 12, 2021, 12:12pm

Ah, so this is the model from SUG 1.13 with some slight elaboration. I discovered recently that I’d been misunderstanding that model for years and am still working out what structure it reflects. Any chance you could post your data? That might help me figure out your issue as well as the model structure.

Tommaso11397 · August 12, 2021, 12:29pm

Yes, sure.

ordered_groups.csv (40.5 KB) This assigns individual observations to their cluster
individual_level_variables.csv (191.0 KB) This are the individual level variables
school_level_variables.csv (5.0 KB) This are the school level variables

This is the R script to run the model :

setwd("...")
rm(list=ls())


library(StanHeaders)
library(rstan)
library(rstanarm, options(mc.cores = parallel::detectCores()))   

individual_level_variables <- read.csv("individual_level_variables.csv", sep = ";")
school_level_variables <- read.csv("school_level_variables.csv", sep = ";")

N <- 5135
K <- 5
J <- 121
L <- 12

y <- individual_level_variables[,8]
w <- individual_level_variables[,2]
x <- individual_level_variables[,3:7]
u <- school_level_variables[,2:13]

ordered_groups <- read.csv("ordered_groups.csv", sep = ";")
jj <- c(1:5135)
for (n in c(1:5135)) {
  jj[n] <- ordered_groups[n,2]
}


stan_data <- list(N = N, K = K, J = J, L = L, x = as.matrix(x), y = y, g = jj, u = as.matrix(u), w = w)

fit_mod1 <- stan(file = "Cholensky Decomposition.stan",
                 data = stan_data,
                 iter = 200, chains = 4, warmup = 100, seed = 69,
                 sample_file = file.path("mysimulation.csv"),
                 control = list(adapt_delta = 0.95, max_treedepth = 12),
                 cores = 8)

csvfiles <- c("mysimulation_1.csv", "mysimulation_2.csv", "mysimulation_3.csv",
              "mysimulation_4.csv")
fit_mod1 <- read_stan_csv(csvfiles)

param <- c("effect", "effect_fs", "effect_qte25", "effect_qte50", "effect_qte75", "alpha")
print(fit_mod1, pars=param, probs = c(0.1, 0.5, 0.9), digits = 3)

And lastly, this is the Stan model:

functions { 
  real quantile(vector x, real p){
    int n;            // length of vector x
    real index;       // integer index of p
    int lo;           // lower integer cap of the index
    int hi;           // higher integer cap of the index
    real h;           // generated weight between the lo and hi
    real qs;          // weighted average of x[lo] and x[hi]
    n = num_elements(x);           
    index = 1 + (n - 1)*p;             
    lo = 1; 
    while ((lo + 1) < index) 
      lo = lo + 1; 
    hi = lo + 1; 
    h = index - lo; 
    qs = (1 - h)*sort_asc(x)[lo] + h*sort_asc(x)[hi];     
    return qs;                
  }
  

}

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> g[N];  // group for individual
  matrix[N, K] x;             // individual predictors
  matrix[J,L] u;                 // group predictors
  vector[N] y;                 // outcomes
  
  vector[N] w;                      // treatment assigned
}

transformed data{
  matrix[L,J] u_transp;                 // group predictors transposed
  u_transp = u';
}

parameters {
  real alpha[J];
  matrix[K, J] z;               // this will be the basis for the beta's
                                // transformation
  
  matrix[K,L] gamma;              // school coefficients
  vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale
  
  vector[L] eta;
  
  real effect;                      // super-population average treatment effect
  real<lower=0> sigma2_t;            // residual SD for the treated
  real<lower=0> sigma2_c;            // residual SD for the control
  
  cholesky_factor_corr[K] L_Omega;
 
}

transformed parameters{
   
   vector<lower=0>[K] tau = tan(tau_unif);
   matrix[K,J] beta;            // vector of school-specific coefficients
   
  
  // Here we define beta, not that since it is the transformation of a 
  // multivariate normal, we do not need to give a prior to beta, but rather
  // to z and to the variance component (tau) and the correlation component
  // of the Cholensky decomposition (L_Omega)
  
  beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;
  
}

model {
  
  vector[J] c;
  vector[N] m;
  vector[N] d;
  vector[N] uno;
  
  
  // Here we assign priors to the variances of the beta_j's, to the correlation
  // component L_Omega and to the random component z
  
  tau_unif ~ uniform(0,pi()/2);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(z) ~ std_normal();
  
  
  // Here we model the likelihood of our outcome variable y
  
  effect ~ normal(0, (9.11134)^2);
  sigma2_c ~ inv_gamma(1, 0.01);          
  sigma2_t ~ inv_gamma(1, 0.01);
  
  for (j in 1:J) {
    c[j] = eta' * u_transp[ ,j];
  }
  
  alpha ~ normal(c,(9.11134)^2);
  
  for (n in 1:N) {
    m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
  }
  
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  
  y ~ normal(m, d);

}

generated quantities{
  real effect_fs;                      // finite-sample average treatment effect
  real effect_qte25;                   // quantile treatment effect at p = 0.25
  real effect_qte50;                   // quantile treatment effect at p = 0.50
  real effect_qte75;                   // quantile treatment effect at p = 0.75
  real y0[N];                       // potential outcome if W = 0
  real y1[N];                       // potential outcome if W = 1
  real effect_unit[N];                 // unit-level treatment effect
  
  for(n in 1:N){
    real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];            
    real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
    if(w[n] == 1){                
      y0[n] = normal_rng(mu_c , sigma2_c);
      y1[n] = y[n];
    }else{                        
      y0[n] = y[n];       
      y1[n] = normal_rng(mu_t, sigma2_t);
    }
    effect_unit[n] = y1[n] - y0[n];
  }
  effect_fs = mean(effect_unit);
  effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25); 
  effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50); 
  effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75); 
}

Thanks a lot for your interest in the issue and all your help :)

Update: there might be an issue in the ordered groups file as some clusters do not have observation, working now to fix it asap.

Tommaso11397 · August 12, 2021, 2:58pm

Ok, so there was a problem with the data which we fixed, now the model behaves better but we only did a small number of iterations (100). Now we are running 1000 iterations with 1000 warmups and see what happens.

These are the correct data:

individual_level_variables.csv (191.0 KB)
school_level_variables_correct.csv (4.8 KB)
ordered_groups_correct.csv (60.5 KB)

And this is the model and the script:

Script for Standardized data.R (2.7 KB)
Cholesky standardized.stan (4.3 KB)

We have decided to standardize all the data as it is easier to give priors and decorrelates the data a bit.

This is the code of the model if you want to just take a peak:

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> g[N];  // group for individual
  matrix[N, K] x;             // individual predictors
  matrix[J,L] u;                 // group predictors
  vector[N] y;                 // outcomes
  
  vector[N] w;                      // treatment assigned
}

transformed data{
  matrix[L,J] u_transp;                 // group predictors transposed
  u_transp = u';
}

parameters {
  real alpha[J];
  matrix[K, J] z;               // this will be the basis for the beta's
                                // transformation
  
  matrix[K,L] gamma;              // school coefficients
  vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale
  real<lower=0> sigma2_alpha;            // residual for the constant
  
  vector[L] eta;
  
  real effect;                      // super-population average treatment effect
  real<lower=0> sigma2_t;            // residual SD for the treated
  real<lower=0> sigma2_c;            // residual SD for the control
  
  cholesky_factor_corr[K] L_Omega;
 
}

transformed parameters{
   
   vector<lower=0>[K] tau = tan(tau_unif);
   matrix[K,J] beta;            // vector of school-specific coefficients
   
  
  // Here we define beta, not that since it is the transformation of a 
  // multivariate normal, we do not need to give a prior to beta, but rather
  // to z and to the variance component (tau) and the correlation component
  // of the Cholensky decomposition (L_Omega)
  
  beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;
  
}

model {
  
  vector[J] c;
  vector[N] m;
  vector[N] d;
  vector[N] uno;
  
  
  // Here we assign priors to the variances of the beta_j's, to the correlation
  // component L_Omega and to the random component z
  
  tau_unif ~ uniform(0,pi()/2);
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(z) ~ std_normal();
  to_vector(gamma) ~ std_normal();
  
  
  // Here we model the likelihood of our outcome variable y
  
  effect ~ std_normal();
  sigma2_c ~ inv_gamma(1, 0.01);          
  sigma2_t ~ inv_gamma(1, 0.01);
  sigma2_alpha ~ inv_gamma(1, 0.01);
  
  for (j in 1:J) {
    c[j] = eta' * u_transp[ ,j];
  }
  
  alpha ~ normal(c,sigma2_alpha);
  
  for (n in 1:N) {
    m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
  }
  
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  
  y ~ normal(m, d);

}

generated quantities{
  real effect_fs;                      // finite-sample average treatment effect
  real effect_qte25;                   // quantile treatment effect at p = 0.25
  real effect_qte50;                   // quantile treatment effect at p = 0.50
  real effect_qte75;                   // quantile treatment effect at p = 0.75
  real y0[N];                       // potential outcome if W = 0
  real y1[N];                       // potential outcome if W = 1
  real effect_unit[N];                 // unit-level treatment effect
  
  for(n in 1:N){
    real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];            
    real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
    if(w[n] == 1){                
      y0[n] = normal_rng(mu_c , sigma2_c);
      y1[n] = y[n];
    }else{                        
      y0[n] = y[n];       
      y1[n] = normal_rng(mu_t, sigma2_t);
    }
    effect_unit[n] = y1[n] - y0[n];
  }
  effect_fs = mean(effect_unit);
  effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25); 
  effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50); 
  effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75); 
}

Again, I apologized for the issue, but whenever you transfer data from Stata to excel to R, things can happen. Btw if you have suggestions on how to improve the model I would be very happy to hear them.

jsocolar · August 12, 2021, 3:02pm

For what it’s worth, you can use Stan more directly from Stata if desired via Stata’s python integration.

simonbrauer · August 12, 2021, 3:13pm

Glad you found the issue. I’ve had trouble on the backend where Excel likes to interpret coef* (SE) as an equation to be calculated. This is very relatable.

You may also want to check out the haven R package and the read_dta() function specifically (if you have not already). It would cut Excel out of the equation entirely.

Tommaso11397 · August 12, 2021, 3:14pm

I did not know, for what is worth I prefer R a lot more than Stata, so I am trying to learn the former and become more independent from the latter. But thanks a lot, this might turn out useful in the future.

Tommaso11397 · August 12, 2021, 3:20pm

I will definitely do that. Being the first time I used Stan on R (and in general R) I approched this issue very naively and these are the results I guess.

Tommaso11397 · August 12, 2021, 4:41pm

Update: now the model seems to have a persisten downward bias in its estimates and a very low effective sample size, around 213 for post-warmup draws per chain=500, total post-warmup draws=2000.

Plotting the autocorrelation I saw that it is likely the cause of the low ess (ignore the title).

The mistery thickens: how does standardizing the data increase the autocorrelation? Shouldn’t it be the contrary? Also the downward bias seems to remain, though being the ess so low it might just be a coincidence.

Also, in this model I also applied QR decomposition on the individual level data, which makes it even weirder.

Lastly, it takes a lot less time to run.

I am no liutenent Colombo, what do I make of this evidence?

robertgrant · August 12, 2021, 4:44pm

This reminded me… When I did a review of Bayesian meta-analyses a few years back, the range of wacky software apparently used was surprising. Three people claimed to have run MCMC in Excel (well, i suppose you could…), and another three in Lotus 1-2-3 (ask your parents). Sadly nobody used a TI-83.

jsocolar · August 12, 2021, 5:06pm

This ESS per nominal sample is low but not outrageously low. How many warmup iterations are you running per chain? And what is the evidence for a downwards bias?

Tommaso11397 · August 12, 2021, 5:12pm

So I did a simulation with 4 Chains, 1000 iterations per chain and 500 warm-up. The weird thing is that it is much faster than before standardization, but with high autocorrelation (why?) and low ESS.

The presumed bias comes from the fact that we are trying to replicate the result of the RCT and it is relative to the frequentist estimate. The paper is by Duflo and coauthors and is pretty robust.

I am trying now to put all std deviations priors to be uninformative (flat) since I read in a paper by Gelman and coauthors that inverse gamma are not too good in this context.

mike-lawrence · August 12, 2021, 9:54pm

I doubt @andrewgelman would recommend uniform(0, Inf) as an appropriate alternative though. If the data are standardized to have unit sd, I like weibull(2,1) for a not-peaked-at-zero sd term (like measurement noise), and maybe normal(0,3) for a sd term where zero truly reflects reflects the most credible amount of variation (rare IMO).

Tommaso11397 · August 13, 2021, 10:23am

I was readng this paper and they seem to suggest using either a uniform or a half-Cauchy

“we recommend starting with a noninformative uniform prior density on standard deviation parameters”(page 527)
“When more prior information is desired, for instance to restrict \sigma away from very
large values, we recommend working within the half-t family of prior distributions,
which are more flexible and have better behavior near 0, compared to the inversegamma family. A reasonable starting point is the half-Cauchy family” (page 528)

But maybe you are right, they did not mention the case of standardized data.

mike-lawrence · August 13, 2021, 12:04pm

Take care when reading that paper to note when it is talking of priors for variance (\sigma^2) versus priors for the standard deviation (\sigma).

Topic		Replies	Views
Prior for regression coefficients: multivariate normal vs multiple, independent univariate normal Modeling prior-choice , hierarchical-model	6	687	April 8, 2020
Prior information in varying intercepts Modeling prior-choice , hierarchical-model	3	790	August 4, 2021
Centered vs noncentered - General	4	4131	August 14, 2017
Types of hyperpriors in Stan Modeling rstan , techniques , hierarchical-model	3	2310	October 5, 2020
Calculate intercept by centralising residuals instead of estimating intercept? Modeling specification , priors	2	182	April 11, 2024

Putting prior on the constant separately in a regression

Related Topics