Putting prior on the constant separately in a regression

Can you show a snippet of your design matrix (the thing you’re doing the QR transform on) both before and after you’ve removed the intercept/constant?]

Hi Mike, thanks for all the help on my posts.

So this is the matrix of individual-level data without the constant (with the constant I would just add a column on the left

My hierarchical model has only two levels: individuals and schools (treatment was randomly assigned to schools). As you can see from above, schools have specific slopes (betas) and intercepts (alphas). I used to have the constant in the data matrix, so that the intercept would be one of the betas and in that case we had correct results. I do not understand why taking out the constant and giving it the same prior changes the results, it does not make much sense.

By looking at the results of a simulation I noticed a weird thing, I’ll post the image below, the interecepts in the last schools behave very weirdly and do not converge.

As you can see, from the 91th school to the 121st the alphas have absurd values and do not converge. What would that be due to? I want to stress that the two model are perfectly (in theory) equivalent.

Last thing, I apoligize for the long reply, this happens also in the model without QR.

If I understand it correctly, the intercept had a prior on it before but no longer does. Is that correct? Also, w is a vector of (data) weights on [0,1] which weight two different standard parameters?

If those two are correct, I wonder if the absurd school intercepts are those with a w close to 0 or 1 and something funky is going on with sigma2_t or sigma2_c that might make the intercepts imprecise. If one of those two go towards infinity, I assume the intercepts could also spread out infinitely. I wouldn’t put my money on that being the issue, but maybe worth checking.

Can you post your full code?

Here is the complete code, let me know if I need to explain some of the context.


data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
int<lower=1,upper=J> g[N];  // group for individual
matrix[N, K] x;             // individual predictors
matrix[J,L] u;                 // group predictors
vector[N] y;                 // outcomes

vector[N] w;                      // treatment assigned
}

transformed data{
matrix[L,J] u_transp;                 // group predictors transposed
u_transp = u';
}

parameters {
real alpha[J];
matrix[K, J] z;               // this will be the basis for the beta's
// transformation

matrix[K,L] gamma;              // school coefficients
vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale

vector[L] eta;

real effect;                      // super-population average treatment effect
real<lower=0> sigma2_t;            // residual SD for the treated
real<lower=0> sigma2_c;            // residual SD for the control

cholesky_factor_corr[K] L_Omega;

}

transformed parameters{

vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
matrix[K,J] beta;            // vector of school-specific coefficients

// Here we define beta, not that since it is the transformation of a
// multivariate normal, we do not need to give a prior to beta, but rather
// to z and to the variance component (tau) and the correlation component
// of the Cholensky decomposition (L_Omega)

beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;

}

model {

vector[J] c;
vector[N] m;
vector[N] d;
vector[N] uno;

// Here we assign priors to the variances of the beta_j's, to the correlation
// component L_Omega and to the random component z

tau_unif ~ uniform(0,pi()/2);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(z) ~ std_normal();

// Here we model the likelihood of our outcome variable y

effect ~ normal(0, (9.11134)^2);
sigma2_c ~ inv_gamma(1, 0.01);
sigma2_t ~ inv_gamma(1, 0.01);

for (j in 1:J) {
c[j] = eta' * u_transp[ ,j];
}

alpha ~ normal(c,(9.11134)^2);

for (n in 1:N) {
m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
}

uno = rep_vector(1, N);
d = sigma2_t*w + sigma2_c*(uno-w);

y ~ normal(m, d);

}

generated quantities{
real effect_fs;                      // finite-sample average treatment effect
real effect_qte25;                   // quantile treatment effect at p = 0.25
real effect_qte50;                   // quantile treatment effect at p = 0.50
real effect_qte75;                   // quantile treatment effect at p = 0.75
real y0[N];                       // potential outcome if W = 0
real y1[N];                       // potential outcome if W = 1
real effect_unit[N];                 // unit-level treatment effect

for(n in 1:N){
real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];
real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
if(w[n] == 1){
y0[n] = normal_rng(mu_c , sigma2_c);
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t, sigma2_t);
}
effect_unit[n] = y1[n] - y0[n];
}
effect_fs = mean(effect_unit);
effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25);
effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50);
effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75);
}


P.S.: this is the version without QR (same results, same issue), so that we get rid of one layer of complexity.

Ah, so this is the model from SUG 1.13 with some slight elaboration. I discovered recently that I’d been misunderstanding that model for years and am still working out what structure it reflects. Any chance you could post your data? That might help me figure out your issue as well as the model structure.

Yes, sure.

ordered_groups.csv (40.5 KB) This assigns individual observations to their cluster
individual_level_variables.csv (191.0 KB) This are the individual level variables
school_level_variables.csv (5.0 KB) This are the school level variables

This is the R script to run the model :

setwd("...")
rm(list=ls())

library(rstan)
library(rstanarm, options(mc.cores = parallel::detectCores()))

individual_level_variables <- read.csv("individual_level_variables.csv", sep = ";")
school_level_variables <- read.csv("school_level_variables.csv", sep = ";")

N <- 5135
K <- 5
J <- 121
L <- 12

y <- individual_level_variables[,8]
w <- individual_level_variables[,2]
x <- individual_level_variables[,3:7]
u <- school_level_variables[,2:13]

ordered_groups <- read.csv("ordered_groups.csv", sep = ";")
jj <- c(1:5135)
for (n in c(1:5135)) {
jj[n] <- ordered_groups[n,2]
}

stan_data <- list(N = N, K = K, J = J, L = L, x = as.matrix(x), y = y, g = jj, u = as.matrix(u), w = w)

fit_mod1 <- stan(file = "Cholensky Decomposition.stan",
data = stan_data,
iter = 200, chains = 4, warmup = 100, seed = 69,
sample_file = file.path("mysimulation.csv"),
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 8)

csvfiles <- c("mysimulation_1.csv", "mysimulation_2.csv", "mysimulation_3.csv",
"mysimulation_4.csv")

param <- c("effect", "effect_fs", "effect_qte25", "effect_qte50", "effect_qte75", "alpha")
print(fit_mod1, pars=param, probs = c(0.1, 0.5, 0.9), digits = 3)


And lastly, this is the Stan model:

functions {
real quantile(vector x, real p){
int n;            // length of vector x
real index;       // integer index of p
int lo;           // lower integer cap of the index
int hi;           // higher integer cap of the index
real h;           // generated weight between the lo and hi
real qs;          // weighted average of x[lo] and x[hi]
n = num_elements(x);
index = 1 + (n - 1)*p;
lo = 1;
while ((lo + 1) < index)
lo = lo + 1;
hi = lo + 1;
h = index - lo;
qs = (1 - h)*sort_asc(x)[lo] + h*sort_asc(x)[hi];
return qs;
}

}

data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
int<lower=1,upper=J> g[N];  // group for individual
matrix[N, K] x;             // individual predictors
matrix[J,L] u;                 // group predictors
vector[N] y;                 // outcomes

vector[N] w;                      // treatment assigned
}

transformed data{
matrix[L,J] u_transp;                 // group predictors transposed
u_transp = u';
}

parameters {
real alpha[J];
matrix[K, J] z;               // this will be the basis for the beta's
// transformation

matrix[K,L] gamma;              // school coefficients
vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale

vector[L] eta;

real effect;                      // super-population average treatment effect
real<lower=0> sigma2_t;            // residual SD for the treated
real<lower=0> sigma2_c;            // residual SD for the control

cholesky_factor_corr[K] L_Omega;

}

transformed parameters{

vector<lower=0>[K] tau = tan(tau_unif);
matrix[K,J] beta;            // vector of school-specific coefficients

// Here we define beta, not that since it is the transformation of a
// multivariate normal, we do not need to give a prior to beta, but rather
// to z and to the variance component (tau) and the correlation component
// of the Cholensky decomposition (L_Omega)

beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;

}

model {

vector[J] c;
vector[N] m;
vector[N] d;
vector[N] uno;

// Here we assign priors to the variances of the beta_j's, to the correlation
// component L_Omega and to the random component z

tau_unif ~ uniform(0,pi()/2);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(z) ~ std_normal();

// Here we model the likelihood of our outcome variable y

effect ~ normal(0, (9.11134)^2);
sigma2_c ~ inv_gamma(1, 0.01);
sigma2_t ~ inv_gamma(1, 0.01);

for (j in 1:J) {
c[j] = eta' * u_transp[ ,j];
}

alpha ~ normal(c,(9.11134)^2);

for (n in 1:N) {
m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
}

uno = rep_vector(1, N);
d = sigma2_t*w + sigma2_c*(uno-w);

y ~ normal(m, d);

}

generated quantities{
real effect_fs;                      // finite-sample average treatment effect
real effect_qte25;                   // quantile treatment effect at p = 0.25
real effect_qte50;                   // quantile treatment effect at p = 0.50
real effect_qte75;                   // quantile treatment effect at p = 0.75
real y0[N];                       // potential outcome if W = 0
real y1[N];                       // potential outcome if W = 1
real effect_unit[N];                 // unit-level treatment effect

for(n in 1:N){
real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];
real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
if(w[n] == 1){
y0[n] = normal_rng(mu_c , sigma2_c);
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t, sigma2_t);
}
effect_unit[n] = y1[n] - y0[n];
}
effect_fs = mean(effect_unit);
effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25);
effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50);
effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75);
}


Thanks a lot for your interest in the issue and all your help :)

Update: there might be an issue in the ordered groups file as some clusters do not have observation, working now to fix it asap.

2 Likes

Ok, so there was a problem with the data which we fixed, now the model behaves better but we only did a small number of iterations (100). Now we are running 1000 iterations with 1000 warmups and see what happens.

These are the correct data:

individual_level_variables.csv (191.0 KB)
school_level_variables_correct.csv (4.8 KB)
ordered_groups_correct.csv (60.5 KB)

And this is the model and the script:

Script for Standardized data.R (2.7 KB)
Cholesky standardized.stan (4.3 KB)

We have decided to standardize all the data as it is easier to give priors and decorrelates the data a bit.

This is the code of the model if you want to just take a peak:

data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
int<lower=1,upper=J> g[N];  // group for individual
matrix[N, K] x;             // individual predictors
matrix[J,L] u;                 // group predictors
vector[N] y;                 // outcomes

vector[N] w;                      // treatment assigned
}

transformed data{
matrix[L,J] u_transp;                 // group predictors transposed
u_transp = u';
}

parameters {
real alpha[J];
matrix[K, J] z;               // this will be the basis for the beta's
// transformation

matrix[K,L] gamma;              // school coefficients
vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale
real<lower=0> sigma2_alpha;            // residual for the constant

vector[L] eta;

real effect;                      // super-population average treatment effect
real<lower=0> sigma2_t;            // residual SD for the treated
real<lower=0> sigma2_c;            // residual SD for the control

cholesky_factor_corr[K] L_Omega;

}

transformed parameters{

vector<lower=0>[K] tau = tan(tau_unif);
matrix[K,J] beta;            // vector of school-specific coefficients

// Here we define beta, not that since it is the transformation of a
// multivariate normal, we do not need to give a prior to beta, but rather
// to z and to the variance component (tau) and the correlation component
// of the Cholensky decomposition (L_Omega)

beta = gamma * u_transp + diag_pre_multiply(tau, L_Omega) * z;

}

model {

vector[J] c;
vector[N] m;
vector[N] d;
vector[N] uno;

// Here we assign priors to the variances of the beta_j's, to the correlation
// component L_Omega and to the random component z

tau_unif ~ uniform(0,pi()/2);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(z) ~ std_normal();
to_vector(gamma) ~ std_normal();

// Here we model the likelihood of our outcome variable y

effect ~ std_normal();
sigma2_c ~ inv_gamma(1, 0.01);
sigma2_t ~ inv_gamma(1, 0.01);
sigma2_alpha ~ inv_gamma(1, 0.01);

for (j in 1:J) {
c[j] = eta' * u_transp[ ,j];
}

alpha ~ normal(c,sigma2_alpha);

for (n in 1:N) {
m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
}

uno = rep_vector(1, N);
d = sigma2_t*w + sigma2_c*(uno-w);

y ~ normal(m, d);

}

generated quantities{
real effect_fs;                      // finite-sample average treatment effect
real effect_qte25;                   // quantile treatment effect at p = 0.25
real effect_qte50;                   // quantile treatment effect at p = 0.50
real effect_qte75;                   // quantile treatment effect at p = 0.75
real y0[N];                       // potential outcome if W = 0
real y1[N];                       // potential outcome if W = 1
real effect_unit[N];                 // unit-level treatment effect

for(n in 1:N){
real mu_c = alpha[g[n]] + x[n, ] * beta[, g[n]];
real mu_t = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect;
if(w[n] == 1){
y0[n] = normal_rng(mu_c , sigma2_c);
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t, sigma2_t);
}
effect_unit[n] = y1[n] - y0[n];
}
effect_fs = mean(effect_unit);
effect_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25);
effect_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50);
effect_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75);
}


Again, I apologized for the issue, but whenever you transfer data from Stata to excel to R, things can happen. Btw if you have suggestions on how to improve the model I would be very happy to hear them.

For what it’s worth, you can use Stan more directly from Stata if desired via Stata’s python integration.

2 Likes

Glad you found the issue. I’ve had trouble on the backend where Excel likes to interpret coef* (SE) as an equation to be calculated. This is very relatable.

You may also want to check out the haven R package and the read_dta() function specifically (if you have not already). It would cut Excel out of the equation entirely.

1 Like

I did not know, for what is worth I prefer R a lot more than Stata, so I am trying to learn the former and become more independent from the latter. But thanks a lot, this might turn out useful in the future.

I will definitely do that. Being the first time I used Stan on R (and in general R) I approched this issue very naively and these are the results I guess.

Update: now the model seems to have a persisten downward bias in its estimates and a very low effective sample size, around 213 for post-warmup draws per chain=500, total post-warmup draws=2000.

Plotting the autocorrelation I saw that it is likely the cause of the low ess (ignore the title).

The mistery thickens: how does standardizing the data increase the autocorrelation? Shouldn’t it be the contrary? Also the downward bias seems to remain, though being the ess so low it might just be a coincidence.

Also, in this model I also applied QR decomposition on the individual level data, which makes it even weirder.

Lastly, it takes a lot less time to run.

I am no liutenent Colombo, what do I make of this evidence?

This reminded me… When I did a review of Bayesian meta-analyses a few years back, the range of wacky software apparently used was surprising. Three people claimed to have run MCMC in Excel (well, i suppose you could…), and another three in Lotus 1-2-3 (ask your parents). Sadly nobody used a TI-83.

This ESS per nominal sample is low but not outrageously low. How many warmup iterations are you running per chain? And what is the evidence for a downwards bias?

So I did a simulation with 4 Chains, 1000 iterations per chain and 500 warm-up. The weird thing is that it is much faster than before standardization, but with high autocorrelation (why?) and low ESS.

The presumed bias comes from the fact that we are trying to replicate the result of the RCT and it is relative to the frequentist estimate. The paper is by Duflo and coauthors and is pretty robust.

I am trying now to put all std deviations priors to be uninformative (flat) since I read in a paper by Gelman and coauthors that inverse gamma are not too good in this context.

I doubt @andrewgelman would recommend uniform(0, Inf) as an appropriate alternative though. If the data are standardized to have unit sd, I like weibull(2,1) for a not-peaked-at-zero sd term (like measurement noise), and maybe normal(0,3) for a sd term where zero truly reflects reflects the most credible amount of variation (rare IMO).

I was readng this paper and they seem to suggest using either a uniform or a half-Cauchy

“we recommend starting with a noninformative uniform prior density on standard deviation parameters”(page 527)
“When more prior information is desired, for instance to restrict \sigma away from very
large values, we recommend working within the half-t family of prior distributions,
which are more flexible and have better behavior near 0, compared to the inversegamma family. A reasonable starting point is the half-Cauchy family” (page 528)

But maybe you are right, they did not mention the case of standardized data.

Take care when reading that paper to note when it is talking of priors for variance (\sigma^2) versus priors for the standard deviation (\sigma).

Yes, I noticed that. I switched and put prior on std instead. I settled on Half Cauchy. The model does well now, though it still has a good deal of autocorrelation, even with QR. I really wonder where it might come from.