Hierarchical bayesian regression

I am trying to model 6 variable linear model using Hierarchical bayesian modelling, underlying equation is,

model_code = """
data {
    int<lower=1> N;
    int<lower=1> M;
    row_vector[M] m[N];
    row_vector[M] p[N];
    row_vector[M] nw[N];
    row_vector[M] gp[N];
    row_vector[M] c_ly[N];
    row_vector[M] c_lm[N];
    row_vector[M] cc[N];

parameters {
    real           alpha;
    vector[6]      beta;
    vector[6]      mu_beta;
    cov_matrix[6]  sigma_beta;
    cov_matrix[M]  sigma_err;

transformed parameters {
    row_vector[M] cc_hat[N];
    for(i in 1:N) {
        cc_hat[i] = alpha + m[i] * beta[1] + p[i] * beta[2] + nw[i] * beta[3] + gp[i] * beta[4] + c_ly[i] * beta[5] + c_lm[i] * beta[6];

model {
    alpha ~ normal(0, 1000);
    mu_beta ~ multi_normal(rep_vector(0.0, 6), diag_matrix(rep_vector(1000,6)));
    sigma_beta ~ inv_wishart(7, diag_matrix(rep_vector(0.00000001,6)));
    beta ~ multi_normal(mu_beta, sigma_beta);
    sigma_err ~ inv_wishart(M+1, diag_matrix(rep_vector(1000,M)));
    cc ~ multi_normal(cc_hat, sigma_err);

and i am getting errors(below), i tried a lot of variants, hardcoded matrix instead of sigma_err(doing this didn’t raise exception),

(when using the sigma_err as it is)

Exception: inv_wishart_lpdf: LDLT_Factor of random variable is not positive definite.  last conditional variance is nan.  (in 'unknown file name' at line 32)

and (this i get everytime)

WARNING:pystan:Rhat above 1.1 or below 0.9 indicates that the chains very likely have not mixed
WARNING:pystan:89 of 800 iterations saturated the maximum tree depth of 10 (11.1 %)
WARNING:pystan:Run again with max_treedepth larger than 10 to avoid saturation
WARNING:pystan:Chain 1: E-BFMI = 0.0484
WARNING:pystan:Chain 2: E-BFMI = 0.0173
WARNING:pystan:Chain 3: E-BFMI = 0.0157
WARNING:pystan:Chain 4: E-BFMI = 0.00741
WARNING:pystan:E-BFMI below 0.2 indicates you may need to reparameterize your model

Can somebody help me out ?
and how can i better initialised parameters?

1 Like

Others may have better, more focused suggestions, but I will start by noticing that your priors (specially the one on the intercept) are bit too wide to provide any sort of regularisation. It seems to me that current practice is to avoid Wishart priors unless you really know what you’re doing. Perhaps @bgoodri can say more about that last point.

1 Like

what can i do instead?

Hey! Can you share your data (or a representative subset)? I can probably have a look at it tonight, if you want.

1 Like

I think here would be a good place to start.


yeah sure, zip file will do?

ok. thanx for the pointer i check it out!!


Feel free to write back if you have any doubts. @Max_Mantei has already volunteered precious time to help but a few others including myself would be happy to provide further assistance. Happy modelling!

1 Like

Will do, thanks again. :)

subset.csv (32.2 KB)

1 Like


Thanks for sending the data! There are a few points I wanted to comment on. Please take everything with a pinch of salt, since I don’t know the specifics of your data or the model you try to fit.

  • The scales of the data are huge! Since all variables are strictly positive, I ended up taking logs of all of them. This is common practice in my field (Econ), not sure if it makes sense in your application – it’s a different functional form, but it is much easier to work on a log scale with such dispersed data.
  • The predictor variables are fairly highly correlated with each other. This can make the sampling process much harder. The QR parameterization can help with that (check out here and here).
  • 4 out of 6 explanatory variables are actually group level variables. You typically don’t want those to have group-varying coefficients. The Gelman/Hill textbook has a sub-chapter on varying intercepts and slope models with group-level predictors.
  • In Stan multi-level models are usually parameterized via the (marginal) standard deviations and the Cholesky factor of the correlation matrix. This is what @maxbiostat pointed out above. He also rightly pointed out, that tighter priors are often much better for fitting such models.

I’d propose a model like this:

\log(cc) \sim \text{Normal}(Q\tilde\beta_{m}, \sigma),

where \beta is array size M (for counties 1:M) of vectors length 3 (intercept, coefficient of log(c_ly), and coefficient of log(c_lm)), indexed at the appropriate m. Q is a N \times 3 matrix computed like this:

1. compute log(c_ly) and log(c_lm) and put them in a matrix with dimension [N,2]
2. subtract the column mean of each column (centering)
3. use QR decomposition and get Q: qr_thin_Q() function in Stan
4. multiply Q by sqrt(N - 1)
5. append a column of 1s to that matrix

The distribution on \tilde\beta_m is Multivariate Normal, where a non-centered parameterization is used:

\begin{aligned} \tilde\beta_m &= \mu_{\tilde\beta,m}' + \text{diag}(\sigma_\tilde\beta)L^\Omega_\tilde\beta Z \\ Z &\sim \text{MVN}(0, I), \end{aligned}

where \text{diag}(\sigma_\tilde\beta) is the diagonal matrix of the three standard deviations of \tilde\beta and L^\Omega_\tilde\beta is the Cholesky factor of their correlation matrix. The Z's are just a bunch of standard normal variables. The group-level predictors enter in \mu_{\tilde\beta,m}:

\mu_{\tilde\beta} = Q^{\text{M}}\tilde\gamma,

where Q^{\text{M}} is an M \times 5 predictor matrix with group-level variables and an intercept, \tilde\gamma is a 5 \times 3 matrix of coefficient, so that \mu_{\tilde\beta} has dimension M \times 3 — that’s why it’s indexed and transposed above. The matrix Q^{\text{M}} is built like this:

1. compute log(m), log(p), log(nw), log(gp) and put them in a matrix with dimension [M,4]
2. subtract the column mean of each column (centering)
3. use QR decomposition and get Q^M: qr_thin_Q() function in Stan
4. multiply Q^M by sqrt(M - 1)
5. append a column of 1s to that matrix

This implies (this is the centered parameterization, which is probably less computationally efficient):

\begin{bmatrix} \tilde\beta_{1,m} \\ \tilde\beta_{2,m} \\ \tilde\beta_{3,m} \\ \end{bmatrix} \sim \text{MVN} \left(Q^{\text{M}}_m\tilde\gamma, \Sigma_\tilde\beta\right),

where \Sigma_\tilde\beta = \text{diag}(\sigma_\tilde\beta)\Omega_\tilde\beta\text{diag}(\sigma_\tilde\beta), so loosely speaking, the group-level effects are “projected” onto the varying slopes and intercepts of the “individual” level.

Okay, so this was very detailed… haha. But I didn’t know how to put it more simply. In my trials it was really the combination of QR decomposition, group-level effects as “projection”/interaction, and non-centered parameterization that worked.

I can post the (mostly) working Stan code tomorrow.


Edit: The tildes on top of \tilde\beta and \tilde\gamma denote the fact that you are probably not interested in them per se, but rather the “real” coefficients \beta and \gamma. You can get those back by doing the steps for creating Q or Q^M, but doing the last 3 point differently:

3. use QR decomposition and get R: qr_thin_R() function in Stan
4. devide R by sqrt(M - 1)
5. compute the inverse of R --> R_inv = inverse(R) in Stan
6. compute beta[K-1] = R_inc * beta_tilde[2:K] 
(above assumes that the intercept column of 1s is in first position)

This is also described in the QR resources I linked to above. Note, that this will not undo the centering of the variables, i.e. your intercepts still have the “as if variables are centered” interpretation, which is also important if you want to calculate predictions or other quantities of interest.


Thank you for such a detailed post, I am going through it and references you suggested.
I will try to fit this model and let you know how well it fits.
Its very kind of you to give your time :)


Hey! I’m glad to help! :)

This is what I think is the most relevant section form the Gelman/Hill book about this:

(I hope it’s fine to post this… If not I’ll delete it.)

Here’s the Stan code. Running it with adapt_delta = 0.99 gave 0 divergent transitions and took about 45 seconds (4 chains, 2000 iterations each). Don’t hesitate to ask if some things are unclear – I think the description in my post above corresponds fairly closely to the code. There are a lot of +1s in the variable declarations. These come from dragging along the regression intercepts in the matrices.

  // to compute centered predictor matrices
  matrix center(matrix H){
    matrix[rows(H), cols(H)] R;
    for (c in 1:cols(H))
      R[:,c] = H[:,c] - mean(H[:,c]);
    return R;
data {
  int<lower=1> N;
  int<lower=1> M;
  int<lower=1> K;
  int<lower=1> K_county;
  matrix[N, K] X;
  matrix[M, K_county] X_county;
  vector[N] cc;
  int<lower=1, upper=M> county_index[N];
transformed data{
  matrix[N, K] Q = qr_thin_Q(center(X)) * sqrt(N - 1);
  matrix[M, K_county] Q_county = qr_thin_Q(center(X_county)) * sqrt(M - 1);
  matrix[K, K] R_inv = inverse(qr_thin_R(center(X)) / sqrt(N - 1));
  matrix[K_county, K_county] R_inv_county = inverse(qr_thin_R(center(X_county)) / sqrt(M - 1));
  matrix[N, K+1] x = append_col(rep_vector(1.0, N), Q);
  matrix[M, K_county+1] x_county = append_col(rep_vector(1.0, M), Q_county);
parameters {
  matrix[K_county+1, K+1] gamma_tilde;
  cholesky_factor_corr[K+1] L_Omega_beta_tilde;
  vector<lower=0>[K+1] sigma_beta_tilde;
  real<lower=0> sigma;
  //auxilliary variables
  vector[K+1] z[M];

transformed parameters {
  matrix[M, K+1] mu_beta_tilde = x_county * gamma_tilde;
  vector[K+1] beta_tilde[M];
  vector[N] mu;
  for (m in 1:M)
    beta_tilde[m] = mu_beta_tilde[m]' + diag_pre_multiply(sigma_beta_tilde, L_Omega_beta_tilde) * z[m];
  for (n in 1:N)
    mu[n] = x[n]*beta_tilde[county_index[n]];


model {

  gamma_tilde[1,1] ~ normal(5, 2.5);
  gamma_tilde[1,2] ~ normal(0, 1.5);
  gamma_tilde[1,3] ~ normal(0, 1.5);
  for(k in 2:K+1)
    gamma_tilde[k] ~ normal(0, 0.25);

  L_Omega_beta_tilde ~ lkj_corr_cholesky(4);
  sigma_beta_tilde ~ normal(0, 0.5);
  sigma ~ normal(0, 2.5);
  for (m in 1:M)
    z[m] ~ std_normal();
  cc ~ normal(mu, sigma);
generated quantities{
  matrix[K_county, K] gamma = R_inv_county * gamma_tilde[2:(K_county+1),2:(K+1)];
  vector[K] beta[M];
  for (m in 1:M)
    beta[m] = R_inv * beta_tilde[m, 2:3];

You’ll note that I handle the data a bit differently compared to your code. I know you are working in Python, but I’ll still post my R code – you’ll probably be able to see what’s going on. If not, feel free to ask.


data_subset <- read_csv("data/subset.csv")

# create the county level predictor matrix
county_vars <- data_subset %>% 
  select(county, m, p, nw, gp) %>% 
  group_by(county) %>% 
  summarize_all(.funs = ~ unique(.)) %>% 
  select(-county) %>% 

stan_data_county_log <- list(
  N = nrow(data_subset),
  M = length(unique(data_subset$county)),
  K = 2,
  K_county = 4,
  X = with(data_subset, cbind(log(c_ly), log(c_lm))),
  X_county = log(county_vars),
  cc = log(data_subset$cc),
  county_index = as.numeric(factor(data_subset$county))

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

m7 <- stan(
  file = "model/m-7.stan",
  data = stan_data_county_log,
  control = list(adapt_delta = 0.99)