Convergence issues of MCMC

Hello!

I am a complete beginner in Stan and Bayesian Regressions generally. With my setup, the chains do not converge. I have 9993 observations and it takes 4-5 days, in the end without the chain convergence. Could anyone help me out to reproduce the example or suggest any kind of solution? I post the correlation matrix of the data for the reproduction and the setup.

Here is my code:
data {
int<lower=0> N;
int<lower=0> NW;// dimension of exogenous covariates
int<lower=0> NZ; // dimension of instruments
vector[N] y; //outcome vector
vector[N] x; //endogenous variable
matrix[N, NW] w; // exogenous variable
matrix[N, NZ] z; // instruments
vector<lower=0, upper=1>[NZ] beta_z;

}

transformed data {
matrix[N, 2] Y;
Y[,1] = y;
Y[,2] = x;
}

parameters {
real<lower=0, upper=1> beta_x;
vector[2] beta_null;
matrix[NW, 2] beta_w;
vector[NZ] alpha_z;
cov_matrix[2] Omega;
}

transformed parameters {
matrix[N,2] mu;
mu[, 1] = beta_null[1] + beta_x * x + z *beta_x * beta_z + w * beta_w[, 1];
mu[, 2] = beta_null[2] + z * alpha_z + w * beta_w[, 2];

}

model {
//priors
beta_x ~ normal(0,1);
beta_null[1] ~ normal(0,1);
alpha_z ~ normal(0,1);
to_vector(beta_w) ~ normal(0, 1);
to_vector(Omega) ~ student_t(4,0,1);
for(n in 1:N) {
Y[n] ~ multi_student_t(4, mu[n],Omega);
}
}

stanoutput <- stan( file = “stanfile”, data = data, chains = 3)

###Data

N <- 9993

NW <- 5

NZ <- 4

y x w1 w2 w3 w4 w5 z1 z2
y 1.00000000 0.26396795 -0.2997271676 0.16760224 -0.012144523 0.31854680 0.30474905 0.038845574 0.01789145
x 0.26396795 1.00000000 0.1186446230 0.02934843 0.034378885 -0.20605879 -0.19810733 0.038781732 -0.04644700
w1 -0.29972717 0.11864462 1.0000000000 -0.04055944 -0.004459172 -0.14316250 -0.14501490 -0.007953769 -0.01566073
w2 0.16760224 0.02934843 -0.0405594401 1.00000000 -0.011347410 0.28869093 0.27350731 0.032998922 0.07557924
w3 -0.01214452 0.03437889 -0.0044591721 -0.01134741 1.000000000 -0.10056156 -0.09746229 -0.020801965 -0.00978715
w4 0.31854680 -0.20605879 -0.1431624999 0.28869093 -0.100561555 1.00000000 0.98289808 0.045713043 0.10861771
w5 0.30474905 -0.19810733 -0.1450149042 0.27350731 -0.097462286 0.98289808 1.00000000 0.044450397 0.11435677
z1 0.03884557 0.03878173 -0.0079537692 0.03299892 -0.020801965 0.04571304 0.04445040 1.000000000 0.33073711
z2 0.01789145 -0.04644700 -0.0156607270 0.07557924 -0.009787150 0.10861771 0.11435677 0.330737114 1.00000000
z3 -0.01022041 -0.07328704 -0.0164752663 0.05923491 0.009410119 0.12644242 0.13301806 0.157859773 0.47729682
z4 -0.03086815 -0.08324394 -0.0003506437 0.06839972 0.022309498 0.09969438 0.10391898 0.088635791 0.26799469
z3 z4
y -0.010220408 -0.0308681490
x -0.073287038 -0.0832439448
w1 -0.016475266 -0.0003506437
w2 0.059234909 0.0683997177
w3 0.009410119 0.0223094982
w4 0.126442424 0.0996943837
w5 0.133018062 0.1039189775
z1 0.157859773 0.0886357909
z2 0.477296821 0.2679946914
z3 1.000000000 0.5614843413
z4 0.561484341 1.0000000000

@James_Savage maybe you could have a look?

Hi @Maria, welcome to Stan Discourse!

This looks a bit odd. The exclusion restriction in IV is that z does not appear in the mean model for y. So I don’t know what z * beta_x * beta_z is doing in mu[,1]. Can you add some context?

Perhaps try:

You might want to make sure it’s working on a smaller sample (say, 200 observations) before throwing 9000 observations at it. It should definitely not take several days to compute! You can probably speed it up further by writing a matrix multivariate normal distribution (to save inversion of the Cholesky factor of the covariance matrix 9000 times). I think I have some code lying about somewhere that does that. Will post it here.

Hi @James_Savage ,

Thanks for the reply. Yes I know that it does not appear, though I need to observe the values of the endogenous variable coefficient when the instruments have the direct effect on the outcome. (effect varies from 0 to 1 concentrated on the interval [0,0.5], that is why beta prior). See https://personal.eur.nl/thurik/Research/Articles/Family%20background%20variables%20as%20instruments%20in%20EER.pdf p.520 for the reference… I would appreciate if you link me the matrix MVN instead of Cholesky decomposition.

The model does converge for 200 observations, just checked. Though with the error that ‘2920 transitions after warmup that exceeded the maximum treedepth’

Oh OK – in that paper they do not estimate \tilde{\gamma}, which is not identified. They assume a value for it and do sensitivity testing for different values.

If it’s converging for small samples and not for large samples, you might want to look at outliers. And potentially specify the sampling equation as a multivariate student t model rather than multivariate normal.

Hope this helps,
Jim

Your priors are probably bad ones too – simulate some data from your model with the given covariates and instruments by drawing from your prior, simulating the structural errors, then simulating mu, then simulating the data. Use that data + your covariates + instruments to fit the model, and see if it estimates cleanly. Repeat many times (you might want to read this https://arxiv.org/pdf/1804.06788.pdf).

And by “bad ones” I mean the priors in my post are probably bad ones :p.

Remember that the prior should always encode the important information that we don’t have in our data; good priors also aid computation.

Always feel free to tag me here – it’s a wonderful community and you’ll find people happy to help you build better models.

Thanks for taking time!! I will try your suggestions.:)

@James_Savage I try your suggestions, I assume multivariate student distribution, so that I do not need to use Cholesky decomposition. I also re-define priors. I rescale the data to the mean, my model still takes huge estimation time. Can one of you reproduce the example to see what’s going wrong there? I would appreciate so much… @bgoodri @James_Savage @Bob_Carpenter

Maria – are you estimating beta_z or assuming it? It is not identified-- you should be passing it in as data. Something like this (I noticed in your original code you specified the model chunk twice-- I assume that was an error?):

data {
  int<lower=0> N;
  int<lower=0> NW;// dimension of exogenous covariates
  int<lower=0> NZ; // dimension of instruments
  vector[N] y; //outcome vector
  vector[N] x; //endogenous variable
  matrix[N, NW] w; // exogenous variable
  matrix[N, NZ] z; // instruments
  real<lower = 0, upper = 1> beta_z;
}
transformed data {
  matrix[N, 2] Y;
  Y[,1] = y;
  Y[,2] = x;
}
parameters {
  real beta_x;
  vector[2] beta_null;
  matrix[NW, 2] beta_w;
  vector[NZ] alpha_z;
  vector<lower=0>[2] scale;
  cholesky_factor_corr[2] L_omega;
}
transformed parameters {
  matrix[N,2] mu;
  mu[, 1] = beta_null[1] + beta_x * x + z * beta_z * beta_x + w * beta_w[, 1];
  mu[, 2] = beta_null[2] + z * alpha_z + w * beta_w[, 2];
}
model {
  //priors
  beta_x ~ normal(0,1);
  beta_null ~ normal(0,1); 
  to_vector(beta_w) ~ normal(0,1);
  alpha_z ~ normal(0,1);
  scale ~ student_t(4,0,1);
  L_omega ~ lkj_corr_cholesky(4);
  for(n in 1:N) {
    Y[n] ~ multi_normal_cholesky(mu[n], diag_pre_multiply(scale,L_omega));
  }
}

oooh!! You are correct. I will try. Hopefully this should fix! (Yes that was an error.) And thank you so much!!!:)

So the estimation time is minimal now, but for some coefficients chains converge, for some - do not. Guess i should consider different priors.

One option i can do is to set improper priors of OLS. i do not have other ideas of what kind of priors i should set… would you suggest setting improper flat priors?

Maria –

Don’t do that! Are you getting divergent transitions, or is it just not converging?

A few possible problems:

  • Weak instruments: Is there much variation in your endogenous covariate? If you regress lm(x ~ z + ., data = w), is the coefficient on z large/significant?

  • Insufficient data: how small is your trial sample? Sometimes making it larger might help.

  • Outliers: have you plotted histograms of x and y? How do they look? Do either of them have sign restrictions (if either are necessarily positive, it can motivate taking logs etc). Are you running an income regression? If if everyone has a positive income, it’s worth taking the log of income, in which case the coefficient is interpreted as the % change in income from a year of education.

  • The prior on scale. Sometimes I find that priors that have little weight around zero can help (lognormal() or inv_gamma(), say).

Let me know if any of these work.

Hey,

Thanks for the detailed suggestions. Earlier in my first post i was getting just non-convergence. Now i re-tried the adapted code after removing outliers, getting rid of cholesky decomposition and estimating log of income, i get divergent transitions and i also need to increase maximum treedepth. I will try to increase treedepth and adapt_delta… If it does not help either, will let you know. I am reading a post that Stan is very fast, while my model took 10 hours. (9500 observations) My ednogenous variable-Education is consisting of the years: 0,3,4,6,8,12, and the distribution is not normal. That does not create any problem in maximum likelihood, do you think that could create problems here?. (To be more precise i get divergent transitions and when i launch shinystan, i see that the two chains converge, while the third does not).

I have a lot of binary variables, i do not see the point in centering them…(just read some people suggest to re-scale the data). Do you think it’s reasonable to center and scale the binary variables?

And yes i make sure that in the samples that i am using, the instrument has a strong and significant effect.

I was also thinking, i could take OLS coefficients as the mean of the parameters and spread it out. Would that make sense?

I update in the original code, the one that i am currently using

I saw a post suggesting non-centered parametrization. https://betanalpha.github.io/assets/case_studies/divergences_and_bias.html . Should i try?

And one more thing, the model works fine with small data (800), while it gives divergent transitions for the larger one.

@Maria

This seems to work for me (took some fixing but now it fits on 20000 observations using penalized maximum likelihood in about 10 seconds). It’ll take longer for MCMC, but probably not much more than 10 minutes or so.

Here’s the Stan code:

data {
  int<lower=0> N;
  int<lower=0> NW;// dimension of exogenous covariates
  int<lower=0> NZ; // dimension of instruments
  vector[N] y; //outcome vector
  vector[N] x; //endogenous variable
  matrix[N, NW] w; // exogenous variable
  matrix[N, NZ] z; // instruments
  vector<lower = 0, upper = 1>[NZ] beta_z;
  int sim_data;
}
transformed data {
  vector[2] Y[N];
  for(i in 1:N) {
    Y[i][1] = y[i];
    Y[i][2] = x[i];
  }
}
parameters {
  real beta_x;
  vector[2] beta_null;
  matrix[2, NW] beta_w;
  vector[NZ] alpha_z;
  vector<lower=0>[2] scale;
  corr_matrix[2] Omega;
}
transformed parameters {
  vector[2] mu[N];
  for(i in 1:N) {
    mu[i][1] = beta_null[1] + beta_x * x[i] + z[i] * beta_z * beta_x  + w[i] * beta_w[1]';
    mu[i][2] = beta_null[2] + z[i] * alpha_z + w[i] * beta_w[2]';
  }
}
model {
  //priors
  beta_x ~ normal(0,1);
  beta_null ~ normal(0,1); 
  to_vector(beta_w) ~ normal(0,1);
  alpha_z ~ normal(0,1);
  scale ~ inv_gamma(2,2);
  Omega ~ lkj_corr(4);
  
  if(sim_data==0) {
    Y ~ multi_normal(mu, quad_form_diag(Omega, scale));
  }
}
generated quantities {
  matrix[N, 2] Y_sim;
  matrix[2, 2] Sigma = quad_form_diag(Omega, scale);
  for(i in 1:N) {
    Y_sim[i] = multi_normal_rng(mu[i]', Sigma)';
  }
}

This simulates the data and fits the model

library(rstan)

# Simulate some fake data
N <- 20000
NZ <- 2
NW <- 2
Omega <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
tau <- c(3,2)
Sigma <- diag(tau) %*% Omega %*% diag(tau)
structural_shocks <- MASS::mvrnorm(N, c(0 ,0), Sigma)
intercepts <- c(3,4)
alpha_z <- rnorm(NZ)
beta_w <- matrix(rnorm(NW*2), 2, NW)
w <- matrix(rnorm(N*NW), N, NW)
z <- matrix(rnorm(N*NZ), N, NZ)
beta_x <- rnorm(1)
x <- intercepts[2] + z %*% alpha_z + w %*%beta_w[2,] + structural_shocks[,2]
y <- intercepts[1] + x * beta_x + w %*% beta_w[1,] + structural_shocks[,1]
mu <- data.frame(y = intercepts[1] + x * beta_x + w %*% beta_w[1,] ,x =  intercepts[2] + z %*% alpha_z + w %*%beta_w[2,])


# Check to make sure that I've constructed the data properly
# Run standard IV on it

plot(coef(AER::ivreg(y ~ x+ w[,1] + w[,2] | w[,1] + w[,2] + z[,1] + z[,2])), c(intercepts[1], beta_x, beta_w[1,]), col = c(1,2,1,1), main = "IV fit")
abline(0,1)

# Check to see that there's a bias -- fit with OLS
plot(coef(lm(y ~ x+ w[,1] + w[,2])), c(intercepts[1], beta_x, beta_w[1,]), col = c(1,2,1,1), main = "LM fit")
abline(0,1)

# OK -- the simulated data seems to work

compiled_model <- stan_model("IV_mvn.stan")

# Construct the data
data_list <- list(N = N, 
                  NW = NW, 
                  NZ = NZ, 
                  z = z, 
                  x = as.numeric(x), 
                  y = as.numeric(y),
                  w = w, 
                  beta_z = rep(0, NZ),
                  sim_data = 0)

# Fit it with penalized maximum likelihood -- takes a couple of seconds
estimated_model_maxlik <- optimizing(compiled_model, data = data_list)

gp <- function (f, p) {
  if (class(f) == "list") {
    f$par[grepl(p, names(f$par))]
  }
  else {
    yy <- rstan::get_posterior_mean(f, pars = p)
    yy[, ncol(yy)]
  }
}

# Make sure it's recovering the parameter estimates
ests <- c(gp(estimated_model_maxlik, "beta_null\\[1"),rrcl::gp(estimated_model_maxlik, "beta_x"), rrcl::gp(estimated_model_maxlik, "beta_w\\[1"))
plot(ests, c(intercepts[1], beta_x, beta_w[1,]), col = c(1,2,1,1), main = "Bayes fit")
abline(0,1)
# Seems to work very well

# Fit with MCMC -- will take some time, but not 10 hours!
estimated_model <- sampling(compiled_model, data = data_list, iter = 500, chains = 4, cores= 4)
1 Like