I’ll describe my model further below, but the basic problem is the following. I fit a model in Stan as a test run, with 3 chains and 10 iterations (in parallel), and the model fit in under 4 minutes. I then increased the number of iterations to 1000 and the number of chains to 4, and let it run overnight. However, when I check on the progress (in the “Viewer” tab of RStudio), Stan hasn’t reported anything past iteration 1/1000. In theory the first 100 iterations should finish in under 40 minutes, so with no progress being reported overnight, it’s telling me the sampler is getting stuck on something. I’ve let it go for up to 48 hours, with no progress observed.
Here are the details:
I have J=14 outcomes, K=23 predictors of interest, and P=21 additional covariates. I have been asked by my frequentist colleagues to fit J\times K different linear regression models, each one regressing one of the J outcomes on one of the K predictors, controlling for the additional P covariates. Inspired by this paper , I would instead like to fit these models all at once, in a Bayesian framework. The full model can be expressed as follows:
y_{ij}=\alpha_{jk} + \beta_{jk}X_i + \theta_{jk}Z_{ik} + \epsilon_{ijk}
Here, y_{ij} is the value of outcome j for unit i, X_i is the vector of covariates for unit i, and Z_{ik} is predictor k for unit i. We put structure on the priors for the \alpha_{jk}, \beta_{jk}, and \theta_{jk}, to borrow strength in both directions, as follows:
\alpha_{jk} = \alpha_0 + \alpha_j + a_{jk}
\beta_{jk} = \beta_0 + \beta_j + b_{jk}
\theta_{jk} = \theta_0 + \theta^E_j + \theta^P_k + \theta^{EP}_{jk}
We put (normal) shrinkage priors across related parameters as appropriate, N(0,1) priors for hyper-means, and half-normal(0,1) priors on variance components.
All of the outcomes are on a similar scale (potential values are fractions from 0 to 1). I set up the data in “long” format, with one row for every combination of i, j, and k. Note that this means that each outcome y_{ij} is repeated K times, which is kind of weird. X is centered and scaled, and Z is centered without scaling (centered separately for each k, actually), just to leave a 1-unit change as interpretable. All the Z predictors have the same potential range (from 1 to 12).
Some things that I’ve tried:
- Reducing the complexity of the model. For example, dropping the a_{jk} and b_{jk} components. We also originally included unit-specific (i) effects, which we have since dropped. We also originally had a more complicated multi-level prior structure for the different effects (the outcomes and predictors of interest both are “grouped”), but took this out because it adds a lot more parameters to the model.
- We added a prior across all of the variance components to try to stabilize them - the “uber-prior” below.
So far nothing has worked. Any suggestions? One thing I’m curious about, could taking out some of the parameters that provide structure (in order to reduce model complexity) actually make the fit take longer, because that structure is helpful?
Here is the current Stan code, with the a_{jk} and b_{jk} terms removed:
data{
int<lower=0> N; // number of observations
int<lower=0> J; // number of outcomes
int<lower=0> K; // number of predictors of interest
int<lower=0> P; // number of background covariates
vector[N] y; // outcome values
vector[N] z; // predictors
int<lower=0, upper = J> j[N]; // outcome index
int<lower=0, upper = K> k[N]; // predictor index
matrix[N, P] X; // background covariates
}
parameters{
real alpha_0; // intercept components
vector[J] alpha_j_raw;
row_vector[P] beta_0; // controls for covariates
matrix[J, P] beta_j_raw;
real theta_0; // overall effect
vector[J] theta_e_raw; // outcome-specific term
vector[K] theta_p_raw; // predictor-specific term
matrix[J, K] theta_ep_raw; // intxn
real<lower=0> sigma_alpha; // variance terms
real<lower=0> sigma_beta;
vector<lower=0>[J] sigma_epsilon_raw;
real<lower=0> sigma_epsilon0;
real<lower=0> tau_e;
real<lower=0> tau_p;
real<lower=0>tau_ep;
real<lower=0> uber_mu;
real<lower=0> uber_sigma;
}
transformed parameters {
vector[J] alpha_j;
matrix[J, P] beta_j;
vector[J] theta_e;
vector[K] theta_p;
matrix[J, K] theta_ep;
vector<lower=0>[J] sigma_epsilon;
// matt tricks
alpha_j = alpha_j_raw*sigma_alpha;
beta_j = beta_j_raw*sigma_beta;
theta_e = theta_e_raw*tau_e;
theta_p = theta_p_raw*tau_p;
theta_ep = theta_ep_raw*tau_ep;
sigma_epsilon = sigma_epsilon_raw*sigma_epsilon0;
}
model {
vector[N] yHat;
vector[N] ySd;
vector[N] alpha;
matrix[N, P] beta;
vector[N] theta;
for(i in 1:N){
alpha[i] = alpha_0 + alpha_j[j[i]];
beta[i] = beta_0 + beta_j[j[i]];
theta[i] = theta_0 + theta_e[j[i]] + theta_p[k[i]] + theta_ep[j[i], k[i]];
yHat[i] = alpha[i] + dot_product(X[i], beta[i]) + z[i]*theta[i];
ySd[i] = sigma_epsilon[j[i]];
}
// likelihood
y~normal(yHat, ySd);
alpha_0 ~ normal(0.5, 1);
alpha_j_raw ~ normal(0, 1);
beta_0 ~ normal(0, 1);
for(c in 1:J){
beta_j_raw[c] ~ normal(0, 1);
theta_ep_raw[c] ~ normal(0, 1);
}
uber_mu ~ normal(0, 1);
uber_sigma ~ normal(0, 1);
theta_0 ~ normal(0, 1);
theta_e_raw ~ normal(0, 1);
theta_p_raw ~ normal(0, 1);
sigma_alpha ~ normal(uber_mu, uber_sigma);
sigma_beta ~ normal(uber_mu, uber_sigma);
sigma_epsilon_raw ~ normal(uber_mu, uber_sigma);
sigma_epsilon0 ~ normal(uber_mu, uber_sigma);
tau_e ~ normal(uber_mu, uber_sigma);
tau_p ~ normal(uber_mu, uber_sigma);
tau_ep ~ normal(uber_mu, uber_sigma);
}