# Sampler getting stuck?

I’ll describe my model further below, but the basic problem is the following. I fit a model in Stan as a test run, with 3 chains and 10 iterations (in parallel), and the model fit in under 4 minutes. I then increased the number of iterations to 1000 and the number of chains to 4, and let it run overnight. However, when I check on the progress (in the “Viewer” tab of RStudio), Stan hasn’t reported anything past iteration 1/1000. In theory the first 100 iterations should finish in under 40 minutes, so with no progress being reported overnight, it’s telling me the sampler is getting stuck on something. I’ve let it go for up to 48 hours, with no progress observed.

Here are the details:

I have J=14 outcomes, K=23 predictors of interest, and P=21 additional covariates. I have been asked by my frequentist colleagues to fit J\times K different linear regression models, each one regressing one of the J outcomes on one of the K predictors, controlling for the additional P covariates. Inspired by this paper , I would instead like to fit these models all at once, in a Bayesian framework. The full model can be expressed as follows:

y_{ij}=\alpha_{jk} + \beta_{jk}X_i + \theta_{jk}Z_{ik} + \epsilon_{ijk}

Here, y_{ij} is the value of outcome j for unit i, X_i is the vector of covariates for unit i, and Z_{ik} is predictor k for unit i. We put structure on the priors for the \alpha_{jk}, \beta_{jk}, and \theta_{jk}, to borrow strength in both directions, as follows:

\alpha_{jk} = \alpha_0 + \alpha_j + a_{jk}
\beta_{jk} = \beta_0 + \beta_j + b_{jk}
\theta_{jk} = \theta_0 + \theta^E_j + \theta^P_k + \theta^{EP}_{jk}

We put (normal) shrinkage priors across related parameters as appropriate, N(0,1) priors for hyper-means, and half-normal(0,1) priors on variance components.

All of the outcomes are on a similar scale (potential values are fractions from 0 to 1). I set up the data in “long” format, with one row for every combination of i, j, and k. Note that this means that each outcome y_{ij} is repeated K times, which is kind of weird. X is centered and scaled, and Z is centered without scaling (centered separately for each k, actually), just to leave a 1-unit change as interpretable. All the Z predictors have the same potential range (from 1 to 12).

Some things that I’ve tried:

1. Reducing the complexity of the model. For example, dropping the a_{jk} and b_{jk} components. We also originally included unit-specific (i) effects, which we have since dropped. We also originally had a more complicated multi-level prior structure for the different effects (the outcomes and predictors of interest both are “grouped”), but took this out because it adds a lot more parameters to the model.
2. We added a prior across all of the variance components to try to stabilize them - the “uber-prior” below.

So far nothing has worked. Any suggestions? One thing I’m curious about, could taking out some of the parameters that provide structure (in order to reduce model complexity) actually make the fit take longer, because that structure is helpful?

Here is the current Stan code, with the a_{jk} and b_{jk} terms removed:

data{
int<lower=0> N;       // number of observations
int<lower=0> J;       // number of outcomes
int<lower=0> K;       // number of predictors of interest
int<lower=0> P;       // number of background covariates

vector[N] y;                       // outcome values
vector[N] z;                       // predictors
int<lower=0, upper = J> j[N];      // outcome index
int<lower=0, upper = K> k[N];      // predictor index
matrix[N, P] X;                    // background covariates

}

parameters{
real alpha_0;                 // intercept components
vector[J] alpha_j_raw;

row_vector[P] beta_0;                   // controls for covariates
matrix[J, P] beta_j_raw;

real theta_0;                    // overall effect
vector[J] theta_e_raw;           // outcome-specific term
vector[K] theta_p_raw;           // predictor-specific term
matrix[J, K] theta_ep_raw;       // intxn

real<lower=0> sigma_alpha;     // variance terms
real<lower=0> sigma_beta;
vector<lower=0>[J] sigma_epsilon_raw;
real<lower=0> sigma_epsilon0;
real<lower=0> tau_e;
real<lower=0> tau_p;
real<lower=0>tau_ep;

real<lower=0> uber_mu;
real<lower=0> uber_sigma;
}

transformed parameters {
vector[J] alpha_j;
matrix[J, P] beta_j;
vector[J] theta_e;
vector[K] theta_p;
matrix[J, K] theta_ep;
vector<lower=0>[J] sigma_epsilon;

// matt tricks
alpha_j = alpha_j_raw*sigma_alpha;
beta_j = beta_j_raw*sigma_beta;
theta_e = theta_e_raw*tau_e;
theta_p = theta_p_raw*tau_p;
theta_ep = theta_ep_raw*tau_ep;
sigma_epsilon = sigma_epsilon_raw*sigma_epsilon0;
}

model {
vector[N] yHat;
vector[N] ySd;
vector[N] alpha;
matrix[N, P] beta;
vector[N] theta;

for(i in 1:N){
alpha[i] = alpha_0 + alpha_j[j[i]];
beta[i] = beta_0 + beta_j[j[i]];
theta[i] = theta_0 + theta_e[j[i]] + theta_p[k[i]] + theta_ep[j[i], k[i]];

yHat[i] = alpha[i] + dot_product(X[i], beta[i]) + z[i]*theta[i];
ySd[i] = sigma_epsilon[j[i]];
}

// likelihood
y~normal(yHat, ySd);

alpha_0 ~ normal(0.5, 1);
alpha_j_raw ~ normal(0, 1);
beta_0 ~ normal(0, 1);

for(c in 1:J){
beta_j_raw[c] ~ normal(0, 1);
theta_ep_raw[c] ~ normal(0, 1);
}

uber_mu ~ normal(0, 1);
uber_sigma ~ normal(0, 1);

theta_0 ~ normal(0, 1);
theta_e_raw ~ normal(0, 1);
theta_p_raw ~ normal(0, 1);

sigma_alpha ~ normal(uber_mu, uber_sigma);
sigma_beta ~ normal(uber_mu, uber_sigma);
sigma_epsilon_raw ~ normal(uber_mu, uber_sigma);
sigma_epsilon0 ~ normal(uber_mu, uber_sigma);
tau_e ~ normal(uber_mu, uber_sigma);
tau_p ~ normal(uber_mu, uber_sigma);
tau_ep ~ normal(uber_mu, uber_sigma);
}


With that many predictors, you really need to consider using the QR reparameterization. Can you get the model to work with one outcome using rstanarm::stan_lmer?

I didn’t think about the number of predictors being a problem, thanks for this. I will try this, as well as fitting the model with all outcomes, but removing the covariates X, just to see if it will go.

Is there a way using stan_lm to specify the shrinkage prior across the $\theta$s? For the one-outcome case, it would just be \$\theta_k ~ N(\theta_0, \sigma_{\theta}), i.e. a ridge penalty.

You can do that with stan_glm. The prior in stan_lm shrinks a lot, but it is not exactly a ridge penalty. It is like a ridge penalty in Q-space, rather than X-space.

1 Like

Sorry, I may not have been clear. In the model I will have two types of “slopes”: \beta, which is of dimension P and are the coefficients for the covariates X, and \theta, which is of dimension K and are the coefficients for the “predictors of interest” Z. I would like a shrinkage prior on just the \theta's, i.e., \theta \sim N(\theta_0, \sigma_{\theta}). As far as I know, there is no way to specify that in either stan_glm or stan_lm. Is that correct?

I was able to fit the model using stan_lm with the standard R2 prior, it fit in under an hour, but I did get the following warnings:

> fit_lm  <- stan_lm(rsa_fmla, data = data, prior = R2(location = 0.2),
+                    iter = 1000, chains = 4, thin=1, cores = 4, init_r = .5,
+                    control = list(adapt_delta = 0.99, stepsize = 0.01, max_treedepth = 18))
There were 2000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 18. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceededExamine the pairs() plot to diagnose sampling problems
Markov chains did not converge! Do not analyze results!>


Of course since I only had one outcome, the model is 1/14 the size of the model that I actually want to fit. In order to fit the full model, would you say that the key will be implementing the QR reparameterization directly in Stan, following section 9.2 of the Stan manual? Do you have any suggestions on how I can do this, while staying true to the hierarchical prior structure that I described in my first post?

With a normal prior in stan_glm, you can specify a vector of scales so that they are smaller for the coefficients of the nuisance predictors and larger for the coefficients of interest. However, the distinction gets muddled if you do a QR reparameterization, although a QR reparameterization is almost always a good thing. You can put the nuisance predictors first and the predictor(s) of interest last. In the latest version of rstanarm, you can specify QR = TRUE and prior = normal(*, autoscale = FALSE), which ironically autoscales relative to the scale of the last predictor, so the prior on the last coefficient is unaffected by the QR reparameterization.

The prior in stan_lm shrinks globally without distinguishing among the nature of predictors. And it seems as if your data require that it takes very small steps. So, you will probably have to do the QR thing in the manual yourself in a Stan program. The manual scales by sqrt(N - 1), which corresponds to normal(*, autoscale = TRUE). To scale to the last variable, just divide all elements of R by its bottom-right cell and multiply all elements of Q by that number to cancel.

Or you could try a polar decomposition, which produces an orthogonal matrix that is closest to the original matrix in the least-squares sense and so arguably the priors you want to specify for the coefficients on the original matrix are appropriate for the coefficients on the orthogonal matrix (perhaps with a bit of extra uncertainty added).

Thanks, this is helpful. When doing the QR decomposition, do you have any suggestions on what type of prior to place on the coefficients of Q^* (theta in the Stan manual)? The manual does not specify a prior for these parameters. Am I correct in assuming that a normal prior is probably still appropriate, on the Q^*-scale?

Also FYI, the run that I started last night, which includes all outcomes but omits all of the nuisance covariates X, is progressing (it’s currently about 50% done). So it seems that if I can implement the QR decomposition to include the covariates more efficiently, it would help.

Is there any QR-type trick that would help with the \theta_{jk} parameters? It’s possible to express the model above with Z_{ik} being a big N\times K matrix with lots of zeroes (all elements in a row except for one are zero), but if we do a QR decomposition of this matrix then I don’t know how we would implement the hierarchical shrinkage prior on the Q^* scale (I don’t think this is possible).

The only prior that has been specifically derived for \boldsymbol{\theta} is the one used by stan_lm. It is maxent given an expected log R^2. Diffuse independent normals with mean zero tend to work fine if you scale the factors of the QR decomposition appropriately. @avehtari has in the past found that the Finnish Horseshoe prior continues to work well, even through it was motivated for the coefficients on X. You can think of \theta_k triangularly as being proportional to coefficient on the k-th column of X conditional on all the previous columns of X but not the subsequent columns of X. That is why the very last element of \theta is proportional to the last element of \beta, and if you scale by the bottom right element of R, then they are equal.

And yes, you should glue \mathbf{X} and \mathbf{Z} together, center their columns, do the QR decomposition, and then scale.

Let’s see if I’m actually understanding you correctly. Recall that I want do to the following:

y_{ijk} \equiv y_{ij} = \alpha_{jk} + X_i\beta_{jk} + z_{ik}\theta_{jk} + \epsilon_{ijk}

\alpha_{jk} = \alpha_0 + \alpha_j + a_{jk}
\beta_{jk} = \beta_0 + \beta_j + b_{jk}
\theta_{jk} = \theta_0 + \theta^E_j + \theta^P_k + \theta^{EP}_{jk}

\alpha_j \sim N(0,\sigma_{\alpha})
a_{jk} \sim N(0,\sigma_{a})
\beta_j \sim N(0,\sigma_{\beta})
b_{jk} \sim N(0,\sigma_{b})
\theta^E_j \sim N(0,\sigma^E_{\theta})
\theta^P_k \sim N(0,\sigma^P_{\theta})
\theta^{EP}_{jk} \sim N(0,\sigma^{EP}_{\theta})

Note that my "a"s look like "alpha"s… sorry about that, hopefully it’s clear where they differ. Also note that each \beta and b is of dimension P.

If I implement a QR decomposition on [1\, X\,Z] = Q^*Z^*, is it then possible to express the model as
y_l \equiv y_{ijk} = Q_l^*R_l^*\phi^*_{jk}, where Q_l^* and R_l^* are the l^{th} rows of Q^* and R^*? And then I can do the hierarchical prior on \phi^*_{jk} instead of it being a simple normal as discussed before?

Note that I edited my previous post, so if you saw the first version of it please disregard. Also in the formulation above, I think Z only needs to be the one column of the z_{ik} value that corresponds to each row.

Yeah, if you just want to QR decompose X, that is possible. Usually without the column of ones (if you have a scalar intercept) but with centering.

QR transformation can be considered as finding independent latent variables, and if only some of those latent variables have effectively non-zero coefficient then regularized (Finnish) horseshoe prior can be useful. There might be better transformations than QR, but at least in a few examples this has worked with QR, too.

Cool, thanks. The reason I suggested including the column of 1’s was because we want the intercept to vary by both j and k, and this seems like a way to allow that. Why may I ask do you think I should remove the 1’s?

The biggest thing was that I didn’t know that after applying the QR decomposition, the rows of the Q^* matrix still correspond to the observations from the original dataset. This allows me to implement the hierarchical prior in Q^*-space. Basically it will be:

\phi^*_{jk} = \phi_0 + \phi_j^E + \phi_k^P + \phi_{jk}^{EP}

The only thing I don’t like here is that there is no way to apply separate priors to the \alpha/\beta/\theta elements of \phi^*. Like in the above specification, how there are difference variance components for each of the normal priors.

I think I’ll try this both ways: (1) keeping the 1/X/Z components of the design matrix separate and only applying the QR decomposition to X, and (2) applying the decomposition to the full design matrix.

I’ll update on how it goes, and thanks again for your help!

The reason we don’t usually have a column of ones is that alpha is a separate parameter. So, you get stuff like vector[N] eta = alpha + Q * theta;. There is no need to multiply a “coefficient” by 1 N times.

I was suggesting including the 1’s in the design matrix that gets QR-decomposed, and not including a separate parameter for alpha. The thought was that you would gain efficiency due to the QR decomposition by combining them. But if that’s not true, then I agree there’s no reason to combine them.

Centering the columns of a design matrix without a column of ones before doing a QR decomposition is equivalent to doing a QR decomposition on a matrix with a column of ones in the first position. That is because the QR decomposition is orthogonalizing each column relative to all the columns previous to it.

hi @jgellar ! Fun to have a reason to look back at this and bug you. :)

how in rstanarm can we specify a ridge penalty for regression coefficients, i.e. in the prior = here: https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html

thanks!

Hi @shira!

Quick response because I’m on my phone, but I don’t want to forget to respond. It depends on exactly what your goals are and what your Xs are. If you’re just looking for something that does shrinkage across all the betas, I’m pretty sure that’s the default with stan_lm with the QR-decomposition. That’s not exactly ridge regression, but I think it has the same effect, and in most cases it’s probably better because you don’t have to worry about variables on different scales and things like that, since that’s all taken care of by QR, and it’s more efficient.

There are some cases where you might prefer the actual ridge (normal on the betas), if for some reason you want to keep all the Xs on their natural scale due to something about their interpretation or something. I don’t think there is a built-in way to do this in rstanarm - but @bgoodri could correct me if that’s not true. I wrote some add-on code to mgcv that allows you to fit this model as a GAM using mgcv::gam, and since rstanarm allows mgcv-style smooths it might work in rstanarm. (I may have already steered this at some point before but I don’t recall whether or not omit worked). Reach out to me next week (after the holiday) and if you want me to send that to you.

There is no classical ridge prior with an unknown ridge. You can do prior = laplace(...) to use a Laplace prior with a specified scale or prior = lasso(...) with an estimated scale and in either case you can specify auto_scale to put it on the standard deviation scale or the raw scale.