Ridge prior with QR decomposition


#1

I have a model with a relatively large p:n ratio, and I would like to implement a ridge-like penalty on the coefficients. I would also like to implement a QR-decomposition, for efficiency, following Section 2.2 of the Stan manual. Does it make sense to implement the ridge prior on the coefficients of the Q_ast matrix? I know this won’t be equivalent to the model that places the prior on beta, but would it have a similar effect?

Following the example in Section 2.2:

model {
  // likelihood
  y ~ normal(Q_ast * theta + alpha, sigma);
  // priors
  theta ~ normal(mu_theta, sd_theta);
}

#2

If you haven’t already, you should check out the case study for QR regression in Stan. QR decomposition should invoke roughly comparable scales for the thetas. Shrinkage also applied, but it’s a little harder to know apriori how it’ll behave. You can also put a prior directly on beta (computed in the transformed parameter block) - since R inverse is a linear transformation, you won’t need a Jacobian correction. This is also shown in the case study, I think.

Sorry, I don’t have so much time right now. If I’ll remember it, I’ll post a little code snippet later that I use to assess what implications my prior on theta have for the betas.


#3

Thanks Max for pointing me to the case study. I do like this approach better than how it’s described in the Stan manual - having beta as a transformed parameter rather than a generated quantity gives more flexibility in the modeling.


#4

I couple follow-up questions on QR.

It seems more natural to me to have the thing we are putting the prior on be a “parameter.” Instead of having beta_tilde be defined as a parameter and beta be defined as a “transformed parameter”, can I switch them, and have beta be the parameter and beta_tilde be transformed?

Also, before implementing the QR decomposition, I was fitting the model using a “Matt trick” on beta. i.e., in the “transformed parameters” section I would define beta = mu_beta + sigma_beta * beta_raw, where beta_raw has a N(0,1) prior. Note that this allows the prior mean for beta to be non-zero (so shrinkage of the betas can go towards a non-zero number).

Putting these two concepts together, I would like to set up my model as follows:

parameters {
    real alpha;
    real mu_beta;
    vector[P] beta_raw;
    real<lower = 0> sigma_beta;
    real<lower = 0> sigma;
}
transformed parameters {
    vector[P] beta = mu_beta + sigma_beta * beta_raw;
    vector[P] theta = R * beta;
}
model {
    // Priors
    mu_beta ~ normal(0,10);
    beta_raw ~ normal(0,1);
    sigma_beta ~ normal(0,1);

    // Likelihood
    y ~ normal(Q_ast * theta + alpha, sigma);
}

Does this make sense? Are there any trade-offs in doing it this way compared to the way it is modeled in the “case study”?


#5

Yes, I think your approach should work.

However, note that putting priors on \beta is probably less (computationally) efficient than putting them on \theta. The prior on \theta implicitly takes into account the (empirical) correlations in your predictor matrix, which a prior on \beta won’t (see case study). There won’t be much difference if these correlations are low – but then you wouldn’t be using QR decomposition, I guess. The columns in Q should all have unit standard deviation, so shrinkage should work fine with \theta.

Regarding NCP (or “Matt trick”), I would probably go on about it like this (put prior on \theta and shrink towards 0 and not towards \mu_{\beta}, but I don’t know if you did that intentionally):

parameters {
    real alpha;
    vector[P] theta_raw;
    real<lower = 0> sigma_theta;
    real<lower = 0> sigma;
}
transformed parameters {
    vector[P] theta = 0 + sigma_theta * theta_raw; // shrinkage towards zero...
    vector[P] beta = R_inv * theta ; // this can also be put in generated quantities
}
model {
    // Priors
    alpha ~ ... // I'd use the sample median and MAD of y for mean and sd respectively
    sigma ~ ...
    theta_raw ~ std_normal(); // equiv. to normal(0,1)
    sigma_theta ~ std_normal(); // equiv. to normal(0,1)

    // Likelihood
    y ~ normal(Q_ast * theta + alpha, sigma);
}

I guess you already read something about the Finnish horseshoe prior, which is the preferred approach towards shrinkage if you ask the people around here. (I don’t know much about it myself, I’m afraid.)