Hello everyone,
I have a question about the QR decomposition using the notation in the User’s Guide where \eta = x\beta = Q^*R^*\beta = Q^* \theta, \theta = {R^*}^{-1}\beta, where \eta is N \times 1, x is N \times X, and \beta is X \times 1. I am specifically unsure about whether or not it is possible or sensible to use the QR decomposition with the R2D2 prior.
Consider the case where the design matrix x has all predictors standardised to one standard deviation and the residual scale of \eta is \sigma = 1. Then the R2D2 prior for the regression coefficients \beta is as follows in Stan code (I believe):
data {
int N, X; // number of observations and predictors
matrix[N, X] x; // predictors
vector[N] eta; // response
}
transformed data {
matrix[N, X] Q_ast = qr_thin_Q(x) * sqrt(I - 1);
matrix[X, X] R_ast, R_ast_inv;
R_ast = qr_thin_R(x) / sqrt(I - 1);
R_ast_inv = inverse(R_ast_inv);
}
parameters {
real<lower=0, upper=1> R_sq;
simplex[X] phi;
vector[X] z;
}
transformed parameters {
real tau_sq = R_sq / (1 - R_sq);
vector[X] beta = z .* sqrt(phi * tau_sq)
}
model {
R_sq ~ beta(1, 1);
phi ~ dirichlet(ones_vector(X))
z ~ std_normal();
}
I was under the impression that the efficiency of the QR decomposition comes from that you can do Q^*\theta instead of Q^*R^*\beta. However, I believe that to scale beta
in the Stan program by R^* I end up having to multiply it for each iteration anyway, effectively getting back Q^*R^*\beta:
transformed parameters {
// scale beta back to be theta
vector[X] theta = R_ast * beta;
}
So my question is, assuming that I haven’t made any great errors above, can the efficiency of the QR decomposition still be exploited when using R2D2 priors?
Thanks!
Matt
2 Likes
The efficiency of the QR decomposition has to do with the fact that the correlations in the elements of \theta should be lower than the correlations in the elements of \beta. In the limit of very high correlations between predictors and lots of data, the corresponding elements of \beta become arbitrarily strongly anticorrelated along an arbitrarily long ridge, yielding an arbitrarily challenging geometry. QR avoids this. You should be declaring theta
as the parameter and beta
as the transformed parameter. You can then still apply the R2D2 prior (note that the linear transform requires no Jacobian adjustment provided that you don’t care about normalizing constants), unless I’m missing something important (always possible).
1 Like
OK, so by which (co)variances would I be scaling the R2D2 prior if I put that prior on theta
? Btw, I sent an email to your gmail account about an unrelated topic, not sure if you got it. Cheers.
Technically you can put R2D2 prior on \theta, but…
- In case of \beta it is assumed coefficients are exhangeable
- In case of \theta we know that they are ordered with decreasing expected explained variance, so you are changing the meaning of the prior
3 Likes
Just to clarify: you can place the R2D2 prior directly on \beta even when \beta is a transformed parameter in a QR context. And because the transformation is linear, you don’t need to worry about the Jacobian, provided you don’t care about normalizing constants (i.e. you aren’t working with Bayes factors).
I think I see what you’re saying, but my concern here would be that you don’t get any of the efficiency benefits because once you’ve put the prior back on \beta, you’re essentially just doing QR\beta = x\beta, no?
No, because you are not parameterizing in terms of \beta anymore, you are parameterizing in terms of \theta. The key is not in the matrix algebra, but rather in the choice of parameterization. It is true that by expressing a prior on \beta when you parameterize by \theta you will need to autodiff through the transformation, but this shouldn’t be terribly burdensome.
I’m sorry that I’m still not really understanding. Is there any chance you could write what you mean using the Stan program in the first post? I’m just having trouble understanding how you can place the prior on \beta without having to scale by R^* when using QR decomposition. Aki’s answer makes sense to me, but I’m just having trouble visualising your point.
I’m not an expert here, but what I’m suggesting is the following. Does this seem correct to you? Instead of
parameters {
vector[X] z;
}
transformed parameters {
vector[X] beta = z .* sqrt(phi * tau_sq);
}
model {
z ~ std_normal();
}
you can do
parameters {
vector[X] z;
}
transformed parameters {
vector[X] z = beta ./ sqrt(phi * tau_sq);
}
model {
z ~ std_normal();
target += // a Jacobian adjustment term here
}
Given this program, you can further modify it to do QR on beta
.
1 Like
Hey, I think my confusion came from that I didn’t realise you can put priors on transformed parameters! Thanks for that. Not sure if you saw in my post above, but I also sent an email to your gmail.
1 Like