Prior on R2

Hi Stanimals,

I am trying to follow the logic and derivation behind the idea to put a prior on R^2 for linear regression models. I am trying to write a step by step derivation to fully comprehend it.

I have to admit that I struggle a bit with the current description/derivation in the prior section of this vignette for rstanarm. In particular the identity of \theta relating it to \rho_k. I see how this works for the single variable case (best linear predictor interpretation of OLS) but not for the multiple regression case…

Is there an extended version of the vignette available somewhere, or any relevant literature pointers or derivations?

Otherwise I would ask my questions here…

@jonah @bgoodri?

2 Likes

\boldsymbol{\theta} is a vector of coefficients between the outcome and the columns of \mathbf{Q}, which are orthogonal to each other. So, the coefficient for each is just a correlation times a ratio of the standard deviation of the outcome to the standard deviation of that column of \mathbf{Q} (which is \frac{1}{\sqrt{N - 1}}).

2 Likes

The first part of section 2.1 in this paper might be helpful?

4 Likes

Thanks. Just another related question: Does centering in \mathbf{X} in imply that \mathbf{Q} is centered (in addition to having uncorrelated columns)? If so how can you show that easily? This is the only missing step I need for the derivation to be completed.

If \mathbf{X} without an intercept has centered columns, the Q factor in its QR decomposition is the same as a Q factor — after its first column has been removed — in the QR decomposition of a design matrix with an intercept in the first column and no centering. And the remaining columns will have mean zero in order to be orthogonal to the removed column.

1 Like

Thanks @bgoodri.

Is it possible to get a self-contained example (Stan and R code) to see what’s actually implemented? I gave it a try, but I am sure this it not yet what you describe in the vignette and above.

data {
    int<lower=1> N;
    int<lower=1> K;
    matrix[N,K] X;
    vector[N] y;
    real<lower=0> eta;
    real<lower=0> s_y;
}
transformed data {
    matrix[N,K] Q = qr_thin_Q(X);
    matrix[K,K] R = qr_thin_R(X);
    matrix[K,K] R_inv = inverse(R);
}
parameters {
    cholesky_factor_corr[K+1] L;
    real<lower=0> omega;
}
transformed parameters {
    matrix[K+1,K+1] corr = L*L';
    vector[K] rho = corr[2:,1];
    real<lower=0> sigma_y = omega*s_y;
    vector[K] theta = sqrt(N-1)*sigma_y * rho;
    real R2 = dot_product(rho,rho);
    real<lower=0> sigma=omega*s_y*sqrt(1-R2);
}
model {
    L ~ lkj_corr_cholesky(eta);
    target += -log(square(omega));
    y ~ normal(Q*theta, sigma);
}
generated quantities {
    vector[K] beta = R_inv * theta;
}

and here some R-code:

library(rstan)
data("clouds", package = "HSAUR3")

ols <- lm(rainfall ~ seeding * (sne + cloudcover + prewetness + echomotion) +
            time, data = clouds)
X <- model.matrix(ols) 
X <- X[,2:ncol(X)]
X_c <- scale(X, scale=F)
y<- clouds$rainfall
stan_data <- list(
  
  N= nrow(X_c),
  K= ncol(X_c),
  y = y,
  X=X_c,
  eta=ncol(X_c)/2,
  s_y=sd(y)
)
fit <- stan("~/Desktop/prior_r2.stan", data=stan_data, chains=4, cores=1, seed=42)

Any help is appreciated.

The Stan code for that is fairly self-contained


The R code less so but most of the action is actually in stan_biglm.fit

1 Like

Thanks for linking that paper. R2-D2 looks pretty cool.

I guess the possible advantage over a Regularized Horseshoe prior would be that you get to specify a prior on R^2 instead of on the proportion of non-zero coefficients. They also mention in that paper that the prior density on the coefficients is unbounded at zero which leads to tighter shrinkage around zero.

Has anyone tried R2-D2 in Stan?