# Which Input for Bayesian R Squared when using STAN (not RSTANARM)

Dear All,

I am checking a hierarchical model for predictive power, and want to compare the variance explained by two models, one with weakly informative priors. i am using the bayes_R2() function from the rstanarm package. Now my STAN file produces generated quantities with a Laplace (double exponential) likelihood. The position parameter is calculated as a time series process with controls. Now my question is, should I calculate the Bayesian R² taking the predicted means (“y_hat”) or the predicted values (“y3”) as y_pred?

I attach the STAN file for the model with uninformative priors, since it is shorter.

Thank you all so much in advance.

data {
int<lower=1> N; // number of observations
int<lower=1> J; // number of industries
int<lower=1> T; //number of periods
int<lower=1,upper=J> industry[N];
int<lower=1,upper=T> year[N];
vector[N] x; //RKT
vector[N] y; //IFTC
vector<lower=-0.5>[N-J] y2; //IFTC for all but first period
vector[N] kl;
vector[N] go;
vector[N] pg;
}
parameters {
vector[J] alpha; // intercept
real beta; // autoregressive element
vector gamma; //slope parameters for RKT, RKT(-1)
real delta;
real eta;
real mu_alpha;
real mu_beta;
real mu_gamma;
real mu_delta;
real mu_eta;
real<lower=0> sigma; //variance of observation
real<lower=0> sigma_alpha; //variance of intercept
real<lower=0> sigma_beta; //variance of slope
real<lower=0> sigma_gamma;
real <lower=0> sigma_delta;
real <lower=0> sigma_eta;
real theta; //moving average coefficient
real mu_theta;
real<lower=0> sigma_theta;
}

model {
vector[N - J] y_hat; // predictions
vector[N] err; //error for time t
int pos; //position in raw dataset
int local_pos; //continuous index for y_hat only
pos = 0;
local_pos = 0;

for(j in 1:J){
err[(j-1)*T+1] = 0; //set the first error for each average equal 0, since no estimation is calculated
for(t in 2:T){
local_pos = local_pos + 1;
pos = (j-1)*T + t; //number of time periods per group + current time period
y_hat[local_pos] = alpha[industry[pos]] + y[pos-1] * beta + x[pos] * gamma + x[pos-1] * gamma + err[pos - 1] * theta + log(kl[pos]) * delta + log(go[pos]) * eta;
//+ log(pg[pos]) * iota;
err[pos] = y2[local_pos] - y_hat[local_pos];

}
}

alpha ~ normal(mu_alpha, sigma_alpha);
beta ~ normal(mu_beta, sigma_beta);
gamma ~ normal(mu_gamma, sigma_gamma);
sigma ~ cauchy(0, 2.5);
delta ~ normal(mu_delta, sigma_delta);
eta ~ normal(mu_eta, sigma_eta);
theta ~ normal(mu_theta, sigma_theta);
mu_alpha ~ normal(0, 0.09);
sigma_alpha ~ cauchy(0, 25);
mu_beta ~ normal(0, 0.09);
sigma_beta ~ cauchy(0, 25);
mu_gamma ~ normal(0, 0.09);
sigma_gamma ~ cauchy(0, 25);
mu_delta ~ normal(0,0.09);
sigma_delta ~cauchy(0, 25);
mu_eta ~ normal(0, 1);
sigma_eta ~ cauchy(0, 25);
mu_theta ~ normal(0, 0.09);
sigma_theta ~ cauchy(0, 25);

y2 ~ double_exponential(y_hat, sigma);
}
generated quantities{
vector[N - J] y_hat; // predictions
vector[N] err; //error for time t
vector[N - J] y3; //predictions
int pos; //position in raw dataset
int local_pos; //continuous index for y_hat only
pos = 0;
local_pos = 0;

for(j in 1:J){
err[(j-1)*T+1] = 0; //set the first error for each average equal 0, since no estimation is calculated
for(t in 2:T){
local_pos = local_pos + 1;
pos = (j-1)*T + t; //number of time periods per group + current time period
y_hat[local_pos] = alpha[industry[pos]] + y[pos-1] * beta + x[pos] * gamma + x[pos-1] * gamma + err[pos - 1] * theta + log(kl[pos]) * delta + log(go[pos]) * eta;
//+ log(pg[pos]) * iota;
err[pos] = y2[local_pos] - y_hat[local_pos];
}
}

for(m in 1:(N-J))
y3[m] = double_exponential_rng(y_hat[m], sigma);
}
}


model{
vector[N] mu = alpha+beta*x;
y~normal(mu,sigma);
}

To include mathematical notation in your post put LaTeX syntax between two \$ symbols, e.g.,
p(\theta | y) \propto p(\theta) p(y | \theta).

Hi,

you might want to check out this paper (specifically page 7). There the linear predictor is used (with inverse link function). In your case this would be \hat{y}, the predicted mean, if I’m not mistaken.

You probably know that the Bayesian R^2 only really makes sense in the linear regression case, the more you deviate from normality, the less “useful” R^2 becomes. The main advantage is that it’s realitvely cheap to compute the Bayesian R^2 and people are somewhat familiar with it (at least conceptually). The paper talks about that a bit.

Cheers!

Thank you!