Hi, Have an awkward issue with some scaling for a poisson model.
This is a poisson model for counting student engagement.
The data are in many different ranges. Some covariates are [0,1], others are [1,20000].
In order to make Stan run smoothly, I’d like to scale the covariates to all be in the same range. Two immediate thoughts on this:
Scale each covariate to be [0,1]
We have population in the model, so scale each covariate as a percentage of population
i.e.
student_count ~ pois( b[1] * percent_hispanic + b[2] * income + b[3] * sat_score)
Both will run fine in Stan. The problem comes when trying to estimate predicted counts. As every covariate is now in the compressed range, the posterior samples for the counts are always too low. My intuition is that I need to un-scale the posterior samples in some way, but not sure how. Any suggestions on how to attack this.
Don’t change the posterior samples unless you really know what you are doing. But the predictions should be fine if your population values are scaled in the same way as in your sample.
But scaling the predictors marginally is not so optimal. It often works much better to do a QR reparameterization of the design matrix (X), which is discussed in the manual and is a strongly recommended option in rstanarm, where your model would be
library(rstanarm)
options(mc.cores = parallel::detectCores())
post <- stan_glm(student_count ~ percent_hispanic + income + sat_score,
data = something, family = poisson(), QR = TRUE)
The QR reparameterzation of (X) makes the columns of Q have the same length and be orthogonal to each other. After inverting the reparameterization to obtain coefficients on X, you can do prediction in the original scale of the variables.
Thanks for pointing me to the QR decomposition. Appreciate it.
I implemented it in the model, chains mix nicely, samples look good, etc. BUT, the coefficients recreated in generated quantities have a median of zero or very close to zero. That seems highly unlikely.
Stan file pasted below. This is a hurdle model for student activity counts, with a campus specific intercept. There are separate coefficients for each portion of the model (the probability of a zero, and the count if non-zero)
data {
int <lower=0> N; // Number of rows
int <lower=0> J; // Number of campus
int <lower=0> K; // Number of coef
int <lower=0> Y[N]; // Count of students
int<lower=1, upper=J> campus[N]; // Campus labels
matrix[N,K] X; // Covariates
}
transformed data{
// As per page 123 of stan language reference v15
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
Q_ast = qr_Q(X)[, 1:K] * sqrt(N - 1);
R_ast = qr_R(X)[1:K, ] / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
// Parameters for campus intercepts
vector[J] theta_z_eta;
vector[J] theta_c_eta;
real<lower=0> theta_z_scale;
real<lower=0> theta_c_scale;
// For non-centered coefficient estimation
vector[K] beta_z_eta;
vector[K] beta_c_eta;
real<lower=0> beta_z_scale;
real<lower=0> beta_c_scale;
}
transformed parameters{
vector[J] theta_z; // Campus intercept for zero prob part
vector[J] theta_c; // Campus intercept for count part
vector[K] beta_z; // Coefficients for zero part
vector[K] beta_c; // Coefficients for count part
theta_z = theta_z_eta * theta_z_scale;
theta_c = theta_c_eta * theta_c_scale;
beta_z = beta_z_eta * beta_z_scale;
beta_c = beta_c_eta * beta_c_scale;
}
model {
real theta;
real lambda;
vector[N] xb_z;
vector[N] xb_c;
theta_z_eta ~ normal(0,1);
theta_z_scale ~ gamma(10,10);
theta_c_eta ~ normal(0,1);
theta_c_scale ~ gamma(10,10);
beta_z_eta ~ normal(0,1);
beta_z_scale ~ gamma(10,10);
beta_c_eta ~ normal(0,1);
beta_c_scale ~ gamma(10,10);
xb_z = Q_ast * beta_z;
xb_c = Q_ast * beta_c;
for(n in 1:N){
theta = theta_z[campus[n]] + xb_z[n];
lambda = exp(theta_c[campus[n]] + xb_c[n]);
if(Y[n]==0){
target += bernoulli_logit_lpmf(1 | theta);
}
else{
target += bernoulli_logit_lpmf(0 | theta) + poisson_lpmf(Y[n] | lambda) - log1m_exp(-lambda);
}
}
}
generated quantities{
vector[K] beta_zero;
vector[K] beta_count;
beta_zero = R_ast_inverse * beta_z;
beta_count = R_ast_inverse * beta_c;
}
I have no way of judging whether that is plausible or not, but I tend to not argue with posterior distributions when it appears Stan is working well. I would be looking at the posterior predictive distribution of the outcome, especially to see if the proportion of zeros is similar to that in the data.
Strangely, no matter what I try, almost all of the coeffcients (beta_zero and beta_count) have a mean value of zero. Many are highly correlated to the dependent variable, and have reasonable values using other regression tools. This leads me to believe that I did something wrong with the QR decomposition part. Tried to follow the Stand documents exactly, but must have an error somewhere.
Does anyone see anything that might cause this? (Stan code in earlier post)
I’d recommend simulating data where you know they’re not zero and see what happens.
If you have highly correlated predictors x1 and x2, and you have coefficients beta1 and beta2 with a symmetric prior, then you expect beta1 and beta2 to have a posterior mean of zero. Look at their sum, which won’t be expected to be zero if x1 and x2 are informative about y.
Still think I’m doing something wrong with the QR decomposition and subsequent coefficient reconstruction in Stan.
x1 and x2 are correlated with each other at 0.25
x1 and y are correlated at -0.145
x2 and y are correlated at -0.05
However the resulting coefficients both have a range from -1e-06 to 3e07 which is effectively 0
I attempt to re-construct the “Un-QR” version of the coefficients in the generated quantities block, and my guess is that is where I am doing something wrong.
The beta_c coefficients for x1 and x2 seem to have reasonable ranges
beta_c[1] ranges from -0.27 to -0.68
beta_c[2] ranges from -0.22 to -0.27
Given the data and correlation between the covariates and y, those beta_c values seem “reasonable”. My guess is that somehow multiplying them by R_ast_inverse is causing the problem.