Scaling for poisson model

Hi, Have an awkward issue with some scaling for a poisson model.

This is a poisson model for counting student engagement.

The data are in many different ranges. Some covariates are [0,1], others are [1,20000].
In order to make Stan run smoothly, I’d like to scale the covariates to all be in the same range. Two immediate thoughts on this:

  • Scale each covariate to be [0,1]
  • We have population in the model, so scale each covariate as a percentage of population

i.e.
student_count ~ pois( b[1] * percent_hispanic + b[2] * income + b[3] * sat_score)

Both will run fine in Stan. The problem comes when trying to estimate predicted counts. As every covariate is now in the compressed range, the posterior samples for the counts are always too low. My intuition is that I need to un-scale the posterior samples in some way, but not sure how. Any suggestions on how to attack this.

Don’t change the posterior samples unless you really know what you are doing. But the predictions should be fine if your population values are scaled in the same way as in your sample.

But scaling the predictors marginally is not so optimal. It often works much better to do a QR reparameterization of the design matrix (X), which is discussed in the manual and is a strongly recommended option in rstanarm, where your model would be

library(rstanarm)
options(mc.cores = parallel::detectCores())
post <- stan_glm(student_count ~ percent_hispanic + income + sat_score, 
                 data = something, family = poisson(), QR = TRUE)

The QR reparameterzation of (X) makes the columns of Q have the same length and be orthogonal to each other. After inverting the reparameterization to obtain coefficients on X, you can do prediction in the original scale of the variables.

1 Like

Ben,

Thanks for pointing me to the QR decomposition. Appreciate it.

I implemented it in the model, chains mix nicely, samples look good, etc. BUT, the coefficients recreated in generated quantities have a median of zero or very close to zero. That seems highly unlikely.

Stan file pasted below. This is a hurdle model for student activity counts, with a campus specific intercept. There are separate coefficients for each portion of the model (the probability of a zero, and the count if non-zero)

data {
  int <lower=0> N; // Number of rows
  int <lower=0> J; // Number of campus
  int <lower=0> K; // Number of coef
  int <lower=0> Y[N]; // Count of students
  int<lower=1, upper=J> campus[N]; // Campus labels
  matrix[N,K] X;  // Covariates
}
transformed data{
  // As per page 123 of stan language reference v15
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(X)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(X)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}
parameters {

  // Parameters for campus intercepts
  vector[J] theta_z_eta;
  vector[J] theta_c_eta;
  real<lower=0> theta_z_scale;
  real<lower=0> theta_c_scale;

  // For non-centered coefficient estimation
  vector[K] beta_z_eta;
  vector[K] beta_c_eta;
  real<lower=0> beta_z_scale;
  real<lower=0> beta_c_scale;

}

transformed parameters{
  vector[J] theta_z; // Campus intercept for zero prob part
  vector[J] theta_c; // Campus intercept for count part
  vector[K] beta_z;  // Coefficients for zero part
  vector[K] beta_c;  // Coefficients for count part

  theta_z = theta_z_eta * theta_z_scale;
  theta_c = theta_c_eta * theta_c_scale;

  beta_z = beta_z_eta * beta_z_scale;
  beta_c = beta_c_eta * beta_c_scale;

}

model {
  real theta;
  real lambda;
  vector[N] xb_z;
  vector[N] xb_c;

  theta_z_eta ~ normal(0,1);
  theta_z_scale ~ gamma(10,10);

  theta_c_eta ~ normal(0,1);
  theta_c_scale ~ gamma(10,10);

  beta_z_eta ~ normal(0,1);
  beta_z_scale ~ gamma(10,10);

  beta_c_eta ~ normal(0,1);
  beta_c_scale ~ gamma(10,10);

  xb_z = Q_ast * beta_z;
  xb_c = Q_ast * beta_c;

  for(n in 1:N){

    theta  =     theta_z[campus[n]] + xb_z[n];
    lambda = exp(theta_c[campus[n]] + xb_c[n]);

    if(Y[n]==0){
      target += bernoulli_logit_lpmf(1 | theta);
    }
    else{
      target += bernoulli_logit_lpmf(0 | theta) + poisson_lpmf(Y[n] | lambda) - log1m_exp(-lambda);
    }
  }
}
generated quantities{
  vector[K] beta_zero;
  vector[K] beta_count;

  beta_zero  = R_ast_inverse * beta_z;
  beta_count = R_ast_inverse * beta_c;

}

I have no way of judging whether that is plausible or not, but I tend to not argue with posterior distributions when it appears Stan is working well. I would be looking at the posterior predictive distribution of the outcome, especially to see if the proportion of zeros is similar to that in the data.

1 Like

Should you have also a common intercept?

It would be also better to use poisson_log_lpmf(theta_c[campus[n]] + xb_c[n]) as there is then no need to exponentiate.

Aki

Thanks Aki!

I was exponentiating in order to keep the lambda value positive.

Strangely, no matter what I try, almost all of the coeffcients (beta_zero and beta_count) have a mean value of zero. Many are highly correlated to the dependent variable, and have reasonable values using other regression tools. This leads me to believe that I did something wrong with the QR decomposition part. Tried to follow the Stand documents exactly, but must have an error somewhere.

Does anyone see anything that might cause this? (Stan code in earlier post)

I’d recommend simulating data where you know they’re not zero and see what happens.

If you have highly correlated predictors x1 and x2, and you have coefficients beta1 and beta2 with a symmetric prior, then you expect beta1 and beta2 to have a posterior mean of zero. Look at their sum, which won’t be expected to be zero if x1 and x2 are informative about y.

Bob, good point.

Still think I’m doing something wrong with the QR decomposition and subsequent coefficient reconstruction in Stan.

x1 and x2 are correlated with each other at 0.25
x1 and y are correlated at -0.145
x2 and y are correlated at -0.05

However the resulting coefficients both have a range from -1e-06 to 3e07 which is effectively 0

I attempt to re-construct the “Un-QR” version of the coefficients in the generated quantities block, and my guess is that is where I am doing something wrong.

generated quantities{
  vector[K] beta_count;
  beta_count = R_ast_inverse * beta_c;
}

The beta_c coefficients for x1 and x2 seem to have reasonable ranges
beta_c[1] ranges from -0.27 to -0.68
beta_c[2] ranges from -0.22 to -0.27

Given the data and correlation between the covariates and y, those beta_c values seem “reasonable”. My guess is that somehow multiplying them by R_ast_inverse is causing the problem.

Any thoughts?

OFFERING A $250 BOUNTY IF SOMEBODY WANTS TO FIX THIS STAN MODEL/FILE TO RUN AND GENERATE COEFFICIENTS THAT ARE “REASONABLE”.

The Catch is that I need this sorted out in the next 36 Hours. (Sunday Noon EDT)

If interested, please contact me directly. I’ll provide Stan code, data, and explanation ASAP.

(I hope this isn’t inappropriate to post here. If so, please let me know and I’ll delete this post.)

Fine by me. We have a jobs category, too :-)

Also, the above is equal to just qr_Q(X) as it’s size K. Not that the mild inefficiency of doing it your way will matter in transformed data.

I’d also recommend poisson_log_lpmf(Y[n] | log_lambda) rather than the poisson_lpfm(Y[n] | exp(log_lambda)) that you’re using now.

Thanks Bob.

Noah, were you able to solve this issue? I think the QR is causing a similar problem for me (see my post here: http://discourse.mc-stan.org/t/all-parameters-0-except-the-final-entry-in-the-model-matrix-regardless-of-what-that-value-is/1511/13