Standardizing predictors and outputs in a hierarchical model

eolica · January 13, 2018, 8:30pm

Hello everyone,

First of all thanks for the StanCon—it was a great experience.
Here is a question that I have been discussing there and many suggested that I post it on the Forum.

My question is about Manual Section 28.11 Standardizing Predictors and Outputs (v. 2.17.0), where it is suggested that we standardize our inputs as part of optimization. The model written is a very simple one predictor population variable model of y_{n} = alpha + beta * x_{n} + epsilon_{n}. I was wondering how to back out the coefficients in the generated quantities block when x is actually a hierarchical variable. That is, for example:

data {
    int<lower=0> N;
    int<lower=0> G; // Number of groups
    vector[N] y;
    vector[N] x;
    int<lower=0> group[G]; // Group indicators
}
transformed data {
    vector[N] x_std;
    vector[N] y_std;
    x_std = (x - mean(x)) / sd(x);
    y_std = (y - mean(y)) / sd(y);
}
parameters {
    real alpha_std;
    real beta_std;
    real<lower=0> sigma_std;
}
model {
    alpha_std ~ normal(0, 10);
    beta_std ~ normal(0, 10);
    sigma_std ~ cauchy(0, 5);
    for (n in 1:N)
        y_std[n] ~ normal(alpha_std + beta_std * x_std[group[n]], 
                          sigma_std);
}

The formulas offered in the manual (page 361) will not be applicable in this case because of the block diagonal structure in the covariance matrices. rstanarm seems to be doing it via autoscale = TRUE but it’s difficult to see how I should replicate that prior_summary and scaling/transforming part when I write my own lower Stan code.

Can someone offer me guidance to this for example using the same multivariate regression example in Section 9.13 (pp. 152-153)?

Sincerely,
Silvia

bgoodri · January 13, 2018, 9:05pm

The formulas for obtaining unstandardized regression coefficients still hold in the hierarchical case, except you would not usually standardized the predictors with the global mean and standard deviation. If you were to do something, it would usually involve the group-specific mean and group-specific standard deviation, in which case you would need to unstandardize with the group-specific statistics.

However, the advice in that section of the manual is not very good. You should not be standardizing the outcome variable ever and standardizing the predictors is not that great an idea either. It is okay to divide by constants to change the units, but dividing by the standard deviation is not that generative. What the rstanarm package actually does is to change the scale of the priors rather than rescaling the predictor, but even that only makes sense when you only have one predictor. In the usual case that you have multiple predictors, you generally want to utilize the QR = TRUE argument in rstanarm, which both orthogonalizes and standardizes the predictors, or do the same thing in your own code, which is explained in section 9.2 of the Stan User Manual and a case study.

eolica · January 14, 2018, 12:10am

Thank you Ben for the great reply—I should have mentioned that you already recommended QR reparameterization! Thank you for clarifying about the manual advice and rstanarm as well.

avehtari · January 15, 2018, 4:14am

That is a bit strongly stated and it’s not that black and white.

Just be aware that it doesn’t work with n<p.

bgoodri · January 15, 2018, 4:49am

When should someone standardize the outcome besides the rare case where the data-generating process has a known standard deviation?

We need to work more on the n < p case. You can do the regression on Q instead of X but there isn’t a unique transformation from the coefficients on Q back to the coefficients on X. I think the right solution is to take the top p rows of R to be given by the data and integrate over the the bottom n - p rows. But if n >> p, that is a lot of unknowns to integrate over in the absence of explicit integrals.

avehtari · January 17, 2018, 3:43pm

When should someone chose weakly informative priors with data dependent scaling? I know you accept this as it’s in rstanarm. For me standardizing and using weakly informative prior with fixed scale is same as scaling the weakly informative prior scale based on the scale of the data. Whether using sd, robust sd, or mad is just details.

bgoodri · January 17, 2018, 9:28pm

Scaling by the standard deviation of the outcome is in rstanarm for the gaussian case only, but I don’t agree with it.

avehtari · January 18, 2018, 10:05am

It seems we should be more careful with the words as this is complicated issue with predictor, outcome, centering, scaling, standardizing, changing data or prior, whether we do it due to interpretation, due to prior information or due to computational reasons (my experiences in this are especially in case of GPs), and for both predictors and the outcome it depends on the type of variable etc. This is something worth several pages of discussion and examples (maybe some day this year, but can’t do it today).

ignacio · April 4, 2018, 9:55pm

So, there is no universal agreement that one should always standardize predictors and outcomes? Can you point me to any reading about this?

I thought standardizing could make computation easier without having any real cost. @bgoodri could you explain why standardizing can be a bad idea??

bgoodri · April 4, 2018, 10:00pm

It isn’t a question of reading what someone else may have written, you can just look in the model from your other thread

transformed data {
  real  mean_y = mean(y); 
  real<lower=0> sd_y = sd(y); 
  vector[N] y_std = (y - mean_y)/sd_y;      
  real  mean_x = mean(x); 
  real<lower=0> sd_x = sd(x); 
  vector[N] x_std = (x - mean_x)/sd_x;      
}

mean_y, sd_y, and thus y_std are not known with certainty but you are treating them as if they are known. Consequently, the posterior distribution of beta and sigma and any predictions derived from them will not reflect the genuine uncertainty in mean_y and sd_y. You will tend to predict future data less well than past data.

ignacio · April 4, 2018, 10:03pm

@bgoodri Thanks! That makes a lot of sense.

avehtari · April 5, 2018, 10:07am

True , but 1) with sensible weakly informative priors, not very small n, and optimal computation there is likely to be no difference, 2) the effect of assuming something to be known can be smaller than the effect of using bad priors or non-optimal computation and if scaling makes it easier to get more sensible prior or better computation the scaling can better than not scaling. You can also flip this by not scaling data, but scaling priors and thinking how does the scaling affect your inference if you in addition make the prior very tight or quite wide (as in rstanarm). With a wide prior assuming the scale of x and y to be known does not have much difference (unless n is really small). There is no universal agreement on priors, so there can’t be universal agreement on dual case of scaling variables.

serenejiang · April 28, 2019, 11:36pm

I saw on Gelman 2008 that standardizing input variables (not the predictors) is recommended so that we do not need to worry about coefficients of very large or very small variables. Since I am applying my stan model to different datasets with very different scales of input variables, then does it make more sense to standardize the input variables, rather than change the scale of the priors each time?

bgoodri · April 29, 2019, 1:05am

Changing the priors is about the same amount of work as changing the units of the predictors. The former makes it simpler to do posterior prediction, but that is about it.

serenejiang · April 29, 2019, 1:18am

I am very new with Stan programming and Bayesian inference as well. What do you mean by changing the priors in this scanerio; do you have some specific examples that I can learn from? I am currently using normal (0,1) and cauchy (0,1) for priors. Do you mind explaining to me why these priors cause my model work properly only with standardized input variables, yet not on the original scale? I tried to choose wider prior distributions such as normal (0, 10) and cauchy (0,5), but the model results aren’t good for input variables on original scale.

Below is a simplified version of my model:

model {
 // prior distributions
  sigma_eps ~ cauchy(0, 1); 
  theta_mu ~ normal(0, 1);

  for(k in 1:K){
    Theta[, k] ~ normal(0, 1);
  }

  // posterior draws
  for(n in 1:N){
    alpha[n] ~ multi_normal(zero_k, diag_matrix(rep_vector(1,K)));
    Y[n] ~ multi_normal(B[n]*theta_mu + B[n] * Theta * alpha[n], diag_matrix(rep_vector(sigma_eps^2, V[n])));
  }
}

Bob_Carpenter · April 29, 2019, 8:37pm

Consider a simple case of a predictor based on length expressed in meters vs. kilometers. The coefficient for kilometers has to be 1000 times larger. So that affects what prior you want to use given the expected scale of the coefficient.

serenejiang · April 29, 2019, 8:48pm

Since the datasets I am applying to have a very different range of scales from one dataset to another, then is it still possible to find a “universal” prior distribution in this case?

Bob_Carpenter · April 29, 2019, 8:54pm

RStanArm does a QR decomposition of the data matrix of predictors in order to standardize them all and remove correlations. Then it starts to make sense to use default priors. @betanalpha has a tutorial here.

serenejiang · April 30, 2019, 10:03pm

Thank you, this link is very helpful.

When I tried to implement QR decomposition in my code, I encountered another problem:
in my model, Yi = Bi * (theta_mu + Theta * Alpha_i), where Bi has different dimensions
for each subject i. When I used QR decomposition on Bi, then I have a different Qi and Ri for
each subject i. My question is then how to get back the original parameters like Theta,
since now Theta = Ri^(-1) * Theta_tilde, meaning that Theta is going to be different for each
subject i; however, Theta is supposed to be same across all subjects in my original model.

Am I use QR decomposition correctly? If yes, how shall I handle it in the scenario of ragged input structure? Thanks!

Topic		Replies	Views
Specification of multivariate hierarchical priors Modeling specification	1	506	June 9, 2019
Help drawing from the posterior after standardizing predictors and outputs Modeling	2	462	April 4, 2018
Standardizing predictors in the transformed data block Modeling rstan	1	363	September 20, 2022
Rewriting old Hierarchical Modeling Code in cmdstanr Modeling cmdstanr	1	172	March 5, 2024
(Another) Slow Hierarchical Model in Stan Modeling fitting-issues , specification , performance	3	1758	September 29, 2017

Standardizing predictors and outputs in a hierarchical model

Related topics