Standardizing predictors and outputs in a hierarchical model

Hello everyone,

First of all thanks for the StanCon—it was a great experience.
Here is a question that I have been discussing there and many suggested that I post it on the Forum.

My question is about Manual Section 28.11 Standardizing Predictors and Outputs (v. 2.17.0), where it is suggested that we standardize our inputs as part of optimization. The model written is a very simple one predictor population variable model of y_{n} = alpha + beta * x_{n} + epsilon_{n}. I was wondering how to back out the coefficients in the generated quantities block when x is actually a hierarchical variable. That is, for example:

data {
    int<lower=0> N;
    int<lower=0> G; // Number of groups
    vector[N] y;
    vector[N] x;
    int<lower=0> group[G]; // Group indicators
}
transformed data {
    vector[N] x_std;
    vector[N] y_std;
    x_std = (x - mean(x)) / sd(x);
    y_std = (y - mean(y)) / sd(y);
}
parameters {
    real alpha_std;
    real beta_std;
    real<lower=0> sigma_std;
}
model {
    alpha_std ~ normal(0, 10);
    beta_std ~ normal(0, 10);
    sigma_std ~ cauchy(0, 5);
    for (n in 1:N)
        y_std[n] ~ normal(alpha_std + beta_std * x_std[group[n]], 
                          sigma_std);
}

The formulas offered in the manual (page 361) will not be applicable in this case because of the block diagonal structure in the covariance matrices. rstanarm seems to be doing it via autoscale = TRUE but it’s difficult to see how I should replicate that prior_summary and scaling/transforming part when I write my own lower Stan code.

Can someone offer me guidance to this for example using the same multivariate regression example in Section 9.13 (pp. 152-153)?

Sincerely,
Silvia

1 Like

The formulas for obtaining unstandardized regression coefficients still hold in the hierarchical case, except you would not usually standardized the predictors with the global mean and standard deviation. If you were to do something, it would usually involve the group-specific mean and group-specific standard deviation, in which case you would need to unstandardize with the group-specific statistics.

However, the advice in that section of the manual is not very good. You should not be standardizing the outcome variable ever and standardizing the predictors is not that great an idea either. It is okay to divide by constants to change the units, but dividing by the standard deviation is not that generative. What the rstanarm package actually does is to change the scale of the priors rather than rescaling the predictor, but even that only makes sense when you only have one predictor. In the usual case that you have multiple predictors, you generally want to utilize the QR = TRUE argument in rstanarm, which both orthogonalizes and standardizes the predictors, or do the same thing in your own code, which is explained in section 9.2 of the Stan User Manual and a case study.

5 Likes

Thank you Ben for the great reply—I should have mentioned that you already recommended QR reparameterization! Thank you for clarifying about the manual advice and rstanarm as well.

That is a bit strongly stated and it’s not that black and white.

Just be aware that it doesn’t work with n<p.

1 Like

When should someone standardize the outcome besides the rare case where the data-generating process has a known standard deviation?

We need to work more on the n < p case. You can do the regression on Q instead of X but there isn’t a unique transformation from the coefficients on Q back to the coefficients on X. I think the right solution is to take the top p rows of R to be given by the data and integrate over the the bottom n - p rows. But if n >> p, that is a lot of unknowns to integrate over in the absence of explicit integrals.

1 Like

When should someone chose weakly informative priors with data dependent scaling? I know you accept this as it’s in rstanarm. For me standardizing and using weakly informative prior with fixed scale is same as scaling the weakly informative prior scale based on the scale of the data. Whether using sd, robust sd, or mad is just details.

1 Like

Scaling by the standard deviation of the outcome is in rstanarm for the gaussian case only, but I don’t agree with it.

1 Like

It seems we should be more careful with the words as this is complicated issue with predictor, outcome, centering, scaling, standardizing, changing data or prior, whether we do it due to interpretation, due to prior information or due to computational reasons (my experiences in this are especially in case of GPs), and for both predictors and the outcome it depends on the type of variable etc. This is something worth several pages of discussion and examples (maybe some day this year, but can’t do it today).

1 Like

So, there is no universal agreement that one should always standardize predictors and outcomes? Can you point me to any reading about this?

I thought standardizing could make computation easier without having any real cost. @bgoodri could you explain why standardizing can be a bad idea??

It isn’t a question of reading what someone else may have written, you can just look in the model from your other thread

transformed data {
  real  mean_y = mean(y); 
  real<lower=0> sd_y = sd(y); 
  vector[N] y_std = (y - mean_y)/sd_y;      
  real  mean_x = mean(x); 
  real<lower=0> sd_x = sd(x); 
  vector[N] x_std = (x - mean_x)/sd_x;      
}

mean_y, sd_y, and thus y_std are not known with certainty but you are treating them as if they are known. Consequently, the posterior distribution of beta and sigma and any predictions derived from them will not reflect the genuine uncertainty in mean_y and sd_y. You will tend to predict future data less well than past data.

7 Likes

@bgoodri Thanks! That makes a lot of sense.

True , but 1) with sensible weakly informative priors, not very small n, and optimal computation there is likely to be no difference, 2) the effect of assuming something to be known can be smaller than the effect of using bad priors or non-optimal computation and if scaling makes it easier to get more sensible prior or better computation the scaling can better than not scaling. You can also flip this by not scaling data, but scaling priors and thinking how does the scaling affect your inference if you in addition make the prior very tight or quite wide (as in rstanarm). With a wide prior assuming the scale of x and y to be known does not have much difference (unless n is really small). There is no universal agreement on priors, so there can’t be universal agreement on dual case of scaling variables.

4 Likes

I saw on Gelman 2008 that standardizing input variables (not the predictors) is recommended so that we do not need to worry about coefficients of very large or very small variables. Since I am applying my stan model to different datasets with very different scales of input variables, then does it make more sense to standardize the input variables, rather than change the scale of the priors each time?

Changing the priors is about the same amount of work as changing the units of the predictors. The former makes it simpler to do posterior prediction, but that is about it.

I am very new with Stan programming and Bayesian inference as well. What do you mean by changing the priors in this scanerio; do you have some specific examples that I can learn from? I am currently using normal (0,1) and cauchy (0,1) for priors. Do you mind explaining to me why these priors cause my model work properly only with standardized input variables, yet not on the original scale? I tried to choose wider prior distributions such as normal (0, 10) and cauchy (0,5), but the model results aren’t good for input variables on original scale.

Below is a simplified version of my model:

model {
 // prior distributions
  sigma_eps ~ cauchy(0, 1); 
  theta_mu ~ normal(0, 1);

  for(k in 1:K){
    Theta[, k] ~ normal(0, 1);
  }

  // posterior draws
  for(n in 1:N){
    alpha[n] ~ multi_normal(zero_k, diag_matrix(rep_vector(1,K)));
    Y[n] ~ multi_normal(B[n]*theta_mu + B[n] * Theta * alpha[n], diag_matrix(rep_vector(sigma_eps^2, V[n])));
  }
} 

Consider a simple case of a predictor based on length expressed in meters vs. kilometers. The coefficient for kilometers has to be 1000 times larger. So that affects what prior you want to use given the expected scale of the coefficient.

Since the datasets I am applying to have a very different range of scales from one dataset to another, then is it still possible to find a “universal” prior distribution in this case?

RStanArm does a QR decomposition of the data matrix of predictors in order to standardize them all and remove correlations. Then it starts to make sense to use default priors. @betanalpha has a tutorial here.

1 Like

Thank you, this link is very helpful.

When I tried to implement QR decomposition in my code, I encountered another problem:
in my model, Yi = Bi * (theta_mu + Theta * Alpha_i), where Bi has different dimensions
for each subject i. When I used QR decomposition on Bi, then I have a different Qi and Ri for
each subject i. My question is then how to get back the original parameters like Theta,
since now Theta = Ri^(-1) * Theta_tilde, meaning that Theta is going to be different for each
subject i; however, Theta is supposed to be same across all subjects in my original model.

Am I use QR decomposition correctly? If yes, how shall I handle it in the scenario of ragged input structure? Thanks!