Standardizing predictors and outputs in a hierarchical model


#1

Hello everyone,

First of all thanks for the StanCon—it was a great experience.
Here is a question that I have been discussing there and many suggested that I post it on the Forum.

My question is about Manual Section 28.11 Standardizing Predictors and Outputs (v. 2.17.0), where it is suggested that we standardize our inputs as part of optimization. The model written is a very simple one predictor population variable model of y_{n} = alpha + beta * x_{n} + epsilon_{n}. I was wondering how to back out the coefficients in the generated quantities block when x is actually a hierarchical variable. That is, for example:

data {
    int<lower=0> N;
    int<lower=0> G; // Number of groups
    vector[N] y;
    vector[N] x;
    int<lower=0> group[G]; // Group indicators
}
transformed data {
    vector[N] x_std;
    vector[N] y_std;
    x_std = (x - mean(x)) / sd(x);
    y_std = (y - mean(y)) / sd(y);
}
parameters {
    real alpha_std;
    real beta_std;
    real<lower=0> sigma_std;
}
model {
    alpha_std ~ normal(0, 10);
    beta_std ~ normal(0, 10);
    sigma_std ~ cauchy(0, 5);
    for (n in 1:N)
        y_std[n] ~ normal(alpha_std + beta_std * x_std[group[n]], 
                          sigma_std);
}

The formulas offered in the manual (page 361) will not be applicable in this case because of the block diagonal structure in the covariance matrices. rstanarm seems to be doing it via autoscale = TRUE but it’s difficult to see how I should replicate that prior_summary and scaling/transforming part when I write my own lower Stan code.

Can someone offer me guidance to this for example using the same multivariate regression example in Section 9.13 (pp. 152-153)?

Sincerely,
Silvia


Meta-analysis and treatment of standard errors as known
Looking for a stan program critique
#2

The formulas for obtaining unstandardized regression coefficients still hold in the hierarchical case, except you would not usually standardized the predictors with the global mean and standard deviation. If you were to do something, it would usually involve the group-specific mean and group-specific standard deviation, in which case you would need to unstandardize with the group-specific statistics.

However, the advice in that section of the manual is not very good. You should not be standardizing the outcome variable ever and standardizing the predictors is not that great an idea either. It is okay to divide by constants to change the units, but dividing by the standard deviation is not that generative. What the rstanarm package actually does is to change the scale of the priors rather than rescaling the predictor, but even that only makes sense when you only have one predictor. In the usual case that you have multiple predictors, you generally want to utilize the QR = TRUE argument in rstanarm, which both orthogonalizes and standardizes the predictors, or do the same thing in your own code, which is explained in section 9.2 of the Stan User Manual and a case study.


#3

Thank you Ben for the great reply—I should have mentioned that you already recommended QR reparameterization! Thank you for clarifying about the manual advice and rstanarm as well.


#4

That is a bit strongly stated and it’s not that black and white.

Just be aware that it doesn’t work with n<p.


#5

When should someone standardize the outcome besides the rare case where the data-generating process has a known standard deviation?

We need to work more on the n < p case. You can do the regression on Q instead of X but there isn’t a unique transformation from the coefficients on Q back to the coefficients on X. I think the right solution is to take the top p rows of R to be given by the data and integrate over the the bottom n - p rows. But if n >> p, that is a lot of unknowns to integrate over in the absence of explicit integrals.


#6

When should someone chose weakly informative priors with data dependent scaling? I know you accept this as it’s in rstanarm. For me standardizing and using weakly informative prior with fixed scale is same as scaling the weakly informative prior scale based on the scale of the data. Whether using sd, robust sd, or mad is just details.


#7

Scaling by the standard deviation of the outcome is in rstanarm for the gaussian case only, but I don’t agree with it.


#8

It seems we should be more careful with the words as this is complicated issue with predictor, outcome, centering, scaling, standardizing, changing data or prior, whether we do it due to interpretation, due to prior information or due to computational reasons (my experiences in this are especially in case of GPs), and for both predictors and the outcome it depends on the type of variable etc. This is something worth several pages of discussion and examples (maybe some day this year, but can’t do it today).


#9

So, there is no universal agreement that one should always standardize predictors and outcomes? Can you point me to any reading about this?

I thought standardizing could make computation easier without having any real cost. @bgoodri could you explain why standardizing can be a bad idea??


#10

It isn’t a question of reading what someone else may have written, you can just look in the model from your other thread

transformed data {
  real  mean_y = mean(y); 
  real<lower=0> sd_y = sd(y); 
  vector[N] y_std = (y - mean_y)/sd_y;      
  real  mean_x = mean(x); 
  real<lower=0> sd_x = sd(x); 
  vector[N] x_std = (x - mean_x)/sd_x;      
}

mean_y, sd_y, and thus y_std are not known with certainty but you are treating them as if they are known. Consequently, the posterior distribution of beta and sigma and any predictions derived from them will not reflect the genuine uncertainty in mean_y and sd_y. You will tend to predict future data less well than past data.


#11

@bgoodri Thanks! That makes a lot of sense.


#12

True , but 1) with sensible weakly informative priors, not very small n, and optimal computation there is likely to be no difference, 2) the effect of assuming something to be known can be smaller than the effect of using bad priors or non-optimal computation and if scaling makes it easier to get more sensible prior or better computation the scaling can better than not scaling. You can also flip this by not scaling data, but scaling priors and thinking how does the scaling affect your inference if you in addition make the prior very tight or quite wide (as in rstanarm). With a wide prior assuming the scale of x and y to be known does not have much difference (unless n is really small). There is no universal agreement on priors, so there can’t be universal agreement on dual case of scaling variables.


Looking for a stan program critique