Re-scaling data and parameters to [0, 1]

Hello all, I’m trying to make use of Gaussian Processes for model calibration in Stan. Something that is often recommended is to re-scale your predictors in the interval [0, 1] which improves sampling efficiency [1,2]. By reading stan’s user guide on standardisation I can see how to do that with my predictors using the stan syntax. From the same page, it is suggested that the priors could have been transformed as well.

What I’m wondering is how would one go about transforming them? I understand why the priors were not transformed in that example, since they were diffuse. In my case, some of the priors relate to the probability of occurrence of the predictors. If the priors are normal or uniform, I found it easy to change the parameters that define them to re-scale appropriately but not for other distributions such as Weibull. If you have informative priors, could you transform them in a similar way as you transform the data but within the transformed parameter block? (i.e. sample tf ~ N(24,4) in the model block and then in the transformed parameter block apply a linear transformation to re-scale tf: tf_std = (tf - tf_min)/(tf_max - tf_min) where tf_min and tf_max come from my computer model data)?

References to what I’m trying to reproduce:
[1] https://epubs.siam.org/doi/abs/10.1137/S1064827503426693
[2] https://www.sciencedirect.com/science/article/pii/S0378778818307539

3 Likes

Hey there! Sorry it took us a while to respond. I’m afraid I don’t have a good answer either… sorry. I also can’t access the papers you linked to.

Just a quick remark: The page of the users guide that you linked to is talking about standardization (subtracting the mean and dividing by standard deviation) and not min-max scaling. The point with standardization is that they are linear transformations and particularly easy in combinations with location scale distributions. Also priors for a GP are quite a different beast than priors for linear regression.

Maybe the other Max @maxbiostat has some good ideas? :)

Cheers,
Max

3 Likes

Hi,

Welcome to the forums.

Yes, in the sense that the prior should be calibrated to be in the same scale as the model parameters, which in turn depend on the scale of the data. If you can provide a concrete example of your model and code, we can help you with the specifics of prior transformation.

3 Likes

Hello both,

Thanks for your comments! For clarification, the GP is used as a surrogate model to my computationally expensive building physics code. I’m trying to calibrate the some of the model inputs of a model used to predict energy use in a building. Drawing from Higdon et al. (2004):

At various settings for x, observations y are made of the physical system which are modelled statistically using the simulator \eta(x, \theta) at the true calibration value \theta according to:

y(x_i) = \eta(x_i,\theta) + \delta(x_i) + \epsilon(x_i)

where the stochastic term (modelled as GP) \delta(x_i) accounts for discrepancy between the simulator \eta(x_i,\theta) and reality \zeta(x_i), and \theta denotes the “true,” but unknown, setting for the calibration inputs t. For the following explanation, I’m ignoring the \delta(x_i) term for simplicity.

Quite often the computational demands of the simulator make it impossible to use an estimation approach based on MCMC directly on the simulator due to the number of simulations it would take to converge. Instead, a limited number of simulation runs may be used:

\eta(x^*_j,t^*_j), \, j=1,...,m.
  • Treat \eta(x,t) as unknown for pairs of (x,t) that aren’t included in m simulator runs

  • If x \in \mathbb{R}^p and t \in \mathbb{R}^l then \eta(x,t) maps \mathbb{R}^{p+l} to \mathbb{R}.

  • A standard prior model for an unknown function (\eta(x,t)) is a Gaussian Process (GP)

Assume that \mu(x,t) of GP is constant and specify a covariance function:

cov((x,t),(x',t')) = \frac{1}{\lambda_\eta} exp\{ -\sum^{p}_{k=1} \beta^\eta_k|x_{ik} - x'_{ik}|^{\alpha} -\sum^{l}_{k'=1} \beta^\eta_{p+k}|t_{ik} - t'_{ik}|^{\alpha} \}

If we assume that:

  • field observations: y = (y(x_1),...,y(x_n))^T

  • simulation outcomes: \eta = (\eta(x_1^*,t_1^*),...,\eta(x_m^*,t_m^*))^T

we can now define a joint n+m vector z = (y^T,\eta^T)^T and the likelihood is:

L(z|\theta,\mu,\lambda_\eta,\beta^\eta,\Sigma_y) \propto |\Sigma^{-1}_z| exp\{-\frac{1}{2}(z-\mu\boldsymbol{1}_{n+m})^T \Sigma^{-1}_z (z-\mu\boldsymbol{1}_{n+m})\}

where \boldsymbol{1}_{n+m} is the n+m vector of 1s and

\Sigma_z = \Sigma_\eta + \begin{pmatrix} \Sigma_y & 0 \\ 0 & 0 \end{pmatrix}

Conditioning on the augmented observaiton vector z results in the posterior:

\pi(\theta,\mu,\lambda_\eta,\beta^\eta|z) \propto L(z|\theta,\mu,\lambda_\eta,\beta^\eta,\Sigma_y)\pi(\theta)\pi(\mu)\pi(\lambda_\eta)\pi(\beta^\eta)

It is advised (by Higdon et al. (2004)) that,

  • The input points (x,t) are standardised to be contained in [0,1]^{p+l}

  • d is transformed so \eta has a mean of 0 and variance of 1

  • independent priors are chosen for \mu, \lambda_{\eta} and \beta^{\eta}

I hope the short description of the background theory clarifies things. My question relates to the fact that since t are standardised to [0, 1] then \pi(\theta) will need to be on the same scale. Since the model represents a building, the inputs t have a a physical meaning. \theta_1 might relate to the boiler efficiency for example and \theta_2 might relate to thermostat setpoint. If I have reason to believe that the thermostat setpoint may be represented by a Normal(24,4.5), forming my prior, I’ll then need to re-scale this to be within [0, 1].

Would your advice be to re-scale this outside of .stan and then use the re-scaled prior or would you that within .stan? I realise that the standardisation is different the min-max scaling. I just thought that since both of them are linear transformations then the process wouldn’t be that different if it can be applied to the priors as well.

Many thanks,
Cali

1 Like

Not universally (I didn’t check what algorithm was used in the references), and it can make the sampling efficiency much worse. In Stan MCMC implementation, the best efficiency is usually obtained when the posterior has approximately unit scale as then the adaptation phase is more efficient. If the original scale of the predictor is, for example, in thousands and the standard deviation of the lengthscale of Gaussian process is around 1. If you now re-scale, then the sd of posterior of lengthscale is around 1/1000 and if at the same time some other parameters have scale near 1, the adaptation of mass matrix to take into account the different scales can take much longer.

In your case, I would not re-scale based on the sampling efficiency argument, especially as it seems to complicate your code and thinking of priors.

Some other reasons to rescale (not necessarily to [0,1])

  • The predictor values have huge magnitude so that there can be numerical issues due to the limitations of floating point presentations, and then it’s reasonable to scale in the range which provides better numerical stability
  • If there are many predictors with widely different scales, but it is assumed that smoothness and relevance are similar, it may be easier to define the prior after re-scaling. This would require that the scaling is not sensitive to random realizations of predictor values (e.g. the data is not very small or thick tailed)

Some reasons to not rescale

  • it’s more difficult to think the priors on scale of the predictor
  • the code gets more complicated
  • rescaling in case of small data can be sensitive to the random variation in the data

Since you have informative priors and the magnitude is not huge, it seems it would be easier to not re-scale and I don’t think this would affect the sampling speed in Stan.

I would advise to start with no re-scaling, and consider scaling only if you encounter problems without scaling.

2 Likes

Thank you for the detailed reply!

Is there a value for “widely different scales” you can think of? In my analysis, two extremes are a parameter that varies in the range [0, 1] and another that varies in the range [0, 330]. I’m not sure whether this would count as different enough to necessitate a transformation or not. I suspect it might be context specific and possibly the best way to find out is to simply try (which I’ll do).

I’ve so far done the rescaling by generating a large sample from my priors within R, perform the transformation on the generated sample and then re-fit the distribution of the transformed sample to identify the new parameters which I know use as my priors in the [0, 1 ] scale. I understand this only approximate and their might be other more rigorous way of doing it. I’ll compare running the simulation without any rescaling and see what happens.

Thanks again for the insights!

Sorry, it depends.

This is relative to your prior. If you have a scale free prior it doesn’t matter, but scale free priors are most of the time less useful than (even weakly) informative priors. You can either scale parameters or priors based on the information.

Yes. It’s not possible to have a universal recommendation (like always rescaling to [0,1]) that could not be beaten by using some context specific information.

Great, thank you for your informative reply!