Predict from a phylogenetic multilevel model trained on standardised data, but with new data that has not been standardised

When using non-standardised variables to train a model I have used non-standardised new data to make predictions

predict(f1, newdata = n, allow_new_levels = T)

However, my model fits better when variables are standardised and I want to predict with a new data point that has not been standardised. In this case I have been recommended to un-standardise the estimated coefficients and use them to predict the new case . Is there a way to do this within predict.brmsfit, specifically when allowing new levels on a phylogenetic tree?

If not, do you have any recommendation on reading/implementation of how I can do what predict.brmsfit do, “manually”, so that I can un-standardise my self? Or do you see a better solution?

Thank you!

Is there a reason you can’t just standardise the new data points as well before prediction using the same mean and standard deviation you used to standardise the original data?

1 Like

Thanks for your reply @cgoold!

I thought of doing it that way but I can’t make up my mind if my research aim becomes valid this way. But now that I think about it, is your suggestion logically identical to my initial question?

Thank you!

Hi @Andreas

I’m not familiar with phylogenetic analyses, so I can’t comment that much on what is valid. What I understood was the you had some predictor variable x, which you standardised before fitting your model, e.g. if you were fitting a simple linear regression:

x_{z} = \frac{x - \bar{x}}{sd_{x}}

y \sim N(\alpha + \beta x_{z}, \sigma)

To make predictions \hat{y} for a new values of your predictors, x^{*}, you need:

y^{*} \sim N( \hat{\alpha} + \hat{\beta} x_{z}^{*}, \hat{\sigma})

where the hat parameters are the esitmated values (e.g. posterior mean), and x^{*}_{z} are the the new x^{*} data points on the standardised scale used to estimate the model:

x^{*}_{z} = \frac{x^{*} - \bar{x}}{sd_{x}}

In R, this could look something like:

\\ x = original predictor
\\ x_star = new predictor values
\\ y_star = predictions using x_star
\\ N = number of predictions for each value of x_star
\\ draws = posterior distribution

x_star_z <- ( x_star - mean(x) )/sd(x)

predictions <- sapply(x_star_z, 
                      function(z) rnorm(N, 
                                        draws$alpha + draws$beta*z, 
                                        draws$sigma
                                       )
               )

This is correct! What confused me is the difference between

x_star_z <- ( x_star - mean(x) )/sd(x)

and

x_star_z <- ( x_star - mean(c(x, x_star) )/sd(c(x, x_star))

which should I use?

However, if I’m not mistaken I think the first alternative is logically equivalent to un-standardise coefficients estimated with standardised predictors used for new (non-standardised) prediction. But I can be wrong!

@Andreas I don’t understand what your second line of code would mean (why subtract x_star from both means?).

And I don’t understand what you mean by (un)standardised coefficients here. Standardised coefficients are coefficients estimated from standardised data.

What I’m trying to do in the second line is to standardise when using the mean and standard deviation from all the data, i.e. x and x_star, instead of just x. This will naturally change the value of x_star_z and this is what confused me initially.

My initial confusion was this. I estimate coefficients from standardised data. But then how do I predict from new data? I need to standardize the new data somehow. You seem to suggest that I use the mean and sd from x. But why should I not use the mean and sd calculated from the joint data x and x_star?

Regarding un-standardisation I’m not sure this is possible, I read it somewhere. The idea is that maybe there is a way to back-track the standardisation procedure like this

x_star_z * sd(x) + mean(x)

but with the coefficients estimated on standardised data and then use these “reversed standardised” coefficients to predict with new data. But I don’t know if that’s possible.

Thanks for your patience!

I understand your original question. To predict from new data, you need to put that new data on the same scale as the scale used in the original fitting of the model. So you need to center the new x values around the old x values’ mean, and scale the new x values by the old x values’ standard deviation. That is the only way to make the two standardised variables equivalent.

To undstandardise a variable, you just manipulate the algebra. If:

x_z = \frac{x - \bar{x}}{\sigma_x}

is the standardised x variable, then you can get the original x back with:

x = x_z \sigma_{x} + \bar{x}

But I don’t think this helps your situation.

Perhaps if you post a minimal working example (code, data/fake data) then people on this forum can help you predict from the new data.

Actually I think this gives me the answer I was looking for. Is there a reference you can think of regarding this topic, i.e. standardise new data for prediction?

Thank you!

I don’t think there is a reference because it’s just moving between standardised to unstandardised variables. It’s just algebra and making sure variables are on the same scale.