 Not good results fitting with rstan

Hi everyone!

I have been dealing with an apparently simple task, but I am stuck with some results I am not confortable with. The task is the following: I have two features (x1 and x2) and a (response) variable z. I am interested in predict de values of z with the known values of x1 and x2, so I am looking some expression that z=f(x1,x2). Let’s have a look at a plot of x1 vs. x2, color-coded by z, whose values range from -0.45 (blue) to +0.45 (red), being z\sim 0 in green:

So you can see there is a kind of gradient of z in the plot. As I don’t have an expression for the function f(x1,x2), y perform a step-by-step regression, starting just with linear terms, that is, x=b_0+b_1x1+b_2x2, and then adding more complex terms and comparing with LOO-CV. I keep adding new terms until the LOO-CV does not improve, so I end up with the expression:
z=b_0+b_1x1+b_2x2+b_3x1^2+b_4x2^2
That is, including the quadratic terms but not the product x1\cdot x2. So my stan model is:

model {
vector[N] mu;
mu = x * beta;
y ~ normal(mu, sigma);
beta ~ normal(0., 10.);
sigma ~ cauchy(0., 10.);
}

where x is the data matrix and \beta de vector of coefficients. But if I plot the results, I got:

This is the same plot as above, where I have only color coded the points with z=+0.3, 0.0 and -0.3 in red, green and blue, respectively. I have also included the curves z=f(x1,x2) for these three values of z. The green curve reproduces well the points, but for the red case, the curve lies above the points, and the contrary occurs in the blue case, where the curve lies below the corresponding points. So, if I use this fit to predict the values of z, the red points obtain a lower value, and for the blue points the predicted value of z is higher, as you can see in the following plot:

In this plot I represent the true values of z vs. the predicted ones using the fit. The solid line represents the 1:1 relationship and the the dashed one the actual tendency of the points. As you can see, larger (positive) values of z obtain lower predicted values, while lower (negative) values obtain higher predictions. I would expect that the predictions follow the 1:1 relationship.

Does anyone know where is the problem or what should I try? I am open to any ideas, so thank you for your time.

1 Like

Hello. A couple things that will help folks answer this question. Can you post the full model code, the full call to run the model, either a snippet of the data or some simulated data for folks to play with, and version of Stan?

Thank you!

Three points:

First, in general we expect the plot of predicted versus observed to have slope (as estimated by regression) less than one. The largest observations tend to be larger than the largest predictions because they arise from the conjunction of large predictor values and large positive error terms. The smallest observations tend to be smaller than the smallest predictions because they arise from the conjunction of small predictor values and large negative error terms.

Here’s a little R script to demonstrate the issue:

set.seed(123)

m <- 1
b <- 0
s <- 1

x <- runif(1000, -1, 1)

predicted <- m*x + b
y <- rnorm(1000, predicted, s)

lm(y ~ predicted) # Slope near 1, as expected
lm(predicted ~ y) # Slope way less than 1.

If we increase the residual standard deviation (relative to the slope and the range spanned by the data x), the issue gets increasingly more pronounced. With residual standard deviation 10, in the above example, the range of the observed values is roughly -30 to 30, and the range of the predicted values is still -1 to 1, so there’s no hope that the slope of predicted ~ observed could be as high as 1.

Second, to estimate the shape of an unknown smooth function whose domain is the Cartesian plane, one slick trick is to use a tensor product smooth, which is easily achievable in Stan via R package brms. For details, see brms::t2().

Third, your intuition to do some sort of assessment of the quality of your fit is very smart. In a Bayesian setting, we often use posterior predictive checking to do this assessment. Some great examples of graphical posterior predictive checks with Stan are available in @jonah et al’s “Visualization in Bayesian workflow” paper here: https://arxiv.org/pdf/1709.01449.pdf