# Missing data in quadratic regression

Hello,

I would like to fit a “simple” linear model, with quadratic terms, where some predictors are missing.

To setup the model in Stan, I’m using the following simulated data, as simple as possible.


N=30
MIS=10

X<-rnorm(N,2,1);
y<-rnorm(N,X^2,0.5);

df<-data.frame(
X=X,  # rows corresponding to I_obs=0 included for simplicity, but not used in stan
I_obs=c(rep(0,MIS),rep(1,N-MIS)),
y=y
)



It is clear that for each y value, there are two compatible X values, and I think the correct probability distribution for missing values is in many cases bimodal (each peak centered around \pm\sqrt{y[i]}, with more weight around +2 since the observed Xs are centered around 2.

In fact running the following model, for several X_mis variables, some chains converge around the postiive value, some chains around the negative one, but I don’t think this is the right approach, since I get all kinds of R^ warnings etc…

Any suggestion to tackle this problem?

thanks
Matteo

functions {
vector merge(array[] int I_obs,
vector X,
vector X_mis
)
{
int N=dims(X)[1];
vector[N] X_merge=X;

int k=0;

for (n in 1:N)
{

if (I_obs[n]==0)
{
k=k+1;
X_merge[n]=X_mis[k];
}

}

return(X_merge);
}

matrix expand( vector X) {

int N=dims(X)[1];
matrix[N, 2] X_full;

X_full[,1]=X;
X_full[,2]=X.*X;

return(X_full);

}

}

data {
int<lower=0> N;
vector[N] y;
vector[N] X;
array[N] int I_obs;

int<lower=0, upper=1> regress; // 0 -> no regression, 1-> regression
}

transformed data {

int totObs=0;

for (n in 1:N)
{
totObs=totObs+I_obs[n];

}

}

parameters {
vector[2] b;
real muX;
real<lower=1e-10> sigmaX;
vector [N-totObs] X_mis;

real<lower=1e-10> sigmaY;
}

model {
muX~normal(0,1);
sigmaX~normal(0,1);

b~normal(0,1);
sigmaY~normal(0,1);

vector [N] X_merge;

X_merge=merge(I_obs, X, X_mis);

X_merge~normal(muX,sigmaX);

if (regress)
y~normal(expand(X_merge)*b,sigmaY);
}



One option is to marginalize out the sign of the X_mis, similar to how mixture models are handled.
That’s not possible with merge and expand functions, so I’ll first remove them:

model {
muX~normal(0,1);
sigmaX~normal(0,1);
b~normal(0,1);
sigmaY~normal(0,1);

int k = 0;
for (i in 1:N) {
if (I_obs[i]) {
X[i] ~ normal(muX, sigmaX);
if (regress)
y[i] ~ normal(X[i]*b[1] + X[i]*X[i]*b[2], sigmaY);
} else {
k += 1;
X_mis[k] ~ normal(muX, sigmaX);
if (regress)
y[i] ~ normal(X_mis[k]*b[1] + X_mis[k]*X_mis[k]*b[2], sigmaY);
}
}
}


or in target += form

model {
...
int k = 0;
for (i in 1:N) {
if (I_obs[i]) {
...
} else {
k += 1;
if (regress)
target += normal_lpdf(X_mis[k] | muX, sigmaX) + normal_lpdf(y[i] | X_mis[k]*b[1] + X_mis[k]*X_mis[k]*b[2], sigmaY);
else
target += normal_lpdf(X_mis[k] | muX, sigmaX);
}
}
}


Now, replace the (unconstrained) X_mis with a positive-constrained absX_mis and write the model as a mixture of absX_mis and -absX_mis.

parameters {
vector[2] b;
real muX;
real<lower=1e-10> sigmaX;
vector<lower=0>[N-totObs] absX_mis;
real<lower=1e-10> sigmaY;
}
model {
muX~normal(0,1);
sigmaX~normal(0,1);
b~normal(0,1);
sigmaY~normal(0,1);

int k = 0;
for (i in 1:N) {
if (I_obs[i]) {
X[i] ~ normal(muX, sigmaX);
if (regress)
y[i] ~ normal(X[i]*b[1] + X[i]*X[i]*b[2], sigmaY);
} else {
k += 1;
if (regress)
target += log_sum_exp(
normal_lpdf(absX_mis[k] | muX, sigmaX) + normal_lpdf(y[i] | absX_mis[k]*b[1] + absX_mis[k]*absX_mis[k]*b[2], sigmaY),
normal_lpdf(-absX_mis[k] | muX, sigmaX) + normal_lpdf(y[i] | -absX_mis[k]*b[1] + (-absX_mis[k])*(-absX_mis[k])*b[2], sigmaY));
else
target += log_sum_exp(
normal_lpdf(absX_mis[k] | muX, sigmaX),
normal_lpdf(-absX_mis[k] | muX, sigmaX));
}
}
}

2 Likes

Your solution makes sense and seems to work.

thanks,
Matteo

Hello, I’m here again for a follow-up.

The problem I posted was maybe just a bit too simplified. In a more realistic setting, x_{0}=argmin_{x} [b_{1}x+b_2x^2]=-0.5\frac{b1}{b2} is not going to be known in advance and could be different from 0. For instance, the data generating process could be (with only 20 observed X values out of 200):

N=200
MIS=180

X<-rnorm(N,0,1);
y<-rnorm(N,(X-2)^2,0.5);

df<-data.frame(
X=X,
I_obs=c(rep(0,MIS),rep(1,N-MIS)),
y=y
)



I have modified the code to include include a shift x0=-0.5\frac{b1}{b2}, as shown below, but I get Rhat and e-bmfi warnings.

// based on suggestions by Niko Huurre on discourse.mc-stan Dec 14th 2023

data {
int<lower=0> N;
vector[N] y;
vector[N] X;
array[N] int I_obs;

int<lower=0, upper=1> regress; // 0 -> no regression, 1-> regression
}

transformed data {

int totObs=0;

for (n in 1:N)
{
totObs=totObs+I_obs[n];

}

print(N);

}

parameters {
vector[2] b;
real muX;
real<lower=1e-10> sigmaX;
vector<lower=0> [N-totObs] absX_mis;

real<lower=1e-10> sigmaY;
}

model {
muX~normal(0,1);
sigmaX~normal(0,1);

b~normal(0,1);
sigmaY~normal(0,1);

absX_mis~normal(0,1);

int k = 0;
real x0;
for (i in 1:N) {
if (I_obs[i]) {
X[i] ~ normal(muX, sigmaX);
if (regress)
y[i] ~ normal(X[i]*b[1] + X[i]*X[i]*b[2], sigmaY);
} else {
k += 1;
x0=-0.5*b[1]/b[2];
if (regress)
target += log_sum_exp(
normal_lpdf(x0+absX_mis[k] | muX, sigmaX) + normal_lpdf(y[i] | (x0+absX_mis[k])*b[1] + (x0+absX_mis[k])*(x0+absX_mis[k])*b[2], sigmaY),
normal_lpdf(x0-absX_mis[k] | muX, sigmaX) + normal_lpdf(y[i] | (x0-absX_mis[k])*b[1] + (x0-absX_mis[k])*(x0-absX_mis[k])*b[2], sigmaY));
else
target += log_sum_exp(
normal_lpdf(x0+absX_mis[k] | muX, sigmaX),
normal_lpdf(x0-absX_mis[k] | muX, sigmaX));
}
}
}


This is the diagnostic:

Processing csv files: C:/Users/patelmom/tmpstan/simple_2.1-202312151556-1-2828fe.csv, C:/Users/patelmom/tmpstan/simple_2.1-202312151556-2-2828fe.csv, C:/Users/patelmom/tmpstan/simple_2.1-202312151556-3-2828fe.csv, C:/Users/patelmom/tmpstan/simple_2.1-202312151556-4-2828fe.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.18, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

Effective sample size satisfactory.

The following parameters had split R-hat greater than 1.05:
b[1], b[2], muX, absX_mis[1], absX_mis[8], absX_mis[9], absX_mis[18], absX_mis[21], absX_mis[23], absX_mis[24], absX_mis[26], absX_mis[29], absX_mis[41], absX_mis[43], absX_mis[46], absX_mis[57], absX_mis[64], absX_mis[65], absX_mis[72], absX_mis[75], absX_mis[78], absX_mis[100], absX_mis[101], absX_mis[105], absX_mis[108], absX_mis[111], absX_mis[113], absX_mis[118], absX_mis[119], absX_mis[120], absX_mis[122], absX_mis[126], absX_mis[129], absX_mis[134], absX_mis[135], absX_mis[137], absX_mis[140], absX_mis[143], absX_mis[144], absX_mis[145], absX_mis[147], absX_mis[150], absX_mis[154], absX_mis[160], absX_mis[162], absX_mis[163], absX_mis[175], absX_mis[176], sigmaY
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.

Processing complete.


Thanks again for any suggestion.

Matteo

My guess is that the shift by x_0=-0.5\frac{b_1}{b_2} is problematic when b_2 \approx 0, which is allowed by your priors. And something like b[1]=-2,b[2]=0 is probably also consistent with the likelihood of the simulated data.

The model is effectively linear if \sigma_{X}\left|b_{2}\right|\ll\left|b_{1}\right|. In that range x0 is theoretically irrelevant but you should define it so that the offset stays near zero. For example, instead of

x0 = -0.5*b[1] * 1/b[2];


you could have

x0 = -0.5*b[1] * b[2]/(b[2]*b[2] + b[1]*b[1]*inv_var_x);


where

transformed data {
real inv_var_x = 1/variance(X);
}


By the way, I don’t think you should have

  absX_mis~normal(0,1);


The normal_lpdf(x0+absX_mis[k] | muX, sigmaX) term in the loop already defines the prior for absX_mis.

Niko, thanks. I think your diagnosis is correct, unfortunately the solution you propose seems just to improve things a bit but is not always working. I will keep studying (this toy problem is much harder than I thought), meanwhile any further hint would be appreciated!
thanks
Matteo

ps. by constraining the quadratic coefficient to be>0 (or <0), convergence issues are solved. It’s a half solution…