Problem estimating heteroskedastic model

Hello,

I am stumped by a problem that I thought upfront I understood. It is a generic linear modeling problem involving a specific type of heteroskedasticity; the issue isn’t specific to Stan, but I am hoping there is a way of ‘fixing’ estimation via Stan. Consider the following data generating model:
y_i=ax_i+(b_0+b_1x_i)y_i^\star
where
y_i^\star=hG_i+e\epsilon_i.
We assume that y^\star has unit variance via the assumptions that e=\sqrt{1-h^2} and that \epsilon is (unobserved) white noise with unit variance. Expanding, we have
y_i=ax_i + b_0 h G_i + + b_1 h x_i G_i + b_0 e \epsilon_i + b_1 e x_i \epsilon_i.

We can replace the products of coefficients and rewrite this as:
y_i=ax_i+\pi_0 G_i+\pi_1 G_i x_i +\lambda_0 \epsilon_i+\lambda_1 x_i \epsilon_i.
Estimation should be fairly straightforwardly (or so I thought) as
E(y_i)=ax_i+\pi_0 G_i+\pi_1 G_i x_i
since E(\epsilon)=0 and
V(y_i)=V((\lambda_0+\lambda_1 x_i)\epsilon)=(\lambda_0+\lambda_1 x_i)^2
since \epsilon has unit variance.

We implement in stan (via rstan) as:

stanmodelcode <- "
     data {
       int<lower=0> N;
       real y[N];
       real g[N];
       real x[N];
     }      
     parameters {
       real a;
       real pi0;
       real pi1;
       real lambda0;
       real lambda1;
     } 
     transformed parameters {
       vector[N] mu;
       vector[N] sigma;
       for (i in 1:N) mu[i]=a*x[i]+pi0*g[i]+pi1*x[i]*g[i];
       for (i in 1:N) sigma[i]=(lambda0+lambda1*x[i]);
     }
     model {
       for (i in 1:N) y[i] ~ normal(mu[i],sigma[i]);
       a ~ normal(0,10);
       pi0 ~ normal(0,10);
       pi1 ~ normal(0,10);
       lambda0 ~ normal(0,10);
       lambda1 ~ normal(0,10);
     } 
     "

We take G to be standard normal. The issue has to do with the distribution of x. If x is drawn from a uniform distribution, all is well (focusing on recovery of \pi_1 and \lambda_1 parameters). An illustration:

library(rstan)
sm<-stan_model(model_code=stanmodelcode,model_name='het_weirdness')
simdata<-function(x,b0=.5,b1=.2,h=sqrt(.6),a=.5) {
    N<-length(x)
    e<-sqrt(1-h^2)
    G<-rnorm(N,0,1)
    eps<-rnorm(N,0,1)
    ystar<-h*G+e*eps
    y<-a*x+(b0+b1*x)*ystar 
    df<-data.frame(x=x,y=y,g=G)
    list(pi1=b1*h,lambda1=b1*e,df=df)
}

set.seed(8675309)
##
x.unif<-runif(2500,-1.5,1.5)
L<-simdata(x.unif)
pi1<-L$pi1
lambda1<-L$lambda1
df<-L$df
dat<-list(N=nrow(df),y=df$y,x=df$x,g=df$g)
fit.u<-sampling(sm,data=dat,pars=c('mu','sigma'),include=FALSE,chains=2,cores=2)
par(mfrow=c(1,2))
hist(extract(fit.u,'pi1')[[1]],xlim=c(0,.2))
abline(v=pi1,col='red',lwd=2)
hist(extract(fit.u,'lambda1')[[1]],xlim=c(0,.2))
abline(v=lambda1,col='red',lwd=2)

So far, so good [note: there are some warnings that I need to resolve and I’m open to suggestions there, but estimation seems to converge to truth]. Now, the trouble: if x is instead sampled from the standard normal, estimation goes awry.

x.norm<-rnorm(2500)
L<-simdata(x.norm)
pi1<-L$pi1
lambda1<-L$lambda1
df<-L$df
dat<-list(N=nrow(df),y=df$y,x=df$x,g=df$g)
fit.n<-sampling(sm,data=dat,pars=c('mu','sigma'),include=FALSE,chains=2,cores=2)
par(mfrow=c(1,2))
hist(extract(fit.n,'pi1')[[1]],xlim=c(0,.2))
abline(v=pi1,col='red',lwd=2)
hist(extract(fit.n,'lambda1')[[1]],xlim=c(0,.2))
abline(v=lambda1,col='red',lwd=2)

Note that the estimate of \lambda_1 is especially bad (the posterior is well below the true value).

We’re trying to figure out how to resolve this. It didn’t seem a priori like estimation should be so sensitive to the distribution of x. Yet, here we are. Frequentist MLE has the same problem. A collaborator has experimented with Mplus (wherein \epsilon is treated as a latent variable) and tells us that estimation via that software does not have this problem. Our preference is to not go that route; it seems like this is something we should be able to resolve via some tweak to the above. Any help greatly appreciated!!!

Ben

1 Like

I looked at this a bit and I do not have a complete explanation but I hope I can make some useful observations. I am a novice with Stan, so treat my input accordingly.

  1. The model does not have any limit on the value of sigma, it is simply calculated from other quantities
sigma[i]=(lambda0+lambda1*x[i]);

and lambda0, lambda1 and x can all be negative. The lack of a limit on sigma shows up in the initialization when negative values are tried in y[i] ~ normal(mu[i],sigma[i]); and an error is thrown.

  1. The uniform distribution that you tried does return the correct value of lambda1 but expanding its limits to [-4, 4] leads to a low estimate of lambda1 similar to using a standard normal for x.
x.unif<-runif(2500,-4,4) 
#... other code ... 
summary(fit.u)$summary[, "mean"]
      a          pi0          pi1        lambda0      lambda1         lp__ 
5.017469e-01 3.826576e-01 1.496183e-01 3.444115e-01 6.381666e-02 1.722301e+03 
  1. Using a narrower normal for x than was originally tried similarly leads to a more reasonable estimate of lambda1.
x.norm<-rnorm(2500, 0, 0.3)
#...other code ...
summary(fit.n)$summary[, "mean"]
        a          pi0          pi1      lambda0      lambda1         lp__ 
   0.5099905    0.3868222    0.1427578    0.3152581    0.1104819 1649.6851628 

Stan does seem to have more trouble fitting with x derived from a standard normal than from runif(2500, -4, 4), having hundreds of divergences with adapt_delta = 0.8.

Musings of a novice: I am surprised the model even runs when x can be -4, lambda1 should be about 0.12 and lambda0 is 0.3 so that sigma is negative. I guess rather than throwing an error, the fit process learns to stay away from regions with impossible values and keeps lambda1 small.

Later addition: lambda1 drops even more if x is drawn from runif(2500, -6, 4)

           a          pi0          pi1      lambda0      lambda1         lp__ 
   0.5016476    0.3818108    0.1506006    0.4067132    0.0289117 1259.5690834 
2 Likes

Here’s some simulated data with x drawn from standard normal distribution.

simulation

The lines show mean ± sd for the generating process. The noteworthy feature is at x=-\frac{b_0}{b_1}=-2.5 where the variance vanishes and the model prediction is deterministic.

Typical posterior draw given by Stan looks like this
posterior

The zero-variance point is always estimated to the left of all the data points. The model cannot deal with it being inside the range of x.

I doubt your real data has any zero-variance region so I think you’ll have to come up with a model that doesn’t predict such nonsense.
The minimal change I’d make is to add a small amount of irreducible noise to all data points. Normal noise adds in quadrature so it’s

parameters {
    ...
    real<lower=0> lambda2;
}
model {
    ...
    for (i in 1:N)
        sigma[i] = hypot(lambda0+lambda1*x[i], lambda2);
    ...
}

where hypot(a,b) is just shorthand for sqrt(a^2+b^2). This model is still multimodal because the likelihood is invariant under the change of sign \left(\lambda_0,\lambda_1\right)\leftrightarrow\left(-\lambda_0,-\lambda_1\right). A lower=0 constraint on either lambda0 or lambda1 should fix that.

5 Likes

Premise: I did not read in full the post

Few things to think about:

  • if you plot an example of your real data, someone might point to you to the best route
  • Heteroskedastic regression could be achieved with gamma regression (for example, so providing a real example of yours might me helpful)
  • If you want to put a linear model on standard deviation this should have the log link (e.g., normal(…, exp(linear equation))), since it should not cross 0
2 Likes

Ah, this is perfect. I feel sheepish for not having caught this. Thanks a ton @nhuurre! This is hugely helpful.