Hello,
I am stumped by a problem that I thought upfront I understood. It is a generic linear modeling problem involving a specific type of heteroskedasticity; the issue isn’t specific to Stan, but I am hoping there is a way of ‘fixing’ estimation via Stan. Consider the following data generating model:
y_i=ax_i+(b_0+b_1x_i)y_i^\star
where
y_i^\star=hG_i+e\epsilon_i.
We assume that y^\star has unit variance via the assumptions that e=\sqrt{1-h^2} and that \epsilon is (unobserved) white noise with unit variance. Expanding, we have
y_i=ax_i + b_0 h G_i + + b_1 h x_i G_i + b_0 e \epsilon_i + b_1 e x_i \epsilon_i.
We can replace the products of coefficients and rewrite this as:
y_i=ax_i+\pi_0 G_i+\pi_1 G_i x_i +\lambda_0 \epsilon_i+\lambda_1 x_i \epsilon_i.
Estimation should be fairly straightforwardly (or so I thought) as
E(y_i)=ax_i+\pi_0 G_i+\pi_1 G_i x_i
since E(\epsilon)=0 and
V(y_i)=V((\lambda_0+\lambda_1 x_i)\epsilon)=(\lambda_0+\lambda_1 x_i)^2
since \epsilon has unit variance.
We implement in stan (via rstan) as:
stanmodelcode <- "
data {
int<lower=0> N;
real y[N];
real g[N];
real x[N];
}
parameters {
real a;
real pi0;
real pi1;
real lambda0;
real lambda1;
}
transformed parameters {
vector[N] mu;
vector[N] sigma;
for (i in 1:N) mu[i]=a*x[i]+pi0*g[i]+pi1*x[i]*g[i];
for (i in 1:N) sigma[i]=(lambda0+lambda1*x[i]);
}
model {
for (i in 1:N) y[i] ~ normal(mu[i],sigma[i]);
a ~ normal(0,10);
pi0 ~ normal(0,10);
pi1 ~ normal(0,10);
lambda0 ~ normal(0,10);
lambda1 ~ normal(0,10);
}
"
We take G to be standard normal. The issue has to do with the distribution of x. If x is drawn from a uniform distribution, all is well (focusing on recovery of \pi_1 and \lambda_1 parameters). An illustration:
library(rstan)
sm<-stan_model(model_code=stanmodelcode,model_name='het_weirdness')
simdata<-function(x,b0=.5,b1=.2,h=sqrt(.6),a=.5) {
N<-length(x)
e<-sqrt(1-h^2)
G<-rnorm(N,0,1)
eps<-rnorm(N,0,1)
ystar<-h*G+e*eps
y<-a*x+(b0+b1*x)*ystar
df<-data.frame(x=x,y=y,g=G)
list(pi1=b1*h,lambda1=b1*e,df=df)
}
set.seed(8675309)
##
x.unif<-runif(2500,-1.5,1.5)
L<-simdata(x.unif)
pi1<-L$pi1
lambda1<-L$lambda1
df<-L$df
dat<-list(N=nrow(df),y=df$y,x=df$x,g=df$g)
fit.u<-sampling(sm,data=dat,pars=c('mu','sigma'),include=FALSE,chains=2,cores=2)
par(mfrow=c(1,2))
hist(extract(fit.u,'pi1')[[1]],xlim=c(0,.2))
abline(v=pi1,col='red',lwd=2)
hist(extract(fit.u,'lambda1')[[1]],xlim=c(0,.2))
abline(v=lambda1,col='red',lwd=2)
So far, so good [note: there are some warnings that I need to resolve and I’m open to suggestions there, but estimation seems to converge to truth]. Now, the trouble: if x is instead sampled from the standard normal, estimation goes awry.
x.norm<-rnorm(2500)
L<-simdata(x.norm)
pi1<-L$pi1
lambda1<-L$lambda1
df<-L$df
dat<-list(N=nrow(df),y=df$y,x=df$x,g=df$g)
fit.n<-sampling(sm,data=dat,pars=c('mu','sigma'),include=FALSE,chains=2,cores=2)
par(mfrow=c(1,2))
hist(extract(fit.n,'pi1')[[1]],xlim=c(0,.2))
abline(v=pi1,col='red',lwd=2)
hist(extract(fit.n,'lambda1')[[1]],xlim=c(0,.2))
abline(v=lambda1,col='red',lwd=2)
Note that the estimate of \lambda_1 is especially bad (the posterior is well below the true value).
We’re trying to figure out how to resolve this. It didn’t seem a priori like estimation should be so sensitive to the distribution of x. Yet, here we are. Frequentist MLE has the same problem. A collaborator has experimented with Mplus (wherein \epsilon is treated as a latent variable) and tells us that estimation via that software does not have this problem. Our preference is to not go that route; it seems like this is something we should be able to resolve via some tweak to the above. Any help greatly appreciated!!!
Ben