Bayesian data augmentation with latent variable: convergence issue

I try to fit a simplified latent variable model using Bayesian data augmentation approach as used by Stephen and Toubia (2010).

The basic idea is that dependent variable y and focal independent variable z may be both influenced by a latent variable, latent, and the model explicitly accounts for this. The model is as following,

z=gamma_1 + gamma_2latent + delta
y=beta_1+beta_2
x_1+beta_3x_2+z_betadelta+latent_beta*latent+epsilon

latent enters both equations for z and y. x_1 and x_2 are two control variables. gamma_1 and beta_1 are two intercepts. delta and epsilon are two error terms.

One novelty is that the residual from first equation delta enters the second equation instead of z itself.

In my simulated dataset, I set beta=c(1,2,3), gamma=c(4,5), latent_beta=c(10) and z_beta=c(6). As seen from the result, betas and gamma_1 have been successfully recovered, however, z_beta, gamma_2 and latent_beta have not.

Specifically, judging from the traceplots, different chains have converged to different values for z_beta, gamma_2 and latent_beta (although some chains of z_beta indeed converged to 6, the value I want).

So my question is how to deal with this divergence issue? Is it anything wrong with my code (I haven’t specified prior yet, I know. It is not likely to blame this right?)?

Here are the codes and traceplots.

#simulating data set
n.obs=500

#construct x
x.beta=c(1,2,3) #first is intercept
n.xvar=length(x.beta)

x=matrix(0,ncol=n.xvar,nrow=n.obs)
x[,1]=1
for(i in 2:n.xvar){
x[,i]=rnorm(n.obs,0,1)
}

#about latent ability
latent=rnorm(n.obs,0,1)
latent.beta=c(10) #latent_beta for y

##about z
#z for latent ability
z.delta=rnorm(n.obs,0,1) #error term for z
z.gamma=c(4,5) #parameter for z of latent variable
z.iv=cbind(1,latent)
n_z_l_var=length(z.gamma)
z=z.iv%*%z.gamma+z.delta
z=as.vector(z)
#z for y
z.beta=c(6)
n_z_y_var=length(z.beta)

#construct y
epsilon=rnorm(n.obs,0,1) #error term for y
y=x %% x.beta+z.deltaz.beta+latent*latent.beta+epsilon
y=as.vector(y)

data=list(n=n.obs,
n_xvar=n.xvar,
n_z_l_var=n_z_l_var,
n_z_y_var=n_z_y_var,
y=y,
x=x,
z=z)

#model
model_string="
data{
int<lower=0> n;
int<lower=0> n_xvar;
int<lower=0> n_z_l_var;
int<lower=0> n_z_y_var;
vector[n] y;
matrix[n,n_xvar] x;
vector[n] z;
}
parameters{
vector[n_xvar] x_beta;
vector[n_z_l_var] z_gamma;
real z_beta;
real latent_beta;
vector[n] latent;
}
model{
vector[n] z_residual;
z~normal(z_gamma[1]+latentz_gamma[2],1);
z_residual=z-(z_gamma[1]+latent
z_gamma[2]);
e
y~normal(xx_beta+z_residualz_beta+latent*latent_beta,1); //likelihood
}"

fit=stan(model_code = model_string,
data=data,
pars=c(“x_beta”,‘z_beta’,“z_gamma”,“latent_beta”),
iter = 10000,warmup = 100)

Result

Inference for Stan model: 106df335f9ff6a8e24de3c8185ee63b5.
4 chains, each with iter=10000; warmup=100; thin=1;
post-warmup draws per chain=9900, total post-warmup draws=39600.

       mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat

x_beta[1] 1.48 0.10 0.20 1.16 1.31 1.46 1.62 1.90 5 1.61
x_beta[2] 2.06 0.12 0.19 1.76 1.93 2.03 2.22 2.42 2 3.22
x_beta[3] 2.90 0.07 0.12 2.72 2.82 2.88 3.00 3.12 3 1.85
z_beta 3.85 2.06 2.91 -1.29 3.64 5.46 5.59 5.80 2 46.26
z_gamma[1] 4.18 0.03 0.06 4.06 4.13 4.19 4.22 4.27 3 1.59
z_gamma[2] 0.18 0.58 0.81 -0.99 -0.48 0.29 1.07 1.11 2 37.89
latent_beta 0.35 1.18 1.66 -2.05 -1.02 0.53 2.11 2.31 2 28.24
lp__ -556.18 15.96 30.02 -612.09 -579.73 -551.85 -536.13 -496.10 4 2.27

Samples were drawn using NUTS(diag_e) at Sun Jul 2 19:52:42 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Traceplot

You should run warmup much longer than 100 iterations. There’s quite a bit that happens in warmup. 100 samples isn’t enough (check the manual for a description – I started typing one out here but I figured I might mess something up).

The default is 1k warmup, 1k samples post-warmup. If you run with those settings, do your chains behave better?

Hope that helps!

Hi, thanks for the suggestion of warmup. It never occured to me that warmup would be the cause.

But unfortunately, after I changed to 1k and 2k warmups, the problem remains. Actually by looking at the traceplots, the chains converge every quickly, within 100 iterations.

Hmm, cool beans, well the next step is to figure out if the parameterization of your problem actually has multiple equivalent solutions.

You’ll still want to keep a long warmup. Stan will usually move from its initial guess to something close to the solution really quickly, but it still needs a long time to adjust its time step and figure out the mass matrix so during the sampling stage you’re getting nice healthy hopefully-independent samples (edit: again, check the manual on this. I’m probably not describing it very precisely).

Another fishy thing about the traceplots are the little flat areas. You don’t want these. There’s something fishy going on. Since you’re using R, just pop open shinystan (launch_shinystan(fit)) and look around and see if you see anything weird.

What you’re looking for is pairs (which you can do from RStan itself) to see if there are strong correlations or banan-like or funnel-like shapes.

I indented and spaced the program so I could read it (how do people work with their programs not indented and not spaced? they’re impossible to read!):

data{
  int n;
  int n_xvar;
  int n_z_l_var;
  int n_z_y_var;
  vector[n] y;
  matrix[n,n_xvar] x;
  vector[n] z;
}
parameters{
  vector[n_xvar] x_beta;
  vector[n_z_l_var] z_gamma;
  real z_beta;
  real latent_beta;
  vector[n] latent;
}
model{
  vector[n] z_residual = z - (z_gamma[1]
                              + latent * z_gamma[2]);
  z ~ normal(z_gamma[1]
             + latent * z_gamma[2],
             1);
  y ~ normal(x * x_beta
             + z_residual * z_beta
             + latent * latent_beta,
             1);
}

Also, the thing commented as likelihood isn’t all the likelihood—you need to marginalize out the latent to get the likelihood.

I don’t see why you’re using a residual with a coefficient here, so I’m guessing there’s probably a more generative formulation of this model. I wouldn’t use latent as a variable name—too generic.

the real problem you’re probably running into is the latent * latent_beta term. This can be bad news if they’re not both controlled strongly by a prior as it introduces nasty multiplicative uncertainty (you should see a banana-like relation between latent and each of the latent_beta).