# Bayesian data augmentation with latent variable: convergence issue

I try to fit a simplified latent variable model using Bayesian data augmentation approach as used by Stephen and Toubia (2010).

The basic idea is that dependent variable `y` and focal independent variable `z` may be both influenced by a latent variable, `latent`, and the model explicitly accounts for this. The model is as following,

z=gamma_1 + gamma_2latent + delta
y=beta_1+beta_2

`latent` enters both equations for `z` and `y`. `x_1` and `x_2` are two control variables. `gamma_1` and `beta_1` are two intercepts. `delta` and `epsilon` are two error terms.

One novelty is that the residual from first equation `delta` enters the second equation instead of `z` itself.

In my simulated dataset, I set `beta=c(1,2,3)`, `gamma=c(4,5)`, `latent_beta=c(10)` and `z_beta=c(6)`. As seen from the result, `betas` and `gamma_1` have been successfully recovered, however, `z_beta`, `gamma_2` and `latent_beta` have not.

Specifically, judging from the traceplots, different chains have converged to different values for `z_beta`, `gamma_2` and `latent_beta` (although some chains of z_beta indeed converged to 6, the value I want).

So my question is how to deal with this divergence issue? Is it anything wrong with my code (I haven’t specified prior yet, I know. It is not likely to blame this right?)?

Here are the codes and traceplots.

#simulating data set
n.obs=500

#construct x
x.beta=c(1,2,3) #first is intercept
n.xvar=length(x.beta)

x=matrix(0,ncol=n.xvar,nrow=n.obs)
x[,1]=1
for(i in 2:n.xvar){
x[,i]=rnorm(n.obs,0,1)
}

latent=rnorm(n.obs,0,1)
latent.beta=c(10) #latent_beta for y

#z for latent ability
z.delta=rnorm(n.obs,0,1) #error term for z
z.gamma=c(4,5) #parameter for z of latent variable
z.iv=cbind(1,latent)
n_z_l_var=length(z.gamma)
z=z.iv%*%z.gamma+z.delta
z=as.vector(z)
#z for y
z.beta=c(6)
n_z_y_var=length(z.beta)

#construct y
epsilon=rnorm(n.obs,0,1) #error term for y
y=x %% x.beta+z.deltaz.beta+latent*latent.beta+epsilon
y=as.vector(y)

data=list(n=n.obs,
n_xvar=n.xvar,
n_z_l_var=n_z_l_var,
n_z_y_var=n_z_y_var,
y=y,
x=x,
z=z)

#model
model_string="
data{
int<lower=0> n;
int<lower=0> n_xvar;
int<lower=0> n_z_l_var;
int<lower=0> n_z_y_var;
vector[n] y;
matrix[n,n_xvar] x;
vector[n] z;
}
parameters{
vector[n_xvar] x_beta;
vector[n_z_l_var] z_gamma;
real z_beta;
real latent_beta;
vector[n] latent;
}
model{
vector[n] z_residual;
z~normal(z_gamma[1]+latentz_gamma[2],1);
z_residual=z-(z_gamma[1]+latent
z_gamma[2]);
e
y~normal(xx_beta+z_residualz_beta+latent*latent_beta,1); //likelihood
}"

fit=stan(model_code = model_string,
data=data,
pars=c(“x_beta”,‘z_beta’,“z_gamma”,“latent_beta”),
iter = 10000,warmup = 100)

Result

Inference for Stan model: 106df335f9ff6a8e24de3c8185ee63b5.
4 chains, each with iter=10000; warmup=100; thin=1;
post-warmup draws per chain=9900, total post-warmup draws=39600.

``````       mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
``````

x_beta[1] 1.48 0.10 0.20 1.16 1.31 1.46 1.62 1.90 5 1.61
x_beta[2] 2.06 0.12 0.19 1.76 1.93 2.03 2.22 2.42 2 3.22
x_beta[3] 2.90 0.07 0.12 2.72 2.82 2.88 3.00 3.12 3 1.85
z_beta 3.85 2.06 2.91 -1.29 3.64 5.46 5.59 5.80 2 46.26
z_gamma[1] 4.18 0.03 0.06 4.06 4.13 4.19 4.22 4.27 3 1.59
z_gamma[2] 0.18 0.58 0.81 -0.99 -0.48 0.29 1.07 1.11 2 37.89
latent_beta 0.35 1.18 1.66 -2.05 -1.02 0.53 2.11 2.31 2 28.24
lp__ -556.18 15.96 30.02 -612.09 -579.73 -551.85 -536.13 -496.10 4 2.27

Samples were drawn using NUTS(diag_e) at Sun Jul 2 19:52:42 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Traceplot

You should run warmup much longer than 100 iterations. There’s quite a bit that happens in warmup. 100 samples isn’t enough (check the manual for a description – I started typing one out here but I figured I might mess something up).

The default is 1k warmup, 1k samples post-warmup. If you run with those settings, do your chains behave better?

Hope that helps!

Hi, thanks for the suggestion of warmup. It never occured to me that warmup would be the cause.

But unfortunately, after I changed to 1k and 2k warmups, the problem remains. Actually by looking at the traceplots, the chains converge every quickly, within 100 iterations.

Hmm, cool beans, well the next step is to figure out if the parameterization of your problem actually has multiple equivalent solutions.

You’ll still want to keep a long warmup. Stan will usually move from its initial guess to something close to the solution really quickly, but it still needs a long time to adjust its time step and figure out the mass matrix so during the sampling stage you’re getting nice healthy hopefully-independent samples (edit: again, check the manual on this. I’m probably not describing it very precisely).

Another fishy thing about the traceplots are the little flat areas. You don’t want these. There’s something fishy going on. Since you’re using R, just pop open shinystan (`launch_shinystan(fit)`) and look around and see if you see anything weird.

What you’re looking for is pairs (which you can do from RStan itself) to see if there are strong correlations or banan-like or funnel-like shapes.

I indented and spaced the program so I could read it (how do people work with their programs not indented and not spaced? they’re impossible to read!):

``````data{
int n;
int n_xvar;
int n_z_l_var;
int n_z_y_var;
vector[n] y;
matrix[n,n_xvar] x;
vector[n] z;
}
parameters{
vector[n_xvar] x_beta;
vector[n_z_l_var] z_gamma;
real z_beta;
real latent_beta;
vector[n] latent;
}
model{
vector[n] z_residual = z - (z_gamma[1]
+ latent * z_gamma[2]);
z ~ normal(z_gamma[1]
+ latent * z_gamma[2],
1);
y ~ normal(x * x_beta
+ z_residual * z_beta
+ latent * latent_beta,
1);
}
``````

Also, the thing commented as likelihood isn’t all the likelihood—you need to marginalize out the latent to get the likelihood.

I don’t see why you’re using a residual with a coefficient here, so I’m guessing there’s probably a more generative formulation of this model. I wouldn’t use `latent` as a variable name—too generic.

the real problem you’re probably running into is the `latent * latent_beta` term. This can be bad news if they’re not both controlled strongly by a prior as it introduces nasty multiplicative uncertainty (you should see a banana-like relation between `latent` and each of the `latent_beta`).