Bivariate logistic model with correlated random effects and missing data

Hi everyone, I am newbie to Stan and not an expert of Bayesian Inference. I am trying to fit a bivariate logistic model with correlated random effects and missing data. For each statistical unit I have a total of positives and the number of people examined with two different methods. Sometimes they applied both methods but other times only one of the two, that’s why I also have missing data. So I have two binomial outcomes that I expect to be correlated and to induce the correlation I have introduced random effects that come from a bivariate normal. Hence, on the logit scale the outcome will be distributed as a bivariate normal as well.

Attached you will find the Stan code, a synthetic data set and the R code to simulate the data and fit the model. Please note that I have coded all the missing values ass 999 instead of “NA”. That is because even if I don’t use those observations in the fitting process Stan will still return an error. Apparently everything is working (95% credible intervals always contains the true parameters).

I would like your help for:

  1. Do you think I could make this code faster?
  2. Are the priors that I am using ok? The effective sample size for the intercepts seems to be low. What could be the problem? A t-student maybe is more suitable for this case.
  3. Any other comment is more than welcome.

Many thanks!!!

simulation.R (1.4 KB)
bivariate-logistic-miss.stan (2.2 KB)
bivariate-logistic-miss.data.R (3.4 KB)

Minor programming stuff:

The terms like:

for (i in 1:N12) {
  y1[id12[i]] ~ binomial_logit(n[id12[i]], beta01 + e_vill[id12[i],1]);
  ...
}

Can be vectorized by just dropping the i indexes and the loops:

y1[id12] ~ binomial_logit(n[id12], beta01 + e_vill[id12,1]);

You could also sort the n[id12] (or the y) variables in the transformed data block if you wanted. Something like:

transformed data {
  int nid2[N2] = n[id12];
}

Might be more mess than it’s worth, not sure.

Regarding the model, I think in a lot of missing data applications you want to infer missing values, or model the missingness somehow. Like missingness could be indicative of something interesting happening. I’m sure there’s a chapter in BDA3 that. It’s gonna be harder if the variables are discrete.

tldr;, missing values are more than just getting the programming right, check into that, but otherwise looks good :P.

Hi Ben,

Many thanks for your comment.

Actually the first version of the code was vectorized but I received (and I still receive) the following syntax error:

No matches for:  real + real[]

That’s because e_vill[id12,1]) is real[]. How can I get rid of this? Should I define the error terms separately for each outcome?

Regarding the model, you are right. But this is a particular case. I am non interested to infer the missing values. I know exactly why some observations are missing. They were testing people with method 1 in the past then they proposed a more efficient method and there was 1 year of overlap, and now they are using method 2 only. So my case it’s more like the one described in chapter 11.5 of the Manual. I use the marginal specification when I have missing values for one of the two outcomes because it helps to infer the betas and the variance.

No matches for: real + real

Ooooh, my mistake, sorry, wow, not paying attention.

Yeah that real there messes things up. Either gotta do a transformed parameter, or turn things into vectors and add them up.

But the transformed parameters solution just moves your loop to another place in the code, so that doesn’t seem great. And then making things vectory:

y1[id12] ~ binomial_logit(n[id12], rep_vector(beta01, N12) + to_vector(e_vill[id12, 1]))

kinda hurts the readability of the code in my opinion.

Can:

e_vill ~ multi_normal_cholesky(mu, L_Sigma);// PRIOR ON RANDOM EFFECTS

Be:

e_vill ~ multi_normal_cholesky({beta01, beta02}, L_Sigma);

Aren’t beta01 and beta02 just the means there? That changes the parameterization of the problem though, which even if it improves readability can mess sampling up.

tldr; I think what you did is the way to go, haha.

1 Like

Hi Ben,

Thank you for your reply. I am moving forward with the model and I realised a really wierd behaviour.

If a put two intercepts in the model I get a negative correlation between the two outcomes, otherwise, with a common intercept for the two outcomes I get a positive correlation. I think I am missing something really baisc here.

Moreover if you look at the pairs plot reported below is possible to spot a problem. The standard deviation for the second outcome (tau2) is highly correlated with the lp__ and with the energy.


I also tried to reparametrise the half Cauchy prior that I put on the sd through a uniform but nothing changed. Any ideas?

Many thanks,

Claudio

yes, you are totally right about this. Actually, if I implement this version, for some reason, Stan samples better from the betas. I get a much higher effective sample size.

I’ve never heard of this being a problem, but I keep away from trying to read into energies/lps unless it’s divergences or something computational I’m trying to diagnose. @avehtari, is this an indication of anything?

So you’re saying if instead of keeping a separate beta01 and beta02 that if you just share one value between the outcomes, the off-diagonals of L_Sigma * L_Sigma’ are positive? And if you keep them separate the values are negative?

That doesn’t sound impossible. If the scale of the intercepts was large, then if that effect was to be absorbed by the e_vill variables, then maybe it’d affect those variables in some significant way.

Maybe add the values of beta01 + e_vill[id12[i],1] and beta02 + e_vill[id12[i],2] to your generated quantities and then compare the respective values between models and see if there’s a big difference to the number that’s getting fed into the binomial_logit.

Just thinking about the model, I’d probably lean towards the two intercept thing unless Y1 and Y2 really are on the same scale and all that. Maybe there’s a model selection thing you could do to help you make a decision? If @avehtari stops by, he can probably point you in the right direction.

When standard deviation is small, the distribution is narrow and densities are higher. So it’s completely natural.

1 Like

Yes, this is typical. Changing most parameters over their marginal posterior has a strong effect on the log density. Usually you get a peak with log density falling off on either side.

We don’t sample the log density per se, so it’s not an issue. The main problem is when there’s a big range of log densities—we can sometimes have trouble exploring because each iteration is bounded (chi-square in N) in the initial momenta, which determine the potential change in log density each iteration (the Hamiltonian of negative log density plus potential (quadratic form of momenta and mass matrix) is conserved).

This kind of thing isn’t that unusual—parameters can soak up unexplained variance and presumably what you’re looking at is some kind of residual.

In other words, it’s a funnel alert.

1 Like

This is an interesting point. I remember Betancourt’s Riemannian-HMC paper talking about how this limits exploration of log-densities of different magnitudes.

I have a stupid question though: why not just make the momentum draw something heavy-tailed like a multivariate student-t?

Betancourt is @betanalpha here.

Stability—we’re only using gradients to approximate the functions curvature. If we take very large momenta, the approximation needs to crank down step size to remain stable and it doesn’t seem to be a win in the end. Same problem with approaches like jittering step size to get really low step sizes to escape being stuck in high curvature regions.

2 Likes