Extracting predicted data from multivariate probit model

Hi,

If I fit a multivariate probit model, such as one like this from the manual, how would I obtain a set of predicted binary observations, \hat{y}, from it? My version has a mixture likelihood so it is a little more complicated than the example in the manual, but I figure doing it for the non-mixture version would be a good start

Draw from a multivariate normal distribution for each realization of the mean vector and covariance matrix and set each posterior prediction equal to 1 if the realization is positive and 0 otherwise.

1 Like

Thanks for the response @bgoodri

As well as the set of predicted binary observations, from these I am then constructing 2x2 tables for the response patterns. My attempt at doing so is given below. It runs but some of the predicted counts (e[s,1],…, e[s,4]) are further off from the observed counts than they should be, so i think there is something wrong with how i’m obtaining the posterior predictions

My likelihood is a mixture and I am using bivariate normals written as two conditionally independent normal_lpdf’s rather than multi_normal_cholesky_lpdf (commented out) as it seems to be substantially faster (both seem to give the same results):

 for (s in 1:NS) { // study index =s
  real lp_bern1 = bernoulli_lpmf(0 | p[s]);
  real lp_bern2 = bernoulli_lpmf(1 | p[s]);
    for (n in 1:ns[s]) {  // individual index =n
       real lp1 =     lp_bern1 //non-diseased
              + normal_lpdf(z[s,n,1] | a1_m[1] , 1)
              + normal_lpdf(z[s,n,2] | a2_m[1] + rho0[s]*(z[s,n,1] - a1_m[1]) , 1);  
           //   + multi_normal_cholesky_lpdf(z[s,n,1:2] | nd_m[1:2] , L_Omega0[s,1:2]); 
       real lp2 =     lp_bern2 //diseased
               + normal_lpdf(z[s,n,1] | a1_m[2] , 1)
               + normal_lpdf(z[s,n,2] | a2_m[2] + rho1[s]*(z[s,n,1] - a1_m[2]) , 1);
          //    + multi_normal_cholesky_lpdf(z[s,n,1:2] | d_m[1:2] , L_Omega1[s,1:2]); 
       target += log_sum_exp(lp1, lp2); 
   }
 }

In the generated quantities block I have:

 // generate model-predicted z's 
 // ( note: d_m[1] = a1_m[2], d_m[2] = a2_m[2], nd_m[1] = a1_m[1],  nd_m[2] = a2_m[1])
  for (n in 1:ns[s]) {
          z_d[s,n,1:2]    =      multi_normal_rng(d_m[1:2],  sigma_d[s,1:2, 1:2]); 
          z_nd[s,n,1:2]   =      multi_normal_rng(nd_m[1:2], sigma_nd[s, 1:2, 1:2]); 
          z_[s,n,1:2] = bernoulli_rng(p[s])*z_d[s,n,1:2] + z_nd[s,n,1:2]*bernoulli_rng(1-p[s]);
         }

     // posterior predictions
     for (n in 1:ns[s]) {
      for (i in 1:2) {
          if (z_[s,n,i] > 0) {
          y_hat[n,i,s] = 1; }
          else {
          y_hat[n,i,s] = 0; }
        }
   
    // 2x2 table for each individual 
     if (y_hat[n,1,s] == 1 && y_hat[n,2,s] == 1) {
     ei[s,n,1] = 1;   ei[s,n,2] = 0;   ei[s,n,3] = 0;   ei[s,n,4] = 0; }
     if (y_hat[n,1,s] == 1 && y_hat[n,2,s] == 0) {
     ei[s,n,1] = 0;   ei[s,n,2] = 1;   ei[s,n,3] = 0;   ei[s,n,4] = 0; }
     if (y_hat[n,1,s] == 0 && y_hat[n,2,s] == 1) {
     ei[s,n,1] = 0;   ei[s,n,2] = 0;   ei[s,n,3] = 1;   ei[s,n,4] = 0; }
     if (y_hat[n,1,s] == 0 && y_hat[n,2,s] == 0) {
     ei[s,n,1] = 0;   ei[s,n,2] = 0;   ei[s,n,3] = 0;   ei[s,n,4] = 1; }
     }
// construct model-predicted 2x2 tables for each study s
     e[s,1] = sum(ei[s, 1:ns[s], 1]); 
     e[s,2] = sum(ei[s, 1:ns[s], 2]);
     e[s,3] = sum(ei[s, 1:ns[s], 3]); 
     e[s,4] = sum(ei[s, 1:ns[s], 4]); 

edit:

I am trying to work out how to reconstruct the 2x2 tables (for each study) to explore the model fit. Each cell represents the possible classification from two diagnostic tests (positive on test 1 and 2, negative on 1 and positive on 2, etc). The reason why it is important to reconstruct the 2x2 tables is to check the fit. One of these cells (“positive on test 1 and test 2”) is also needed to calculate the expected correlations between test 1 and test 2 results for each study, which require this joint probability to calculate. For each study s, these are given by:

\frac{Pr(Y_{1,s} = 1, Y_{2,s} = 1) - Pr(Y_{1,s} = 1)*Pr(Y_{2,s} = 1)}{\sqrt{Pr(Y_{1,s} = 1)*Pr(Y_{2,s} = 1)*(1-Pr(Y_{1,s} = 1))*(1-Pr(Y_{2,s} = 1))}}

Where Y_{i,s} represents the result of test i from study s

I was thinking, another way to reconstruct these 2x2 tables, as opposed to what I have tried to do here, is:

Pr(Y_{1,s} = 1, Y_{2,s} = 1) = p_{s} * (Se_{1}*Se_{2} + cov_{s,d=1}) + (1-p_{s}) * ((1-Sp_{1})*(1-Sp_{2}) + cov_{s,d=0})
Pr(Y_{1,s} = 1, Y_{2,s} = 1) = p_{s} * (Se_{1}*(1-Se_{2}) - cov_{s,d=1}) + (1-p_{s}) * ((1-Sp_{1})*(Sp_{2}) - cov_{s,d=0})
Pr(Y_{1,s} = 1, Y_{2,s} = 1) = p_{s} * ((1-Se_{1})*Se_{2} - cov_{s,d=1}) + (1-p_{s}) * (Sp_{1}*(1-Sp_{2}) - cov_{s,d=0})
Pr(Y_{1,s} = 1, Y_{2,s} = 1) = p_{s} * ((1-Se_{1})*(1-Se_{2}) + cov_{s,d=1}) + (1-p_{s}) * ((1-Sp_{1})*(1-Sp_{2}) - cov_{s,d=0})

Where p_{s} is the disease prevalence in study s, cov_{s,d} is the covariance term for disease group d in study s, Se_{i} and Sp_{i} are the sensitivity and specificity for test i.

Since

cov_{s,d=1} = \rho_{s,d=1}* \sqrt{Se_{1}*Se_{2} *(1-Se_{1})*(1-Se_{2})} , and

cov_{s,d=0} = \rho_{s,d=0}* \sqrt{Sp_{1}*Sp_{2} *(1-Sp_{1})*(1-Sp_{2})} ,

I could work out the 2x2 tables this way instead. The problem is, \rho_{s,d} are not the same as the correlations between the latent continuous variables from the multivariate probit model, but rather the correlations between the binary test result data. So, i’d work that out first in generated quantities (I don’t think Stan has a built in function to work out the correlation between two vectors?). I will try that out and update if it works.

@martinmodrak I originally had this edit in my reply to you on a different thread but moved it here where it is more relevant

Generally, this might also mean there is a problem elsewhere in the model and the predictions are just fine. I admit I tried to follow what you do and I was unable to get a good grasp on the model. I think a good way forward would be to simplify the model (e.g. remove some dimensions - what if n = 1?) and find the smallest model that is still problematic - this would be both easier to understand for us on the forums and more likely to give some good insight into what is happening.

Sorry that I can’t help you more.

Sorry I forgot to update the post - the posterior predictive is fine now - calculating the correlations in generated quantities like I said I would try works. The model is slow for the full dataset of ~11,000 obs using the parameterization from the Stan manual of the multivariate probit (> 2 hours) - so now im looking at using bgoodri’s parameterization which uses the GHK algorithm to try and speed it up

1 Like