Multivariate probit with missing outcomes

I’ve been working with multivariate probit models recently and one thing that occurred to me is that it would be nice if one could model missing outcomes.
Assuming one enters missing values as -1, and using @bgoodri 's parametrization, would something like this make sense or is this completely wrong?

  real multiprobit_lpmf(
			int[] y,          // response data - integer array of 1s, 0s and -1s for missing data
			real[] mu, // Intercept
			real[] u,         // nuisance that absorbs inequality constraints
			int K,            // response dimension size
			matrix L_Omega    // lower cholesky of correlation matrix
			){
    vector[K] z;       // latent continuous state
    vector[K] logLik;  // log likelihood
    real prev;
    prev = 0;
    for (k in 1:K) { 
    if (y[k] == -1 ) {
	  z[k] = 0;
	  logLik[k] = 0;
    }
      else {
	real bound; // threshold at which utility = 0
	bound = approx_Phi( -(mu[k] + prev) / L_Omega[k,k]  );
	if (y[k] == 1) {
	  real t;
	  t = bound + (1 - bound) * u[k];
	  z[k] = approx_inv_Phi(t);       // implies utility is positive
	  logLik[k] = log1m(bound);  // Jacobian adjustment
	}
	else {
	  real t;
	  t = bound * u[k];
	  z[k] = approx_inv_Phi(t);       // implies utility is negative
	  logLik[k]= log(bound);   // Jacobian adjustment
	}
      }
      if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
      // Jacobian adjustments imply z is truncated standard normal
      // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
    }
    return sum(logLik);
  }

I simulated some data and got pretty reasonable results:

set.seed(1234)
y <- mutate(as.data.frame(rmvnorm(100, c(-1,1), matrix(c(1,0.5,0.5,1), nrow = 2))>0)
         , across(everything(), as.numeric))

y2 <- y
y2[1:3,1] <- -1
y2[20:25,2] <- -1

## y2[5:6,2] <- -1

data_test_1 <- list(N = 100
                 , K = 2
                 , y = y)

data_test_2 <- list(N = 100
                 , K = 2
                 , y = y2)

samples_1 <- sampling(model_1
                   , data = data_test_1
                   , cores = 4
                   , chains = 4
                   , iter = 1000
                   , warmup = 500)

samples_2 <- sampling(model_1
                   , data = data_test_2
                   , cores = 4
                   , chains = 4
                   , iter = 1000
                   , warmup = 500)



Gets me:

print(samples_1, pars = c("Intercept", "Omega"))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Intercept[1] -0.82    0.00 0.15 -1.13 -0.92 -0.82 -0.73 -0.54  2506 1.00
Intercept[2]  0.88    0.00 0.15  0.60  0.78  0.87  0.97  1.18  1798 1.00
Omega[1,1]    1.00     NaN 0.00  1.00  1.00  1.00  1.00  1.00   NaN  NaN
Omega[1,2]    0.59    0.01 0.20  0.16  0.46  0.60  0.74  0.93   471 1.01
Omega[2,1]    0.59    0.01 0.20  0.16  0.46  0.60  0.74  0.93   471 1.01
Omega[2,2]    1.00    0.00 0.00  1.00  1.00  1.00  1.00  1.00   129 1.00

> print(samples_2, pars = c("Intercept", "Omega"))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Intercept[1] -0.80    0.00 0.14 -1.08 -0.90 -0.80 -0.71 -0.53  4794    1
Intercept[2]  0.89    0.00 0.16  0.59  0.79  0.89  1.00  1.21  4359    1
Omega[1,1]    1.00     NaN 0.00  1.00  1.00  1.00  1.00  1.00   NaN  NaN
Omega[1,2]    0.52    0.01 0.20  0.10  0.38  0.54  0.66  0.84   951    1
Omega[2,1]    0.52    0.01 0.20  0.10  0.38  0.54  0.66  0.84   951    1
Omega[2,2]    1.00    0.00 0.00  1.00  1.00  1.00  1.00  1.00   550    1

But then again, maybe this is just a fluke?

The whole model:

functions {  
  real approx_Phi(real x) {
    return inv_logit(x * 1.702);
  }

  real approx_inv_Phi(real x) {
    return logit(x) / 1.702;
  }
  
  real multiprobit_lpmf(
			int[] y,          // response data - integer array of 1s and 0s
			real[] mu, // Intercept
			real[] u,         // nuisance that absorbs inequality constraints
			int K,            // response dimension size
			matrix L_Omega    // lower cholesky of correlation matrix
			){
    vector[K] z;       // latent continuous state
    vector[K] logLik;  // log likelihood
    real prev;
    /* mu = Intercept + beta * x; */
    prev = 0;
    for (k in 1:K) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
      if (y[k] == -1 ) {
	z[k] = 0;
	logLik[k] = 0;
      }
      else {
	real bound; // threshold at which utility = 0
	bound = approx_Phi( -(mu[k] + prev) / L_Omega[k,k]  );
	if (y[k] == 1) {
	  real t;
	  t = bound + (1 - bound) * u[k];
	  z[k] = approx_inv_Phi(t);       // implies utility is positive
	  logLik[k] = log1m(bound);  // Jacobian adjustment
	}
	else {
	  real t;
	  t = bound * u[k];
	  z[k] = approx_inv_Phi(t);       // implies utility is negative
	  logLik[k]= log(bound);   // Jacobian adjustment
	}
      }
      if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
      // Jacobian adjustments imply z is truncated standard normal
      // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
    }
    return sum(logLik);
  }

}

data {
  int<lower=1> K;                       // Kesponse dimension size
  int<lower=0> N;                       // sample size
  int<lower=-1,upper=1> y[N, K];     // outcomes
}

parameters {
  vector[K] Intercept;
  cholesky_factor_corr[K] L_Omega;
  real<lower=0,upper=1> u[N,K];    // nuisance that absorbs inequality constraints
}

transformed parameters {
  real mu[N, K];
  for (k in 1:K) {
    mu[, k] = to_array_1d(Intercept[k] * rep_vector(1, N)); // 
  }
}

model {
  L_Omega ~ lkj_corr_cholesky(2);
  Intercept ~ std_normal();
  // implicit: u is iid standard uniform a priori
  { // likelihood
    for(n in 1:N){
      target +=
	multiprobit_lpmf(y[n] | mu[n], u[n], K, L_Omega);
    }
  }
}

generated quantities {
  corr_matrix[K] Omega = multiply_lower_tri_self_transpose(L_Omega);
  vector[K] preds[N];      // predictions
  for (n in 1:N){
    {
      vector[K] preds_tmp = multi_normal_cholesky_rng(to_vector(mu[n]), L_Omega);
      for (k in 1:K)
	if (preds_tmp[k] > 0)
	  preds[n,k] = 1;
	else
	  preds[n,k] = 0;
    }
  }
}
1 Like

I am not an expert on this (@CerulloE might know more), but I think the z[k] = 0 line is suspicious - since you don’t have any bound on z[k] when the data is missing, I would expect that you want z[k] = approx_inv_Phi(u[k]) as you still want to integrate out this dimension.

I also think it should be possible to completely avoid having a latent variable for the missing data - you should be able to work with just the marginal distribution of the observed outcomes. Or, in other words, if you rearranged the dimensions such that those with missing data come last, you would be able to just terminate the cycle over dimensions earlier. I unfortunately don’t understand the underlying GHK algorithm and generally the relevant linear algebra well enough to be sure whether reordeing of the dimensions would force you to make some expensive transform to L_Omega or not.

The way to test your implementation fully is via Simulation-based calibration - or at least via multiple different simulated datasets with different true parameter values.

Best of luck with the model!

2 Likes

Thanks for your answer Martin. I’ll give your suggestion a try!

Apologies to re-hash an old thread.

Just quick question and curious if you found a way to handle missing data within @bgoodri 's parameterization of the multivariate probit (essentially a Stan version of the GHK algorithm). I actually initially coded it similar to @martinmodrak 's suggestion above, but am unsure if this is correct given to marginalize out a variable one needs to sum over all possibilities.

Just thought I’d see if there was any success. I attach a version of my code below. It is part of a larger copula model that will eventually be combined with continuous data (the missing data here is quite easy and coded much like the stan manual suggests). I should note that I also tried:

 if(n_thr[k] == 1){ //binary
          if(y[n,k] == 99){ //99 is missing
            z[n,k] = inv_Phi(u[n,k]);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log(z[n,k]);

thinking that although this is technically the jacobian adjustment when the data are observed, this value gets added to the lpdf in the centered_gaussian_copula_cholesky_ function. But I fear this may also be wrong and I get initialization errors log(0) when I use that approach.

Again sorry for the re-hash, but I figured best to keep in initial thread instead of a redundant thread.

functions{
  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U);
    int J = cols(U);
    matrix[J, N] G = mdivide_left_tri_low(L, U') ;
    return -N * sum(log(diagonal(L))) -  0.5 * ( sum( G^2 ) - sum(U^2) );
  }
  
  real centered_gaussian_copula_cholesky_(matrix[,] marginals, matrix L, int[] order) {
    // Extract dimensions of final outcome matrix
    int N = rows(marginals[1][1]);
    int J = rows(L);
    matrix[N, J] U;

    // Iterate through marginal arrays, concatenating the outcome matrices by column
    // and aggregating the log-likelihoods (from continuous marginals) and jacobian
    // adjustments (from discrete marginals)
    real adj = 0;
    int pos = 1;
    for (m in 1 : size(marginals)) {
      int curr_cols = cols(marginals[m][1]);

      U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];

      adj += sum(marginals[m][2]);
      pos += curr_cols;
    }

    // Return the sum of the log-probability for copula outcomes and jacobian adjustments
    return multi_normal_cholesky_copula_lpdf(U[ : , order] | L) + adj;
  }
  
  matrix[] ordinal_marginal(array [,] int y, matrix mu_glm, real[,] u, array [] int n_thr, array [] vector tau, matrix L){
    int N = rows(mu_glm);
    int K = cols(mu_glm);
    matrix[N,K] rtn[2];
    
    for(n in 1:N){
      vector[K] z[N];
      real prev = 0;
      for(k in 1:K){
        if(n_thr[k] == 1){ //binary
          if(y[n,k] == 99){ //99 is missing
            z[n,k] = inv_Phi(u[n,k]);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = 0;
          } else if(y[n,k] == 1){
            real bound;
            bound = Phi(-(mu_glm[n,k] + prev) / L[k,k]);
            real t;
            t = bound + (1 - bound)*u[n,k];
            z[n,k] = inv_Phi(t);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log1m(bound);
          } else{
            real bound;
            bound = Phi(-(mu_glm[n,k] + prev) / L[k,k]);
            real t;
            t = bound*u[n,k];
            z[n,k] = inv_Phi(t);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log(bound);
          }
        } else{ // polychotomous
          if(y[n,k] == 99){
            z[n,k] = inv_Phi(u[n,k]);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = 0;
          } else if(y[n,k] == 1){
            real t;
            real ub = Phi((tau[1+k-2,1] - (mu_glm[n,k] + prev)) / L[k,k]);
            t = ub*u[n,k];
            z[n,k] = inv_Phi(t);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log(ub);
          } else if (y[n,k] == n_thr[k] + 1){
            real t;
            real lb = Phi((tau[1+k-2,n_thr[k]] - (mu_glm[n,k] + prev)) / L[k,k]);
            t = lb + (1 - lb)*u[n,k];
            z[n,k] = inv_Phi(t);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log1m(lb);
          } else{
            real t;
            real lb = Phi((tau[1+k-2,y[n,k] - 1] - (mu_glm[n,k] + prev)) / L[k,k]);
            real ub = Phi((tau[1+k-2,y[n,k]] - (mu_glm[n,k] + prev)) / L[k,k]);
            t = lb + (ub-lb)*u[n,k];
            z[n,k] = inv_Phi(t);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log(ub-lb);
          }
        }
        if(k < K){
          prev = L[k+1, 1:k]*head(z[n],k);
        }
      }
    }
    
    return rtn;
  }
  
}
data{
  int<lower=1> N; // total sample size
  int<lower=1> K; // # of ordinal variables (ordinal marginal)
  int<lower=1> Thr[K]; # of thresholds for ordinal variables
  int Y[N,K]; // ordinal responses
  real X[N]; // predictor (age)
}
transformed data{
  int order[K]; // order of responses for copula -> ordinal (binary) first
  for(i in 1:K){
    order[i] = i;
  }
}
parameters{
  vector<lower=0>[K] o1; // first parameter of ordinal mean function
  ordered[Thr[3]]tau[2]; //cutpoints for ordinal model (polychtomous only)
  real<lower=0, upper=1> U[N,K]; //nuisance that absorbs inequality constraints
  cholesky_factor_corr[K] L; // cholesky factor of correlation matrix 
}
model{
  matrix[N,K] mu;
  for(n in 1:N){
    for(k in 1:K){
      mu[n,k] = X[n]^o1[k];
    }
  }
  
  // Priors
  o1 ~ normal(0,5);
  L ~ lkj_corr_cholesky(4); // 
  for(i in 1:2){
    for(t in 1:Thr[3]){
    tau[i,t] ~ normal(t+0.5,1); // cutpoint prior
    }
  }
  
  // Likelihood
  target += centered_gaussian_copula_cholesky_({ordinal_marginal(Y,mu,U,Thr,tau,L)},L,order);
  
}
generated quantities{
  corr_matrix[K] Omega = multiply_lower_tri_self_transpose(L);
}

1 Like

Sorry, this is quite complex problem and I currently don’t have the bandwidth necessary to provide a good comment.

The code appears superficially like what I would guess is the correct solution.

Some weak additional comments:

Getting a few of those followed by succesful sampling may be OK. If on the other hand initialization fails completely, that’s a bigger problem.

Are you sure the implementation works as expected for non-missing data? (I would recommend using SBC via the SBC package Simulation Based Calibration for rstan/cmdstanr models • SBC , but even successful recovery of a single set of parameters is a step in the right direction).

Numerically, I’ve found the current Stan implementations (roughly around version 2.25, not sure what happened since) of Phi and inv_Phi problematic. This could account for some initialization failures. Based on suggestions by others, I am instead using:

      real approx_Phi(real x) {
        return inv_logit(x * 1.702);
      }

      real approx_inv_Phi(real x) {
        return logit(x) / 1.702;
      }

Best of luck!

This comment is great thank you!

I will give SBC a try for diagnosis of issues - using the full copula may be tedious, but perhaps I can just run a general multivariate probit regression (like @bgoodri 's initial parameterization for instance) and see what I can recover first.

Weirdly enough when I first began working on this model (and using some of your advice on a previous thread), I used the above approximations to Phi, but the chains would have trouble mixing. That has never been the issue with Stan’s out-of-the-box implementations. Even after trying the approximations I still get log(0) initialization error.

For clarity this runs and “appears” to recover the parameters of interest:

if(y[n,k] == 99){ //99 is missing
            z[n,k] = inv_Phi(u[n,k]);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = 0;

And this does not:

if(y[n,k] == 99){ //99 is missing
            z[n,k] = inv_Phi(u[n,k]);
            rtn[1][n,k] = z[n,k];
            rtn[2][n,k] = log(z[n,k]);

The initialization errors suggest that in the second chunk above z=inv_Phi(u) may be 0 in some instances and therefore when I take the log(0), I get the errors. For example this may occur when u = 0.5. In the instance where the data is present this is avoided because because we include the mean and the bounds and such (I think). However, in the missing case the variate is unbounded. Sorry for the rambling…

Please pardon if this is wrong, but I wonder if the above code in that first chunk could be correct-ish? If I understand marginalization correctly, the goal is to sum over all possibilities - is this step not done when the copula functions unpacks the adjustments along with the latent random variates - i.e. they increment the log probability, so is the log sum of these variates occurring already? Or, would I need to somehow include those sums similar to how the jacobian adjustments are incremented at the end. Pardon the rambles.

I shall continue to work through!

I know this doesn’t answer your actual question, @cwolfe, given the email you sent me, but I can comment here on this thread that dealing with multivariate probit missing data is trivial if you go to the full covariance matrix form for the missing items. That way, the distribution of the observed items arise from the mean and covariance matrix restricted to the observed rows (and columns for covariance).

1 Like

I pinged @bgoodri this question about missing data in relation to the GHK implementation of the multivariate probit described above. I paste his response below:

It seems like they should be elements of the vector formed by post-multiplying the Cholesky factor by a vector of possibly truncated standard normals obtained by pushing uniforms through the quantile function of the standard normal. Or if the missing responses are discrete, discretizing the result with the cutpoints.

I am not yet sure how to code this up given what we have. Any suggestions?
But, it is possible to account for the missing data based on the current model construction.

I would just like to follow up after talking through the problem with Ben.

If the goal is to incorporate missing values as a parameter much like the manual’s chapter on missing data, then @Matias_Guzman_Naranjo’s original post is ‘close’. What is necessary is to post multiply the cholesky factor by the possibly truncated standard normal’s.

if (y[k] == -1 ) { //missing
	  z[k] = L_Omega[k,k]*inv_Phi(u[k]);
	  logLik[k] = 0; // simply the jacobian adjustment 0 in this case. 
    }

In the words of Ben, “There is no truncation and no Jacobian for them. I don’t think you can integrate them out analytically but you don’t have to do much of anything with them if you don’t want to. They just affect the joint posterior distribution of all the unknown parameters.” Further, treating the idea of missing values as parameters in the model is exactly what the manual calls to do - in this case we use u.

The above solution works in my case because imputation is not what I am after - in fact given the nature of my data (biological growth data), the imputation may not make a ton of sense because the missing values are the result of biological processes not occurring yet (which could obvs also be modelled, but that is a whole OTHER layer on top…).

BUT. If the goal is to impute the unknown values y_star based on the GHK simulator here, then - based on Ben’s code:

vector[K] z = inv_Phi(u); // standard normal but possibly truncated
vector[K] y_star = L * z; // multivariate normal but possibly truncated

If the first element of the y vector is missing y[1] then:

//if the missing value is continuous
 y_star[1] = y_star

//binary
y_star[1] > 0 = 1 else 0 

//ordinal
y_star[1] = kth ordinal category // where k is > k-th cutpoint but not greater than the k-th+1 cutpoint

This answer’s the second component of his response I posted above by discretizing the missing responses based on the cutpoints.

This is great, thank you for the follow up!