 # Multivariate probit with missing outcomes

I’ve been working with multivariate probit models recently and one thing that occurred to me is that it would be nice if one could model missing outcomes.
Assuming one enters missing values as -1, and using @bgoodri 's parametrization, would something like this make sense or is this completely wrong?

``````  real multiprobit_lpmf(
int[] y,          // response data - integer array of 1s, 0s and -1s for missing data
real[] mu, // Intercept
real[] u,         // nuisance that absorbs inequality constraints
int K,            // response dimension size
matrix L_Omega    // lower cholesky of correlation matrix
){
vector[K] z;       // latent continuous state
vector[K] logLik;  // log likelihood
real prev;
prev = 0;
for (k in 1:K) {
if (y[k] == -1 ) {
z[k] = 0;
logLik[k] = 0;
}
else {
real bound; // threshold at which utility = 0
bound = approx_Phi( -(mu[k] + prev) / L_Omega[k,k]  );
if (y[k] == 1) {
real t;
t = bound + (1 - bound) * u[k];
z[k] = approx_inv_Phi(t);       // implies utility is positive
logLik[k] = log1m(bound);  // Jacobian adjustment
}
else {
real t;
t = bound * u[k];
z[k] = approx_inv_Phi(t);       // implies utility is negative
}
}
if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
// Jacobian adjustments imply z is truncated standard normal
// thus utility --- mu + L_Omega * z --- is truncated multivariate normal
}
return sum(logLik);
}
``````

I simulated some data and got pretty reasonable results:

``````set.seed(1234)
y <- mutate(as.data.frame(rmvnorm(100, c(-1,1), matrix(c(1,0.5,0.5,1), nrow = 2))>0)
, across(everything(), as.numeric))

y2 <- y
y2[1:3,1] <- -1
y2[20:25,2] <- -1

## y2[5:6,2] <- -1

data_test_1 <- list(N = 100
, K = 2
, y = y)

data_test_2 <- list(N = 100
, K = 2
, y = y2)

samples_1 <- sampling(model_1
, data = data_test_1
, cores = 4
, chains = 4
, iter = 1000
, warmup = 500)

samples_2 <- sampling(model_1
, data = data_test_2
, cores = 4
, chains = 4
, iter = 1000
, warmup = 500)

``````

Gets me:

``````print(samples_1, pars = c("Intercept", "Omega"))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Intercept -0.82    0.00 0.15 -1.13 -0.92 -0.82 -0.73 -0.54  2506 1.00
Intercept  0.88    0.00 0.15  0.60  0.78  0.87  0.97  1.18  1798 1.00
Omega[1,1]    1.00     NaN 0.00  1.00  1.00  1.00  1.00  1.00   NaN  NaN
Omega[1,2]    0.59    0.01 0.20  0.16  0.46  0.60  0.74  0.93   471 1.01
Omega[2,1]    0.59    0.01 0.20  0.16  0.46  0.60  0.74  0.93   471 1.01
Omega[2,2]    1.00    0.00 0.00  1.00  1.00  1.00  1.00  1.00   129 1.00

> print(samples_2, pars = c("Intercept", "Omega"))
Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Intercept -0.80    0.00 0.14 -1.08 -0.90 -0.80 -0.71 -0.53  4794    1
Intercept  0.89    0.00 0.16  0.59  0.79  0.89  1.00  1.21  4359    1
Omega[1,1]    1.00     NaN 0.00  1.00  1.00  1.00  1.00  1.00   NaN  NaN
Omega[1,2]    0.52    0.01 0.20  0.10  0.38  0.54  0.66  0.84   951    1
Omega[2,1]    0.52    0.01 0.20  0.10  0.38  0.54  0.66  0.84   951    1
Omega[2,2]    1.00    0.00 0.00  1.00  1.00  1.00  1.00  1.00   550    1
``````

But then again, maybe this is just a fluke?

The whole model:

``````functions {
real approx_Phi(real x) {
return inv_logit(x * 1.702);
}

real approx_inv_Phi(real x) {
return logit(x) / 1.702;
}

real multiprobit_lpmf(
int[] y,          // response data - integer array of 1s and 0s
real[] mu, // Intercept
real[] u,         // nuisance that absorbs inequality constraints
int K,            // response dimension size
matrix L_Omega    // lower cholesky of correlation matrix
){
vector[K] z;       // latent continuous state
vector[K] logLik;  // log likelihood
real prev;
/* mu = Intercept + beta * x; */
prev = 0;
for (k in 1:K) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
if (y[k] == -1 ) {
z[k] = 0;
logLik[k] = 0;
}
else {
real bound; // threshold at which utility = 0
bound = approx_Phi( -(mu[k] + prev) / L_Omega[k,k]  );
if (y[k] == 1) {
real t;
t = bound + (1 - bound) * u[k];
z[k] = approx_inv_Phi(t);       // implies utility is positive
logLik[k] = log1m(bound);  // Jacobian adjustment
}
else {
real t;
t = bound * u[k];
z[k] = approx_inv_Phi(t);       // implies utility is negative
}
}
if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
// Jacobian adjustments imply z is truncated standard normal
// thus utility --- mu + L_Omega * z --- is truncated multivariate normal
}
return sum(logLik);
}

}

data {
int<lower=1> K;                       // Kesponse dimension size
int<lower=0> N;                       // sample size
int<lower=-1,upper=1> y[N, K];     // outcomes
}

parameters {
vector[K] Intercept;
cholesky_factor_corr[K] L_Omega;
real<lower=0,upper=1> u[N,K];    // nuisance that absorbs inequality constraints
}

transformed parameters {
real mu[N, K];
for (k in 1:K) {
mu[, k] = to_array_1d(Intercept[k] * rep_vector(1, N)); //
}
}

model {
L_Omega ~ lkj_corr_cholesky(2);
Intercept ~ std_normal();
// implicit: u is iid standard uniform a priori
{ // likelihood
for(n in 1:N){
target +=
multiprobit_lpmf(y[n] | mu[n], u[n], K, L_Omega);
}
}
}

generated quantities {
corr_matrix[K] Omega = multiply_lower_tri_self_transpose(L_Omega);
vector[K] preds[N];      // predictions
for (n in 1:N){
{
vector[K] preds_tmp = multi_normal_cholesky_rng(to_vector(mu[n]), L_Omega);
for (k in 1:K)
if (preds_tmp[k] > 0)
preds[n,k] = 1;
else
preds[n,k] = 0;
}
}
}
``````
1 Like

I am not an expert on this (@CerulloE might know more), but I think the `z[k] = 0` line is suspicious - since you don’t have any bound on `z[k]` when the data is missing, I would expect that you want `z[k] = approx_inv_Phi(u[k])` as you still want to integrate out this dimension.

I also think it should be possible to completely avoid having a latent variable for the missing data - you should be able to work with just the marginal distribution of the observed outcomes. Or, in other words, if you rearranged the dimensions such that those with missing data come last, you would be able to just terminate the cycle over dimensions earlier. I unfortunately don’t understand the underlying GHK algorithm and generally the relevant linear algebra well enough to be sure whether reordeing of the dimensions would force you to make some expensive transform to `L_Omega` or not.

The way to test your implementation fully is via Simulation-based calibration - or at least via multiple different simulated datasets with different true parameter values.

Best of luck with the model!

1 Like