Multivariate probit model, use of sufficient statistics

Hi all, I have a question related to the multivariate probit example at the bottom of this page. That example uses the Albert-Chib method to generate continuous z variables underlying the binary responses.

Say that we have no covariates and that we fix the mean vector to 0 for all n=1,\ldots,N. This might make the model useless, but let’s ignore that for now. My intuition is that we could model the sample mean and sample covariance matrix of the z variables in place of the individual z_n.

That is, in the model block, instead of z_n \sim Normal(0, \Omega) for all n, we would have \bar{z} \sim Normal(0, \frac{1}{N} \Omega) and (N-1)S_z \sim Wishart(N-1, \Omega).

I am having trouble thinking about the Jacobian implications of this, though, because the z_n are parameters in this Albert-Chib model (see the link at the top of the post). So my questions are whether it is indeed possible to use \bar{z} and S_z here, and, if so, what to do with the Jacobian.

1 Like

In general, you can always refactor Stan code so long as it doesn’t change the log density up to a constant (the constant can change, but it still has to be log density plus constant).

In the case of exponential families, you can always replace repeated sampling with sufficient statistics. For example, you can replace

for (n in 1:N) y[n] ~ bernoulli(theta);

with

sum(y) ~ binomial(N, theta);

That’s the case I described in the user’s guide. I’m not very good at exponential family algebra, but it’s the same deal with the normal—just replace the sampling of items with sampling of sufficient stats. That will work by definition of what sufficient stats are.

Edit: I should’ve added that you can try it on a very simple standalone example to make sure it’s working before plugging it in as part of a larger model.

1 Like

Thanks for the input. I created a simple example below (though not completely simple) that has a use_sufficient switch saying whether or not to use sufficient statistics of z. Then I do some short runs and compare posterior means of the estimated correlation matrix with/without sufficient stats. I get very different values in the two cases, with the non-sufficient estimates seeming correct. But interestingly, I can obtain the sufficient stat posterior means by taking the non-sufficient posterior means and multiplying by about .39. It feels to me like a Jacobian sort of issue, but I cannot rule out user error!

Stan model:

data {
  int<lower=1> D;
  int<lower=0> N;
  int<lower=0> N0;
  int<lower=0> N1;
  int<lower=0, upper=1> use_sufficient;
  int<lower=0, upper=1> y[N, D];
}
parameters {
  corr_matrix[D] Omega;
  vector<lower=0>[N1] z_pos;
  vector<upper=0>[N0] z_neg;
}
transformed parameters {
  vector[D] z[N];
  matrix[D,D] Sz;
  vector[D] zbar;
  
  zbar = rep_vector(0, D);

  {
    int run_N0 = 1;
    int run_N1 = 1;  

    for (i in 1:N) {
      for (j in 1:D) {
	if (y[i,j] == 1) {
	  z[i,j] = z_pos[run_N1];
	  run_N1 += 1;
	} else {
	  z[i,j] = z_neg[run_N0];
	  run_N0 += 1;
	}
      }
      zbar += inv(N) * z[i];
    }
  }
  
  Sz = rep_matrix(0, D, D);
  for (i in 1:N) {
    Sz += (z[i] - zbar) * (z[i] - zbar)';
  }
}
model {
  target += lkj_corr_lpdf(Omega | 1);
  
  if (use_sufficient) {
    target += multi_normal_lpdf(zbar | rep_vector(0, D), inv(N) * Omega);
    target += wishart_lpdf(Sz | N - 1, Omega);
  } else {
     target += multi_normal_lpdf(z | rep_vector(0, D), Omega);
  }
}

Code for running an example via rstan:

library(rstan)
library(mvtnorm)

## non-diagonal correlation matrix for generating data
cormat <- matrix(c(1, 0.48, 0.4, 0.06, 0.06, 0.12, 0.16, 0.09, 0.1, 
                   0.48, 1, 0.52, 0.07, 0.12, 0.11, 0.12, 0.03, 0.06, 0.4, 0.52, 
                   1, -0.09, 0.08, 0, 0.12, 0.06, 0.02, 0.06, 0.07, -0.09, 1, 0.42, 
                   0.46, 0.09, 0.07, 0.07, 0.06, 0.12, 0.08, 0.42, 1, 0.45, 0.15, 
                   0.08, 0.1, 0.12, 0.11, 0, 0.46, 0.45, 1, 0.09, 0.1, 0.13, 0.16, 
                   0.12, 0.12, 0.09, 0.15, 0.09, 1, 0.53, 0.56, 0.09, 0.03, 0.06, 
                   0.07, 0.08, 0.1, 0.53, 1, 0.61, 0.1, 0.06, 0.02, 0.07, 0.1, 0.13, 
                   0.56, 0.61, 1), 9, 9)

set.seed(123)
dcont <- rmvnorm(200, sigma = cormat)
dord <- matrix(1L, nrow(dcont), ncol(dcont))
dord[dcont < 0] <- 0L

## stan model without sufficient statistics
standat <- list(D = 9, N = nrow(dord), N0 = sum(dord == 0L), N1 = sum(dord == 1L),
                use_sufficient = 0, y = dord)

## the stan model above is in the file probitsuff.stan
m1 <- stan_model("probitsuff.stan")

m1run <- sampling(m1, data = standat, iter = 200, warmup = 100, pars = "Omega")
m1summ <- summary(m1run)

## stan model with sufficient statistics
standat$use_sufficient <- 1
m1suff <- sampling(m1, data = standat, iter = 200, warmup = 100, pars = "Omega")
m1suffsum <- summary(m1suff)

## correlation parameters are a scaled version of one another
m1omeg <- m1summ[[1]][grepl("Omega", rownames(m1summ[[1]])), "mean"]
m1someg <- m1suffsum[[1]][grepl("Omega", rownames(m1suffsum[[1]])), "mean"]
plot(m1omeg[m1omeg < 1], m1someg[m1someg < 1])
summary(lm(m1someg[m1someg < 1] ~ m1omeg[m1omeg < 1]))

One update on the above model. In the model block, I can replace these two lines:

    target += multi_normal_lpdf(zbar | rep_vector(0, D), inv(N) * Omega);
    target += wishart_lpdf(Sz | N - 1, Omega);

with an evaluation of the multivariate normal in terms in zbar and Sz (leaving out constant terms):

    target += -.5*(N * (zbar' * zbar) + trace(mdivide_left_spd(Omega, Sz)) + N * log_determinant(Omega));

and then the results match up with the use_sufficient = 0 results. I think it suggests that, for this example, the wishart_lpdf approach is not the right way to go.