 # Feedback request: Multivariate Probit Regression with GP

I’m relatively new to writing stan models directly and was hoping I could get some feedback on this.

I have data from about 700 populations. For each population I have a series of binary outcomes (~1000 for each) which I am interested in modelling. Each population belongs to a group (which I am modelling as group effects) and I have longitude and latitude data for each population (which I am modelling with a GP).
The data looks a bit like this:

``````y1 y2 y3 y4 group lon lat
1   1  0  1     1  12  14
0   0  1  0     1  13  20
1   1  0  1     2   5  23
...
``````

I have been experimenting with two ideas. One is to fit multiple logistic models, all sharing the same lscale and sdgp parameters. The other is a multivariate probit model (I am more or less following this implementation here). From a theory perspective, I think that the multivariate probit makes much more sense, because I expect many of these outcomes to be correlated with each other.

``````functions {
/* Spectral density function of a Gaussian process
* with squared exponential covariance kernel
* Args:
*   x: array of numeric values of dimension NB x D
*   sdgp: marginal SD parameter
*   lscale: vector of length-scale parameters
* Returns:
*   numeric values of the function evaluated at 'x'
*/
vector spd_cov_exp_quad(vector[] x, real sdgp, real lscale) {
int NB = dims(x);
int D = dims(x);
vector[NB] out;
real constant = square(sdgp) * (sqrt(2 * pi()) * lscale)^D;
real neg_half_lscale2 = -0.5 * square(lscale);
for (m in 1:NB) {
out[m] = constant * exp(neg_half_lscale2 * dot_self(x[m]));
}
return out;
}
real multiprobit_lpmf(
int[] y,          // response data - integer array of 1s and 0s
vector mu, // Intercept
real[] u,         // nuisance that absorbs inequality constraints
int K,            // response dimension size
matrix L_Omega    // lower cholesky of correlation matrix
){
vector[K] z;       // latent continuous state
vector[K] logLik;  // log likelihood
real prev;
/* mu = Intercept + beta * x; */
prev = 0;
for (k in 1:K) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
real bound; // threshold at which utility = 0
bound = Phi( -(mu[k] + prev) / L_Omega[k,k]  );
if (y[k] == 1) {
real t;
t = bound + (1 - bound) * u[k];
z[k] = inv_Phi(t);       // implies utility is positive
logLik[k] = log1m(bound);  // Jacobian adjustment
}
else {
real t;
t = bound * u[k];
z[k] = inv_Phi(t);     // implies utility is negative
logLik[k]= log(bound);  // Jacobian adjustment
}
if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
// Jacobian adjustments imply z is truncated standard normal
// thus utility --- mu + L_Omega * z --- is truncated multivariate normal
}
return sum(logLik);
}
}

data {
int<lower=1> K;  // Number of outcomes
int<lower=0> N; // sample size
int<lower=0,upper=N> J;              // number of groups
int<lower=1,upper=J> group_index[N]; // group indices
int<lower=0,upper=1> y[N,K]; // response data - integer array of 1s and 0s
int<lower=1> NBgp;   // number of basis functions of an approximate GP
matrix[N, NBgp] Xgp; // approximate GP basis matrices
vector slambda[NBgp];// approximate GP eigenvalues
real<lower=0> prior_a;
real<lower=0> prior_b;
/* vector[K] x[N]; // covariate data  */
}

parameters {
vector[K] Intercept;
vector[J] eta_re [K];         // Vector of group effects
real<lower=0> sigmaalpha [K]; // Vector of sd of group effects
vector[NBgp] zgp [K];               // latent part
cholesky_factor_corr[K] L_Omega;
real<lower=0,upper=1> u[N,K]; // nuisance that absorbs inequality constraints
real<lower=0> lscale;            // length scale
real<lower=0> sdgp;              // sd of the GP
}

transformed parameters {
real mu[N, K];
for (k in 1:K) {
{
vector[J] alpha = sigmaalpha[k] * eta_re[k];
vector[NBgp] rgp = sqrt(spd_cov_exp_quad(slambda, sdgp, lscale)) .* zgp[k];
vector[N] gp_k = Xgp * rgp;
for (n in 1:N) {
mu[n, k] = Intercept[k] + alpha[group_index[n]] + gp_k[n]; //
}
}
}
}

model {
L_Omega ~ lkj_corr_cholesky(4);
Intercept ~ std_normal();

lscale ~ inv_gamma(prior_a, prior_b);
sdgp ~ std_normal();
for(k in 1:K) {
zgp[k] ~ std_normal();
sigmaalpha[k] ~ std_normal();
eta_re[k] ~ std_normal();
}
// implicit: u is iid standard uniform a priori
{ // likelihood
for(n in 1:N){
target += multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
}
}
}

generated quantities {
corr_matrix[K] Omega;
vector[N] log_lik;
for (n in 1:N)
log_lik[n] = multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
Omega = multiply_lower_tri_self_transpose(L_Omega);
}
``````

For the model with multiple logistic regressions the relevant changes are:

``````model {
lscale ~ inv_gamma(prior_a, prior_b);
sdgp ~ std_normal();

// fit all logistic models
for (k in 1:K) {
zgp[k] ~ std_normal();
// priors regression
Intercept[k] ~ std_normal();
sigmaalpha[k] ~ std_normal();
eta_re[k] ~ std_normal();
y[, k] ~ bernoulli_logit(mu[,k]);
}
}

generated quantities {
real log_lik[N, K];
for (k in 1:K) {
for (n in 1:N)
log_lik[n, k] = bernoulli_logit_lpmf(y[n, k] | mu[n,k]);
}
}
``````

I am generating the data for the approximate GP with brms’ make_standata.

Both models work fit in reasonable time for 20 outcomes and I do not get any warnings or issues. Rhat values all look good and ESSs too (except for some cells in the correlation matrix which have a lowish ESS).

One issue I am seeing is that the pareto-k values for the multivarite probit model are terrible:

``````Computed from 4000 by 552 log-likelihood matrix

Estimate    SE
elpd_loo  -3076.2  68.8
p_loo      1001.4  27.5
looic      6152.4 137.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      34    6.2%   262
(0.5, 0.7]   (ok)       211   38.2%   82
(0.7, 1]   (bad)      255   46.2%   12
(1, Inf)   (very bad)  52    9.4%   3
``````

While for the multi-logistic model the results are much less bad:

``````Computed from 4000 by 14840 log-likelihood matrix
Estimate    SE
elpd_loo  -4967.1  73.0
p_loo       340.3   8.3
looic      9934.2 146.1

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     14770 99.5%   952
(0.5, 0.7]   (ok)          64  0.4%   784
(0.7, 1]   (bad)          6  0.0%   267
(1, Inf)   (very bad)     0  0.0%   <NA>
``````

I have essentially three questions:

1. Are there any clear mistakes in any of the two models or anything I could improve on?

2. For now, with 20 outcomes, the models take about one or two hours to fit. But adding many more outcomes will make things extremely slow, are there obvious ways of speeding things up?

3. Any ideas of why the pareto diagnostics are so bad for the multivariate model? any suggestions for things to try?

Thanks!

1 Like

At least for #3 please have a look here: Cross-validation FAQ

A GP with that large a sample size will take a long time to fit. Have you tried running the whole dataset with `brms` (since it kinda optimizes some things) and just one chain for 500 iterations, to get a rough estimate on computation time?

1 Like

Thanks for your answer and the link! I am a bit worried that the high pareto values are rather due to a coding error on my part. At least I don’t understand why the multivariate probit is so much worse.
Regarding the GP, I am using the approximation technique, but even with a real GP that doesn’t seem to change things much.

Have you tried running the whole dataset with `brms` (since it kinda optimizes some things) and just one chain for 500 iterations, to get a rough estimate on computation time?

But these models are not really available in `brms`, no? If I fit

``````brm(mvbind(y1, y2) ~ gp(lon, lat) + (1|group), family = bernoulli)
``````

I will get an independent GP for each outcome. But that is not really what I want, I want a shared lscale and sdgp value for all outcomes. If you know of a way fitting either of these models in `brms` that’d be so much easier for me. And as far as I can tell there is no multivariate probit, is there?

1 Like

Heyhey, apologies I haven’t reviewed your code fully but in response to:

1. Any ideas of why the pareto diagnostics are so bad for the multivariate model? any suggestions for things to try?

In my own experience fitting much simpler MV Probits, I can rarely get good Pareto K’s. I think some of the core assumptions are violated by the implicit truncated MVN in the MV Probit. I think a more robust approach generally (especially for tricky models) is to use a specially designed test (for me I was using it as a classifier so used cross-entropy). Hope that’s more reassuring than negative!

1 Like

Ah, you’re completely right, MV probit does not exist in `brms` I believe… Additionally, I don’t think it’s possible to model lscale and sdgp separately for each outcome, but I think we should ask @paul.buerkner here. @avehtari could perhaps provide some input on the GP Stan code, and perhaps add to @fergusjchadwick’s answer?

I’ll add that I have a mostly-working implementation of multivariate cumulative probit (ordinal extension of the binary case) as a custom distribution for `brms` (with big thanks to @bgoodri whose ideas I copied and @CerulloE that kindly answered my inquiries and gave me several crucial hints). I am still testing this and preparing some other outputs, so I don’t think it is ready for completely public release yet (it might be in a few weeks), but I’d be happy to share the code for additional testing and collaboration (ping a private message if interested). Note that the MV probit itself adds a bunch of complexity and a lot of nuisance parameters, so it is quite likely that together with a GP the model will just break very badly.

One thing that helped a lot in my case was to use a simple approximation to `Phi` and `inv_Phi` as

``````     real approx_Phi(real x) {
return inv_logit(x * 1.702);
}

real approx_inv_Phi(real x) {
return logit(x) / 1.702;
}
``````

(this is one of the hints provided by @CerulloE )

It is also an interesting suggestion to investigate how LOO behaves with the parametrization I have…

I think you can make multiple predictors in `brms` share the same GP by using the non-linear syntax. But this is one more step into the unknown territory with a risk of making the model intractable. So something like this could work (code not actually tested):

``````brm(bf(y1 ~ my_smooth + other_terms, y2 ~ my_smooth + other_terms,
my_smooth ~ 0 + gp(lon, lat),  nl = TRUE, family = bernoulli)
``````

Note that non-linear syntax fundamentally changes how the formula are interpreted by `brms` so make sure to check out the docs.

Best of luck with the model!

3 Likes

@fergusjchadwick: I just realized why the Pareto-K is almost guaranteed to be bad if you just add up the contributions to `target` from this parametrization. In this case, you don’t compute the log likelihood of the observed values given the linear predictors. You actually compute the log likelihood given the linear predictors AND the nuisance parameters. Since the observed values have huge influence on the associated nuisance parameters, loo correctly treats them as having large influence on the model and thus having high k-hat.

If you could integrate the nuisance parameters out, you would probably get a reasonably-behaved likelihood that would play nicely with loo. That could probably be done by implementing some Monte-Carlo integrator for multivariate normals in the generated quantities, but that could be hard to do (and expensive to compute)

It could also make some kind of weird sense to compute the multivariate normal log likelihood of the nuisance parameters given the linear predictors and feed that to loo, but I can’t think completely clearly if that would correspond to a meaningful quantity or not.

1 Like

Thank you for your answer Martin

One thing that helped a lot in my case was to use a simple approximation to `Phi` and `inv_Phi` as

I’ll give this a try!

I think you can make multiple predictors in `brms` share the same GP by using the non-linear syntax.

I gave this a try but it produces independent GPs for each outcome. I think reading some comment by Paul explaining that this will only be possible in `brms` 3 but maybe I’m wrong?

Best of luck with the model!

Thanks!

@martinmodrak: That makes a huge amount of sense. It’s remarkable how complex going multivariate makes things. Would be v. interested to hear if you figure out a solution (and about your new parameterisation).

Btw. it’s very rare to get `p_loo` > the number of log-lik matrix elements (here 552).

It @martinmodrak already figured out the reason.

Note also cleaner and faster spectral density functions at Bayesian workflow book - Birthdays (there’s a link to the repo).

1 Like

Thanks Aki. I’m looking at the linked functions, but do these work for 2-dimensional GPs?

Not directly, but looking at those you could make the same improvements for 2D versions. Gabriel Riutort-Mayol is aware of the improvements and has told he would update the repo which has function for 2D case, but he seems to be busy with other things.

I gave this a shot, and it samples much faster, but I think I made a mistake because the estimates of the exact GP are somewhat different and tend to vary a lot as I change L and M. If you have a minute, do you see how to fix these functions/where the error might be?

``````  matrix PHI_EQ_2D(int N, int M, real L, vector x1, vector x2) {
matrix[N, M] fi;
matrix[N, M] fi1;
matrix[N, M] fi2;
fi1 = sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x1+L), M),
linspaced_vector(M, 1, M)))/sqrt(L);
fi2 = sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x2+L), M),
linspaced_vector(M, 1, M)))/sqrt(L);
fi = fi1.*fi2;
return fi;
}

vector diagSPD_EQ_2D(real sdgp, real lscale, real L, int M) {
return sqrt((sdgp^2) * sqrt(2*pi()*lscale)^2 *
exp(-0.5*(lscale*pi()/2/L)^2*linspaced_vector(M, 1, M)^2));
}``````
1 Like

Just letting you know that I’ve seen your question, but I just haven’t had enough time to figure out an answer

3 Likes

I think `sqrt(2*pi()*lscale)^2` should be `2*pi()*lscale^2` and the term inside exp seems to be missing `*2` (compare how basis_functions_approach_to_GP/model_BF_multi.stan at master · gabriuma/basis_functions_approach_to_GP · GitHub has `rho1^2*w1^2 + rho2^2*w2^2`, and you have rho1=rho2 and w1=w2).

1 Like