Feedback request: Multivariate Probit Regression with GP

I’m relatively new to writing stan models directly and was hoping I could get some feedback on this.

I have data from about 700 populations. For each population I have a series of binary outcomes (~1000 for each) which I am interested in modelling. Each population belongs to a group (which I am modelling as group effects) and I have longitude and latitude data for each population (which I am modelling with a GP).
The data looks a bit like this:

y1 y2 y3 y4 group lon lat
1   1  0  1     1  12  14
0   0  1  0     1  13  20
1   1  0  1     2   5  23
...

I have been experimenting with two ideas. One is to fit multiple logistic models, all sharing the same lscale and sdgp parameters. The other is a multivariate probit model (I am more or less following this implementation here). From a theory perspective, I think that the multivariate probit makes much more sense, because I expect many of these outcomes to be correlated with each other.

functions {
  /* Spectral density function of a Gaussian process
   * with squared exponential covariance kernel
   * Args:
   *   x: array of numeric values of dimension NB x D
   *   sdgp: marginal SD parameter
   *   lscale: vector of length-scale parameters
   * Returns: 
   *   numeric values of the function evaluated at 'x'
   */
  vector spd_cov_exp_quad(vector[] x, real sdgp, real lscale) {
    int NB = dims(x)[1];
    int D = dims(x)[2];
    vector[NB] out;
    real constant = square(sdgp) * (sqrt(2 * pi()) * lscale)^D;
    real neg_half_lscale2 = -0.5 * square(lscale);
    for (m in 1:NB) {
      out[m] = constant * exp(neg_half_lscale2 * dot_self(x[m]));
    }
    return out;
  }
  real multiprobit_lpmf(
			int[] y,          // response data - integer array of 1s and 0s
			vector mu, // Intercept
			real[] u,         // nuisance that absorbs inequality constraints
			int K,            // response dimension size
			matrix L_Omega    // lower cholesky of correlation matrix
			){
    vector[K] z;       // latent continuous state
    vector[K] logLik;  // log likelihood
    real prev;
    /* mu = Intercept + beta * x; */
    prev = 0;
    for (k in 1:K) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
      real bound; // threshold at which utility = 0
      bound = Phi( -(mu[k] + prev) / L_Omega[k,k]  );
      if (y[k] == 1) {
	real t;
	t = bound + (1 - bound) * u[k];
	z[k] = inv_Phi(t);       // implies utility is positive
	logLik[k] = log1m(bound);  // Jacobian adjustment
      }
      else {
	real t;
	t = bound * u[k];
	z[k] = inv_Phi(t);     // implies utility is negative
	logLik[k]= log(bound);  // Jacobian adjustment
      }
      if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
      // Jacobian adjustments imply z is truncated standard normal
      // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
    }
    return sum(logLik);
  }
}

data {
  int<lower=1> K;  // Number of outcomes
  int<lower=0> N; // sample size
  int<lower=0,upper=N> J;              // number of groups
  int<lower=1,upper=J> group_index[N]; // group indices
  int<lower=0,upper=1> y[N,K]; // response data - integer array of 1s and 0s
  int<lower=1> NBgp;   // number of basis functions of an approximate GP
  matrix[N, NBgp] Xgp; // approximate GP basis matrices
  vector[2] slambda[NBgp];// approximate GP eigenvalues
  real<lower=0> prior_a;
  real<lower=0> prior_b;
  /* vector[K] x[N]; // covariate data  */
}

parameters {
  vector[K] Intercept;
  vector[J] eta_re [K];         // Vector of group effects
  real<lower=0> sigmaalpha [K]; // Vector of sd of group effects
  vector[NBgp] zgp [K];               // latent part
  cholesky_factor_corr[K] L_Omega;
  real<lower=0,upper=1> u[N,K]; // nuisance that absorbs inequality constraints
  real<lower=0> lscale;            // length scale
  real<lower=0> sdgp;              // sd of the GP
}

transformed parameters {
  real mu[N, K];
  for (k in 1:K) {
    {
      vector[J] alpha = sigmaalpha[k] * eta_re[k];
      vector[NBgp] rgp = sqrt(spd_cov_exp_quad(slambda, sdgp, lscale)) .* zgp[k];
      vector[N] gp_k = Xgp * rgp;
      for (n in 1:N) {
	mu[n, k] = Intercept[k] + alpha[group_index[n]] + gp_k[n]; // 
      }
    }
  }
}

model {
  L_Omega ~ lkj_corr_cholesky(4);
  Intercept ~ std_normal();

  lscale ~ inv_gamma(prior_a, prior_b);
  sdgp ~ std_normal();
  for(k in 1:K) {
    zgp[k] ~ std_normal();
    sigmaalpha[k] ~ std_normal();
    eta_re[k] ~ std_normal();
  }
  // implicit: u is iid standard uniform a priori
  { // likelihood
    for(n in 1:N){
      target += multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
    }
  }
}

generated quantities {
  corr_matrix[K] Omega;
  vector[N] log_lik;
  for (n in 1:N)
    log_lik[n] = multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
  Omega = multiply_lower_tri_self_transpose(L_Omega);  
}

For the model with multiple logistic regressions the relevant changes are:

model {
  lscale ~ inv_gamma(prior_a, prior_b);
  sdgp ~ std_normal();                                                                                                                                         

  // fit all logistic models                                                                                                                                                                        
  for (k in 1:K) {
    zgp[k] ~ std_normal();
    // priors regression                                                                                                                                                                            
    Intercept[k] ~ std_normal();
    sigmaalpha[k] ~ std_normal();
    eta_re[k] ~ std_normal();
    y[, k] ~ bernoulli_logit(mu[,k]);
  }
}

generated quantities {
  real log_lik[N, K];
  for (k in 1:K) {
    for (n in 1:N)
      log_lik[n, k] = bernoulli_logit_lpmf(y[n, k] | mu[n,k]);
  }
}

I am generating the data for the approximate GP with brms’ make_standata.

Both models work fit in reasonable time for 20 outcomes and I do not get any warnings or issues. Rhat values all look good and ESSs too (except for some cells in the correlation matrix which have a lowish ESS).

One issue I am seeing is that the pareto-k values for the multivarite probit model are terrible:

Computed from 4000 by 552 log-likelihood matrix

         Estimate    SE
 elpd_loo  -3076.2  68.8
 p_loo      1001.4  27.5
 looic      6152.4 137.5
 ------
Monte Carlo SE of elpd_loo is NA.

 Pareto k diagnostic values:
                          Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      34    6.2%   262
 (0.5, 0.7]   (ok)       211   38.2%   82
   (0.7, 1]   (bad)      255   46.2%   12
   (1, Inf)   (very bad)  52    9.4%   3

While for the multi-logistic model the results are much less bad:

Computed from 4000 by 14840 log-likelihood matrix
         Estimate    SE
elpd_loo  -4967.1  73.0
p_loo       340.3   8.3
looic      9934.2 146.1

Pareto k diagnostic values:
                          Count Pct.    Min. n_eff
 (-Inf, 0.5]   (good)     14770 99.5%   952
  (0.5, 0.7]   (ok)          64  0.4%   784
    (0.7, 1]   (bad)          6  0.0%   267
    (1, Inf)   (very bad)     0  0.0%   <NA>

I have essentially three questions:

  1. Are there any clear mistakes in any of the two models or anything I could improve on?

  2. For now, with 20 outcomes, the models take about one or two hours to fit. But adding many more outcomes will make things extremely slow, are there obvious ways of speeding things up?

  3. Any ideas of why the pareto diagnostics are so bad for the multivariate model? any suggestions for things to try?

Thanks!

1 Like

At least for #3 please have a look here: Cross-validation FAQ

A GP with that large a sample size will take a long time to fit. Have you tried running the whole dataset with brms (since it kinda optimizes some things) and just one chain for 500 iterations, to get a rough estimate on computation time?

1 Like

Thanks for your answer and the link! I am a bit worried that the high pareto values are rather due to a coding error on my part. At least I don’t understand why the multivariate probit is so much worse.
Regarding the GP, I am using the approximation technique, but even with a real GP that doesn’t seem to change things much.

Have you tried running the whole dataset with brms (since it kinda optimizes some things) and just one chain for 500 iterations, to get a rough estimate on computation time?

But these models are not really available in brms, no? If I fit

brm(mvbind(y1, y2) ~ gp(lon, lat) + (1|group), family = bernoulli)

I will get an independent GP for each outcome. But that is not really what I want, I want a shared lscale and sdgp value for all outcomes. If you know of a way fitting either of these models in brms that’d be so much easier for me. And as far as I can tell there is no multivariate probit, is there?

1 Like

Heyhey, apologies I haven’t reviewed your code fully but in response to:

  1. Any ideas of why the pareto diagnostics are so bad for the multivariate model? any suggestions for things to try?

In my own experience fitting much simpler MV Probits, I can rarely get good Pareto K’s. I think some of the core assumptions are violated by the implicit truncated MVN in the MV Probit. I think a more robust approach generally (especially for tricky models) is to use a specially designed test (for me I was using it as a classifier so used cross-entropy). Hope that’s more reassuring than negative!

1 Like

Ah, you’re completely right, MV probit does not exist in brms I believe… Additionally, I don’t think it’s possible to model lscale and sdgp separately for each outcome, but I think we should ask @paul.buerkner here. @avehtari could perhaps provide some input on the GP Stan code, and perhaps add to @fergusjchadwick’s answer?

I’ll add that I have a mostly-working implementation of multivariate cumulative probit (ordinal extension of the binary case) as a custom distribution for brms (with big thanks to @bgoodri whose ideas I copied and @CerulloE that kindly answered my inquiries and gave me several crucial hints). I am still testing this and preparing some other outputs, so I don’t think it is ready for completely public release yet (it might be in a few weeks), but I’d be happy to share the code for additional testing and collaboration (ping a private message if interested). Note that the MV probit itself adds a bunch of complexity and a lot of nuisance parameters, so it is quite likely that together with a GP the model will just break very badly.

One thing that helped a lot in my case was to use a simple approximation to Phi and inv_Phi as

     real approx_Phi(real x) {
        return inv_logit(x * 1.702);
      }

      real approx_inv_Phi(real x) {
        return logit(x) / 1.702;
      }

(this is one of the hints provided by @CerulloE )

It is also an interesting suggestion to investigate how LOO behaves with the parametrization I have…

I think you can make multiple predictors in brms share the same GP by using the non-linear syntax. But this is one more step into the unknown territory with a risk of making the model intractable. So something like this could work (code not actually tested):

brm(bf(y1 ~ my_smooth + other_terms, y2 ~ my_smooth + other_terms,
           my_smooth ~ 0 + gp(lon, lat),  nl = TRUE, family = bernoulli)

Note that non-linear syntax fundamentally changes how the formula are interpreted by brms so make sure to check out the docs.

Best of luck with the model!

3 Likes

@fergusjchadwick: I just realized why the Pareto-K is almost guaranteed to be bad if you just add up the contributions to target from this parametrization. In this case, you don’t compute the log likelihood of the observed values given the linear predictors. You actually compute the log likelihood given the linear predictors AND the nuisance parameters. Since the observed values have huge influence on the associated nuisance parameters, loo correctly treats them as having large influence on the model and thus having high k-hat.

If you could integrate the nuisance parameters out, you would probably get a reasonably-behaved likelihood that would play nicely with loo. That could probably be done by implementing some Monte-Carlo integrator for multivariate normals in the generated quantities, but that could be hard to do (and expensive to compute)

It could also make some kind of weird sense to compute the multivariate normal log likelihood of the nuisance parameters given the linear predictors and feed that to loo, but I can’t think completely clearly if that would correspond to a meaningful quantity or not.

2 Likes

Thank you for your answer Martin

One thing that helped a lot in my case was to use a simple approximation to Phi and inv_Phi as

I’ll give this a try!

I think you can make multiple predictors in brms share the same GP by using the non-linear syntax.

I gave this a try but it produces independent GPs for each outcome. I think reading some comment by Paul explaining that this will only be possible in brms 3 but maybe I’m wrong?

Best of luck with the model!

Thanks!

@martinmodrak: That makes a huge amount of sense. It’s remarkable how complex going multivariate makes things. Would be v. interested to hear if you figure out a solution (and about your new parameterisation).

Btw. it’s very rare to get p_loo > the number of log-lik matrix elements (here 552).

It @martinmodrak already figured out the reason.

Note also cleaner and faster spectral density functions at https://avehtari.github.io/casestudies/Birthdays/birthdays.html (there’s a link to the repo).

1 Like

Thanks Aki. I’m looking at the linked functions, but do these work for 2-dimensional GPs?

Not directly, but looking at those you could make the same improvements for 2D versions. Gabriel Riutort-Mayol is aware of the improvements and has told he would update the repo which has function for 2D case, but he seems to be busy with other things.

I gave this a shot, and it samples much faster, but I think I made a mistake because the estimates of the exact GP are somewhat different and tend to vary a lot as I change L and M. If you have a minute, do you see how to fix these functions/where the error might be?

  matrix PHI_EQ_2D(int N, int M, real L, vector x1, vector x2) {
    matrix[N, M] fi;
    matrix[N, M] fi1;
    matrix[N, M] fi2;
    fi1 = sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x1+L), M),
				 linspaced_vector(M, 1, M)))/sqrt(L);
    fi2 = sin(diag_post_multiply(rep_matrix(pi()/(2*L) * (x2+L), M),
				 linspaced_vector(M, 1, M)))/sqrt(L);
    fi = fi1.*fi2;
    return fi;
  }

  vector diagSPD_EQ_2D(real sdgp, real lscale, real L, int M) {
    return sqrt((sdgp^2) * sqrt(2*pi()*lscale)^2 *
		exp(-0.5*(lscale*pi()/2/L)^2*linspaced_vector(M, 1, M)^2));
  }
1 Like

Just letting you know that I’ve seen your question, but I just haven’t had enough time to figure out an answer

3 Likes

I think sqrt(2*pi()*lscale)^2 should be 2*pi()*lscale^2 and the term inside exp seems to be missing *2 (compare how https://github.com/gabriuma/basis_functions_approach_to_GP/blob/master/multi_dimensional/simulated_dataset/model_BF_multi.stan#L23 has rho1^2*w1^2 + rho2^2*w2^2, and you have rho1=rho2 and w1=w2).

1 Like