I’m relatively new to writing stan models directly and was hoping I could get some feedback on this.
I have data from about 700 populations. For each population I have a series of binary outcomes (~1000 for each) which I am interested in modelling. Each population belongs to a group (which I am modelling as group effects) and I have longitude and latitude data for each population (which I am modelling with a GP).
The data looks a bit like this:
y1 y2 y3 y4 group lon lat
1 1 0 1 1 12 14
0 0 1 0 1 13 20
1 1 0 1 2 5 23
...
I have been experimenting with two ideas. One is to fit multiple logistic models, all sharing the same lscale and sdgp parameters. The other is a multivariate probit model (I am more or less following this implementation here). From a theory perspective, I think that the multivariate probit makes much more sense, because I expect many of these outcomes to be correlated with each other.
functions {
/* Spectral density function of a Gaussian process
* with squared exponential covariance kernel
* Args:
* x: array of numeric values of dimension NB x D
* sdgp: marginal SD parameter
* lscale: vector of length-scale parameters
* Returns:
* numeric values of the function evaluated at 'x'
*/
vector spd_cov_exp_quad(vector[] x, real sdgp, real lscale) {
int NB = dims(x)[1];
int D = dims(x)[2];
vector[NB] out;
real constant = square(sdgp) * (sqrt(2 * pi()) * lscale)^D;
real neg_half_lscale2 = -0.5 * square(lscale);
for (m in 1:NB) {
out[m] = constant * exp(neg_half_lscale2 * dot_self(x[m]));
}
return out;
}
real multiprobit_lpmf(
int[] y, // response data - integer array of 1s and 0s
vector mu, // Intercept
real[] u, // nuisance that absorbs inequality constraints
int K, // response dimension size
matrix L_Omega // lower cholesky of correlation matrix
){
vector[K] z; // latent continuous state
vector[K] logLik; // log likelihood
real prev;
/* mu = Intercept + beta * x; */
prev = 0;
for (k in 1:K) { // Phi and inv_Phi may overflow and / or be numerically inaccurate
real bound; // threshold at which utility = 0
bound = Phi( -(mu[k] + prev) / L_Omega[k,k] );
if (y[k] == 1) {
real t;
t = bound + (1 - bound) * u[k];
z[k] = inv_Phi(t); // implies utility is positive
logLik[k] = log1m(bound); // Jacobian adjustment
}
else {
real t;
t = bound * u[k];
z[k] = inv_Phi(t); // implies utility is negative
logLik[k]= log(bound); // Jacobian adjustment
}
if (k < K) prev = L_Omega[k+1,1:k] * head(z, k);
// Jacobian adjustments imply z is truncated standard normal
// thus utility --- mu + L_Omega * z --- is truncated multivariate normal
}
return sum(logLik);
}
}
data {
int<lower=1> K; // Number of outcomes
int<lower=0> N; // sample size
int<lower=0,upper=N> J; // number of groups
int<lower=1,upper=J> group_index[N]; // group indices
int<lower=0,upper=1> y[N,K]; // response data - integer array of 1s and 0s
int<lower=1> NBgp; // number of basis functions of an approximate GP
matrix[N, NBgp] Xgp; // approximate GP basis matrices
vector[2] slambda[NBgp];// approximate GP eigenvalues
real<lower=0> prior_a;
real<lower=0> prior_b;
/* vector[K] x[N]; // covariate data */
}
parameters {
vector[K] Intercept;
vector[J] eta_re [K]; // Vector of group effects
real<lower=0> sigmaalpha [K]; // Vector of sd of group effects
vector[NBgp] zgp [K]; // latent part
cholesky_factor_corr[K] L_Omega;
real<lower=0,upper=1> u[N,K]; // nuisance that absorbs inequality constraints
real<lower=0> lscale; // length scale
real<lower=0> sdgp; // sd of the GP
}
transformed parameters {
real mu[N, K];
for (k in 1:K) {
{
vector[J] alpha = sigmaalpha[k] * eta_re[k];
vector[NBgp] rgp = sqrt(spd_cov_exp_quad(slambda, sdgp, lscale)) .* zgp[k];
vector[N] gp_k = Xgp * rgp;
for (n in 1:N) {
mu[n, k] = Intercept[k] + alpha[group_index[n]] + gp_k[n]; //
}
}
}
}
model {
L_Omega ~ lkj_corr_cholesky(4);
Intercept ~ std_normal();
lscale ~ inv_gamma(prior_a, prior_b);
sdgp ~ std_normal();
for(k in 1:K) {
zgp[k] ~ std_normal();
sigmaalpha[k] ~ std_normal();
eta_re[k] ~ std_normal();
}
// implicit: u is iid standard uniform a priori
{ // likelihood
for(n in 1:N){
target += multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
}
}
}
generated quantities {
corr_matrix[K] Omega;
vector[N] log_lik;
for (n in 1:N)
log_lik[n] = multiprobit_lpmf(y[n] | to_vector(mu[n]), u[n], K, L_Omega);
Omega = multiply_lower_tri_self_transpose(L_Omega);
}
For the model with multiple logistic regressions the relevant changes are:
model {
lscale ~ inv_gamma(prior_a, prior_b);
sdgp ~ std_normal();
// fit all logistic models
for (k in 1:K) {
zgp[k] ~ std_normal();
// priors regression
Intercept[k] ~ std_normal();
sigmaalpha[k] ~ std_normal();
eta_re[k] ~ std_normal();
y[, k] ~ bernoulli_logit(mu[,k]);
}
}
generated quantities {
real log_lik[N, K];
for (k in 1:K) {
for (n in 1:N)
log_lik[n, k] = bernoulli_logit_lpmf(y[n, k] | mu[n,k]);
}
}
I am generating the data for the approximate GP with brms’ make_standata.
Both models work fit in reasonable time for 20 outcomes and I do not get any warnings or issues. Rhat values all look good and ESSs too (except for some cells in the correlation matrix which have a lowish ESS).
One issue I am seeing is that the pareto-k values for the multivarite probit model are terrible:
Computed from 4000 by 552 log-likelihood matrix
Estimate SE
elpd_loo -3076.2 68.8
p_loo 1001.4 27.5
looic 6152.4 137.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 34 6.2% 262
(0.5, 0.7] (ok) 211 38.2% 82
(0.7, 1] (bad) 255 46.2% 12
(1, Inf) (very bad) 52 9.4% 3
While for the multi-logistic model the results are much less bad:
Computed from 4000 by 14840 log-likelihood matrix
Estimate SE
elpd_loo -4967.1 73.0
p_loo 340.3 8.3
looic 9934.2 146.1
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 14770 99.5% 952
(0.5, 0.7] (ok) 64 0.4% 784
(0.7, 1] (bad) 6 0.0% 267
(1, Inf) (very bad) 0 0.0% <NA>
I have essentially three questions:
-
Are there any clear mistakes in any of the two models or anything I could improve on?
-
For now, with 20 outcomes, the models take about one or two hours to fit. But adding many more outcomes will make things extremely slow, are there obvious ways of speeding things up?
-
Any ideas of why the pareto diagnostics are so bad for the multivariate model? any suggestions for things to try?
Thanks!