Multivariate normal too slow

performance

#1

Hi there,
I’m very new to Stan and I have been trying a model with multivariate outcomes.
Quick details about the data: I have a set of subjects and groups and for each subject-group pair there is a measure which is 6-dimensional. Because of the way these measures are taken, I assume diagonal covariance matrix.
It takes more than 10 hours to run 500 iterations (250 warmup 250 sampling) with 2 chains on 2 cores.
Below is my code and any helpful look into that to speed up would be really great. Thanks!

data {
int<lower=1> K; //dimension size (6)
int<lower=1> J; // num subjects (183)
int<lower=1> S; // num groups (65)  
int<lower=0> N; // num observations (1000)
vector[J] x[N]; // for subjects indicators
vector[S] u[N]; // for group indicators
vector[K] y[N]; // outcome measure (numeric)
}

parameters {
matrix[K, J] beta; //subject coefficients
matrix[K, S] eta; // group coefficients
vector<lower=0>[K] sigma; // std for dimesnions
}

model {
vector[K] mu[N];
matrix[K, K] Sigma;
Sigma = diag_matrix(square(sigma)); // assuming diagonal covariance matrix

for (n in 1:N)
mu[n] = beta * x[n] + eta * u[n];
y ~ multi_normal(mu , Sigma);
}

#2

Never use multi_normal with a diagonal covariance matrix, use the univariate normal distribution on the stacked observations


#3

Thanks Ben!

I tried that version too (code below) and it took about 2.5 hours for 500 iterations with warning about increasing max_treedepth.

data {
int<lower=1> J; // num subjects (183)
int<lower=1> S; // num groups (65)  
int<lower=0> N; // num observations (6000)
matrix[N,J] x;  // for subjects indicators
matrix[N,S] u; // for group indicators
real y[N]; // outcome measure (numeric)
}

parameters {
vector[J] beta;
vector[S] eta;
real<lower=0> sigma;
}

model {
y ~ normal(x* beta + u * eta, sigma);
}

Is there any settings that I need to do in RStan to speed this up?


#4

You probably need to use the QR reparameterization.


#5

Thanks so much Ben! QR reparameterization reduced the time to 16 mins from 2.5 hours.

One question, is QR reparameterization still possible when I model the multivariate outcomes with covariance parameter (similar to my initial code in the question)? Instead of a matrix for the predictors, I’ll have a set of vectors (vector[J] x[N];)

Any pointers will be very useful.


#6

You are going to need a matrix to be doing the QR decomposition. I think in your case, it would be several QR decompositions.


#7

Following up on this question. I tried a complicated model with a larger dataset as follows:

data{
int<lower=1> J; // num subjects (5,000)
int<lower=1> S; // num groups (90)
int<lower=0> N; // num observations (25,000)
int<lower=1> K; // num dimensions (6)
matrix[N,J] x;  // for subjects indicators
matrix[N,S] u; // for group indicators
vector[K] y[N];  // outcome measure (numeric)
int<lower=1,upper=90> group[N]; // group for subject
}


transformed data {
  // QR reparameterization of predictors
  matrix[N, J] Qx = qr_Q(x)[, 1:J] * N;
  matrix[J, J] R = qr_R(x)[1:J, ] / N;
  matrix[J, J] R_inv = inverse(R);

  matrix[N, S] Qu = qr_Q(u)[, 1:S] * N;
  matrix[S, S] R2 = qr_R(u)[1:S, ] / N;
  matrix[S, S] R2_inv = inverse(R2);
}

parameters {
  matrix[K,J] beta_t; // for subject
  matrix[K,S] eta_t; // for group

  cholesky_factor_corr[K] L_Omega;
  vector<lower=0>[K] L_sigma;
  vector<lower=0,upper=1>[S] alpha;  // random intercept for each subject-group
}

transformed parameters {
  matrix[K,J] beta =  beta_t * R_inv ;
  matrix[K,S] eta =  eta_t * R2_inv;
}


model {
vector[K] mu[N];
matrix[K, K] L_Sigma;

alpha ~ beta(2,2);
L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

to_vector(beta) ~ normal(0, 5);
to_vector(eta) ~ normal(0, 5);
L_Omega ~ lkj_corr_cholesky(4);
L_sigma ~ cauchy(0, 2.5);

for (n in 1:N){
for (k in 1:K){
mu[n][k] = alpha[group[n]] * dot_product(beta_t[k], Qx[n]) + dot_product(eta_t[k], Qu[n]); // to avoid transpose
}
}
y ~ multi_normal_cholesky(mu, L_Sigma);
}

Based on the manual, I tried few speed up trick such as removing transpose and using do_product.
Initial estimation for two chains in two cores is as follows:

Gradient evaluation took 18.4891 seconds
1000 transitions using 10 leapfrog steps per transition would take 184891 seconds.
Adjust your expectations accordingly!

Gradient evaluation took 20.4548 seconds
1000 transitions using 10 leapfrog steps per transition would take 204548 seconds.
Adjust your expectations accordingly!

However, this code is running for more than four days now and it has only finished the warmpup state (250/500 iterations) and started sampling (251/500).

Any suggestions for further improving the speed would be helpful!


#8

The next thing to do is use matrix algebra to define mu. Somethign like

mu[n] = alpha(group[n]) + beta_t[k] * Qx + eta_t[k] * Qu;

Or the other way around to vectorize on K if that’s bigger.

Where do your estimates fall? Those hard interval constraints can be bad news, as I just blogged about:

So the best thing to do at this point might be adding tighter priors. For instance, those Cauchy can also be problematic if there’s not enough data to determine the scale of L_sigma.

Thanks for sharing the detailed example and reporting back.