Multi-chain performance related to Local Setup

goodmattg · December 6, 2017, 2:28am

All,

I’ve been testing a model I found recently with multiple chains on a large dataset. I’m consistently seeing drastically different performance in the chain evaluation times. I understand that there’s randomness involved with certain chains not converging, but is it also possible that my local configuration is wrong?

In the example below: 1 chain completes in 80 seconds, one in ~4500 seconds and one at ~8000 seconds. This is on a dual-core machine.

Looking for any pointers to get consistent performance out of multiple chains, or dealing with large datasets.

Gradient evaluation took 0.046838 seconds
1000 transitions using 10 leapfrog steps per transition would take 468.38 seconds.
Adjust your expectations accordingly!

Gradient evaluation took 0.046643 seconds
1000 transitions using 10 leapfrog steps per transition would take 466.43 seconds.
Adjust your expectations accordingly!

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: Error in function log1p(double): log1p(x) requires x > -1, but got x = -1.0000000000000002. (in ‘unkown file name’ at line 30)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration: 1 / 200 [ 0%] (Warmup)
Iteration: 1 / 200 [ 0%] (Warmup)
Iteration: 20 / 200 [ 10%] (Warmup)
Iteration: 40 / 200 [ 20%] (Warmup)
Iteration: 60 / 200 [ 30%] (Warmup)
Iteration: 20 / 200 [ 10%] (Warmup)
Iteration: 40 / 200 [ 20%] (Warmup)
Iteration: 80 / 200 [ 40%] (Warmup)
Iteration: 60 / 200 [ 30%] (Warmup)
Iteration: 80 / 200 [ 40%] (Warmup)
Iteration: 100 / 200 [ 50%] (Warmup)
Iteration: 101 / 200 [ 50%] (Sampling)
Iteration: 120 / 200 [ 60%] (Sampling)
Iteration: 140 / 200 [ 70%] (Sampling)
Iteration: 160 / 200 [ 80%] (Sampling)
Iteration: 180 / 200 [ 90%] (Sampling)
Iteration: 200 / 200 [100%] (Sampling)

Elapsed Time: 38.914 seconds (Warm-up)
40.9491 seconds (Sampling)
79.8631 seconds (Total)

Gradient evaluation took 0.045198 seconds
1000 transitions using 10 leapfrog steps per transition would take 451.98 seconds.
Adjust your expectations accordingly!

WARNING: There aren’t enough warmup iterations to fit the
three stages of adaptation as currently configured.
Reducing each adaptation stage to 15%/75%/10% of
the given number of warmup iterations:
init_buffer = 15
adapt_window = 75
term_buffer = 10

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: Error in function log1p(double): log1p(x) requires x > -1, but got x = -1.0000000000000002. (in ‘unkown file name’ at line 30)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration: 1 / 200 [ 0%] (Warmup)
Iteration: 20 / 200 [ 10%] (Warmup)
Iteration: 100 / 200 [ 50%] (Warmup)
Iteration: 101 / 200 [ 50%] (Sampling)
Iteration: 40 / 200 [ 20%] (Warmup)
Iteration: 120 / 200 [ 60%] (Sampling)
Iteration: 60 / 200 [ 30%] (Warmup)
Iteration: 140 / 200 [ 70%] (Sampling)
Iteration: 80 / 200 [ 40%] (Warmup)
Iteration: 160 / 200 [ 80%] (Sampling)
Iteration: 100 / 200 [ 50%] (Warmup)
Iteration: 101 / 200 [ 50%] (Sampling)
Iteration: 180 / 200 [ 90%] (Sampling)
Iteration: 120 / 200 [ 60%] (Sampling)
Iteration: 200 / 200 [100%] (Sampling)

Elapsed Time: 184.99 seconds (Warm-up)
4579.35 seconds (Sampling)
4764.34 seconds (Total)

Iteration: 140 / 200 [ 70%] (Sampling)
Iteration: 160 / 200 [ 80%] (Sampling)
Iteration: 180 / 200 [ 90%] (Sampling)
Iteration: 200 / 200 [100%] (Sampling)

Elapsed Time: 3056.8 seconds (Warm-up)
4563.45 seconds (Sampling)
7620.25 seconds (Total)

sakrejda · December 6, 2017, 3:15am

This is often an indication of a model that has an issue somewhere in the parameter space but it’s small enough that not all chains hit it.

goodmattg · December 6, 2017, 3:59am

Understood. But am I to understand that that from a performance standpoint, a poorly defined model can lead to some chains falling to 1/10 of the average time to complete a chain?

Is there any general technique to diagnose “issues” in a model?

goodmattg · December 6, 2017, 4:16am

I got this model from a forum I attended. Other than setting up Beta using cholesky factorization, I can’t see anything else wrong.

data {
int<lower=2> C; // # of alternatives (choices) in each scenario
int<lower=1> K; // # of covariates of alternatives
int<lower=1> N; // # of respondents
int<lower=1> S; // # of scenarios per respondent
int<lower=0> G; // # of respondent covariates
int<lower=1,upper=C> Y[N, S]; // observed choices
matrix[C, K] X[N, S]; // matrix of attributes for each obs
matrix[G, N] Z; // vector of covariates for each respondent
}

parameters {
matrix[K, N] Beta;
matrix[K, G] Theta;
corr_matrix[K] Omega;
vector<lower=0>[K] tau;
}
transformed parameters {
cov_matrix[K] Sigma = quad_form_diag(Omega, tau);
}

model {

// Setting up the covariance matrix prior (STAN Reference p.148)
to_vector(Theta) ~ normal(0, 5);
tau ~ cauchy(0, 2.5);
Omega ~ lkj_corr(2);

for (n in 1:N) {
Beta[,n] ~ multi_normal(Theta*Z[,n], Sigma);
for (s in 1:S)
Y[n,s] ~ categorical_logit(X[n,s]*Beta[,n]);
}
}

Bob_Carpenter · January 5, 2018, 10:34pm

Michael Betancourt’s case studies on workflow is a good place to start. You’ll probably get divergence warnings from chains that fall to 1/10 of the average compute time as they typically indicate chains that have found highly curved parts of the posterior and dramatically reduced step size. The other chains will be biase because they’ll miss these highly curved areas.

Other problems may be numerical in nature. You want to use multi_normal_cholesky() and a Cholesky factor of a correlation matrix for a parmeter. You might also want to try lighter tailed priors on tau if the data doesn’t constrain it well. Also, correlation/covariance matrices can be hard to fit without a lot of data, especially in high dimensions.

Sometimes it’s not a model per se that is wrong but the fit to the data. Have ou tried simulating data form the model and fitting that?

goodmattg · January 12, 2018, 6:50pm

Thanks for the response Bob!

Those are all good points - I’ll make sure to look into them. Honestly I haven’t. Will take some time in the near future to simulate data from the model.

Topic		Replies	Views
Is having 4 chains with length of 5000 the same as having 8 chains with length of 2500? Modeling	8	1586	January 28, 2022
Preliminary benchmark: incremental and adaptive parallel warm-up Publicity warm-up	4	906	November 9, 2021
Variation in elapsed time of parallel chains and best practices for computing expectations from multiple chains Algorithms mcmc	18	3119	October 29, 2018
Error in dim(x) <- c(dim(x), 1) Modeling	3	49	December 17, 2024
What is the relationship between transition and iteration General	3	1839	April 12, 2021

Multi-chain performance related to Local Setup

Related topics