# Varying Effects Zero-Inflated Mixture Model runs slowly

This model fits well for smaller datasets, but as the number of clusters and observations grow, the time required to fit is growing non-linearly. Is this a fault of my model?

I am trying to fit count data for N subjects, where subjects are broken into C
distinct clusters (C << N). My model is a multi-level, zero-inflated mixture of two negative binomials (i.e., neg_binomial_2_lpmf). Some clusters have very little data, while others have lots, so I want to take advantage of pooling between clusters.

data {
int<lower=1> N;         //number of observations
int<lower=1> C;         //number of groups/clusters
real<lower=0> meanF;    //mean freq across all
int<lower=0> Y[N];    //outcome variable
int<lower=1> G[N];
}
parameters {
real<lower=0,upper=1> zinf[C];     // Zero inflation probability
real<lower=0,upper=1> theta[C];    // Mix weight for 2x Neg. Binom. mixture
positive_ordered[2] phi[C];        // Size (phi) parameter for Neg. Binom.
positive_ordered[2] mu[C];         // Mean (mu) parameter for Neg. Binom.
}
model {
for (i in 1:C) {
phi[i,] ~ exponential(1) ;          	// Phi prior - positive, but close to 0
mu[i,] ~ lognormal(meanF, 100) ;   // Mu prior - lognormal captures both modes
zinf[i] ~ beta(2,2) ;              	// Prob that Y[n] is inflated 0
theta[i] ~ beta(2,2) ;             	// Prob that Y[n] comes from 1st NegBinomial
}
for (n in 1:N) {
int d = G[n] ;
if ( Y[n] > 0 )  target += log1m(zinf[d]) +
log_mix ( theta[d],
neg_binomial_2_lpmf( Y[n] | mu[d,1] , phi[d,1] ),
neg_binomial_2_lpmf( Y[n] | mu[d,2] , phi[d,2] )
);
if ( Y[n] == 0 ) target += log_mix( zinf[d], 0,
log_mix ( theta[d],
neg_binomial_2_lpmf( 0 | mu[d,1] , phi[d,1] ),
neg_binomial_2_lpmf( 0 | mu[d,2] , phi[d,2] )
)
);
};
}


Called with:

options(mc.cores = parallel::detectCores())

ZInbn2fit ← stan( model_code = ZInbn2Model,
data = ZInbn2Data,
iter = 3000, warmup=1000, chains = 2,

Both ‘mu’ and ‘phi’ are positive_ordered so the model will identify.

Stan identifies all parameters well and without errors, but it takes forever in practice. In tests with 2 clusters and 2000 observations (split in various proportions between clusters) the model fits in under a minute (after compilation). But actual data with 15 clusters and about 15,000 observations takes ~4 hours (e.g., 1000 transitions using 10 leapfrog steps per transition would take 139.58 seconds.).

Some datasets have over 200 clusters and 200K observations. Since I cannot control the number of clusters, and I’d prefer not to lose any data by sampling observations, I’m hoping either my model code or stan(…) parameters may be optimized somehow. Any ideas are appreciated.

MacBook Pro 2019, 2.3GHz 8-Core Intel Core i9, 16GB RAM
RStudio v.1.4.1106
R version 3.6.2 (2019-12-12) – “Dark and Stormy Night”
rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)

Does this model work with the optimizer? You’re just getting the mode then, but maybe it can help you work with these bigger datasets.

chains = 2

The way the model is written it will only use one core for each chain. The easiest way to do within-chain parallelism (more than one core in each chain) is reduce_sum (tutorial here).

You’ll need to switch to cmdstanr to use that though (website here, but I think there is info in that tutorial). rstan 2.21 does not have that feature.

I have tried optimizing within rstan:

m ← stan_model(model_code=ZInbn2Model)
f ← optimizing(m, data=ZInbn2Data)

The mixture distribution has two modes, one at 0 and another at much larger X (200…600). As I interpret this message, optimizer is unable to find both. Does that make sense?

Error in sampler\$unconstrain_pars(theta) :
Exception: Error transforming variable mu: stan::io::positive_ordered_unconstrain: Vector is not a valid ordered vector. The element at 2 is 201.748, but should be greater than the previous element, 201.748 (in ‘model7b034cfb14c_b2b1cd814bae3517d88e3d851131e1b3’ at line 13)
[origin: unknown original type]
In .local(object, …) : non-zero return code in optimizing

Thanks so much for the helpful links on CmdStanR. That is clearly the way to go. However I fear my issues with this model are more fundamental. I ran it last night on another dataset and it failed to converge.

1: There were 3344 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
2: There were 8 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.32, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Oddly, re-running with more chains (4), longer warmup (1500) and more iterations (6000) made the problem worse:

• 13288 divergent transitions after warmup
• 3180 that exceeded maximum truedepth
• largest R-hat is 2.52

I fear I’m going to have to give up on pooling and run each cluster individually. Some clusters have very little data, and I was hoping to benefit by pooling with clusters with an overabundance.

Again, many thanks for the insight.

Ooo, yeah, the sampler, VI, or optimizer aren’t going to be able to cleanly handle multiple modes.

Gotta either parameterize like that stuff out (get rid of one of the modes), or bring in assumptions or something from the outside.

I think this error is the result of parameters going off to -inf in the unconstrained space. It’s probably related to the divergences and stuff but hard to say anything specific.

Are the parameters of the individual models:

parameters {
real<lower=0,upper=1> zinf;     // Zero inflation probability
real<lower=0,upper=1> theta;    // Mix weight for 2x Neg. Binom. mixture
positive_ordered[2] phi;        // Size (phi) parameter for Neg. Binom.
positive_ordered[2] mu;         // Mean (mu) parameter for Neg. Binom.
}


Asking cause then I’ll go back and think about how the pooling is working.

Yes. For any given cluster, those are the parameters. For example, in one cluster the plot below shows actual raw count data (grey histogram) and the fitted mixture distribution (red).

The mixture distribution is the weighted sum of two component neg. binomials, shown here (weighted):

Not sure it’s relevant, but…

BEGIN SIDEBAR
Generatively, within one cluster, the data is generated by N subjects (i_1, i_2, ..., i_N) who are each exposed to some number of stimuli, say S_i, where 0 \leq S_i \leq T. The probability that they will respond to any single stimulus is a fixed p for all subjects. So the true distribution of stimulus response counts w.r.t. stimulus count is a mixture of poisson(\lambda_i), where each \lambda_i = p S_i.

The distribution ALWAYS has the general shape you see above, with a mode at 0 and another mode near \bar{S_i}. And because a mixture of poissons may be represented as a negative binomial, it seemed easier to fit that way. I broke it into a mixture of two neg.binom’s because no single NB distribution could capture both modes. The zero-inflation was added because the zero mode value always fit low because subjects with S_i=0 were not captured by the other two mixture components.

Across the clusters, p and T are constant, but N and the S_i are different.
END SIDEBAR

Again, many thanks for your insight and help.

UPDATE: In the end I gave up on pooling. Even with CmdStanR (which is great, btw), the full multi-level model was just too slow to be practical as a small part of much larger automated analysis.

I was finding that most clusters would fit just fine. So I fit all of these first, and used the parameters from the results to generate random draws from each in proportion to their input data size. I then add this small number of “averaged” data points to the few clusters with too little data to fit properly. The end result produces estimated parameters (with appropriately large standard deviation) for the small clusters, which is all I needed.

Thanks @bbbales2 for the help.

1 Like

In the original 8-schools model we only have means and sds of the original 8-schools effects, so if you need to do these sort of quick summary-statistics, you can still do it and build a model on the back end. So the prototypical hierarchical model does this :D.

It’s definitely possible to also just have a model that Stan takes forever on cuz lots of data or whatnot – but the folk theorem is just a useful thing to keep in mind (The Folk Theorem, revisited « Statistical Modeling, Causal Inference, and Social Science).