How to scale hierarchical models when sample count is in tens or hundreds of thousands

Are there any good tricks to try for scaling hierarchical models besides using a CPU cluster in pystan?
GPU support is not available yet so google ML engine is not an option?
Sample sizes are over 50K users ranging to millions



data {
 int<lower=0> N; // number of cases
 int<lower=0> J; // number of groups eg.users
 int<lower=0, upper=1> iszero[N]; // indicates negative outcomes
 int<lower=1> id[N]; // group number for each case
}
parameters {
 vector<lower=0, upper=1>[J] theta; // chance of success per test group
 real<lower=0, upper=1> phi; // population chance of success
 real<lower=1> kappa; // population concentration
}

// Stan model

model {
 // Priors for Bernoulli:
 phi ~ beta(2, 2);
 kappa ~ pareto(1, 1.5); // hyperprior (requires that kappa > 1st Pareto parameter)
 theta ~ beta(kappa * phi, kappa * (1 - phi)); // prior
 // Likelihood sampling statements:
 for (n in 1:N) {
   iszero[n] ~ bernoulli(theta[id[n]]);
   }
 }

// Output data

generated quantities {
real post_phi_global;
real post_kappa_global;

post_phi_global = phi;
post_kappa_global = kappa;
}

Might not help, but for this model you can make your likelihood easier to evaluate with sufficient statistics.

For each group j, you have:

\prod_i p(y_i | g_j)

Assuming y_i \in [0, 1] and p_j is the probability of 1 in each group, expand the above to:

\prod_i p_j^{y_i} (1 - p_j)^{1 - y_i}

And that’s the same as:

p_j^{\sum_i y_i} (1 - p_j)^{\sum_i (1 - y_i)}

And you can compute those sums of y for each group on the outside and pass them in as data.

1 Like

Thanks for the tip, I used this as a basis:

One more thing to consider: with such a large sample size for a relatively simple model, it is well possible that you are deep in the asymptopia where the posterior is close to normal centered on the MAP estimate. (this will likely depend strongly on how big J is compared to N) If - and that’s a big IF - you are there, using the optimizing method in Stan could give you very good results. Maybe try running the model with say 20K rows of the dataset with both sampling and optimizing and if the results match, you can probably use optimizing safely for your large datasets.

1 Like