Are there any good tricks to try for scaling hierarchical models besides using a CPU cluster in pystan?

GPU support is not available yet so google ML engine is not an option?

Sample sizes are over 50K users ranging to millions

```
data {
int<lower=0> N; // number of cases
int<lower=0> J; // number of groups eg.users
int<lower=0, upper=1> iszero[N]; // indicates negative outcomes
int<lower=1> id[N]; // group number for each case
}
parameters {
vector<lower=0, upper=1>[J] theta; // chance of success per test group
real<lower=0, upper=1> phi; // population chance of success
real<lower=1> kappa; // population concentration
}
// Stan model
model {
// Priors for Bernoulli:
phi ~ beta(2, 2);
kappa ~ pareto(1, 1.5); // hyperprior (requires that kappa > 1st Pareto parameter)
theta ~ beta(kappa * phi, kappa * (1 - phi)); // prior
// Likelihood sampling statements:
for (n in 1:N) {
iszero[n] ~ bernoulli(theta[id[n]]);
}
}
// Output data
generated quantities {
real post_phi_global;
real post_kappa_global;
post_phi_global = phi;
post_kappa_global = kappa;
}
```