Are there any good tricks to try for scaling hierarchical models besides using a CPU cluster in pystan?
GPU support is not available yet so google ML engine is not an option?
Sample sizes are over 50K users ranging to millions

data {
int<lower=0> N; // number of cases
int<lower=0> J; // number of groups eg.users
int<lower=0, upper=1> iszero[N]; // indicates negative outcomes
int<lower=1> id[N]; // group number for each case
}
parameters {
vector<lower=0, upper=1>[J] theta; // chance of success per test group
real<lower=0, upper=1> phi; // population chance of success
real<lower=1> kappa; // population concentration
}
// Stan model
model {
// Priors for Bernoulli:
phi ~ beta(2, 2);
kappa ~ pareto(1, 1.5); // hyperprior (requires that kappa > 1st Pareto parameter)
theta ~ beta(kappa * phi, kappa * (1 - phi)); // prior
// Likelihood sampling statements:
for (n in 1:N) {
iszero[n] ~ bernoulli(theta[id[n]]);
}
}
// Output data
generated quantities {
real post_phi_global;
real post_kappa_global;
post_phi_global = phi;
post_kappa_global = kappa;
}

One more thing to consider: with such a large sample size for a relatively simple model, it is well possible that you are deep in the asymptopia where the posterior is close to normal centered on the MAP estimate. (this will likely depend strongly on how big J is compared to N) If - and that’s a big IF - you are there, using the optimizing method in Stan could give you very good results. Maybe try running the model with say 20K rows of the dataset with both sampling and optimizing and if the results match, you can probably use optimizing safely for your large datasets.