So N is over half a million. How large is K?
How committed are you to probit? If logit link is an acceptable alternative there’s bernoulli_logit_glm distribution.
With that big N and small K, you don’t need MCMC. You can use optimizing and draws from the approximate posterior. Here’s an example adapted from Regression and Other Stories: Scalability and using github version of rstanarm which provides also Pareto-k and
n_eff diagnostics:
SEED <- 1655
set.seed(SEED)
n <- 6e5
k <- 3
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- (a + b*x + sigma*rnorm(n))>0+0
fake <- data.frame(x, xn, y)
fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
algorithm='optimizing', init=0)
In my laptop this takes 22s.
> summary(fit3)
Model Info:
function: stan_glm
family: binomial [logit]
formula: y ~ .
algorithm: optimizing
priors: see help('prior_summary')
observations: 600000
predictors: 4
Estimates:
Median MAD_SD 10% 50% 90%
(Intercept) 3.6 0.0 3.6 3.6 3.6
x 5.4 0.0 5.4 5.4 5.4
X1 0.0 0.0 0.0 0.0 0.0
X2 0.0 0.0 0.0 0.0 0.0
Monte Carlo diagnostics
mcse khat n_eff
(Intercept) 0.0 0.3 800
x 0.0 0.3 809
X1 0.0 0.3 777
X2 0.0 0.3 821
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and khat is the Pareto k diagnostic for importance sampling.
(perfomance is usually good when khat < 0.7).
The first thing I’d check is to make sure that the problem really is with the size of the data and not something else. See what happens if you fit your model to the following:
fake data generated using known values for your parameters alpha and beta
a small random subset of your data points, about, say, 6000 points in size
If you still have problems, that may be a sign that other things are wrong.
It appears that I’m using too much data. Even Bayesian linear regression with that much data takes too long.
Just tried to replicate an example executed with MLE:
import statsmodels.api as sm
X = sm.add_constant(X) # add a constant
probit_model = sm.Probit(y, X)
result = probit_model.fit()
print(result.summary2())
Bayesian linear regression using the code I posted above, plus additional option importance_resampling=FALSE, takes 2s in my laptop takes. importance_resampling which enables the diagnostics and importance sampling correction takes another 20s. Without the diagnostic you can get in 2s 1000 draws from the approximate normal posterior which in this case is very close to the true posterior, it’s just that the diagnostic takes some additional time.
PyStan has optimizing, but doesn’t yet have computation of Hessian and importance-resampling.