I am trying to fit a probit model.
I took the code from the manual and changed it slightly to allow for more explanatory variables (3):
matrix[N, K] X;
// beta ~ normal(0, 5);
for (n in 1:N)
y[n] ~ bernoulli(Phi(alpha + X[n]*beta));
The problem: I have ~600000 data-points. The simulation is very very slow.
What can I do here?
So N is over half a million. How large is K?
How committed are you to probit? If logit link is an acceptable alternative there’s bernoulli_logit_glm distribution.
That seems to be a bit faster, but still very slow.
Maybe multithreading could help. Here’s a tutorial.
Have you tried vectorising your sampling statement?
Simply: y ~bernoulli_logit_glm(alpha + beta * x)
You shouldn’t have to loop, I think.
With that big N and small K, you don’t need MCMC. You can use optimizing and draws from the approximate posterior. Here’s an example adapted from Regression and Other Stories: Scalability and using github version of rstanarm which provides also Pareto-k and
SEED <- 1655
n <- 6e5
k <- 3
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- (a + b*x + sigma*rnorm(n))>0+0
fake <- data.frame(x, xn, y)
fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
In my laptop this takes 22s.
family: binomial [logit]
formula: y ~ .
priors: see help('prior_summary')
Median MAD_SD 10% 50% 90%
(Intercept) 3.6 0.0 3.6 3.6 3.6
x 5.4 0.0 5.4 5.4 5.4
X1 0.0 0.0 0.0 0.0 0.0
X2 0.0 0.0 0.0 0.0 0.0
Monte Carlo diagnostics
mcse khat n_eff
(Intercept) 0.0 0.3 800
x 0.0 0.3 809
X1 0.0 0.3 777
X2 0.0 0.3 821
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and khat is the Pareto k diagnostic for importance sampling.
(perfomance is usually good when khat < 0.7).
The first thing I’d check is to make sure that the problem really is with the size of the data and not something else. See what happens if you fit your model to the following:
- fake data generated using known values for your parameters
- a small random subset of your data points, about, say, 6000 points in size
If you still have problems, that may be a sign that other things are wrong.
Yes, tried that.
Still, it seems that 600000 is too much.
Thank you, @avehtari!
Are you aware whether there is a Python version of
It appears that I’m using too much data. Even Bayesian linear regression with that much data takes too long.
Just tried to replicate an example executed with MLE:
import statsmodels.api as sm
X = sm.add_constant(X) # add a constant
probit_model = sm.Probit(y, X)
result = probit_model.fit()
That takes ~2 seconds.
Bayesian linear regression using the code I posted above, plus additional option importance_resampling=FALSE, takes 2s in my laptop takes. importance_resampling which enables the diagnostics and importance sampling correction takes another 20s. Without the diagnostic you can get in 2s 1000 draws from the approximate normal posterior which in this case is very close to the true posterior, it’s just that the diagnostic takes some additional time.
PyStan has optimizing, but doesn’t yet have computation of Hessian and importance-resampling.
You can call R packages from Python.
@avehtari the execution is indeed much faster.
Can the same procedure of using
optimizing and draws from the approximate posterior
be used for Ordinal Probit Regression?
I’ve tried the
stan_polr() function suggested in Estimating Ordinal Regression Models with rstanarm, but that is slow with a lot of data too.
Also, is this: https://arxiv.org/abs/1507.02646 a good reference for this topic?
I guess it might work given appropriate parameterization and enough data in each group, but I haven’t tested with any data.
Optimizing has not been enabled for this. It would need some testing whether it would work reasonably with current priors etc.
See also @betanalpha’s ordinal regression case study
mentioned above is no longer active. Can you refer me to section/page in the book where this example is? Or a URL for an example?