A lot of data - slow simulation

I am trying to fit a probit model.
I took the code from the manual and changed it slightly to allow for more explanatory variables (3):

data {
    int<lower=1> N;
    int<lower=1> K;
    matrix[N, K] X;
    int<lower=0,upper=1> y[N];
}
parameters {
    real alpha;
    vector[K] beta;
}
model {
    // beta ~ normal(0, 5);
    for (n in 1:N)
        y[n] ~ bernoulli(Phi(alpha + X[n]*beta));
}

The problem: I have ~600000 data-points. The simulation is very very slow.

What can I do here?

So N is over half a million. How large is K?
How committed are you to probit? If logit link is an acceptable alternative there’s bernoulli_logit_glm distribution.

K=3
That seems to be a bit faster, but still very slow.

Maybe multithreading could help. Here’s a tutorial.

1 Like

Have you tried vectorising your sampling statement?

Simply: y ~bernoulli_logit_glm(alpha + beta * x)

You shouldn’t have to loop, I think.

1 Like

With that big N and small K, you don’t need MCMC. You can use optimizing and draws from the approximate posterior. Here’s an example adapted from https://avehtari.github.io/RAOS-Examples/BigData/bigdata.html and using github version of rstanarm which provides also Pareto-k and
n_eff diagnostics:

SEED <- 1655
set.seed(SEED)
n <- 6e5
k <- 3
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- (a + b*x + sigma*rnorm(n))>0+0
fake <- data.frame(x, xn, y)

fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
                algorithm='optimizing', init=0)

In my laptop this takes 22s.

> summary(fit3)

Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      y ~ .
 algorithm:    optimizing
 priors:       see help('prior_summary')
 observations: 600000
 predictors:   4

Estimates:
              Median   MAD_SD   10%   50%   90%
(Intercept) 3.6      0.0      3.6   3.6   3.6  
x           5.4      0.0      5.4   5.4   5.4  
X1          0.0      0.0      0.0   0.0   0.0  
X2          0.0      0.0      0.0   0.0   0.0  

Monte Carlo diagnostics
            mcse khat n_eff
(Intercept) 0.0  0.3  800  
x           0.0  0.3  809  
X1          0.0  0.3  777  
X2          0.0  0.3  821  

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and khat is the Pareto k diagnostic for importance sampling.
 (perfomance is usually good when khat < 0.7).
4 Likes

The first thing I’d check is to make sure that the problem really is with the size of the data and not something else. See what happens if you fit your model to the following:

  • fake data generated using known values for your parameters alpha and beta
  • a small random subset of your data points, about, say, 6000 points in size

If you still have problems, that may be a sign that other things are wrong.

Yes, tried that.
Still, it seems that 600000 is too much.

Thank you, @avehtari!
Are you aware whether there is a Python version of rstanarm.

It appears that I’m using too much data. Even Bayesian linear regression with that much data takes too long.
Just tried to replicate an example executed with MLE:

import statsmodels.api as sm

X = sm.add_constant(X)     # add a constant

probit_model = sm.Probit(y, X)
result = probit_model.fit()

print(result.summary2())

That takes ~2 seconds.

Bayesian linear regression using the code I posted above, plus additional option importance_resampling=FALSE, takes 2s in my laptop takes. importance_resampling which enables the diagnostics and importance sampling correction takes another 20s. Without the diagnostic you can get in 2s 1000 draws from the approximate normal posterior which in this case is very close to the true posterior, it’s just that the diagnostic takes some additional time.

PyStan has optimizing, but doesn’t yet have computation of Hessian and importance-resampling.

You can call R packages from Python.

1 Like

@avehtari the execution is indeed much faster.

Can the same procedure of using

optimizing and draws from the approximate posterior

be used for Ordinal Probit Regression?

I’ve tried the stan_polr() function suggested in Estimating Ordinal Regression Models with rstanarm, but that is slow with a lot of data too.

Also, is this: https://arxiv.org/abs/1507.02646 a good reference for this topic?

I guess it might work given appropriate parameterization and enough data in each group, but I haven’t tested with any data.

Optimizing has not been enabled for this. It would need some testing whether it would work reasonably with current priors etc.

See also @betanalpha’s ordinal regression case study

1 Like