A lot of data - slow simulation

ursus · August 6, 2019, 1:11pm

I am trying to fit a probit model.
I took the code from the manual and changed it slightly to allow for more explanatory variables (3):

data {
    int<lower=1> N;
    int<lower=1> K;
    matrix[N, K] X;
    int<lower=0,upper=1> y[N];
}
parameters {
    real alpha;
    vector[K] beta;
}
model {
    // beta ~ normal(0, 5);
    for (n in 1:N)
        y[n] ~ bernoulli(Phi(alpha + X[n]*beta));
}

The problem: I have ~600000 data-points. The simulation is very very slow.

What can I do here?

nhuurre · August 6, 2019, 1:27pm

So N is over half a million. How large is K?
How committed are you to probit? If logit link is an acceptable alternative there’s bernoulli_logit_glm distribution.

ursus · August 6, 2019, 2:50pm

K=3
That seems to be a bit faster, but still very slow.

nhuurre · August 6, 2019, 2:56pm

Maybe multithreading could help. Here’s a tutorial.

erognli · August 6, 2019, 4:06pm

Have you tried vectorising your sampling statement?

Simply: y ~bernoulli_logit_glm(alpha + beta * x)

You shouldn’t have to loop, I think.

avehtari · August 6, 2019, 4:34pm

With that big N and small K, you don’t need MCMC. You can use optimizing and draws from the approximate posterior. Here’s an example adapted from Regression and Other Stories: Scalability and using github version of rstanarm which provides also Pareto-k and
n_eff diagnostics:

SEED <- 1655
set.seed(SEED)
n <- 6e5
k <- 3
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- (a + b*x + sigma*rnorm(n))>0+0
fake <- data.frame(x, xn, y)

fit3 <- stan_glm(y ~ ., data=fake, family=binomial(),
                algorithm='optimizing', init=0)

In my laptop this takes 22s.

> summary(fit3)

Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      y ~ .
 algorithm:    optimizing
 priors:       see help('prior_summary')
 observations: 600000
 predictors:   4

Estimates:
              Median   MAD_SD   10%   50%   90%
(Intercept) 3.6      0.0      3.6   3.6   3.6  
x           5.4      0.0      5.4   5.4   5.4  
X1          0.0      0.0      0.0   0.0   0.0  
X2          0.0      0.0      0.0   0.0   0.0  

Monte Carlo diagnostics
            mcse khat n_eff
(Intercept) 0.0  0.3  800  
x           0.0  0.3  809  
X1          0.0  0.3  777  
X2          0.0  0.3  821  

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and khat is the Pareto k diagnostic for importance sampling.
 (perfomance is usually good when khat < 0.7).

jjramsey · August 6, 2019, 4:37pm

The first thing I’d check is to make sure that the problem really is with the size of the data and not something else. See what happens if you fit your model to the following:

fake data generated using known values for your parameters alpha and beta
a small random subset of your data points, about, say, 6000 points in size

If you still have problems, that may be a sign that other things are wrong.

ursus · August 6, 2019, 7:09pm

Yes, tried that.
Still, it seems that 600000 is too much.

ursus · August 6, 2019, 7:11pm

Thank you, @avehtari!
Are you aware whether there is a Python version of rstanarm.

ursus · August 6, 2019, 7:20pm

It appears that I’m using too much data. Even Bayesian linear regression with that much data takes too long.
Just tried to replicate an example executed with MLE:

import statsmodels.api as sm

X = sm.add_constant(X)     # add a constant

probit_model = sm.Probit(y, X)
result = probit_model.fit()

print(result.summary2())

That takes ~2 seconds.

avehtari · August 6, 2019, 8:23pm

Bayesian linear regression using the code I posted above, plus additional option importance_resampling=FALSE, takes 2s in my laptop takes. importance_resampling which enables the diagnostics and importance sampling correction takes another 20s. Without the diagnostic you can get in 2s 1000 draws from the approximate normal posterior which in this case is very close to the true posterior, it’s just that the diagnostic takes some additional time.

PyStan has optimizing, but doesn’t yet have computation of Hessian and importance-resampling.

You can call R packages from Python.

ursus · August 9, 2019, 1:14pm

@avehtari the execution is indeed much faster.

Can the same procedure of using

optimizing and draws from the approximate posterior

be used for Ordinal Probit Regression?

I’ve tried the stan_polr() function suggested in Estimating Ordinal Regression Models with rstanarm, but that is slow with a lot of data too.

Also, is this: [1507.02646] Pareto Smoothed Importance Sampling a good reference for this topic?

avehtari · August 11, 2019, 2:55pm

I guess it might work given appropriate parameterization and enough data in each group, but I haven’t tested with any data.

Optimizing has not been enabled for this. It would need some testing whether it would work reasonably with current priors etc.

See also @betanalpha’s ordinal regression case study

bmacGTPM · June 14, 2021, 5:21pm

The link
https://avehtari.github.io/RAOS-Examples/BigData/bigdata.html
mentioned above is no longer active. Can you refer me to section/page in the book where this example is? Or a URL for an example?

avehtari · June 15, 2021, 7:08am

The new link is Regression and Other Stories: Scalability

bmacGTPM · June 15, 2021, 5:49pm

great, thanks!

Topic		Replies	Views
Improving Speed for hierarchical logit/probit models Modeling rstan , performance , hierarchical-model	2	725	October 26, 2020
Simulating from Multivariate Probit Modeling	9	824	March 16, 2021
Adjoint-differentiated Laplace approximation for multivariate probit? Modeling	7	679	July 22, 2022
Many issues about a Multivarite Probit purchase model: data simulation, reparameterization and speedup Modeling fitting-issues , performance	9	1214	February 10, 2018
Multivariate probit issue Modeling fitting-issues	3	1216	August 15, 2017

A lot of data - slow simulation

Related topics