Obvious way to speed up huge logisitic regression?

rborder · May 23, 2018, 5:01pm

The following (incredibly simple) logistic regression model is something I’d like to expand to a more complex error-in-variables model with a missclassified outcome. However, this is only worth doing if I can get the simpler model running fast enough (working with 150k - 500k observations). Is there any obvious way to speed up this QR parameterized logistic model below?

data {
    int<lower=0> N;                     // number of observations
    int<lower=0> P;                     // number of columns of design matrix
    int<lower=0,upper=1> Y[N];          // binary outcome
    matrix[N,P] Q;                      // design matrix Q factor
    matrix[P,P] Rinv;                   // design matrix R factor inverse
}

parameters {
    vector[P] beta_twiddle;             // coefficients alt parameterization
}

transformed parameters {
    vector[P] beta;                     // coefficents standard parameterization
    vector[N] mu;                       // linear predictor
    beta = Rinv * beta_twiddle;
    mu = Q * beta_twiddle;
}
model {
    beta ~ normal(0,10);

    Y ~ bernoulli_logit(mu);
}

One immediate thought is to calculate the transformed parameters after the fact, which would require a prior on beta_twiddle rather than beta.

mike-lawrence · May 23, 2018, 5:14pm

Would the trick in section 28.8 (“Exploiting Sufficient Statistics”) in the manual help here? Where instead of

Y ~ bernoulli_logit(mu);

you would do

sum(Y) ~ binomial(N,logit(mu));

?

rborder · May 23, 2018, 5:18pm

Interesting suggestion, I will have to think about this! The first wrinkle is that for the missclassification model I’ll have to instead consider a distribution over the sufficient statistic, which would slow things down a bit, but still could be much faster.

mike-lawrence · May 23, 2018, 5:18pm

Oh, nope my suggestion won’t work; that trick applies when there is a single probability to express as the parameter of the binomial. Is there maybe a smaller set of unique values for mu on which you can loop over using this trick?

rborder · May 23, 2018, 5:20pm

Unfortunately the design matrix is quite dense so I can’t see how to only consider a finite number of values for mu…

rborder · May 23, 2018, 5:43pm

In the short-term, I’d appreciate a suggestion for the appropriate prior to put on beta_twiddle so I can save the multliplication. Or perhaps there’s a matvec specialized for triangular matrices?

wds15 · May 23, 2018, 8:28pm

You should try to exploit sparsity of your design matrix if possible.

There is a multiply_lower_tri function in stan-math… so you could only use it via C++ and it is going to run efficiently with the next 2.18 version if I recall correctly (maybe I am wrong and it runs already ok in 2.17, but some of these functions were improved).

Bob_Carpenter · May 24, 2018, 5:42am

If Rinv or Q are sparse, you could use the CSR multiply function.

You might also want to look into how @bgoodri coded priors for these kinds of models in RStanArm. That does the QR decomposition behind the scenes.

rborder · May 24, 2018, 2:07pm

Good suggestion, I will look into this.

Will definitely look into this, cheers.

Topic		Replies	Views
Stan Model running too slow (inv-logit transformation seems to slow it down) Modeling	13	907	August 10, 2020
Improving Performance on Logistic Regression with Informative Priors Modeling performance , rstanarm	4	1589	May 1, 2020
Successfully implemented model, wondering if it can be optimized Modeling techniques , performance	1	574	March 25, 2018
Efficient computation of multi-logit Modeling	4	99	December 16, 2024
Advice on optimizing transformed parameters definition Modeling	1	542	January 9, 2018

Obvious way to speed up huge logisitic regression?

Related topics