Error-in-variables regression with large datasets (500k observations, 200ish covariates)

rborder · April 1, 2018, 10:39pm

I’d like to be able to run the below error-in-variables regression program using data with a 500k x 200 ish design matrix (genetic association studies require humongous samples because individual polymorphisms have tiny, tiny effects; also there are lots of relevant confounds to control for).

Running the model on a subset of 70k observations takes just under a week using 24 cores (Xeon E7-4830 v3 @ 2.10GHz) and a terabyte of RAM, with 24 chains and 2000 iterations per chain. I could quadruple these numbers (hardware-wise) if necessary, but I have a few questions before I get too invested!

1. Is it a terrible idea to try to use Stan at this scale? Is anyone else successfully running models of this size?

2. If it’s not a terrible idea, are their any suggestions for improving performance?

the first thing that comes to mind is using a QR parameterization rather than the below perhaps?
also curious about optimal # iterations : #chains

Thanks.

Stan program

data {
    int<lower=1> J;
    int<lower=0> P; // number of covariates
    vector[J] y;
    int<lower=1,upper=3>  w[J];
    simplex[3] pVec;
    simplex[3] piMat[3];
    row_vector[P] z[J]; // covariates
}  

parameters {
    real<lower=0> sigma_y; // variance of Y
    real alpha; // intercept
    real beta; // slope
    vector[P] bCov;
}

model {
    alpha ~ normal(0,10);
    beta ~ normal(0,10);
    sigma_y ~ normal(0,10);   // truncated to [0,infty)
    for (p in 1:P) {
        bCov[p] ~ normal(0,10);
    }
    for (j in 1:J) {       // for each observation
        real lp[3];        // will contain marginal probabilities
        for (k in 1:3) {  // support of x
            lp[k] = categorical_lpmf(k | pVec)
                + categorical_lpmf(w[j] | piMat[k])
                + normal_lpdf(y[j] | alpha + beta * k + z[j]*bCov, sigma_y);
        }
        target += log_sum_exp(lp);
    }
}

bgoodri · April 2, 2018, 12:53am

200 parameters is no big deal for Stan, but for 500k observations you would want to look into using the sufficient statistic version of the normal likelihood, which is talked about in Lancaster’s Introduction to Modern Bayesian Econometrics and implemented in rstanarm:

github.com

stan-dev/rstanarm/blob/master/src/stan_files/lm.stan#L18


 * likelihood with a scalar standard deviation for all errors
 * Equivalent to normal_lpdf(y | intercept + Q * R * beta, sigma) but faster
 * @param theta vector of coefficients (excluding intercept), equal to R * beta
 * @param b precomputed vector of OLS coefficients (excluding intercept) in Q-space
 * @param intercept scalar (assuming columns of Q have mean zero)
 * @param ybar precomputed sample mean of the outcome
 * @param SSR positive precomputed value of the sum of squared OLS residuals
 * @param sigma positive scalar for the standard deviation of the errors
 * @param N integer equal to the number of observations
 */
real ll_mvn_ols_qr_lp(vector theta, vector b,
                      real intercept, real ybar,
                      real SSR, real sigma, int N) {
  target += -0.5 * (dot_self(theta - b) + 
    N * square(intercept - ybar) + SSR) / 
    square(sigma) -// 0.91... is log(sqrt(2 * pi()))
    N * (log(sigma) + 0.91893853320467267);
  return target();
}
}
data {

Since you are mixing, you will actually need three design matrices, one for each of the three values of k that go in the first column. To do the QR-reparameterization on a matrix that big, you should definitely not use the implementation in Stan Math. Presuming you have a good bit of sparsity, you should use one of the sparse QR algorithms that implements a thin option.

rborder · April 2, 2018, 2:02am

Awesome, thanks for the tip, will check out this alternative likelihood. Do you know if a general approach based on sufficient statistics for GLMs has been developed?

avehtari · April 2, 2018, 9:23am

rborder:

for (k in 1:3) {  // support of x
            lp[k] = categorical_lpmf(k | pVec)
                + categorical_lpmf(w[j] | piMat[k])
                + normal_lpdf(y[j] | alpha + beta * k + z[j]*bCov, sigma_y);
        }

Here k is a loop index, and pVec, w, and piMat are data and thus categorical_lpmf’s are constant, and these could be dropped, unless you tried to do something else, e.g. mixture model which would require log_sum_exp.

funtusov · April 10, 2018, 5:13pm

Trying to gain a better understanding of the mechanic of this, could you please point to the chapter/section that discusses this in the book?

More specifically, I’m trying to understand if something like this is viable for a hierarchical model with the scale on the order of the original poster’s model.

bgoodri · April 10, 2018, 6:36pm

Chapter 3, mostly in the appendix.

Bob_Carpenter · April 14, 2018, 1:23am

That was queued up at some point to go into Stan—don’t know what happened to it in the math lib.

I still haven’t properly understood the last Lancaster trick for multinomials.

Topic		Replies	Views
Optimizing performance of mixed effect model on large data set Modeling performance	0	412	June 24, 2019
Additive random effect models with large data sets rstanarm	25	2925	June 15, 2019
Speeding up multivariate normal model Modeling techniques	7	2970	September 9, 2017
Stan_glmer speed for large sample sizes rstanarm	4	1666	April 19, 2021
Improving Performance on Logistic Regression with Informative Priors Modeling performance , rstanarm	4	1316	May 1, 2020

Error-in-variables regression with large datasets (500k observations, 200ish covariates)

1. Is it a terrible idea to try to use Stan at this scale? Is anyone else successfully running models of this size?

2. If it’s not a terrible idea, are their any suggestions for improving performance?

Stan program

Related Topics