Error-in-variables regression with large datasets (500k observations, 200ish covariates)

I’d like to be able to run the below error-in-variables regression program using data with a 500k x 200 ish design matrix (genetic association studies require humongous samples because individual polymorphisms have tiny, tiny effects; also there are lots of relevant confounds to control for).

Running the model on a subset of 70k observations takes just under a week using 24 cores (Xeon E7-4830 v3 @ 2.10GHz) and a terabyte of RAM, with 24 chains and 2000 iterations per chain. I could quadruple these numbers (hardware-wise) if necessary, but I have a few questions before I get too invested!

1. Is it a terrible idea to try to use Stan at this scale? Is anyone else successfully running models of this size?

2. If it’s not a terrible idea, are their any suggestions for improving performance?

  • the first thing that comes to mind is using a QR parameterization rather than the below perhaps?
  • also curious about optimal # iterations : #chains

Thanks.

Stan program

data {
    int<lower=1> J;
    int<lower=0> P; // number of covariates
    vector[J] y;
    int<lower=1,upper=3>  w[J];
    simplex[3] pVec;
    simplex[3] piMat[3];
    row_vector[P] z[J]; // covariates
}  

parameters {
    real<lower=0> sigma_y; // variance of Y
    real alpha; // intercept
    real beta; // slope
    vector[P] bCov;
}

model {
    alpha ~ normal(0,10);
    beta ~ normal(0,10);
    sigma_y ~ normal(0,10);   // truncated to [0,infty)
    for (p in 1:P) {
        bCov[p] ~ normal(0,10);
    }
    for (j in 1:J) {       // for each observation
        real lp[3];        // will contain marginal probabilities
        for (k in 1:3) {  // support of x
            lp[k] = categorical_lpmf(k | pVec)
                + categorical_lpmf(w[j] | piMat[k])
                + normal_lpdf(y[j] | alpha + beta * k + z[j]*bCov, sigma_y);
        }
        target += log_sum_exp(lp);
    }
}

200 parameters is no big deal for Stan, but for 500k observations you would want to look into using the sufficient statistic version of the normal likelihood, which is talked about in Lancaster’s Introduction to Modern Bayesian Econometrics and implemented in rstanarm:


Since you are mixing, you will actually need three design matrices, one for each of the three values of k that go in the first column. To do the QR-reparameterization on a matrix that big, you should definitely not use the implementation in Stan Math. Presuming you have a good bit of sparsity, you should use one of the sparse QR algorithms that implements a thin option.

Awesome, thanks for the tip, will check out this alternative likelihood. Do you know if a general approach based on sufficient statistics for GLMs has been developed?

Here k is a loop index, and pVec, w, and piMat are data and thus categorical_lpmf’s are constant, and these could be dropped, unless you tried to do something else, e.g. mixture model which would require log_sum_exp.

1 Like

Trying to gain a better understanding of the mechanic of this, could you please point to the chapter/section that discusses this in the book?

More specifically, I’m trying to understand if something like this is viable for a hierarchical model with the scale on the order of the original poster’s model.

Chapter 3, mostly in the appendix.

1 Like

That was queued up at some point to go into Stan—don’t know what happened to it in the math lib.

I still haven’t properly understood the last Lancaster trick for multinomials.

1 Like