Break up a large dataset and take average

bhomass · August 22, 2017, 4:49am

for very large models which takes too long to run, does it make sense to break up the dataset, run them in parallel, then average the results at the end in order to speed up execution?

anon75146577 · August 22, 2017, 4:53am

Yes and no. It depends on what the model you’d run for the whole data set would be.

bhomass · August 22, 2017, 4:59am

it has a logistic regression as model, calculating for beta parameters where

inv_logit(alpha + x * beta)

I plan to apply QR reparameterization to x * beta after this works.

anon75146577 · August 22, 2017, 5:02am

If you’re assuming y|alpha,beta,x is conditionally independent, then this is ok, but it’s probably faster to just subsample your data directly.

eg, find a cheap classifier, and then subsample near the classifier boundary (called local case control subsampling in the literature) and then compute the posterior on that subsample.

avehtari · August 22, 2017, 11:26am

If the model is non-hierarchical and the number of observations n is much larger than the number of covariates, I would expect advi with QR parameterization to work also well.

For hierarchical models, see, https://arxiv.org/abs/1412.4869
This paper also has several references for alternative methods for parallel computation using subsets of data.

Aki

bhomass · August 25, 2017, 1:21am

Apparently case control sampling is only for logistic regression? Very limited use.

anon75146577 · August 25, 2017, 1:24am

Yes, but that’s what you’re doing isn’t it?

Also the principle of weighting your samples based on something like the information is transferable. Basically LCC subsampling is one demonstration of the general principle that you can do better than random subsampling if you think about your data and your model.

bhomass · August 25, 2017, 5:30am

Bob Carpenter mentioned in another thread that none of Gellman’s work in Expectation Propagation has made into the repo. Is there still true? any known road maps? if not, you are suggesting moving away from stan and roll your own?

Another question, AFAIK, ADVI is the default stan implementation. There is nothing I need to spell out in my code. Is that right?

avehtari · August 25, 2017, 7:07am

EP-parallallization is not in the Stan repo. The code for the paper is available at GitHub - gelman/ep-stan
There will be some future changes in Stan, which will make implementation easier. I recommend to read the paper first to check whether your problem is such that it would benefit from EP-parallelization. It’s also possible to you can get useful results with something simpler.

Yes. At least RStan and PyStan have an option for advi.

bhomass · August 25, 2017, 4:00pm

I am using cmdstan. what about that as far as designating ADVI?

My model has dynamic HMM, an array dependent variable, random effect consumer response. It seems complicated enough. However, please suggest a few names that are considered simpler than EP.

avehtari · August 25, 2017, 5:19pm

I’m sorry, but based on this I assumed it’s just non-hierarchical logistic regression. I want to be clear that I was saying is that advi could be good enough for a simple non-hierarchical logistic regression with Gaussian prior on beta (preferably with QR-transformation), and n>>p (where p is the number of covariates).

Oh, you should have told that in the beginning. For this I can’t recommend anything else than NUTS. Depending on the hierarchical structure, EP-parallelization with NUTS on group level might be applicable, but can’t say based on this description.

You can find references to all the methods I know in the EP-life paper mentioned before. If you tell more about which kind of model and data you have, I may be able to give more specific recommendation which kind of algorithm might be useful.

bhomass · August 26, 2017, 12:08am

Sorry for the initial brief description. It was to get the conversation started. I can give a pretty good picture of the model. Its a consumer response study to find the effectiveness of sales promotions. The consumer is assumed to be going thru the usual hidden funnel states, with some observable indicator giving hints which is the current state. The consumer states make transition based on a inverse logit of multivariate stimulus, giving rise to a random effect intercept and fixed effect coefficients. Finally, there is another inverse logit to predict a conversion event, which also generates an intercept and another set of coefficients. By my projection, this model will take a month to run in a single machine. QR reparametrization failed due to non-invertible matrices. So I am looking for a way to break up the computation. Would concensus MC be a good candidate?

bgoodri · August 26, 2017, 12:23am

The QR reparametrization cannot fail; it can only be failed. That is to say: Why is your design matrix rank deficient?

bhomass · August 26, 2017, 12:42am

Aha, interesting, that was definitely the question I should have asked myself. After adding more rows, that problem went away. However, back to parallel processing discussion. I still like to pursue that approach, and leave QR as another option.

savsec · August 26, 2017, 5:57pm

I’ve built a similar model in Stan recently. Simple logistic regression, ~150K samples, ~10 features. I couldn’t get the model to sample/converge, so I tried the QR decomposition. That blew through my memory and crashed. It seems that the QR decomposition in Stan returns two N by N matrices instead of an N by K and K by K (N is number of samples, K is number of features). I ended up computing the QR decomposition in Python and passing the result into Stan as data. (scipy.linalg.qr(x, ‘economic’) gives you the N by K, K by K decomposition). I ran four chains, 2K samples, 500 warmup. The whole thing takes about 10 minutes and the convergence is excellent. I highly recommend this approach.

bgoodri · August 26, 2017, 6:28pm

That is correct and memory intensive. I always do the QR decomposing in R, which produces a thin QR decomposition using parallel computation.

savsec · August 26, 2017, 6:31pm

Is there any plan to fix this? I can’t think of a use the “full” QR decomposition has (which can’t also be accomplished with the “thin”). If there’s a compelling reason to have the “full” QR decomposition as an option in Stan, at least could there be separate functions for the full and thin versions?

bgoodri · August 26, 2017, 6:32pm

Eigen only does fat QR decompositions.

Charles_Driver · August 26, 2017, 7:43pm

I also wanted the ‘other’ sort of QR decomposition at some point, though I only wanted the R matrix out of it, so put this together - perhaps it is helpful:

  matrix qrR(matrix in){
  int n;
  int m;
  real s;
  vector[cols(in)] d;
  real fak;
  matrix[rows(in),cols(in)] a;
  matrix[cols(in),cols(in)] out;
  
  a = in;
  
  n=cols(a);
  m=rows(a);
  
  for(j in 1:n){
    s = 0;
    for( i in j : m)  {
      s = s + a[i,j]^2;
    }
    s = sqrt(s); 
    if(a[j,j] > 0) { d[j] = -s;} else{ d[j] = s;}
    fak = sqrt(s * (s + fabs(a[j,j])));
    a[j,j] = a[j,j]-d[j] ;
    for( k in j:m){
      a[k,j] = a[k,j] / fak;
    }
    for( i in min(j + 1,n)  : n ){
      s = 0;
      for (k in j : m ){
        s= s + a[k,j] * a[k,i];
      }
      for( k in j : m ){ 
        a[k,i] = a[k,i] - a[k,j] * s;
      }
    }
  }
  for( i in 1:n){
    for(j in 1:n){
      if(i>j) out[i,j]=a[j,i];
      if(i == j) out[i,j]=d[i];
      if(i<j) out[i,j]=0;
    }}
  return -out;
}

bhomass · August 27, 2017, 11:53pm

I am going over the ep-stan code. It seems as it is, it’s a single machine demonstration of a parallelizable algorithm. Has any one tried this out over multiple machines? It would appear PySpark would come in handy?

Topic		Replies	Views
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3551	March 5, 2018
Linear, parallell regression CmdStan	89	5509	September 20, 2018
Reduce computation time with highly-correlated covariates Modeling rstan	18	4002	July 3, 2020
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3270	April 21, 2020
Parallel computation Chains over nodes of a cluster. Parallel computation of log_likelihood within a node Modeling	10	1833	October 12, 2018

Break up a large dataset and take average

Related topics