Break up a large dataset and take average

for very large models which takes too long to run, does it make sense to break up the dataset, run them in parallel, then average the results at the end in order to speed up execution?

Yes and no. It depends on what the model youā€™d run for the whole data set would be.

it has a logistic regression as model, calculating for beta parameters where

inv_logit(alpha + x * beta)

I plan to apply QR reparameterization to x * beta after this works.

If youā€™re assuming y|alpha,beta,x is conditionally independent, then this is ok, but itā€™s probably faster to just subsample your data directly.

eg, find a cheap classifier, and then subsample near the classifier boundary (called local case control subsampling in the literature) and then compute the posterior on that subsample.

If the model is non-hierarchical and the number of observations n is much larger than the number of covariates, I would expect advi with QR parameterization to work also well.

For hierarchical models, see, https://arxiv.org/abs/1412.4869
This paper also has several references for alternative methods for parallel computation using subsets of data.

Aki

Apparently case control sampling is only for logistic regression? Very limited use.

Yes, but thatā€™s what youā€™re doing isnā€™t it?

Also the principle of weighting your samples based on something like the information is transferable. Basically LCC subsampling is one demonstration of the general principle that you can do better than random subsampling if you think about your data and your model.

Bob Carpenter mentioned in another thread that none of Gellmanā€™s work in Expectation Propagation has made into the repo. Is there still true? any known road maps? if not, you are suggesting moving away from stan and roll your own?

Another question, AFAIK, ADVI is the default stan implementation. There is nothing I need to spell out in my code. Is that right?

EP-parallallization is not in the Stan repo. The code for the paper is available at GitHub - gelman/ep-stan
There will be some future changes in Stan, which will make implementation easier. I recommend to read the paper first to check whether your problem is such that it would benefit from EP-parallelization. Itā€™s also possible to you can get useful results with something simpler.

Yes. At least RStan and PyStan have an option for advi.

I am using cmdstan. what about that as far as designating ADVI?

My model has dynamic HMM, an array dependent variable, random effect consumer response. It seems complicated enough. However, please suggest a few names that are considered simpler than EP.

Iā€™m sorry, but based on this I assumed itā€™s just non-hierarchical logistic regression. I want to be clear that I was saying is that advi could be good enough for a simple non-hierarchical logistic regression with Gaussian prior on beta (preferably with QR-transformation), and n>>p (where p is the number of covariates).

Oh, you should have told that in the beginning. For this I canā€™t recommend anything else than NUTS. Depending on the hierarchical structure, EP-parallelization with NUTS on group level might be applicable, but canā€™t say based on this description.

You can find references to all the methods I know in the EP-life paper mentioned before. If you tell more about which kind of model and data you have, I may be able to give more specific recommendation which kind of algorithm might be useful.

Sorry for the initial brief description. It was to get the conversation started. I can give a pretty good picture of the model. Its a consumer response study to find the effectiveness of sales promotions. The consumer is assumed to be going thru the usual hidden funnel states, with some observable indicator giving hints which is the current state. The consumer states make transition based on a inverse logit of multivariate stimulus, giving rise to a random effect intercept and fixed effect coefficients. Finally, there is another inverse logit to predict a conversion event, which also generates an intercept and another set of coefficients. By my projection, this model will take a month to run in a single machine. QR reparametrization failed due to non-invertible matrices. So I am looking for a way to break up the computation. Would concensus MC be a good candidate?

The QR reparametrization cannot fail; it can only be failed. That is to say: Why is your design matrix rank deficient?

1 Like

Aha, interesting, that was definitely the question I should have asked myself. After adding more rows, that problem went away. However, back to parallel processing discussion. I still like to pursue that approach, and leave QR as another option.

Iā€™ve built a similar model in Stan recently. Simple logistic regression, ~150K samples, ~10 features. I couldnā€™t get the model to sample/converge, so I tried the QR decomposition. That blew through my memory and crashed. It seems that the QR decomposition in Stan returns two N by N matrices instead of an N by K and K by K (N is number of samples, K is number of features). I ended up computing the QR decomposition in Python and passing the result into Stan as data. (scipy.linalg.qr(x, ā€˜economicā€™) gives you the N by K, K by K decomposition). I ran four chains, 2K samples, 500 warmup. The whole thing takes about 10 minutes and the convergence is excellent. I highly recommend this approach.

1 Like

That is correct and memory intensive. I always do the QR decomposing in R, which produces a thin QR decomposition using parallel computation.

Is there any plan to fix this? I canā€™t think of a use the ā€œfullā€ QR decomposition has (which canā€™t also be accomplished with the ā€œthinā€). If thereā€™s a compelling reason to have the ā€œfullā€ QR decomposition as an option in Stan, at least could there be separate functions for the full and thin versions?

Eigen only does fat QR decompositions.

I also wanted the ā€˜otherā€™ sort of QR decomposition at some point, though I only wanted the R matrix out of it, so put this together - perhaps it is helpful:

  matrix qrR(matrix in){
  int n;
  int m;
  real s;
  vector[cols(in)] d;
  real fak;
  matrix[rows(in),cols(in)] a;
  matrix[cols(in),cols(in)] out;
  
  a = in;
  
  n=cols(a);
  m=rows(a);
  
  for(j in 1:n){
    s = 0;
    for( i in j : m)  {
      s = s + a[i,j]^2;
    }
    s = sqrt(s); 
    if(a[j,j] > 0) { d[j] = -s;} else{ d[j] = s;}
    fak = sqrt(s * (s + fabs(a[j,j])));
    a[j,j] = a[j,j]-d[j] ;
    for( k in j:m){
      a[k,j] = a[k,j] / fak;
    }
    for( i in min(j + 1,n)  : n ){
      s = 0;
      for (k in j : m ){
        s= s + a[k,j] * a[k,i];
      }
      for( k in j : m ){ 
        a[k,i] = a[k,i] - a[k,j] * s;
      }
    }
  }
  for( i in 1:n){
    for(j in 1:n){
      if(i>j) out[i,j]=a[j,i];
      if(i == j) out[i,j]=d[i];
      if(i<j) out[i,j]=0;
    }}
  return -out;
}
1 Like

I am going over the ep-stan code. It seems as it is, itā€™s a single machine demonstration of a parallelizable algorithm. Has any one tried this out over multiple machines? It would appear PySpark would come in handy?