Fitting hierarchical logistic regression to large dataset

Kasper_Fischer · January 16, 2020, 9:30pm

I am a molecular biologist and I am investigating the effect of genetic variants on a disease. My data consist of 2000 patients, 1000 with the disease in question (asthma) and 1000 who are healthy. A binary outcome. For each patient I have the data of 100000 variants (called SNPs), which takes on 0, 1 or 2 depending on the number of copies the individual posses of each SNP. I want to estimate the effect of each SNP, so I assign a coefficient b to each SNP using a multivariate logistic regression but each SNP is drawn from an adaptive prior, making it a hierarchical model. I am using a standard normal as a prior for b, since I expect most SNPs to have no effect on the disease. My problem is that I quickly run out of memory while fitting the model to this large dataset. I wonder if it is because my model is not constructed well enough, or if it is simply my computer which is not powerful enough (in which case I cannot do anything).

My computer specifications are: Intel® Core™ i5-7200U CPU @ 2.50GHz, with 8 GB RAM.

I am using 500 iterations and using rstans options(mc.cores = 4), as I only have 4 cores.

Do you have any suggestions of what I could do, in order to be able to fit this model to a the large dataset described above? My model is described below:

data{
  int<lower=1> N; // Rows
  int<lower=1> M; // Columns
  
  int<lower=0, upper=1> y[N]; // Outcome variables
  matrix<lower=0, upper=2>[N, M] x; // Predictor variables
}

parameters{
  // Hyper priors
  real mu;
  real<lower=0> sigma;
  
  // Priors
  real a;
  vector[M] b;
}

model{
  // Hyper-priors
  mu ~ std_normal();
  sigma ~ cauchy(0, 5);
  
  // Priors
  a ~ normal(0, 5);
  b ~ normal(mu, sigma);
  
  // Likelihood
  y ~ bernoulli_logit(a + x * b);
}

generated quantities{
  vector[M] OR;
  OR = exp(b); // effect size of each SNP
}

mcol · January 16, 2020, 9:43pm

If the problem is running out of memory, the easiest thing to do is to run one chain at a time. This will slow down the overall time to get the solution, but it will give you a better chance of fitting such a large dataset in your memory.

Another thing is to compute the odds ratios outside of stan, so you’ll save up a bit more memory.

You could then look at using bernoulli_logit_glm which should speed up the sampling a bit.

The model itself looks to me as simple as it gets, but perhaps somebody else will be able to offer some advice on that.

mcol · January 16, 2020, 10:46pm

Thinking a bit more outside the box, have you looked into ways of reducing the number of SNPs, for example by LD clumping? Or by removing some of the least associated SNPs from large meta-analyses (such as https://www.ncbi.nlm.nih.gov/pubmed/20860503, assuming the data you are using is not part of that)?

Max_Mantei · January 17, 2020, 12:28am

Hey! So, N is 2000 and M is 100000? Yeah, that’s kinda large. Did you try excluding the generated quantities block? This only eats up memory and you can easily compute the OR afterwards.

Also try a single chain with as few iterations as sensible (not less, though) as @mcol suggested. You can also try if the save_warmup = FALSE option in the stan call helps.

Good luck! :)

andrjohns · January 17, 2020, 1:07am

If memory is the concern, then cmdstan (or one of the interfances like cmdstanR) could work, since the output is written straight to the disk rather than being stored in memory.

Guido_Biele · January 17, 2020, 3:15pm

Would cmdstanr also do the trick?

Ara_Winter · January 17, 2020, 3:24pm

Have you tried running this with fake data. Like 100 patients total and maybe 1000 SNPs? And see how long that takes to run and how much memory it uses up. We have the same issue with microbiome data in our lab.

mcol · January 17, 2020, 3:25pm

I would think so.

Topic		Replies	Views
Hierarchical logistic regression on anomalies Modeling	1	636	March 26, 2018
Improving Performance on Logistic Regression with Informative Priors Modeling performance , rstanarm	4	1573	May 1, 2020
Speeding up a hierarchical negative binomial regression with a huge number of groups Modeling bioinformatics	8	820	December 18, 2020
Scaling up a hierarchical model Modeling bioinformatics	28	3090	June 6, 2019
Parallelizing Bayesian Hierarchical Model with Many Parameters Modeling performance , paralellization	10	731	December 5, 2024

Fitting hierarchical logistic regression to large dataset

Related topics