Fitting hierarchical logistic regression to large dataset

I am a molecular biologist and I am investigating the effect of genetic variants on a disease. My data consist of 2000 patients, 1000 with the disease in question (asthma) and 1000 who are healthy. A binary outcome. For each patient I have the data of 100000 variants (called SNPs), which takes on 0, 1 or 2 depending on the number of copies the individual posses of each SNP. I want to estimate the effect of each SNP, so I assign a coefficient b to each SNP using a multivariate logistic regression but each SNP is drawn from an adaptive prior, making it a hierarchical model. I am using a standard normal as a prior for b, since I expect most SNPs to have no effect on the disease. My problem is that I quickly run out of memory while fitting the model to this large dataset. I wonder if it is because my model is not constructed well enough, or if it is simply my computer which is not powerful enough (in which case I cannot do anything).

My computer specifications are: Intel® Core™ i5-7200U CPU @ 2.50GHz, with 8 GB RAM.

I am using 500 iterations and using rstans options(mc.cores = 4), as I only have 4 cores.

Do you have any suggestions of what I could do, in order to be able to fit this model to a the large dataset described above? My model is described below:

data{
  int<lower=1> N; // Rows
  int<lower=1> M; // Columns
  
  int<lower=0, upper=1> y[N]; // Outcome variables
  matrix<lower=0, upper=2>[N, M] x; // Predictor variables
}

parameters{
  // Hyper priors
  real mu;
  real<lower=0> sigma;
  
  // Priors
  real a;
  vector[M] b;
}

model{
  // Hyper-priors
  mu ~ std_normal();
  sigma ~ cauchy(0, 5);
  
  // Priors
  a ~ normal(0, 5);
  b ~ normal(mu, sigma);
  
  // Likelihood
  y ~ bernoulli_logit(a + x * b);
}

generated quantities{
  vector[M] OR;
  OR = exp(b); // effect size of each SNP
}

If the problem is running out of memory, the easiest thing to do is to run one chain at a time. This will slow down the overall time to get the solution, but it will give you a better chance of fitting such a large dataset in your memory.

Another thing is to compute the odds ratios outside of stan, so you’ll save up a bit more memory.

You could then look at using bernoulli_logit_glm which should speed up the sampling a bit.

The model itself looks to me as simple as it gets, but perhaps somebody else will be able to offer some advice on that.

2 Likes

Thinking a bit more outside the box, have you looked into ways of reducing the number of SNPs, for example by LD clumping? Or by removing some of the least associated SNPs from large meta-analyses (such as https://www.ncbi.nlm.nih.gov/pubmed/20860503, assuming the data you are using is not part of that)?

Hey! So, N is 2000 and M is 100000? Yeah, that’s kinda large. Did you try excluding the generated quantities block? This only eats up memory and you can easily compute the OR afterwards.

Also try a single chain with as few iterations as sensible (not less, though) as @mcol suggested. You can also try if the save_warmup = FALSE option in the stan call helps.

Good luck! :)

If memory is the concern, then cmdstan (or one of the interfances like cmdstanR) could work, since the output is written straight to the disk rather than being stored in memory.

1 Like

Would cmdstanr also do the trick?

Have you tried running this with fake data. Like 100 patients total and maybe 1000 SNPs? And see how long that takes to run and how much memory it uses up. We have the same issue with microbiome data in our lab.

I would think so.