I am a molecular biologist and I am investigating the effect of genetic variants on a disease. My data consist of 2000 patients, 1000 with the disease in question (asthma) and 1000 who are healthy. A binary outcome. For each patient I have the data of 100000 variants (called SNPs), which takes on 0, 1 or 2 depending on the number of copies the individual posses of each SNP. I want to estimate the effect of each SNP, so I assign a coefficient b to each SNP using a multivariate logistic regression but each SNP is drawn from an adaptive prior, making it a hierarchical model. I am using a standard normal as a prior for b, since I expect most SNPs to have no effect on the disease. My problem is that I quickly run out of memory while fitting the model to this large dataset. I wonder if it is because my model is not constructed well enough, or if it is simply my computer which is not powerful enough (in which case I cannot do anything).
My computer specifications are: Intel® Core™ i5-7200U CPU @ 2.50GHz, with 8 GB RAM.
I am using 500 iterations and using rstans options(mc.cores = 4), as I only have 4 cores.
Do you have any suggestions of what I could do, in order to be able to fit this model to a the large dataset described above? My model is described below:
data{
int<lower=1> N; // Rows
int<lower=1> M; // Columns
int<lower=0, upper=1> y[N]; // Outcome variables
matrix<lower=0, upper=2>[N, M] x; // Predictor variables
}
parameters{
// Hyper priors
real mu;
real<lower=0> sigma;
// Priors
real a;
vector[M] b;
}
model{
// Hyper-priors
mu ~ std_normal();
sigma ~ cauchy(0, 5);
// Priors
a ~ normal(0, 5);
b ~ normal(mu, sigma);
// Likelihood
y ~ bernoulli_logit(a + x * b);
}
generated quantities{
vector[M] OR;
OR = exp(b); // effect size of each SNP
}