Hi everybody,
I am currently trying to infer the levels of 5mC and 5hmC in a typical DNA methylation microarray experiment, and I would like to develop a Bayesian hierarchical model that could not only estimate those levels, but also describe the posteriors for the differences among groups, thus leading to a ROPE-based, Bayesian decision making process.
I am a completely newbie when it comes to Bayesian modeling, and I have developed a model that seems to work nicely. Problem is, although the sampling is very fast, I need to fit in the order of more than 400K models at the same time, so I need to push the performance limits in order to achieve a decent compromise between execution time and the flexibility this kind of approach provides.
I will try to define the problem a little bit. Input data comes in the shape of two matrices of the same dimension. Each row represents a probe from the array, and each column a sample. The model also receives a vector containing the group each sample belongs to. My idea was to model the methylation state as a Dirichlet process over a 3-dimension simplex in the form (%no_methylation, %5mC, %5hmC) for each of the groups. That would be the latent methylation values for the problem. The matrices contain values between 0 and 1 indicating the amount of methylation, and they come from different experimental techniques, by which we know that one of the matrices represent the amount of 5mC and the other the sum of 5mC and 5hmC.
I have modeled the observed matrices as Beta distributions depending on the previous Dirichlet generated values. One of the Beta distributions has a mean corresponding to the second component of the simplex, and the other’s mean is the sum of the second and third components. The Beta distributions are parameterized by their mean and sample size, and I have set a prior on the sample size using a Gamma distribution whith shape = 2 and mean = 1000.
This is the model I have developed:
data {
int <lower = 1> n_probes;
int <lower = 1> n_samples;
int <lower = 1> n_groups;
real <lower = 0, upper = 1> BS[n_probes, n_samples];
real <lower = 0, upper = 1> OX[n_probes, n_samples];
int <lower = 1> group_id[n_samples];
vector <lower = 1>[3] alpha;
real <lower = 0> nu_shape;
real <lower = 0> nu_mean;
}
parameters {
real <lower = 0> nu_minus_one;
simplex[3] meth_state[n_groups];
}
model {
vector[n_groups] BS_alpha;
vector[n_groups] BS_beta;
vector[n_groups] OX_alpha;
vector[n_groups] OX_beta;
vector[n_groups] mc;
vector[n_groups] hmc;
real nu;
nu_minus_one ~ gamma(nu_shape, nu_shape / nu_mean);
nu = nu_minus_one + 1;
for (j in 1:n_groups) {
meth_state[j] ~ dirichlet(alpha);
mc[j] = meth_state[j][2];
hmc[j] = meth_state[j][3];
}
OX_alpha = mc * nu;
OX_beta = (1 - mc) * nu;
BS_alpha = (mc + hmc) * nu;
BS_beta = (1 - mc - hmc) * nu;
for (j in 1:n_samples) {
BS[, j] ~ beta(BS_alpha[group_id[j]], BS_beta[group_id[j]]);
OX[, j] ~ beta(OX_alpha[group_id[j]], OX_beta[group_id[j]]);
}
}
Currently I am achieving an approximate average time of 5s for each probe, with the usual 4000 iterations and 4 chains. Model diagnostics seem to be ok, and the simple test cases are working. But I would like to improve the model performance, as any improvement will lead to substantially lower execution times over more than 400K fits. I have tried to vectorize as much as I could, but due to my inexperience in this field, I suspect I might be missing the point completely.
Any hint or help would be much appreciated.
Regards,
Gustavo