I want to estimate age as an outcome variable and my predictors are values of genomic DNA methylation. The structure of the dataset is 95 samples (73 for training) and thousands (>160.000) of genomic DNA methylation data (values between 0 and 100). In other species, we use machine learning models that shrink the non-significant coefficients, penalized regressions LASSO or Elastic Net, and the final model contains few (i.e., 1-50) variables. In my new species, I have values for age but we know that they are likely not so accurate because of the method used to get them. Thus, I thought applying Bayesian statistics would be more appropriate, however, I have some doubts on how to do this with my dataset.
-
Predictors type: All of my predictors are of the same type with values from 0 to 100.
-
Predictors numbers: I applied filters prior to modelling to reduce predictors (exclude predictors with low variance, correlated between them, with very high or very low values, correlated with my outcome variable) to reach 300 prior to modelling.
-
Outcome variable: The values of age as outcome variable are themselves estimates by another method which contains a lot of noise. Let’s say 1 year of error, meaning that a sample assigned to age class 3, could actually be year class 2, 3 or 4. Values in this dataset range from 1 to 5, while they theoretically could range from 0 to 20 years.
-
Regularization: I used the horseshoe prior that will allow me to select variables with the goal to reach <50.
But I don’t know how to combine the horseshoe prior with my knowledge about the values of the outcome variable not being quite right and about the distribution of the predictors. Can I any advice on how to express this in my model?
model_horseshoe <- stan_glm(age ~ ., data = all.cors.filt.df, family = gaussian(), prior = hs())
I was delighted to see how helpful this community seems to be and I would greatly appreciate a bit of help, but I apologize in advance if the answer to my question is obvious and I haven’t been able to think or find it.