Priors set up: Combine horseshoe prior with knowledge about noise in response variable

I want to estimate age as an outcome variable and my predictors are values of genomic DNA methylation. The structure of the dataset is 95 samples (73 for training) and thousands (>160.000) of genomic DNA methylation data (values between 0 and 100). In other species, we use machine learning models that shrink the non-significant coefficients, penalized regressions LASSO or Elastic Net, and the final model contains few (i.e., 1-50) variables. In my new species, I have values for age but we know that they are likely not so accurate because of the method used to get them. Thus, I thought applying Bayesian statistics would be more appropriate, however, I have some doubts on how to do this with my dataset.

  • Predictors type: All of my predictors are of the same type with values from 0 to 100.

  • Predictors numbers: I applied filters prior to modelling to reduce predictors (exclude predictors with low variance, correlated between them, with very high or very low values, correlated with my outcome variable) to reach 300 prior to modelling.

  • Outcome variable: The values of age as outcome variable are themselves estimates by another method which contains a lot of noise. Let’s say 1 year of error, meaning that a sample assigned to age class 3, could actually be year class 2, 3 or 4. Values in this dataset range from 1 to 5, while they theoretically could range from 0 to 20 years.

  • Regularization: I used the horseshoe prior that will allow me to select variables with the goal to reach <50.

But I don’t know how to combine the horseshoe prior with my knowledge about the values of the outcome variable not being quite right and about the distribution of the predictors. Can I any advice on how to express this in my model?

model_horseshoe <- stan_glm(age ~ ., data = all.cors.filt.df, family = gaussian(), prior = hs())

I was delighted to see how helpful this community seems to be and I would greatly appreciate a bit of help, but I apologize in advance if the answer to my question is obvious and I haven’t been able to think or find it.

1 Like

Based on your data description, I recommend to check also paper Projective inference in high-dimensional problems: prediction and feature selection. For very high dimensional but small data sets, it might be better to use all data and not filter out, e.g. collinear ones. All collinear covariates can be used (and exploit the information in them) by creating a SPCA based reference model and use fast projection predictive approach for variable selection (the approach has Bayesian decision theoretical justification). That paper shows the projpred approach beats LASSO and Elastic Net. The methods in the paper are available in dimreduce and projpred packages, and projpred is easy to use with stan_glm models. If there are many collinear covariates, this approach is probably better than using horseshoe prior directly.

Horseshoe is useful if the problem really is sparse, so that only a few covariates are relevant (and some collinearity is fine). It might be easier to set the horseshoe prior and prior on residual variance in brms, see Regularized horseshoe priors in brms — horseshoe • brms and Prior Definitions for brms Models — set_prior • brms. Instead of horseshoe prior, you could consider also R2D2 prior which allows to set the prior on R^2 and covariate sparsity, see R2-D2 Priors in brms — R2D2 • brms

4 Likes

Thanks a lot for the help! I read the paper, I learnt a lot and it looks it’s exactly the case of my data: exactly as in Figure 3 of the paper even when I start with p=26 variables after heavy filtering as I described before the marginals are overlapping zero. Now, I need a bit of time to apply the approach to my data, but I will come back to post the code once I have something more advanced. Already it seems to be computationally feasible which is great.

I’m trying to use the SPCA based reference model and use fast projection predictive approach for variable selection.

I started with all my data without any filtering (p=164906) and split them into training (n=74) and test (n=20). I added the standard error of my response variable (age) to the dataset.

set.seed(123)
splits <- initial_split(meth.complete, strata = age, prop=4/5)
age_other <- training(splits)
age_test <- testing(splits)
x <- age_other %>% select(-c(age, se.age))
y <- age_other$age
se.y <- age_other$se.age

With the training data I performed dimensionality reduction using the supervised principal components from the dimreduce package.

dr1 <- spca(x,y, preprocess=TRUE) # spca
z1 <- predict(dr1,x)

The plot shows a better separation by age.

p <- ggplot() + geom_point(aes(x=z1[,1], y=z1[,2], color=y), show.legend = TRUE)
p + scale_color_viridis_c()

Now, I understand that I need to construct the SPCA-based reference model using the principal components from spca, so I used brms for modelling which also allowed me to include the known standard error of my response variable.

data <- data.frame(z1, y, se.y)
fit <- brm(y|se(se.y) ~ ., data=data, family = gaussian())

In the next step, should I then use the cv_varsel() from projpred with this reference model with the principal components as variables? How is it going to select from the initial 164906 variables?

That is a challenging daa

You need to use init_refmodel() function. See Reference model and more general information — refmodel-init-get • projpred and full example code using spca (from the paper Using reference models in variable selection | Computational Statistics)
https://github.com/fpavone/ref-approach-paper/blob/a15f821d76a05d6934b672865332a643a53ac8dd/code/minimal_subset.R

With that many variables you want to use either method='L1' or if using method='forward' limit the search e.g. with nterms_max=20. Try first with validate_search=FALSE just to test how much time one search path takes (see more in [2306.15581] Robust and efficient projection predictive inference). And if that works, then rerun with validate_search=TRUE and possibly with cv_method='kfold', K=10.

You may also try a simpler and faster approach like loc.fdr combined with the reference model (see Section 4.2 in Using reference models in variable selection | Computational Statistics)

1 Like