New preprint: Biomedical application + Bayesian approach + Uncertainty estimation

Hello! I wanted to share our new preprint:

CNVscore calculates pathogenicity scores for copy number variants together with uncertainty estimates accounting for learning biases in reference Mendelian disorder datasets

We used CNVscore, a supervised learning model combining gradient boosting with Bayesian logistic regression with a generalized horseshoe prior, for the classification of pathogenic and benign Copy Number Variants (CNVs). Unlike alternative supervised-learning approaches, CNVscore combines a pathogenicity score with an estimate of uncertainty, making it possible to evaluate the suitability of the training set for the query variants.

Pathogenicity CNVscores reached classification performances similar to those of state-of-the-art techniques in comparative benchmark tests across independent sets. Furthermore, CNVscore identified low-uncertainty CNV subsets for which supervised-learning approaches resulted in a higher classification accuracy.

Comments and feedback are more than welcome!

Francisco Requena


I skimmed your figure 3:

A gradient-boosting model was first trained to classify pathogenic and benign CNVs on 38 genome-wide features. Each of the resulting trees was decoupled into a set of independent decision rules, which were used to annotate CNVs in a binary manner. Such vectors were used as input, to train a Bayesian generalized linear regression model on the same CNV sets. The likelihoods of the model parameters were combined with priors to generate their posterior probability.

Is Stan only used for the logistic regression?
Also, if your features/rules (the input of your logistic regression) are already trained/selected, do you have to worry about the feature-selection bias during your Bayesian uncertainty estimate?