Encoding factor similarities to predict on unseen data

backes · November 12, 2021, 11:21pm

Hello,

I am trying to predict a value for data which was not in the training set. I have two separate variables which give me a headache but I think that they both can be solved the same way. To make it clearer I’ll present them individually, but in reality they are both in the same model.

For the first I am trying to predict a value measured in different organisms under different conditions. The model might look like:

brm(value ~ 1 + temperature +  (1 |  organism)

Instead of simply fitting this as factors, I found this vignette:
https://cran.r-project.org/web/packages/brms/vignettes/brms_phylogenetics.html
Which allows me to use the phylogenetic data and encode the similarity of the organisms. I want to use this model to make predictions. It is possible that I need to predict the value for an organism without any data in the training set.

Is there a way to encode the phylogenetic relation differently such that for a new organism, I give more weight to the more closely related samples? I can assume that I know where the new organism is in the phylogenetic tree and hence can precompute this. There is this allow_new_levels which allows me to make predictions for unseen factors, but I’m not sure how this works and also don’t know how to incorporate more phylogenetic information.

Ideally I use an existing phylogenetic tree (e.g. from Home - Taxonomy - NCBI), compute the distance/similarity of each organism (and not only these in the training set) and somehow pass this data to the model such that it can be used during the predictions.

For the second variable I give each experiment a hierarchical category which encodes the type of experiment. The model (ignoring organism) might look identical to the first one:

brm(value ~ temperature +  (1 |  category)

The category is composed of three numbers: a.b.c and can be seen as a flattened tree. Each level makes it more specific and we can measure similarity/distance as follows:

d(1.1.1, 1.1.1) = 0
d(1.1.x, 1.1.1) = 1 for x != 1

d(1.2.x, 1.1.1) = 2
d(1.x.y, 1.e.f) = 2 (if e != x)
d(2.x.y, 1.e,f) = 3

I hope you get the idea. There are only a finite number of categories, so similar to the phylogenetic tree, I could precompute the similarity of each category and pass this somehow to the model as an n\times n matrix. That way if there’s a category for which no experiment exists, I can leverage this knowledge and hopefully weigh similar categories more in the prediction.

I am very new to hierarchical and bayesian models, so if my approach is totally wrong then kindly let me know.

Thank you

Topic		Replies	Views
Phylogenetic models with unseen categories brms	2	520	December 23, 2019
Is newdata2 in brms::predict doing what I think it's doing? brms specification , phylogenetic , brms	1	755	September 8, 2021
R2 in brm Modeling	2	465	March 9, 2021
Predict from a phylogenetic multilevel model trained on standardised data, but with new data that has not been standardised brms	9	508	June 29, 2020
Predict with a brms phylogenetic model for a new species with known phylogenetic position brms	17	1356	February 15, 2024

Encoding factor similarities to predict on unseen data

Related topics