# Encoding factor similarities to predict on unseen data

Hello,

I am trying to predict a value for data which was not in the training set. I have two separate variables which give me a headache but I think that they both can be solved the same way. To make it clearer I’ll present them individually, but in reality they are both in the same model.

For the first I am trying to predict a value measured in different organisms under different conditions. The model might look like:

brm(value ~ 1 + temperature +  (1 |  organism)


Instead of simply fitting this as factors, I found this vignette:
https://cran.r-project.org/web/packages/brms/vignettes/brms_phylogenetics.html
Which allows me to use the phylogenetic data and encode the similarity of the organisms. I want to use this model to make predictions. It is possible that I need to predict the value for an organism without any data in the training set.

Is there a way to encode the phylogenetic relation differently such that for a new organism, I give more weight to the more closely related samples? I can assume that I know where the new organism is in the phylogenetic tree and hence can precompute this. There is this allow_new_levels which allows me to make predictions for unseen factors, but I’m not sure how this works and also don’t know how to incorporate more phylogenetic information.

Ideally I use an existing phylogenetic tree (e.g. from Home - Taxonomy - NCBI), compute the distance/similarity of each organism (and not only these in the training set) and somehow pass this data to the model such that it can be used during the predictions.

For the second variable I give each experiment a hierarchical category which encodes the type of experiment. The model (ignoring organism) might look identical to the first one:

brm(value ~ temperature +  (1 |  category)


The category is composed of three numbers: a.b.c and can be seen as a flattened tree. Each level makes it more specific and we can measure similarity/distance as follows:

d(1.1.1, 1.1.1) = 0
d(1.1.x, 1.1.1) = 1 for x != 1

d(1.2.x, 1.1.1) = 2
d(1.x.y, 1.e.f) = 2 (if e != x)
d(2.x.y, 1.e,f) = 3


I hope you get the idea. There are only a finite number of categories, so similar to the phylogenetic tree, I could precompute the similarity of each category and pass this somehow to the model as an n\times n matrix. That way if there’s a category for which no experiment exists, I can leverage this knowledge and hopefully weigh similar categories more in the prediction.

I am very new to hierarchical and bayesian models, so if my approach is totally wrong then kindly let me know.

Thank you