How to add customized constraints when making predictions using posterior_predict

Hi, below is a simple hierarchical Bayes model that I want to fit. Y_ij here can be considered as the expression of gene i in cell j measured by technology 1 and x_ij is the expression measured by technology 2. Y_ij is count data so I use a ZINB distribution to model it. X_ij is relative proportion so it is between 0 and 1. Cell size is used as an offset.

The key part of the model is the alpha_i, which represents distortion between two technologies. It is simply a group-level effect, i.e., for all cells from the same gene, they are distorted in the same way. Different genes are distorted in different ways.

Eventually, what I want is that for a new gene (let’s say we measured its expression in 10 cells), I can predict its expression in technology 1 based on its expression in technology 2. And this requires proper prediction/sampling of distortion (alpha_i) for this new gene. In my case, I want the sampling to work this way:
(1) when making a prediction of this gene for all 10 cells, I want all the cells to have the same level of distortion, i.e., alpha_i to be the same for these 10 cells instead of sampling one alpha_i for each cell.
(2) if this gene is in the training data, then it means that it has an estimated alpha_i already. My second constraint is that when making prediction for a gene that is already seen in the training data, instead of randomly sample a new alpha_i for it, I want to use the same alpha_i estimated during training.
(3) In reality, the distortion of each gene may not be random. For example, longer genes may have higher alpha_i than shorter genes. There may not be a strong correlation between gene length and distortion level so I don’t plan to include this in the model. But when making predictions, I want to assign a higher alpha_i for a longer gene instead of completely random sampling alpha_i from its posterior distribution.

My first question is that can posterior_predict achieve the three constraints when making predictions for new data? In general, can I control how the group-level effect is sampled when making predictions for new data?

My understanding is that for (1) and (2), the default posterior_predict can already do it. But when I looked at the description of prepare_predictions, I feel like the default posterior_predict will sample alpha_i for each cell separately even for the same gene. And even for a gene that is seen in the training data, when making predictions, it will sample a new alpha_i instead of using the alpha_i estimated during training. Is this true? In general, how does posterior_predict work in terms of sampling new group-level effects?

I tried to look into the source code of the posterior_predict function. I found the R script of posterior_predict.R on Github but it doesn’t contain the details. I am trying to write my own code to do this. Here is a description of the procedure:

If we want to make a prediction of a new gene with 10 cells:
(1) the input is x_ij and cell size j
(2) sample alpha_i from its posterior distribution. Since I am doing it as an independent step, I can control the sampling. For example, I can use the same alpha_i during training if this gene is in the training set.
(3) randomly sample other parameters like alpha, theta, and pie from their posterior distribution.
(4) plug all these values into the model and randomly draw count from the ZINB distribution.

I don’t know much about MCMC sampling so I am doing the prediction directly based on the model. Can anyone tell me if this is correct?

Thank you!

A quick version of the post if you don’t have time:

When making posterior predictive prediction on new data using a fitted model with group level effect, is there any way to control how the group level effect is sampled? Can I add constraint to it instead of pure random sampling from posterior predictive distribution?


some options are documented in ?prepare_predictions

perhaps that helps already.

Thanks for the reply!

Yes I have checked the document of prepare_predictions. Based on my understanding, what is related to my question is the “sample_new_levels” parameter. If I understand correctly, “uncertainty” means that samples from the same new level are randomly selected and they can be from different existing levels (also randomly selected). On the contrary, “old_levels” means that samples from the same new level are randomly selected from the same existing level and this existing level is randomly selected from all existing levels. Is this correct? If yes, then they cannot do what I want to do here. My goal is for samples from the same new level, they are randomly selected from the same existing level but this existing level is not randomly selected from all existing levels. This existing level is selected based on some constraints. For example, if the new level is already seen in the training data, then the same group level effect estimated during training is selected as the “existing level” in prediction (not randomly sample from existing levels). I was wondering if there is any way to achieve this?

Thank you!

I see. This is unfortunately not yet possible in brms.

Sure. I am thinking about implementing it myself. I saw from this post that for making predictions on new data, I can sample values for each parameter (in my case they are alpha, alpha_i, theta, and pie) from their posterior distribution respectively and then sample yrep from the ZINB distribution with sampled alpha, alpha_i, theta, and pie. Is this how the posterior_predict function works? If yes, then I can add the constraints to the first step to customize the sampling process. I tried to look for the source code of posterior_predict but the posterior_predict.R file on Github doesn’t include all the details…

This is basically how the posterior_predict function works, yes.