Hi, below is a simple hierarchical Bayes model that I want to fit. Y_ij here can be considered as the expression of gene i in cell j measured by technology 1 and x_ij is the expression measured by technology 2. Y_ij is count data so I use a ZINB distribution to model it. X_ij is relative proportion so it is between 0 and 1. Cell size is used as an offset.
The key part of the model is the alpha_i, which represents distortion between two technologies. It is simply a group-level effect, i.e., for all cells from the same gene, they are distorted in the same way. Different genes are distorted in different ways.
Eventually, what I want is that for a new gene (let’s say we measured its expression in 10 cells), I can predict its expression in technology 1 based on its expression in technology 2. And this requires proper prediction/sampling of distortion (alpha_i) for this new gene. In my case, I want the sampling to work this way:
(1) when making a prediction of this gene for all 10 cells, I want all the cells to have the same level of distortion, i.e., alpha_i to be the same for these 10 cells instead of sampling one alpha_i for each cell.
(2) if this gene is in the training data, then it means that it has an estimated alpha_i already. My second constraint is that when making prediction for a gene that is already seen in the training data, instead of randomly sample a new alpha_i for it, I want to use the same alpha_i estimated during training.
(3) In reality, the distortion of each gene may not be random. For example, longer genes may have higher alpha_i than shorter genes. There may not be a strong correlation between gene length and distortion level so I don’t plan to include this in the model. But when making predictions, I want to assign a higher alpha_i for a longer gene instead of completely random sampling alpha_i from its posterior distribution.
My first question is that can posterior_predict achieve the three constraints when making predictions for new data? In general, can I control how the group-level effect is sampled when making predictions for new data?
My understanding is that for (1) and (2), the default posterior_predict can already do it. But when I looked at the description of prepare_predictions, I feel like the default posterior_predict will sample alpha_i for each cell separately even for the same gene. And even for a gene that is seen in the training data, when making predictions, it will sample a new alpha_i instead of using the alpha_i estimated during training. Is this true? In general, how does posterior_predict work in terms of sampling new group-level effects?
I tried to look into the source code of the posterior_predict function. I found the R script of posterior_predict.R on Github but it doesn’t contain the details. I am trying to write my own code to do this. Here is a description of the procedure:
If we want to make a prediction of a new gene with 10 cells:
(1) the input is x_ij and cell size j
(2) sample alpha_i from its posterior distribution. Since I am doing it as an independent step, I can control the sampling. For example, I can use the same alpha_i during training if this gene is in the training set.
(3) randomly sample other parameters like alpha, theta, and pie from their posterior distribution.
(4) plug all these values into the model and randomly draw count from the ZINB distribution.
I don’t know much about MCMC sampling so I am doing the prediction directly based on the model. Can anyone tell me if this is correct?
Thank you!