Specifying conditional random intercept

I am trying to fit a model to predict gene expression in experiment B based on gene expression in experiment A. The initial model is listed below:

Screen Shot 2021-01-22 at 11.35.19 AM

In this model, y_ij is the expression in experiment B for gene i in sample j. x_ij is the same thing in experiment A. The global intercept (alpha) and gene-specific random intercept (alpha_i) represents the variation between the two experiments.

In reality, there are situations that one gene is highly expressed in experiment A (large value in x_i) but its expression is completely lost in experiment B (0 in y_i). This happens at a reasonably high probability. Therefore, the current model cannot capture this effect. In the current model, the only chance for this to happen is that the random intercept of a gene is very very low (close to -Inf). However, since I assume the random intercept to follow a normal distribution, this is not very likely to happen. I need a model to allow this to happen more frequently.

Therefore, I am thinking of modeling this effect in two different ways:

  1. Instead of assuming the gene-specific random intercept (alpha_i) to follow a normal distribution, can I assume it to follow a bimodal distribution (a mixture of current normal distribution and also another distribution with some density around a very low value)? If yes, how to do it in brms? I have seen examples of using mixed distribution in brms but they are for the response variable, not prior.

  2. Another way I can think of is to modified the current model to the model below:

Basically, I want to add an indicator function through a Bernoulli distribution. For genes that behave normally (beta_i = 1), the model will collapse to the initial model. For genes that the expression is completely lost (beta_i = 0), its random intercept will turn into a very low value. The delta in the Bernoulli distribution basically captures this gene silencing event. However, I don’t know how to specify this model using brms. I tried the code below but it doesn’t work.

m <- brm(    
  bf(response ~ 0 + Intercept + X + gene.silence*(1 | gene) + (1 - gene.silence)*very.low.value),          
  data = data2fit, 
  family = zero_inflated_negbinomial(),
  iter = 2000, chains = 4, cores = 4

response is the y here. X here is x. gene.silence is a vector with 0 and 1. 1 means the gene behaves normally. When gene.silence=0, we will assign a very.low.value (here I use -10^6) to it. Here is the error message I got when running this code:

Normally ‘Intercept’ should be an internal variable that doesn’t need to be in the data. I am not sure what goes wrong here.

Essentially the question boils down to how to model a random intercept that behaves differently under two different conditions. Any body have any idea?

  • Operating System: Linux
  • brms Version: brms_2.14.4

Sorry for not getting to your question earlier. It is relevant and well written.

This is almost certainly not possible with brms and is in general a hard model to implement, as it can imply a bimodal posterior which breaks sampling in most cases. To have such a model you would probably need to write your own model in Stan and marginalize out a discrete indicator variable (i.e. for each gene have a parameter representing the probability of one of the components).

This is probably because in R model formula (1 - gene.silence) is not “one minus the value of gene.silence” but rather “use intercept and do not use gene.silence”, (see e.g. multiple regression - To remove more than predictors in lm() function in r - Cross Validated for usage).

This then creates some weird interactions with brms-specific construct of 0 + Intercept.

Does that make sense?

Best of luck with your model!

Thank you very much for your reply! Your answers address my questions. Now I am trying to model gene-specific silencing by modeling the zero-inflation component of the ZINB model using a regression model with random intercept. This way, the model becomes a distributional model. And genes with gene-specific silencing will have a zero-inflation component equal to 1. Hopefully, this can capture the gene-specific silencing. It is still tricky to specify the prior distribution for the random intercept though because it still doesn’t follow a normal distribution (most genes have small zero inflation rate with only a few have zero inflation rate equal to 1). But I guess this is not a big problem…

1 Like

If you run into trouble, definitely start a new topic to get the most eyes on it.

Also note that there is substantial discussion whether the ZINB model is even a good model for RNA-seq data - I tried to dig to the bottom of this some time ago for scRNA-seq data, but I am not much wiser (rna seq - What is the actual cause of excessive zeroes in single cell RNA-seq data? Is it PCR? - Bioinformatics Stack Exchange).

So a prudent strategy might be to fit plain neg. binomial model and then do a posterior predictive check for the number of zeroes and only if the data you are working with actually have more zeroes than the neg. binomial model can explain, then proceed to ZINB. (you note that most genes have small zero inflation rate)

Best of luck with the model!

Yeah I am aware of this. I didn’t mention this in the post because I don’t want to distract people from the question. What I am analyzing is actually spatially resolved transcriptomic data. The experiment generates lots of zeros and fitting a ZINB model is reasonable in this case. Also I guess this gives me more control of the gene specific silencing. Anyway, thank you for your suggestion!

1 Like