Effect of drug on gene expression using Bayes

Hi all, I’m an applied researcher working on assessing the effect of a drug on the expression level of approximately 90 genes. The study used quantitative PCR to examine the number of copies of each gene transcript. What we’re interested in is whether the drug group has higher/lower expression levels of any of these specified genes at follow-up, relative to those who received placebo.

The typical way to perform a study like this (about 90% of what occurs in the literature), is to run independent univariable regression models for each gene - with the group variable included as a binary predictor. Then, to account for multiple tests, a p value correction is used to control the false discovery rate.

This strikes me as potentially inefficient. The expression level of many of these genes will be correlated, so the outcomes are not independent. Furthermore, the drug is not ‘clean’ pharmacologically - it has many targets. I would be surprised if it had a large effect on just one gene and no effect on others. For these reasons, I have been considering a Bayesian hierarchical model which allows the assessment of the effect of the drug on each gene to ‘borrow strength’ from the effect of the drug on other genes (i.e. ‘partial pooling’ of drug effect estimates). However, I’ve never modeled multiple correlated outcomes before using Stan or BRMS.

Is my intuition correct that modeling the hierarchical structure of this data might be more efficient? And is such a model structure easy enough to implement in Stan or BRMS? If anyone has any links or texts to recommend which investigate similar problems, I would greatly appreciate it. It always feels risky going against the grain in the literature, so I need to be able to back these methods with a robust understanding of their relative merits. Any help would be much appreciated. Thanks!

2 Likes

Would I be correct in my understanding that the data consists of a single count observation per combination of person and gene?

In the interim, it strikes me that while there is the potential for correlation structure in your experiment that you might benefit from incorporating in your modelling, the specific notion that only some genes’ expression will be affected by the drug is best captured by either a mixture model or something like the horseshoe prior. @avehtari might advise on best use of the horseshoe (and whether it’s recommended over the mixture model below), but here’s how I’d do it with a mixture and considering only the post-test data (to highlight the mixture approach, I’ll follow-up with a pre/post model that includes some correlation structure):

data{
  int num_obs ;
  int obs_count[num_obs] ;
  int num_subj ;
  int subj_for_obs[num_obs] ;
  int num_genes ;
  int gene_for_obs[num_obs] ;
  int<lower=0,upper=1> pre_or_post[num_obs] ;  
  int<lower=0,upper=1> drug_or_placebo[num_obs] ;
}
parameters{
  //subject parameters
  real subj_mean ;
  real<lower=0> subj_sd ;
  vector[num_subj] subj ;
  //gene parameters:
  real affected_genes_mean ;
  real<lower=0> affected_genes_sd ;
  real<lower=0> null_genes_sd ;
  real<lower=0,upper=1> prop_genes_affected ;
  vector[num_genes] gene ;
}
model{
  subj ~ normal(subj_mean,subj_sd) ;
  prop_genes_affected ~ ... //put your prior on the proportion of genes affected here
  for(this_gene in 1:num_genes){ //have to loop for log_mix
    target += log_mix(
      prop_genes_affected
      , normal_lpdf(gene[this_gene] | affected_genes_mean,affected_genes_sd)
      , normal_lpdf(gene[this_gene] | 0,null_genes_sd)
    ) ;
  }
  obs_count ~ count_distribution( //insert whatever count dist is best for your data here
    subj[subj_for_obs] 
    + gene[gene_for_obs]
  ) ;
}

That’s the “centered parameterization” version, but it’s easy to add non-centered to either or both subj & gene.

3 Likes

I think your intuition is correct here. I have some (quite limited) experience with rt-PCR data (AFAIK RT-PCR is just a different name for quantitative PCR, but I am not a biologist, so might be mistaken). The biggest problem would IMHO the usually quite limited sample size - how many replicates do you have per condition? If the sample size is low it is likely to make it hard to reliably learn anything about between-genes correlation (“low” depends on the amount of biological variability you have, but < 5 is IMHO hopeless and much more may be required).

Even with low sample size a standard non-correlated partial pooling model might be sensible. I presume from your post, that you know how to build those and don’t need help there, but feel free to ask, if something is not clear.

If you wanted to explicitly model correlation using brms things get somewhat tricky. There are IMHO two basic ways (not really tested any of those, there might be mistakes, please do check if the models make sense to you), either you use a multivariate model:

brm(mvbind(Ct_gene1, Ct_gene2, .... , Ct_geneN) ~ condition, rescor = TRUE)

This will only work when you can assume normal response, but AFAIK Ct numbers would be reasonably approximated by a normal distribution. This will treat the individual Ct numbers as draws from a multivariate normal distribution. This will however not make the effect correlated.

Or you can keep the long format and have

Ct ~  (gene | gr(1, by = condition))

Which will estimate a between gene correlation matrix separately for each condition.

A potentially big issue is however if you are using relative quantification as differences in relative expression don’t necessarily map easily to differences in absolute expression. For this to work you need to make additional assumptions like the total amount of mRNA extracted from the cells being the same across conditions OR some housekeeping genes not changing expression. Both of those might be problematic if an intervention has strong effect on the cells. (some good discussion of this in the context of RNA-seq is IMHO: https://www.biorxiv.org/content/10.1101/564955v1 I believe most of the concerns apply to rt-PCR as well) Note that this issue is AFAIK ignored in most standard analyses as well, so you probably don’t need to tackle it immediately, but I wanted to point out it is there.

In any case, when going against status quo, I’ve found it useful to include both “standard” and “novel” analysis in the manuscript (usually in supplementary material). Quite often the large scale inferences are the same in both cases, sparing you the need to defend the new approach vigorously. If there are differences, you might be able to point out exactly why the differences arise which could then form a strong argument for (or against) your approach.

Best of luck with your model!

This reminds me of a talk I saw (in person!) with a very catchy theme, “Come join the multiple testing party!”

Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation

David Gerard, [Matthew Stephens Biostatistics , Volume 21, Issue 1, January 2020, Pages 15–32

We combine two important ideas in the analysis of large-scale genomics experiments (e.g. experiments that aim to identify genes that are differentially expressed between two conditions). The first is use of Empirical Bayes (EB) methods to handle the large number of potentially-sparse effects, and estimate false discovery rates and related quantities. The second is use of factor analysis methods to deal with sources of unwanted variation such as batch effects and unmeasured confounders. We describe a simple modular fitting procedure that combines key ideas from both these lines of research. This yields new, powerful EB methods for analyzing genomics experiments that account for both sparse effects and unwanted variation. In realistic simulations, these new methods provide significant gains in power and calibration over competing methods. In real data analysis, we find that different methods, while often conceptually similar, can vary widely in their assessments of statistical significance. This highlights the need for care in both choice of methods and interpretation of results.

Stephens M (2013) A Unified Framework for Association Analysis with Multiple Related Phenotypes. PLoS ONE 8(7): e65245.

and more recently

Bayesian multivariate reanalysis of large genetic studies identifies many new associations

2 Likes

Thank you all for your thoughtful responses. I have been working through the links provided (which have been very helpful), and trying to ground myself in the basics of multivariate regression methods. (Strange that as someone trained in epidemiology I have never been formally exposed to modelling multiple related outcomes, given how common this is).

@mike-lawrence thank you for providing the code example. The mixture model is not immediately intuitive to me but I am breaking it down line by line. The outcome variables are continuous - they can be converted to counts with a formula, but the outcomes are typically analysed as Ct values. That is, the number of PCR cycles until a fluorescence threshold is met (with the number of copies of the transcript inversely proportional to this Ct value). They become continuous rather than counts because they are standardized against ‘reference’ genes whose expression levels are known to be consistent. I’m going to continue to read about mixture models because the logic seems very appropriate for what i’m trying to do.

@martinmodrak I have approximately 16 participants per condition. Hence why i’d really like to capture additional power by modelling things hierarchically. Thanks a lot for the code snippets and the links. The multivariate correlated-residual BRMS model is very appealing in its simplicity. I may return to you in the future about an uncorrelated partial pooling model - since i’ve never put this in practise in the context of multiple responses. Thanks a lot for your insights here.

@cmcd Excellent resources. This has sent me back to the drawing board (in a good way).

4 Likes