Using hierarchical models for High dimensional (protein expression) data

Hey
I will first outline briefly the type of data I deal with below:
I have 6 (biological) samples.
3 samples got treatment A and the other 3 treatment B.
In each sample, we can measure the expression level of many proteins (1000+).
We assume that many proteins will not be influenced by the treatments and will therefore have similar expression levels in both treatments.
Other proteins will be influenced and can have drastically different expression levels between the two treatments.
We assume that the expression levels are log normal distributed.
(The whole setup is actually quite a bit more complex but this minimal example will demonstrate the problem I struggle with)

The scientific question we try to answer with our model is which proteins have different expression levels between the two treatments.
This is also referred to as a differential expression analysis.

I use BRMS to model this data, and in my mind the most simple model is expressed by the following formula:

log(expression) ~ protein_id + treatment:protein_id

This allows me to compare posteriors of treatment A vs B for each protein separately.
However, this model take a long time to sample has a lot of sampling issues (low ESS)
Therefore I tried the following hierarchical model instead:

log(expression) ~ (1| protein_id) + (1: treatment:protein_id)

After setting a very narrow prior on the sd parameters, I am able to get a model with almost no convergence issues and high ESS and rhat’s around 1. (still takes a couple of hours to sample)

However, I was wondering If could have done this better?

One of the issues I can see is that most proteins does not show difference in expression levels, so the shrinkage on the proteins that have different expression levels might be too much.

Also the treatment A and treatment B parameters within each protein are obviously correlated.
If there is no difference in expression between A and B, then both parameters will be closed to zero (and vice versa).

If anybody has any tips and/or has pointers to material that deals with these type of analyses in Bayesian setting, I would be very happy. :) (maybe for related data such as gene expression)

Best regards and thanks in advance!

I work with a lot of high-dimensional data (amplicon community profiling data usually) - I’m not aware of many resources for working within this kind of framework, at least from a Bayesian hierarchical modelling standpoint. Certainly, frequentist approaches seem to me to tend to take the “fit one model per feature” approach instead of going hierarchical.

One thought I have looking at your model is that you might want to explicitly capture the correlation between treated and untreated coefficients, rather than assuming a-priori they’re uncorrelated. Using brms-style syntax:

log(expression) ~ (1 + treatment | protein_id)

By explicitly estimating the covariance between the two treatments, you will allow for pooling of information across treatment types, and it will potentially give you some insight into how the population of protein responses looks (e.g. if you have a positive correlation, then proteins with high levels of expression in untreated samples would on average have higher proportional change in expression in treated samples).

One other question I would put to you are if there are any sample-specific effects - certainly when working with DNA/RNA sequencing data, sequencing depth is not constant between samples, which can lead to variation in overall counts across all measured features in a sample - not accounting for this variation can lead to misleading results. I’m not sure how protein expression levels are measured so this might not be a problem for you, but it would be worth considering including if you think it’s important.

1 Like

Hey

Thanks a lot for your answer (and sorry for my late reply).

You are right, frequentist proteomics and transcriptomics approaches tend to only look at 1 feature ( protein/ gene) at a time and thus missing out on many of the shared information between the thousands of features that can be captured with hierarchical model.
I don’t get why there is no more interest in these approaches. I guess the speed of MCMC on large data is the main bottle neck here.

Indeed, we could expect sample specific “batch” effect in proteomics. I try to deal with these by adding a + (1|sample) in my brms formula.

Thanks for the tip on the correlation between the treatment types! I’ll take a look at it.

I was also thinking about the fact that we expect treatment effect in some of the proteins and none in others . I wonder if I could use the R2D2 prior for this? But I did not figure out how to use it in my case.

Best regards

Hello!

As you mentioned, you could try the R2D2M2 prior. We have tested the R2D2M2 prior in high dimensional real life settings as you see in our paper here.

Here is a small minimal working example that shows how to set the prior and could help you:

library(brms
# Read the data
# I am using the popularity dataset, which I found in this brms tutorial
# https://www.rensvandeschoot.com/tutorials/brms-started/

popular2data <- read_sav(file = "https://github.com/MultiLevelAnalysis/Datasets-third-edition-Multilevel-book/blob/master/chapter%202/popularity/SPSS/popular2.sav?raw=true")

popular2data <- dplyr::select(popular2data, pupil, class, extrav, sex, texp, popular)

# Specifiy the R2D2 prior 
bprior <- prior(R2D2(mean_R2= 0.5, prec_R2=1 ,cons_D2 = 2, main = TRUE), class = sd) + 
  prior(R2D2(), class = b)

# Model with varying slopes

fit1 <- brm(popular ~ 1 + sex + extrav + texp + (1 + extrav | class),  
              data = popular2data, 
              prior = bprior,
              warmup = 1000,
              iter = 2000, 
              cores = 4, 
              chains = 4, 
              seed = 123) 

summary(fit1)
plot(fit1)
3 Likes

Thanks a lot for this answer! I did not had time to try it out in my project yet but I will for sure!