Differential expression with proteomics

Dear stan community,

I’ve recently started out using stan and brms for modelling. The existing and established workflows for differential expression on RNA-seq apparently call for the use negative binomial distributions since such data is based on counts and cannot be continuous. However, I am working with proteomic data obtained with a mass spectrometer. Hence, the relative intensities of each protein are continuous. Since an identified protein always has a positive intensity y, the data can be log-transformed, such that log(y) ~ N (µ,σ2). If we consider a simple experiment, a treatment is imposed on one group of cells but not the other, such that we have a categorical variable drug that either is 0 or 1. We then extract the proteins from each group, run them on the mass spectrometer and obtain intensity values for all detected proteins.

In my naive approach, to assess the proteins that have been differentially expressed, I would then for each protein compare intensity values obtained for each protein as follows with brms:

fit_0 = brm_multiple( bf(y ~ 1), ...)
fit_1 = brm_multiple( bf(y ~ 0 + drug), ...)
loo_compare( loo (fit_0), loo(fit_1) )

However, this resembles to me as something that appears almost as a frequentist approach as I am doing this model evaluation for each individual protein. I am also not sure if this somehow makes me run into a multiple-testing conundrum (which I’ve understood Bayesian statistics not really causes).

Should I rather do this evaluation as a mixed-effects model with another variable p (protein ID)?
Such as

fit_0 = brm( bf(y ~ (1|protein)), ...)
fit_1 = brm( bf(y ~ 0 + drug+(1|protein)), ...)

With this later approach it is not as clear to me how I would observe the proteins that really are differentially expressed. Any comments and feedback are welcome and appreciated!

At least in RNA sequencing, mean and variance are correlated (beyond the negative binomial property), I would expect this would be the case for proteins as well. Building a hierarchical model could help with lowly abundant proteins if any exist.

In the second model you are testing whether your protein pool is overall differentially transcribed. As mentioned you can build a hierarchical version of brm_multiple( bf(y ~ 0 + drug), ...), testing each protein.

github: stemangiola/ppcseq includes such model for RNA sequencing that can be adapted for proteins, as design.

1 Like

You are more on the right track with your second model, but you want to include the drug treatment within your varying effect - as you have it currently specified in fit_1 the drug parameter represents the average change in abundance across each protein, rather than the change for each protein individually. You can formulate the model along the lines of:

brm(y ~ (1 + drug | protein))

This will give you a drug parameter for each protein that should tell you about the differential abundance between drug treatments.The hierarchical model will partially pool information about each protein, which should also help improve differential abundance estimates for rare proteins.

1 Like

Thanks for the input!
@stemangiola I checked out your package but it appears it requires data as counts for input so I did not pursue that option any further

@prototaxites I had been under the impression that drug + (1| protein) was just another way to write (1+drug| entry), but it now makes sense that how the syntax works is that outside of the parenthesis you have effects on the population level, and grouped (nested) effects go inside the parenthesis. This really helped, thanks!

1 Like