Model review: Modelling proteins as a multivariate normal distribution with measurement error

I’m interested in quantifying proteomic composition of microbes in natural samples. In the model below, I’ve tried to combine two types of proteomic measurements, along with their associated measurement error, and then modelled the proteomic composition as a function of environmental variables. This is conceptually similar to approaches in community ecology, e.g. this. I would love any feedback on the model structure, and on communicating the correct statistical notation!


data {
  // data dimensions (N observations, K proteins, P measurements of protein/carbon from Joli et al)
  int<lower=0> N;
  int<lower=0> K;
  
  // fmol per ug protein measurement means -- mean estimates
  vector[N] cuznsod_meas;
  vector[N] mnsod_meas;
  vector[N] perox2_meas;

  // measurement error of fmol per ug protein measurement -- standard errors
  vector[N] cuznsod_se;
  vector[N] mnsod_se;
  vector[N] perox2_se;

  // proportion of ug protein in Frag -- mean estimates
  vector[N] frag_proportion_meas;
  
  // measurement error of fmol per ug protein measurement -- standard errors
  vector[N] frag_proportion_se;
  
  // covariate data values
  vector[N] temp_vals;
  vector[N] fe_vals;
  vector[N] mn_vals;
  vector[N] light_vals;

}

parameters {
  // setting up the covariance matrix
  cholesky_factor_corr[K] chol;
  vector<lower=0>[K] sigma;

  // parameters for true observed values
  vector<lower=0>[N] cuznsod_true;
  vector<lower=0>[N] mnsod_true;
  vector<lower=0>[N] perox2_true;

  vector<lower=0, upper=1>[N] frag_proportion_true;// This is a proportion that is therefore restricted b/w 0 and 1
  
  // regression model coefficients
  vector<lower=0>[K] beta_0; // intercept, bounded by zero because all proteins are non-zero values
  vector[K] beta_1_temp;
  vector[K] beta_2_fe;
  vector[K] beta_3_mn;
  vector[K] beta_4_light;
}

transformed parameters {
  corr_matrix[K] R; // correlation matrix
  cov_matrix[K] Sigma; // covariance matrix
  matrix[N, K] y_vals; // targeted proteomics values (not normalized by estimated Frag biomass)
  vector[K] y[N]; // collection of values that are estimated fmol protein / ug Fragilariopsis protein. This notation forms an array of K vectors of length N.
  
  // Recomposing the Cholesky factorized matrices to make a covariance matrix
  R = multiply_lower_tri_self_transpose(chol); // R = Lcorr * Lcorr'
  Sigma = quad_form_diag(R, sigma); // quad_form_diag: diag_matrix(sig) * R * diag_matrix(sig)
  
  // Designating columns of the y_vals matrix to be the estimated values for the targeted proteomics data
  y_vals[, 1] = cuznsod_true;
  y_vals[, 2] = mnsod_true;
  y_vals[, 3] = perox2_true;

  // Loop through each element of the matrix and assign it to the array
  for (i in 1:N) {
    for (j in 1:K) {
      // These targeted values are divided by the estimates Frag-derived protein
      y[i, j] = y_vals[i, j]/frag_proportion_true[i]; // divide by the true proportion of Frag protein
    }
  }
}

model {
  //This notation forms an array of K vectors of length N.
  vector[K] mu[N];
  
  frag_proportion_true ~ normal(0.1, 0.5); // weak prior on the proportion of frag in the sample
  
  // measurement error model to estimate the true proportion of Frag protein
  for (i in 1:N){
    frag_proportion_meas[i] ~ normal(frag_proportion_true[i], frag_proportion_se[i]);
  }
  
  // measurement error model for targeted measurements
  for (i in 1:N){
      cuznsod_meas[i] ~ normal(cuznsod_true[i], cuznsod_se[i]);
      mnsod_meas[i] ~ normal(mnsod_true[i], mnsod_se[i]);
      perox2_meas[i] ~ normal(perox2_true[i], perox2_se[i]);
  }
  
  // the means of the multivariate normal are a function of environmental covariates
  for(n in 1:N){
     mu[n] = beta_0 + beta_1_temp*temp_vals[n] + beta_2_fe*fe_vals[n] + beta_3_mn*mn_vals[n] + beta_4_light*light_vals[n];
  }
  
  // y is a matrix of estimated, normalized measurements
  y ~ multi_normal(mu, Sigma);

}

A few notes about the data and what I’ve done so far: I started off by simulating various multivariate normal distributions, and seeing if I could recover parameters using this approach. We don’t have a tonne of data overall, so the environmental dependence aspect is really exploratory. The main goal is to make good quantitative estimates, so the \beta_0 is super important.

All environmental variables were z-score transformed.


I would also love feedback on how I’ve described the model in the methods section, which is as described below (in draft form!). Note that we only have 14 observations that we are using in total, and three different proteins. I would love specific feedback on the dimensions!

" Taxon-normalized peptide abundances (P_r) were modelled as a multivariate normal distribution where the mean value is a function of environmental correlates. This approach was borrowed from community ecology, where there is both an environmental and biotic dependence on a species’ abundance. From a molecular perspective, this can be interpreted as the influence of an environmental trigger (e.g., promoter activity dependent on an environmental variable) as well as two proteins having a shared regulator. Specifically, we modelled the estimated protein abundances as follows:

𝐏_r ~ MVN(𝝁, 𝜮)

Where 𝐏_r is an matrix of estimated, taxon-normalised peptide abundances (with 3 columns and 14 rows), which are assumed to follow a multivariate normal distribution (MVN), where 𝝁 is a 3-element vector of mean abundances, and 𝜮 is the covariance matrix. We decomposed the covariance matrix using Cholesky factorization, to separately estimate the correlation matrix and the individual protein variances.

Lastly, we wanted to explain variation in the abundance of each protein, so we examined how the abundance varied as a function of several environmental covariates. Specifically, we examined dissolved Fe and Mn concentrations, light, and temperature. We therefore set the mean abundance to:

𝝁 = \beta_0 + \beta_{Fe}[dFe_z] + \beta_{Mn}[dMn_z] + \beta_{PAR}[PAR_z] + \beta_{T}[Temp_z]

Where \beta_0 is a three-element vector of intercept coefficients, \beta_{Fe} represents the set of coefficients for Fe, etc. Values within the square brackets are each a vector of length 14. Prior to fitting this model, we z-score transformed each environmental covariate, and therefore \beta_0 can be interpreted as an average value."

2 Likes

Hi, @jspmcc. These questions asking to review non-trivial models are hard for us as it’s not clear where to look.

For your model, the main thing I would recommend is staying on the Cholesky factor scale. It’s quadratic rather than cubic complexity and avoids a ton of autodiff implied by trying to put covariance matrices back together. We explain how to do this in the regression chapter of the User’s Guide. Briefly, rather than defining Sigma = quad_form_diag(R, sigma), you want to use L_Sigma = diag_pre_multiply(sigma, R), which will give you the Cholesky factor of Sigma, i.e., diag_matrix(sigma) * R. Then replace multi_normal(mu, Sigma), with multi_normal_cholesky(mu, L_Sigma).

You can simplify further by using linear algebra rather than loops. I’m pretty sure mu can be redefined as a matrix and unfolded into matrix multiplies.

You declare beta_0 with a lower bound of zero, but it’s really the whole expression defining mu that you want to lower bound at zero. Or you want to work on the log scale and then take an exp() at the end to convert back to positive values.

For the writeup, I’d pick a tense (past/present) and a voice (active/passive) and a person (1st/3rd) and stick to it.

I think it’s OK to code in Cholesky factors and do the write up with dense matrices—the Cholesky factorization is an implementation detail.

When you write math, think of equations as statements like sentences. As such, we’d write “abundances as” with no following colon, display equation followed by a comma, then “where” starting in lower case. When you write math in LaTeX, you want functions to be in Roman, not italics, and you want them in text mode, so it’s P_r \sim \textrm{MVN}(\mu, \Sigma), coded as \textrm{MVN}(\mu, \Sigma). I would not write any of these variables in bold following Andrew Gelman’s advice in BDA3—there just aren’t enough capitalization, greek, and bolding to keep marking raised types. I would keep P in italics because it’s a variable (constants are regular Roman). If you’re saying \mu has 3 elements, then maybe say \Sigma is 3 \times 3?

I think the relevant point is that the prior is factored into a prior on the correlation matrix and a prior on the scales. This does not require a Cholesky factorization and I wouldn’t present it in Cholesky form in the methods section. Just Cholesky factoring doesn’t separately estimate anything. You can always decompose a matrix into a product of scales and a correlation matrix by dividing by the scales (square root of diagonals).

What is the z index in the definition of \mu? I think you need to explain all the variables used here and how the indexing works—it’s by far the most confusing part of this description. What are the d suffixes for? If these are constant indexes, then you probably want them to be in roman font rather than italics.

A z-score transform of the covariates will have a predictable effect on \beta_0, but I don’t see why you say it can be interpreted as an average value. I think that would require that the covariates are not correlated, which would require a QR-decomposition, not just a z-transform. I don’t think you need to mention the interpretation of \beta_0 in any case—it’s just a standard regression and this is probably not the place to teach people how regression works.

Good luck with the paper!

P.S. I would strongly urge you to drop almost all the comments in your code because they’re pretty much all obvious from reading the code if you know the language. The general principle of doc is that you should assume the reader of the code is an expert in the language (this is not just me—it’s at the heart of the Google code style sheets in all of their languages). If a reader doesn’t know how something in the language works, assume they have the doc and can look it up to get the definitive answer on what a function does. Some simple examples:

corr_matrix[K] R; // correlation matrix

If someone knows Stan, they know corr_matrix is a correlation matrix. Even if they don’t know Stan, I think it’s fair to assume they can guess what’s going on here. Similarly, here’s an example where you’re trying to document Stan in your own code.

Sigma = quad_form_diag(R, sigma); // quad_form_diag: diag_matrix(sig) * R * diag_matrix(sig)

Comments like this picking out blocks really don’t help with much:

// setting up the covariance matrix
cholesky_factor_corr[K] chol;
  vector<lower=0>[K] sigma;

I also wouldn’t name a variable chol—that’s more of a partial type than a description of a variable—try to use notation that follows the math in your paper. Similarly, you don’t need _vals as a suffix for a variable—everything has a value, so it’s not adding anything. Similarly, we usually reserve _true for the single “true” values, and I wouldn’t use it as a parameter name unless you have a measurement error model where you measure a true value with error.

Okay! I wasn’t sure if this is an appropriate forum, and figured it wouldn’t hurt. I won’t again, apologies! (Thank you for your comments and suggestions on this though!!)

Fair, I honestly just like writing it as a loop because I find it easier to understand.

Yes I agree!

The z index is to indicate that the values were z-score transformed. I had taken this out of the larger methods section, and now I realize that other terms were defined earlier (not shown in my post). For clarity, the d suffix stands for dissolved.

Fascinating! I’ve never heard this argument before. I tend to just comment just to make sure everything is in plain language!

Please don’t stop sending things to the forum. I was just apologizing for how long it takes us to get to some of these posts. I wasn’t trying to discourage you from posting!

Doesn’t there need to also be some indexing here?

This is standard operating procedure in professional code. For example, here’s the Google Style guide on commenting Python:

On the other hand, never describe the code. Assume the person reading the code knows Python (though not what you’re trying to do) better than you do.

They do temper this with

The final place to have comments is in tricky parts of the code. If you’re going to have to explain it at the next code review, you should comment it now. Complicated operations get a few lines of comments before the operations commence. Non-obvious ones get comments at the end of the line.

One problem is that comments can lie and they can quickly get stale. Many of the comments I see are just wrong because the programmer didn’t understand the language they were using. It’s far better to write code that’s readable than to try to document unreadable code. The code doesn’t lie and you wind up having to read that anyway because comments are generally unreliable. The guidance is to only write comments in code when you are departing from idiomatic usage of the language or you’re invoking some complicated algorithm and want to point off to where it’s defined in a paper or quickly describe it inline.

Oh oops, I misunderstood!

Thanks for those references on commenting – very interesting!

I should have added that one of my colleagues here at Flatiron, @JSoules, has a really nice presentation on this topic he did for our in-house training last fall. You can find his talk here:

under the title

  • “[NLA + HPC] Coding for Humans: Best Practices for writing software people can read”.

I also gave a talk at FWAM on Bayesian workflow under the title “[Comp Stats+Bayesian] Testing Bayesian models and predictive inference in ML and statistics”.

1 Like