I’m interested in quantifying proteomic composition of microbes in natural samples. In the model below, I’ve tried to combine two types of proteomic measurements, along with their associated measurement error, and then modelled the proteomic composition as a function of environmental variables. This is conceptually similar to approaches in community ecology, e.g. this. I would love any feedback on the model structure, and on communicating the correct statistical notation!
data {
// data dimensions (N observations, K proteins, P measurements of protein/carbon from Joli et al)
int<lower=0> N;
int<lower=0> K;
// fmol per ug protein measurement means -- mean estimates
vector[N] cuznsod_meas;
vector[N] mnsod_meas;
vector[N] perox2_meas;
// measurement error of fmol per ug protein measurement -- standard errors
vector[N] cuznsod_se;
vector[N] mnsod_se;
vector[N] perox2_se;
// proportion of ug protein in Frag -- mean estimates
vector[N] frag_proportion_meas;
// measurement error of fmol per ug protein measurement -- standard errors
vector[N] frag_proportion_se;
// covariate data values
vector[N] temp_vals;
vector[N] fe_vals;
vector[N] mn_vals;
vector[N] light_vals;
}
parameters {
// setting up the covariance matrix
cholesky_factor_corr[K] chol;
vector<lower=0>[K] sigma;
// parameters for true observed values
vector<lower=0>[N] cuznsod_true;
vector<lower=0>[N] mnsod_true;
vector<lower=0>[N] perox2_true;
vector<lower=0, upper=1>[N] frag_proportion_true;// This is a proportion that is therefore restricted b/w 0 and 1
// regression model coefficients
vector<lower=0>[K] beta_0; // intercept, bounded by zero because all proteins are non-zero values
vector[K] beta_1_temp;
vector[K] beta_2_fe;
vector[K] beta_3_mn;
vector[K] beta_4_light;
}
transformed parameters {
corr_matrix[K] R; // correlation matrix
cov_matrix[K] Sigma; // covariance matrix
matrix[N, K] y_vals; // targeted proteomics values (not normalized by estimated Frag biomass)
vector[K] y[N]; // collection of values that are estimated fmol protein / ug Fragilariopsis protein. This notation forms an array of K vectors of length N.
// Recomposing the Cholesky factorized matrices to make a covariance matrix
R = multiply_lower_tri_self_transpose(chol); // R = Lcorr * Lcorr'
Sigma = quad_form_diag(R, sigma); // quad_form_diag: diag_matrix(sig) * R * diag_matrix(sig)
// Designating columns of the y_vals matrix to be the estimated values for the targeted proteomics data
y_vals[, 1] = cuznsod_true;
y_vals[, 2] = mnsod_true;
y_vals[, 3] = perox2_true;
// Loop through each element of the matrix and assign it to the array
for (i in 1:N) {
for (j in 1:K) {
// These targeted values are divided by the estimates Frag-derived protein
y[i, j] = y_vals[i, j]/frag_proportion_true[i]; // divide by the true proportion of Frag protein
}
}
}
model {
//This notation forms an array of K vectors of length N.
vector[K] mu[N];
frag_proportion_true ~ normal(0.1, 0.5); // weak prior on the proportion of frag in the sample
// measurement error model to estimate the true proportion of Frag protein
for (i in 1:N){
frag_proportion_meas[i] ~ normal(frag_proportion_true[i], frag_proportion_se[i]);
}
// measurement error model for targeted measurements
for (i in 1:N){
cuznsod_meas[i] ~ normal(cuznsod_true[i], cuznsod_se[i]);
mnsod_meas[i] ~ normal(mnsod_true[i], mnsod_se[i]);
perox2_meas[i] ~ normal(perox2_true[i], perox2_se[i]);
}
// the means of the multivariate normal are a function of environmental covariates
for(n in 1:N){
mu[n] = beta_0 + beta_1_temp*temp_vals[n] + beta_2_fe*fe_vals[n] + beta_3_mn*mn_vals[n] + beta_4_light*light_vals[n];
}
// y is a matrix of estimated, normalized measurements
y ~ multi_normal(mu, Sigma);
}
A few notes about the data and what I’ve done so far: I started off by simulating various multivariate normal distributions, and seeing if I could recover parameters using this approach. We don’t have a tonne of data overall, so the environmental dependence aspect is really exploratory. The main goal is to make good quantitative estimates, so the \beta_0 is super important.
All environmental variables were z-score transformed.
I would also love feedback on how I’ve described the model in the methods section, which is as described below (in draft form!). Note that we only have 14 observations that we are using in total, and three different proteins. I would love specific feedback on the dimensions!
" Taxon-normalized peptide abundances (P_r) were modelled as a multivariate normal distribution where the mean value is a function of environmental correlates. This approach was borrowed from community ecology, where there is both an environmental and biotic dependence on a species’ abundance. From a molecular perspective, this can be interpreted as the influence of an environmental trigger (e.g., promoter activity dependent on an environmental variable) as well as two proteins having a shared regulator. Specifically, we modelled the estimated protein abundances as follows:
𝐏_r ~ MVN(𝝁, 𝜮)
Where 𝐏_r is an matrix of estimated, taxon-normalised peptide abundances (with 3 columns and 14 rows), which are assumed to follow a multivariate normal distribution (MVN), where 𝝁 is a 3-element vector of mean abundances, and 𝜮 is the covariance matrix. We decomposed the covariance matrix using Cholesky factorization, to separately estimate the correlation matrix and the individual protein variances.
Lastly, we wanted to explain variation in the abundance of each protein, so we examined how the abundance varied as a function of several environmental covariates. Specifically, we examined dissolved Fe and Mn concentrations, light, and temperature. We therefore set the mean abundance to:
𝝁 = \beta_0 + \beta_{Fe}[dFe_z] + \beta_{Mn}[dMn_z] + \beta_{PAR}[PAR_z] + \beta_{T}[Temp_z]
Where \beta_0 is a three-element vector of intercept coefficients, \beta_{Fe} represents the set of coefficients for Fe, etc. Values within the square brackets are each a vector of length 14. Prior to fitting this model, we z-score transformed each environmental covariate, and therefore \beta_0 can be interpreted as an average value."