Additive degeneracy issue

I have a problem that I can’t solve with my very basic linear algebra skills - posting here in the hope there is someone more informed out there!

I am trying to model some measurements of N_{rxn} chemical reactions that depend deterministically on properties of their N_{cpd} reactants through the relationship y = S^{T}\theta, where

  • S is an N_{cpd} by N_{rxn} stoichiometric matrix specifying the proportions in which the reactions create and destroy reactants
  • \theta is a length N_{rxn} vector of unknown reactant properties

(the measurements are of gibbs energy changes of reaction and the reactant properties are their gibbs energy changes of formation).

There is prior information about \theta and I’m assuming that the measurement error is known in advance.

Here is a naive representation of the model I’d like to fit in the form of a Stan program:

data {
  int<lower=1> N_rxn;
  int<lower=1> N_cpd;
  matrix[N_cpd, N_rxn] S;
  vector[N_rxn] y;
  vector<lower=0>[N_rxn] error_scale;
  vector[N_cpd] prior_loc_theta;
  vector<lower=0>[N_cpd] prior_scale_theta;
}
parameters {
  vector[N_cpd] theta;
}
model {
  target += normal_lpdf(theta | prior_loc_theta, prior_scale_theta);
  target += normal_lpdf(y | S' * theta, error_scale);
}
generated quantities {
  vector[N_rxn] yrep;
  {
    vector[N_rxn] yhat = S' * theta;
    for (n in 1:N_rxn){
      yrep[n] = normal_rng(yhat[n], error_scale[n]);
    }
  }
}

This model doesn’t work very well in cases where the measured reactions don’t identify all the compound properties, which is typical. For example, a stoichiometric matrix might look like this:

[[0, 0, 0, -1],
 [0, 0, 0, -1],
 [0, 0, 0, 2],
 [0, 0, -1, 0],
 [0, -1, -1, 0],
 [1, 1, 2, 0]]

In this setup reaction 1 (the leftmost column) creates compound 6 with no compound being destroyed. Reaction 2 creates compound 6 and destroys compound 5, etc.

The problem is that measuring these reactions only gives information about the absolute \theta values of compounds 4 to 6. For \theta_1, ...\theta_3 the measurements only give information about their relative values. This can be seen by looking at the reduced row echelon form of S^T:

[[1, 1, -2, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 1]]

From this matrix we can see that the measurement y_1 depends on \theta_1 + \theta_2 - 2\theta_3. I think this is a case of what Michael Betancourt @betanalpha calls additive degeneracy.

I tried using the reduced row echelon form of S^T to make the following alternative Stan model which seems to give the same posterior predictive distribution as the one above, but uses fewer leapfrog steps:

data {
  int<lower=1> N_rxn;
  int<lower=1> N_cpd;
  matrix[N_cpd, N_rxn] S;
  matrix[N_rxn, N_cpd] S_T_rref;
  vector[N_rxn] y;
  vector<lower=0>[N_rxn] error_scale;
  vector[N_cpd] prior_loc_theta;
  vector<lower=0>[N_cpd] prior_scale_theta;
}
transformed data {
  vector[N_rxn] prior_loc_theta_free = S_T_rref * prior_loc_theta;
  vector[N_rxn] prior_scale_theta_free = fabs(S_T_rref) * prior_scale_theta;
}
parameters {
  vector[N_rxn] theta_free;
}
model {
  vector[N_rxn] yhat = S' * (theta_free' * S_T_rref)';
  target += normal_lpdf(theta_free | prior_loc_theta_free, prior_scale_theta_free);
  target += normal_lpdf(y | yhat, error_scale);
}
generated quantities {
  vector[N_rxn] yrep;
  {
    vector[N_rxn] yhat = S' * (theta_free' * S_T_rref)';
    for (n in 1:N_rxn){
      yrep[n] = normal_rng(yhat[n], error_scale[n]);
    }
  }
}

I’ve made a quick python script that runs both models with some representative data here.

My questions are:

  • Does what I’ve done so far make sense? Are my models really equivalent mathematically/informationally?

  • If not, is there a standard way of dealing with this kind of problem?

  • If so, how can I recover the vector \theta? I think the required information is there but I’m confused about how to combine the relative information in theta_free with the pre-experimental information about the absolute theta parameters.

I don’t have time to carefully look at the linear algebra but the basic idea is correct. See also The QR Decomposition For Regression Models for a demonstration of a similar problem, including the translation of a prior density in the nominal parameterization to the reduced one.

Ultimately it helps to recognize that you’re reparameterizing the entire model, not just the one likelihood function, so you need to propagate the change everywhere including through the prior density with the requisite Jacobian and what not.

The reduced form works out parameters that are more directly informed by the available data that should facilitate the fit, at least provided that the transformed prior model isn’t awkward. Otherwise the main options are a more informative prior model – even more information on just a few parameters can help fight the degeneracy – or incorporating complementary measurements that identify how some of the reactions work on their own.

2 Likes

Thanks very much for the helpful comments and link! I think I’m now pretty clear about the right general approach and have got an encouraging result in my simple simulation study. I’m still not 100% sure so I thought I’d write out everything I did.

I defined a reparameterisation with a matrix R which is the N_{cpd} \times N_{cpd} identity matrix but with rows from the reduced row echelon form of S^T where rref(S^T) has leading ones. In this case the matrix is

R = \left[\begin{matrix}1 & 1 & -2 & 0 & 0 & 0\\0 & 1 & 0 & 0 & 0 & 0\\0 & 0 & 1 & 0 & 0 & 0\\0 & 0 & 0 & 1 & 0 & 0\\0 & 0 & 0 & 0 & 1 & 0\\0 & 0 & 0 & 0 & 0 & 1\end{matrix}\right]

with rows 1, 4, 5 and 6 taken from rref(S^T) because its leading ones are at these columns. The reparameterisation is

\gamma = R\theta

and theta can be recovered with

\theta = R^{-1}\gamma

I made the following Stan model to test the reparameterisation:

data {
  int<lower=1> N_rxn;
  int<lower=1> N_cpd;
  matrix[N_cpd, N_rxn] S;
  matrix[N_cpd, N_cpd] R;  // matrix defining a reparameterisation
  vector[N_rxn] y;
  vector<lower=0>[N_rxn] error_scale;
  vector[N_cpd] prior_loc_theta;
  vector<lower=0>[N_cpd] prior_scale_theta;
}
parameters {
  vector[N_cpd] gamma;
}
transformed parameters {
  vector[N_cpd] theta = R \ gamma;
}
model {
  target += normal_lpdf(theta | prior_loc_theta, prior_scale_theta);
  // no jacobian as gamma -> theta is a linear transformation 
  target += normal_lpdf(y | S' * theta, error_scale);
}
generated quantities {
  vector[N_rxn] yrep;
  vector[N_rxn] log_lik;
  {
    vector[N_rxn] yhat = S' * theta;
    for (n in 1:N_rxn){
      yrep[n] = normal_rng(yhat[n], error_scale[n]);
      log_lik[n] = normal_lpdf(y[n] | yhat[n], error_scale[n]);
    }
  }                             
}

It seems to work ok in the test case I put in. From this graph it looks like the marginal posteriors are about the same with the new parameterisation as the naive one:

This graph shows that Stan took much fewer leapfrog steps when sampling from the reparameterised model:

I still need to try the method out in a more complicated case and I also want to include more information about the compounds (two compounds with similar chemical strucutures likely have similar formation energy) but for now I’m very happy - thanks again!

2 Likes

Here’s a quick update after some more work, and some more questions. I updated the github repository with code implementing a new model, some simulation studies and a draft report and presentation.

To recap, the general problem is that there is a stoichiometric matrix S specifying the substrates and products of lots of reactions, a vector y of measurements for each reaction’s Gibbs energy change in standard conditions and a vector \theta of unknown formation energies for each compound, which determine the reaction energy changes according to S^T\cdot\theta = \hat{y}. There is also some extra information: the compounds decompose into chemical groups, and each compound’s formation energy is approximately the sum of those of its component groups. I didn’t include this extra information in the example above.

I’d like to use the following Stan program to represent all this information:

data {
  int<lower=1> N_rxn;
  int<lower=1> N_cpd;
  int<lower=1> N_grp;
  matrix[N_cpd, N_rxn] S;
  matrix[N_cpd, N_grp] G;
  vector[N_rxn] y;
  vector<lower=0>[N_rxn] sigma;
  vector[N_cpd] prior_theta[2];
  vector[N_grp] prior_gamma[2];
  real prior_tau[2];
}
parameters {
  real<lower=0> tau;
  vector[N_cpd] theta;
  vector[N_grp] gamma;
}
model {
  target += normal_lpdf(theta | prior_theta[1], prior_theta[2]);
  target += normal_lpdf(gamma | prior_gamma[1], prior_gamma[2]);
  target += normal_lpdf(tau | prior_tau[1], prior_tau[2]);
  target += normal_lpdf(theta | G * gamma, tau);  // approximate group additivity
  target += normal_lpdf(y | S' * theta, sigma);
}

The difficulty is that the measurements are of linear combinations of unknowns. Since the measurement error is small compared to the prior sds for theta, the marginal posteriors tend to be wide, whereas the posteriors for linear combinations of components of theta representing compounds in the same reaction are narrow.

To mitigate the degeneracy I made a method for deriving the formation energies from auxiliary parameters, borrowing from this 1991 biology paper. To get the auxiliaries from the formation energies I multiply by a matrix consisting of the reduced row echelon form of S^T augmented with rows from the N_{cpd}\times N_{cpd} dentity matrix. Formation energies can be obtained from auxiliaries by multiplying by this matrix’s inverse. The idea is that the posteriors for auxiliary parameters corresponding to rows from rref(S^T) will be determined by the measurements and be narrow, whereas the other auxiliary parameters will be determined by the priors.

Here’s a Stan program implementing the reparameterised model:

data {
  int<lower=1> N_rxn;
  int<lower=1> N_cpd;
  int<lower=1> N_grp;
  matrix[N_cpd, N_rxn] S;
  matrix[N_cpd, N_grp] G;
  matrix[N_cpd, N_cpd] R_inv;
  matrix[N_grp, N_grp] RG_inv;
  vector[N_rxn] y;
  vector<lower=0>[N_rxn] sigma;
  vector[N_grp] prior_gamma[2];
  vector[N_cpd] prior_theta[2];
  real prior_tau[2];
}
parameters {
  real<lower=0> tau;  // controls group additivity accuracy
  vector[N_cpd] eta_cpd;
  vector[N_grp] eta_grp;
}
transformed parameters {
  vector[N_cpd] theta = R_inv * eta_cpd;
  vector[N_grp] gamma = RG_inv * eta_grp;
}
model {
  target += normal_lpdf(theta | prior_theta[1], prior_theta[2]);
  target += normal_lpdf(gamma | prior_gamma[1], prior_gamma[2]);
  target += normal_lpdf(theta | G * gamma, tau);
  target += normal_lpdf(tau | prior_tau[1], prior_tau[2]);
  target += normal_lpdf(y | S' * theta, sigma);
}

The reparameterisation seems to reduce the number of leapfrog steps in two simulation studies I made that are in the linked repo, but the results weren’t as dramatic as I expected. Here is a graph of cumulative leapfrog steps in a small artificial simulation study:

Here is a similar plot from a larger simulation study with real stoichiometries from the equilibrator project:

In particular, in the big problem (~650 reactions and compounds) the number of leapfrog steps was lower for the reparameterised model by about 33% but the total sampling time was about the same.

I have some specific questions about the approach I took:

  • Is it plausible that the extra matrix multiplication in the reparameterised
    model makes the cost per leapfrog step so much higher?
  • Is there a better way to express the priors and multilevel structure? I would
    like to express both in a non-centered kind of way to see if this is more
    efficient, but I struggled to do both this and the reparameterisation.

More generally, I’d like to know if anyone knows any other context where this
problem comes up (i.e. Bayesian modelling of precise measurements of linear
combinations of unknowns with relatively imprecise prior information). It seems
like quite a general problem and my solution comes from the same field, so I
think it’s possible I’m missing out on a lot of prior art.

1 Like