Problems formulating cross-over model with measurement error and censored data

Hello,

I have a working hierarchical model representing a factorial design for an environmental treatment effect. The model works fine, and provides good estimates. However, I made some simplifications in calculations regarding the data input to the model, that I believe actually could be incorporated into the model for a more complete inference. This part I am currently struggling with.

Situation:
Pre and post-treatment I measure total (T) and dissolved (D) concentrations of my compound, but I am actually interested in a third calculated quantity, particulate concentration (P) for my model. But due to measurement differences and how my lab reports the data the quantity P = T - D is not straightforward to calculate. The lab reports, if quantifiable concentrations, a measurement mean and a measurement error interval of approximately 2\sigma. If the concentrations are unquantifiable they report a limit of reporting (typically the case for some compounds post-treatment). So when I read the Stan manual for ideas on how to implement it all, it looks like I have some cross-over between a measurement-error model and left-censored data. However, neither seems to fit out of the box for my intent. I made some naive implementation, but it errors out on log(0) probability. I think I may be missing something fundamental.

Model code below, the particulate vector is what I would ideally want to incorporate in an expanded model including the experimental design. Any pointers on what I have misunderstood?

Any advice would be most welcome!

data {
    int<lower = 0> N;            // Number of experiments
    
    // Data for total concentration
    vector[N] total_mea;     // Measured total concentration
    vector[N] total_mea_err; // Uncertainty in total concentrations, 2*sigma
    int total_bql[N];    // Below quantification limit: 0 or 1
    vector[N] total_lor; // if total_bql == 1, this is the limit of reporting
    
    // Data for dissolved concentration
    vector[N] dissolved_mea;     // Measured dissolved concentration
    vector[N] dissolved_mea_err; // Uncertainty in dissolved concentrations, 2*sigma
    int dissolved_bql[N];        // Below quantification limit: 0 or 1
    vector[N] dissolved_lor;     // if dissolved_bql == 1, this is the limit of reporting
    
  }

parameters {
  vector[N] total;
  vector[N] dissolved;
  // Constrain particulate to be >=0
  vector<lower = 0>[N] particulate;
}

model {
  for (i in 1:N) {
    real total_err;
    real dissolved_err;
    
    if (total_bql[i]) {
      //  Sample total concentration is below quantification limit
      // Assume it comes from Uniform[0, limit_of_reporting]
      total[i] ~ uniform(0, total_lor[i]);
      total_err = total_lor[i] / sqrt(12);
    } else {
      // Sample concentration was measured to mean with uncertainty 2xsigma
      total[i] ~ normal(total_mea[i], total_mea_err[i]/2);
      total_err = total_mea_err[i]/2;
    }
    
    if (dissolved_bql[i]) {
      //  Sample dissolved concentration is below quantification limit
      // Assume it comes from Uniform[0, limit_of_reporting]
      dissolved[i] ~ uniform(0, dissolved_lor[i]);
      dissolved_err = dissolved_lor[i] / sqrt(12);
    } else {
      dissolved[i] ~ normal(dissolved_mea[i], dissolved_mea_err[i]/2);
      dissolved_err = dissolved_mea_err[i]/2;
    }
    
    // Calculate the difference
    // TODO:
      // currently assumes two random normally distributed variables
    // but could be normal - normal, normal - uniform, uniform - normal,
    // depending on the assumption of the <LOR distribution
    // perhaps create custom lpdf function later
    particulate[i] ~ normal(total[i] - dissolved[i], sqrt(total_err^2 + dissolved_err^2));
  }
}

Sample data:

compoundA <- list(N = 23L, 
                 total_mea = c(1.16, 1.29, 7.17, 8.21, 2.62, 9.95, 1.89, 1.12, 7.28, 0.949, 0.892, 
                                10.5, 12, 1.21, 1.22, 10.8, 1.1, 1.36, 2.95, 3.97, 0.5, 10.7, 1.16), 
                  total_mea_err = c(0.17, 0.18, 0.73, 0.83, 0.29, 1, 0.23, 0.17, 0.74, 0.158, 0.154,
                                    1.1, 1.2, 0.17, 0.18, 1.1, 0.17, 0.19, 0.32, 0.42, -1, 1.1, 0.17), 
                  total_bql = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
                  total_lor = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
                                0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), 
                  dissolved_mea = c(0.244, 0.415, 0.548, 0.5, 0.871, 0.547,  0.73, 1, 0.584, 0.447,
                                    0.598, 1.16, 0.387, 0.767, 0.626, 0.27, 0.666, 0.5, 0.31, 0.305, 
                                    0.306, 0.392, 0.63), 
                  dissolved_mea_err = c(0.027, 0.043, 0.056, -1, 0.088, 0.056, 0.074, 0.1, 0.06, 
                                        0.046, 0.061,  0.12, 0.04, 0.078, 0.064, 0.029, 0.068, -1, 
                                        0.033, 0.033, 0.033, 0.041, 0.064), 
                  dissolved_bql = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0,  0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), 
                  dissolved_lor = c(0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 
                                    0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
1 Like

This looks like a mostly reasonable approach, although there are few problems and things I would probably change:

  1. When using uniform, you also need to put the bounds on the support of the variables. The easiest way to do this would probably be to split total (and dissolved) into two arrays, one with bounds (for the “below detection limit” values), other with no bounds

  2. The lower bound of 0 is probably shared also when measured with uncertainty, so all total and dissolved should probably be defined with <lower=0>

  3. If it is necessary that total > dissolved, then this once again should be reflected in bounds on the parameters

  4. I don’t think it makes sense to treat particulate as parameter - once you get total and dissolved, particulate can be directly computed, right?

If some of the advice is confusing or you have other trouble applying it, feel free to ask for clarifications.

Hope that helps at least a bit and best of luck with your model!

1 Like

Thank you so much for replying! Sorry for taking a while to respond. Since I initially posted my question I’ve gone through many different iterations of trying to get it to work, but your answer helped me solve some of my sub-problems though, much appreciated! I’ve yet not managed to get it all to work together.

Regarding your points:

  1. This one caused me some headaches until you pointed out the requirements for bounds. I don’t think I can split the set like you suggest, even though it would make it cleaner in the code. I want to use the whole series in further modelling later. I have solved it using the transformed parameters section and masking (is this the right terminology?) different parameters into a vector.

  2. Completely true. This is how I run it now in my models.

  3. For my data set this is only true element-wise, total[i] > dissolved[i] is true, but tota[i+1] > dissolved[i] may not be. (I have multiple chemical compounds in the data-set with very different concentrations, some may be in the 10-10 ug/L others in 55000 ug/L).

  4. There is more to the model than what I posted. I have these measurements pre and post treatment, so later I wish to calculate the standardized treatment effect size (\bar{P}^{pre}_i - \bar{P}^{post}_i)/P^{pre}_{\sigma, i} and use that in my factorial model (e.g. \text{eff_size} = A + B + C) to see the treatment contributions for each factor, and I do this in a hierarchical model. That part of the model works nicely, I made some simplifications to calculate the \text{eff_size} in R and sent that as data to Stan. Now I want to augment the model with the full uncertainty.

And now I have scaled it back to a simpler data set to try to understand the measurement error model and how to properly specify it in Stan. This is where I am stuck right now. Leaving the censored data points behind, I get large many divergences, when i incorporate the estimation of the standard deviation of the difference in the measurement error observations.

With the following toy data set:

data <- list(
  N = 5,
  T_obs_mu = c(55, 1430, 368, 34769, 6438),
  T_obs_sd = c(7, 200, 21, 2050, 350),
  D_obs_mu = c(35, 1480, 5, 29786, 6864),
  D_obs_sd = c(5, 150, 1, 1640, 150)
)

I can estimate P_{\mu} with:


data {
  int<lower=0> N;
  vector[N] T_obs_mu;
  vector[N] T_obs_sd;
  vector[N] D_obs_mu;
  vector[N] D_obs_sd;
}

transformed data {
  vector[N] P_obs_sd;
  
  for (i in 1:N) {
    P_obs_sd[i] = sqrt(square(T_obs_sd[i]) + square(D_obs_sd[i]));
  }
}

parameters {
  vector<lower = 0>[N] T_mu;
  vector<lower = 0>[N] D_mu;
  vector<lower = 0>[N] P_mu;
}

model {
  T_mu ~ normal(T_obs_mu, T_obs_sd);
  D_mu ~ normal(D_obs_mu, D_obs_sd);
  
  P_mu ~ normal(T_mu - D_mu, P_obs_sd);
}


Works nicely, and the estimates make sense. However, trying to estimate P_{\sigma} now results in divergences on basically all draws and low effective samples in general.

data {
  int<lower=0> N;
  vector[N] T_obs_mu;
  vector[N] T_obs_sd;
  vector[N] D_obs_mu;
  vector[N] D_obs_sd;
}

transformed data {
  vector[N] P_obs_sd;
  
  for (i in 1:N) {
    P_obs_sd[i] = sqrt(square(T_obs_sd[i]) + square(D_obs_sd[i]));
  }
}

parameters {
  vector<lower = 0>[N] T_mu;
  vector<lower = 0>[N] D_mu;
  vector<lower = 0>[N] P_mu;
  vector<lower = 0>[N] P_sd;
}

transformed parameters {
  vector[N] lambda = inv(P_obs_sd);
}

model {
  T_mu ~ normal(T_obs_mu, T_obs_sd);
  D_mu ~ normal(D_obs_mu, D_obs_sd);
  
  P_sd ~ exponential(lambda);
  P_mu ~ normal(T_mu - D_mu, P_sd);
}

The estimates seem relatively fine though. But the traceplot, and pairs plot indicate problems.


Any ideas on how to fix the model?

1 Like

Sorry, I kind of dropped the ball on this.

Since I think Stan 2.26 you can provide a vector of lower/upper bounds, so having

parameters {
  vector[N] total;
  vector<upper=total>[N] dissolved;
}

should work as expected in recent Stan.

Looking at the later model, I think I kind of see what is happening. At least part of the problem is that the range of sensible P_mu varies a lot with P_sd, which is itself not constrained by data in any direct way. So you can have P_mu far from T_mu - D_mu and large P_sd or P_mu close to T_mu - D_mu and small P_sd which creates difficult funnel-like geometry.

In some cases of funnels, one can use the non-centered parametrization to make the posterior better behaved, but this problem reminds me a bit of the Underdetermined Linear Regression case study where this cannot help as you just don’t have enough data to learn anything useful. I fear this is also the case here - I don’t see any way how the data could give you any information about P_sd.

Additionally, the model seems a bit weird as it corresponds to a process where parameters are derived from data, when the usual thinking is that we have some unobserved parameters that then give rise to the data.

So something like

T_obs_mu ~ normal(T_mu, T_obs_sd);

would be a bit more natural. But maybe there is a good reason for this that I am missing?