Lower BFMI but seems to be a better fit?

I am fitting a nonlinear biochemical/kinetic model with several correlated parameters, which is a task in itself, but on top of that I am trying to incorporate an additional feature, which is to account for machine error as well as experimental error, which are qualitatively different. I discuss it in more detail here, but in essence the machine always injects some additional signal \epsilon to any readout, which leads to a sort of ‘limit-of-detection’ problem since samples orders of magnitude below \epsilon are hidden in the noise. That is, any measured signal Y_{n}^{measured} is made up of two components:

Y_{n}^{measured} = Z_{n} + \epsilon_{n}

where

Z_{n} \sim \mathcal{N}(Y_{n}^{true},\sigma) is a separate measurement error model for the underlying data, and
\epsilon_{n} \sim \mathcal{N}(L,\tau) is the added machine noise.

(Just in case you’re wondering, machine noise is added in on purpose by companies that manufacture the detection devices because they don’t want to report negative values)

This implies in the following
Y_{n}^{measured} \sim \mathcal{N}( Y_{n}^{true} + L, \sigma + \tau)

I initially thought that the addition of the variances means that \sigma + \tau cannot be identified separately, as was @Bob_Carpenter 's suggestion. But I had a hunch that a model can do it because it is even possible to distinguish them by eye, given that you know the limit of detection is at the lower end of measured values. At those low values (and for measurements where you expect 0, for example, measurements of blank samples that don’t emit any signal), all you are seeing is \epsilon. On the other hand, measurements far above this value should have only the experimental measurement error, i.e. if Y_{n}^{true} >> L, then only \sigma matters. In fact, people actually estimate \epsilon heuristically, by eye, using blank samples (I explain in my older post I liked).

So I coded both a model with one combined variance and one with two separate variances.
The model with a single variance looks something like this in stan

transformed parameters{
vector[N] Ytrue = ...; //nonlinear formula generating Ytrue
//estimating L on log-scale because it is often very small, 0.0001 etc.
vector[N] Z = Ytrue + exp(L); 
}
model{
normal_lpdf(Ymeasured|Z,sigma);
}

In contrast, the model with two variances looks like this

parameters{
vector[N] Z_raw; //using non-centered parametrization
}
transformed parameters{
vector[N] Ytrue = ...; //nonlinear formula generating Ytrue
//again estimating L on log-scale because it is very small

vector[N] Z = (Ytrue + sigma*Z_raw) + exp(L); 
}
model{
normal_lpdf(Ymeasured|Z,tau);
}

In addition to these two, since I have pretty strong assumptions that experimental errors are lognormal, whereas I’m not so sure about machine error, I have also made models with lognormal versions of the above.

The problem I’m encountering is:

  • the two-variance model has better posterior predictions, but requires a lot more samples to properly converge and always has a low E-BFMI
  • the one-variance model fits faster and achieves convergence easier, without any BFMI issues, but its posterior prediction is much poorer

Here is the two variance model’s energy plot and posterior prediction


And in contrast here is the one variance model’s energy plot and posterior prediction


Any ideas what could be happening? I wonder if I have coded the one variance model wrong (should I have a vector of L values instead of just a single L?). If it’s any help, when I perform LOO-CV on the two models, the two-variance model has a lot of very high pareto K values and the one variance model does fine, so I don’t think I can reliably compare the two… but when I do loo_compare the two variance model has a far higher elpd.

Any advice very much appreciated

While it’s often true that slow convergence reflects a misspecified model, this is not universally true. And the opposite certainly isn’t true: fast convergence is not a good indicator that the model fits well. It sounds like you have both a priori and a posteriori reasons for thinking the two-variance model is a better model for your data, and you are encountering the challenge that it is difficult to get a reliable posterior out of this model, due to the EFMI issues. In this case, the first thing to ask is just how low is the EFMI? If it’s greater than about 0.2, and you can get acceptable r-hats without divergences by running a bunch of iterations, then you can probably proceed with inference from your model without worrying too much. If the EFMI is substantially lower than 0.2, you can try reparameterizing the two-variance model to see if you can get a version that yields a reliable posterior.

One trick that can be useful is to parameterize in terms of the total variance (or its square root–a combined standard deviation), plus a parameter on the unit interval that allocates a proportion of that variance to one effect and a proportion of that variance to the other effect. This can break correlations when the two variances are weakly identified but their sum is strongly identified.

For what it’s worth, I’m suspicious of

Adding iid Gaussian noise will encourage negative values, not discourage them! Adding a positive offset L of course will discourage negative values, but adding some additional noise won’t help any further, and will actually make negative values more likely.

Moreover, if you

then the machine would never accidentally output a negative value absent the addition of the machine noise.

All of this makes me a bit suspicious that your description might not be the best conceptual model for what the machine noise represents and where it comes from. If modeling the machine noise is important to your inference, then it’ll be important to make sure that your model adequately captures the data generating process here.

Your suspicions are justified please allow me to clarify:

The measurements are spectrophotometric (photons into a diode). This machine can be used for a variety of purposes, so whether the experimental errors themselves are assumed to be lognormally or normally distributed depends on the type of sample. For example, bacterial densities can have multiplicative errors within the range they are measured because growth is on the log scale. But the reasons for injecting an offset L by the manufacturers is to solve the problem of blank samples, having no bacteria, or no other chemical/colloid/etc that will affect transmittance.

The way that this offset is achieved is through injecting a bit of extra voltage to every measurement before analog to digital conversion. So there is noise associated with this L, (this additional noise is not added on purpose but rather it’s unavoidable), therefore samples do not necessarily decrease cleanly all the way down to a fixed L but begin to be affected by it a bit earlier depending on the small fluctuations around L. Thus very low bacterial densities should approach L asymptotically without noise but in practice samples with low densities hover around the detection limit.

Sorry that I did not give all the details on the outset, with the hopes of making the question cleaner to read and understand I may have omitted important details that have actually slowed down the discussion. In practice the two variance parametrization I use that works best has lognormal both for the Z and epsilon distributions: this is also because Ytrue (bacterial density) is calculated on the log scale:

parameters{
vector[N] Z_raw;
}
transformed parameters{
vector[N] Ytrue_log = ...;
vector[N] Z = exp(Ytrue_log + sigma*Z_raw) + exp(L); 
}
model{
target+=lognormal_lpdf(Ymeasured|log(Z),tau);
}

This two variance parametrization has a BFMI below 0.1 usually so I think it may be pathological (?).

I have tried normal targets for both the one and two variance parametrization and they work very poorly, going from sampling without converging to not even properly sampling (because of infinities after exponentiating ytrue with gaussian errors). I can give more details bout that if this is important information.

I tried to do what you suggested with partitioning the variances like so. Please forgive me if my implementation is poor and unprincipled; I am a self taught biologist who wants to get it right but lacks proper theoretical training

parameters{
vector[N] Z_raw;
real<lower=0,upper=1> sigma_par;
}
transformed parameters{
vector[N] Ytrue_log = ...;
vector[N] Z = exp(Ytrue_log + sigma*sigma_par*Z_raw) + exp(L); 
}
model{
sigma_par ~ beta(3,2);
target+=lognormal_lpdf(Ymeasured|log(Z),sigma*(1-sigma_par));
}

My prior for sigma_par reflects that the noise derived from L should be smaller than the experimental noise due to growth variation.

I am using exponential(1) priors for the variances

Just a thought, not exactly related to your question: have you considered the possibility that the machine noise is indeed normally distributed, but that the experimental noise is actually heteroscedastic, i.e., scales with the expectation? Although it’s long ago for me, I remember when analysing experimental data from the lab that measurement or “experimental” error is often log/normally distributed. If you have enough observations to accurately estimate the SD of that log-normal noise, you might even be able to pick up the SD of the machine noise near the detection limit. I imagine you’d need many observations for that, though.

An alternative is to model left-censoring of observations near/under the detection limit, which I’ve found to work well for me.

Update: it seems that I am getting closer to an answer, which I suspect is that, of my three different detection limits (for three different types of light sources from the machine), only two have samples that approach the apparent detection limit, but the third light source can measure all the way down past the most dilute sample, so the model is not able to estimate the detection limit. This leads to difficulty in exploring the posterior for the limit of detection as well as the standard deviation parameters for that particular light source, leading to low BFMI.

Compare again the one variance vs. two variance energy plots for K (a parameter that shouldn’t be affected by the one/two variance parametrization), the values L of the limits of detection themselves for each of the three light sources, and their corresponding sigma and sigma_par, the latter being a number between 0 and 1 that indicates how much of the variance is in \sigma vs. \tau (using the notation from the first post). The idea of partitioning the variance using a total variance and a (0,1) partition parameter is based on @jsocolar 's reply in this post.

You can clearly see that, in the two variance parametrization (with both sigma and sigma_par), the pathological parameters are for the third light source. The one variance parametrization (without fluctuations in the extra injected voltage) doesn’t seem to have this problem: even though LOD[3] stretches across three orders of magnitude, unlike the other two detection limits, the model has a much easier time dealing with this unknown level of variation, I guess, when there isn’t also some scale parameter involved in the mix.

One variance:

Two variance:

Other possibly interesting points for posterity:

  • I got slightly better bfmi and fits by changing the prior for variances to std_normal()
  • I tried to model the partitioned scale parameter as a sqaured variance but I did not see a difference in the fit
  • fitting is worse when there is only a single sigma and three sigma_par, which would imply the lognormal noise on Ytrue is separate from the measurement process, and also when there is only one sigma and one sigma_par.