How to handle constrained parameters that can have density close to bounds?

I understand that, due to the way STAN handles parameter constraints, parameters lower bounded at 0 can never actually take 0 values. What are recommended strategies for dealing with parameters that, conceptually, should be bounded, but may in some cases have values extremely close to the boundary?

My personal context with this problem:

I’ve been using STAN to fit some pathogen decay models to data which take the form Count (Y) vs. Time (t). In a typical setting, we might be interested in relating an exogenous variable, such as temperature (T), to a first order rate of pathogen decay, resulting in a model, over all experiments j in {1,…,J}, with observations n in {1…Nj}, e.g.:

log Yn ~ normal ( log Y0j - Kj tn , σYj)
log Kj ~ normal ( log Kref + λ ( Tj - Tref ), σK)

A common complication to the above model occurs in datasets where decay begins after some lag period, during which the population is relatively stable. These periods can be controlled for using a slight modification of the above model with the added parameter tDj, the lag time between the start of experiment j and the onset of decay:

if tn ≤ tDj : log Yn ~ normal ( log Y0j , σYj)
if tn > tDj : log Yn ~ normal ( log Y0j - Kj tn , σYj)

log Kj ~ normal ( log Kref + λ ( Tj - Tref ), σK)

Across experimental datasets, some experiments will exhibit a discernable lag phase while others will not. Other experiments might be noisy or have imprecise measurements of Yn such that there is little evidence in the likelihood to specify whether or not a lag phase is present.

I’ve tried a few different approaches to dealing with the lag phases:

  • Using a mixture distribution over both models (essentially model 1 is tDj = 0 and model 2 is tDj > 0) for each experiment
    - Works well in terms of main parameter recovery, though experiments that are weakly identified as one model or the other often have low effective sample sizes due to slight bimodality for their parameters
    - Relatively slow, and I think it might be contributing to difficulty sampling from more complex models where additional mixture components are present

  • Using the more complex model for all experiments
    - Slightly biases decay rate estimates in cases where some experiments actually do not have a lag phase
    - Hits maximum treedepth, which I guess is due to very large steps needed for a u-turn in unconstrained parameters where the optimal value for the constrained parameter is near the boundary tDj = 0

  • Allowing tDj < 0
    - Conceptually problematic, since there is either an unidentifiable pairing of tDj and Y0j (and related treedepth problems when constraining priors are added) or an essentially incorrect model if we still treat the intercept as the first observed count.
    - The incorrect model, where the intercept is treated as the first observed count, is easy to sample but gives biased estimates of Kj as the apparent decay rates flatten with negative values of tDj (though temperature sensitivity λ is fairly well estimated)

Is there any approach that would be superior to the mixture model in ease of sampling while remaining minimally biased for cases when tDj = 0?

1 Like

My advice is to go with the mixture model and figure out why it’s slow. Slight bimodality wouldn’t do that so you’re missing the real cause.

Thanks Krzyzstof. The mixture model isn’t really slow in the form presented here - just slightly slower than the model in which all experiments have a lag phase (and without treedepth issues). The model in which the duration of the lag phase may be negative runs about twice as fast, but is biased. The real struggle comes when an additional form of decay (and its lagged version) are added as mixture components. In this case, the decay rate tapers off, suggesting a mixture of two populations decaying at different rates:

log Yn ~ normal( log Y0j + log( αj e-Kfastjtn + (1 - αj ) e-Kslowjtn ), σYj )

log Kfastj ~ normal( log Kfastref + λfast (Tj - Tref), σKfast )
log Kslowj ~ normal( log Kslowref + λslow (Tj - Tref), σKslow )

When all experiments are treated as components of a mixture of the tailing model with the constant rate model, and their respective lagged versions, maximum treedepth is exceeded, some experiments have very poor posterior diagnostics, and n_eff / n_total goes from about 75% to about 12.5%. I think the difficulty might lie with apparently constant rate experiments whose decay rates better align with the conditional distributions of fast or slow rates from tailing experiments, in which case they may have most of their posterior density in the tailing model with mixing proportion parameter αj close to 0 or 1.

One thing that occurred to me is rather than sampling αj on the logit scale, to sample the apparent offset in intercept for the slow decaying population, log(1 - αj), on the log scale (upper bound at 0). I’m hoping that this approach helps by removing one of the bounds to which the model fit is particularly sensitive (αj often has to be close to 1 for the fast decay phase to be distinguishable from noise, and especially if fast decay occurs over the entire observed time series). The remaining bound is less critical (i.e. the average decay rate is close to the slow rate if αj isn’t close to 1), so hopefully the sampler won’t have to spend much time taking many small steps in the region of log(1-αj) close to 0.

What happens is that they are log transformed to the unconstrained scale. So that means we take an unconstrained value and exponential transform it back to the constrained scale. If the unconstrained value is around -400 or less, it will round to zero. Technically, we shouldn’t get zero values, but we do in practice.

@wds15 may be able to answer about the lags when he gets back—I think he knows how to fit them.

Michael Betancourt wrote a useful case study on mixtures—you can find it on the web site under users >> documentation >> case studies.

It’s usually because you need to take very small steps in order to preserve a high acceptance rate. Increasing max tree depth may improve sampling. If there’s true identifiability with priors, then increasing tree depth won’t help.

P.S. It’s “Stan”—it’s not an acronym.

Thanks Bob! I’ve seen the case study that Michael wrote on mixture models. It’s very helpful. I’m in a situation where it’s appealing to apply a mixture model across many experiments, only some of which can properly identify all parameters of the mixture component to which they correspond. I guess the lag issue isn’t really the problem in this case, though I wonder if there’s a way to handle it that’s more efficient than doubling the number of mixture components for other models just to include lagged and unlagged versions. The more difficult problem is for the biphasic decay model I described in the last post.

Theoretically, data can appear to have a constant decay rate if only one part or the other of a biphasic curve is observed. The idea I’m trying to execute is basically to estimate mixture component membership (monophasic or biphasic models) based on clustering of similar decay rate coefficients, i.e. if a constant decay rate experiment has a rate coefficient that is extraordinarily fast compared to other monophasic data, but similar to fast rates from biphasic data, it should be classified as belonging to the biphasic mixture component with parameter α close to 1. In these cases, however, the unconstrained parameter for α only has an identifiable lower bound.

If there’s true identifiability with priors, then increasing tree depth won’t help.

Has anybody developed guidelines for appropriate priors in a situation like this? I’m having difficulty thinking of something that might work without ‘peeking’ at the data, since appropriate upper bounds for logit(α) depend on the duration of each experiment, as well as typical values of the biphasic decay rate coefficients conditional on temperature. It’s also difficult to conceive of priors that add similarly steep curvature to the likelihood around these upper bounds to mirror what’s happening at the well-identified lower bounds.

I assume that meant true non-identifiability.

If you only identify a lower bound on a parameter and its otherwise uniform, it remains unconstrained and can’t be normalized properly.

I’d think rather than putting hard upper and lower bounds that you’d instead use some kind of hierarchical modeling to pull everything toward a well-behaved population estimate.

I assume that meant true non-identifiability.

Oh, that makes more sense…

Unfortunately, there’s not a good theoretical reason to put hierarchical priors on α. What I’ve been doing recently is putting loose priors, independent for each experiment, on the initial count for each population in the biphasic context:

log Yn ~ normal( log Y0j + log( αj e-Kfastjtn + (1 - αj ) e-Kslowjtn ), σYj )
Is equivalent to
log Yn ~ normal( log( eY0_fastj - Kfastjtn + eY0_slowj - Kslowjtn ), σYj )
where αj = invlogit(Y0_fastj - Y0_slowj)

I give the Y0fast and Y0slow parameters normal priors with mean at the mean concentration for the experiment and scale twice the standard deviation of concentration. That’s sufficient to allow the slow-decaying population to be so miniscule as to be unobservable within the experiment, while preventing the likelihood from being flat out to infinity / -infinity. It doesn’t keep the curvature in the likelihood from being very assymetrical around the region not identified from the data, so sampling can still be difficult…

Hmm, on second thought maybe a common logistic prior with unknown location and scale wouldn’t be the worst idea… I still expect the unidentified experiments would wind up being difficult to sample, but I’ll give it a shot…

I recommend changing the structure of your model entirely. The main feature of your model is that the decay rate varies in time, going from say zero at time t=0 to some constant asymptotically for large t. Some situations may involve a rapid transition almost immediately, others may involve staying near zero for a while until later when the transition occurs.

Define an ODE for rate(t) that captures these features, make it a simple enough ODE that you can get a closed form solution perhaps. Then use maxima or Wolfram Alpha or the like to get a function that has these features in terms of 2 to 4 parameters, and then use the closed form solution with its parameters to reformulate your model.

So the common hierarchical prior across experiments actually worked very well! No treedepth issues and good effective sample sizes for almost everything. I had tried a version of it previously that wasn’t successful, but I think working with the unconstrained parameters rather than α itself helped quite a bit. Thanks for the suggestion, Bob.

Daniel, I’m interested in exploring the dynamic modeling approach at some point, but first trying to work within a set of models that are similar to what already exists in this sort of literature. The tailing phenomenon is likely to require multiple population compartments, in which case the simplest model would be equivalent to what I’ve already shown, which is a solution to the set of equations

δYfast/δt = -Kfast
δYslow/δt = -Kslow
Y = Yfast + Yslow

Examples of other models that can explain the same phenomenon can be found here and here, but are not identifiable without additional types of biochemical data.

The lag phases have been modeled in more detail in some very interesting recent work as a function of available nutrients for the bacterial population, though the authors still wind up using a piecewise specification.

Two sub-populations of bacteria with different rates is definitely a possibility. Consider the continuous version of this model though, let

P(K) be a probability density on K.

Let Y(K,t) be the total population of bacteria. Where Y(K,0) = Y0 * P(K) and Y0 is the initial total population of bacteria.

Then in a small time unit the population with decay rate K changes like

Y(K,t+dt) - Y(K,t) = -K Y(K,t) dt

And Ytot(t) = integrate(Y(K,t),dK)

For each K the solution for the differential equation is simply Y(K,t) = Y(K,0) * exp(-Kt)

so Ytot(t) = integrate(Y(K,0)exp(-Kt),dK) = integrate(Ytotal_initP(K)*exp(-Kt),dK)

if the initial distribution for K is sufficiently nice, you can potentially do this integral using Maxima/Wolfram Alpha etc and get yourself a nice simple curve in terms of some parameters that describe the initial distribution of K values. For example, using a gamma distribution for K and doing the calculation in maxima:

integrate(Y0tot*1/gamma(n)/theta^n * K^(n-1)*exp(-K/theta)*exp(-K*t),K,0,inf);
Is n positive, negative or zero?

pos;
Is theta (t theta + 1) positive, negative or zero?

pos;
Is n an integer?

no;
                                     Y0tot
(%o9)                         -------------------
                                 1       n      n
                              (----- + t)  theta
                               theta
(%i10) 

Took me a while to get it, but that’s an interesting model, for sure. My current model is the same, except P(K) is a mixture distribution with two components, each with mean and variance conditional on temperature. It would be interesting to examine the performance of non-mixture distributions for P(K) and what might be inferred about the effect of exogenous variables in those formulations.

Edit: I guess the current model for K at the cell count level is actually a mixture of two delta functions…

What about lag phases? Would that imply some sort of sigmoid function of time on the distribution of decay rates? (I guess another possibility could be Weibull distributed survival times, but I’m not crazy about the Weibull distribution because its rate parameter can’t be interpreted independently from the shape parameter)

Yes, whereas my suggested model is a continuous mixture of various K values.

Yes, that was what I originally was thinking of, an ODE that describes K as a function of time probably tracking a sigmoid type curve. I doubt very much you’ll get a nice closed form model with both this sigmoid K(t) and a continuous mixture of K values…so I think the question is which is more important, the heterogeneity of the population, or the time-varying rate of decay? Or if you really want, you can go large and start using Stan’s ODE routines… don’t jump right there though, start small and work your way up.