Interval-valued regression with Stan

Hello Stan community,
I’m working on a model where my observed variable, obs*, is generated as obs* = obs + error, with error ~ N(0, sigma). I also have additional information that obs* falls within the interval [obs-, obs+]. I’m wondering how to correctly specify the likelihood function in Stan that incorporates this interval constraint.

Any insights or suggestions on how to model this situation in Stan would be greatly appreciated!

Thank you in advance for your help.

Do you want to model some process error separate from this observation error, or is the entire residual assumed to be this bounded error term?

1 Like

I’m actually considering the entire residual to be attributed to a single bounded error term in such a way that the likelihood of obs* falls into the interval becomes

\operatorname{Pr}\left(obs^*\in [obs^-, obs^+]\right)=\operatorname{Pr}(obs^* \leq obs^+)-\operatorname{Pr}(obs^*\leq obs^-)

Let me make sure I understand. I think that you’re saying

is one by assumption, with the first term on the right hand side equal to one and the second term equal to zero.

If that’s right, then the next question is whether it is obvious that there is any choice of coefficients in the model that enable you to satisfy all of these constrains. An example that would be impossible would be
x1 = 0, obs1- = -1, obs1+ = 1
x2 = 1, obs2- = 9, obs2+ = 11
x3 = 2, obs3- = -1, obs3+ = 1
where the model is obs* regressed on x. There’s no way to draw a straight line that passes through all three intervals.

If we are on the right track, and this regression definitely has a valid solution, then we can start to think about how to implement it. A final question for now:

Do you mean that the error is drawn from something that looks like a truncated gaussian, or do you intend to specify that the error is drawn from something that looks approximately Gaussian and is narrow enough to look approximately Gaussian even as it respects the constraint on obs*?

To adress your question, allow me to provide a specific example in the following form:

y = a + bx + e

Here, e is a random variable following a normal distribution N(0, \sigma).

In practice, we do not have direct access to y itself, but rather we observe an interval [y^-, y^+] within which y lies.

The objective is to estimate the parameters a, b, and \sigma.

For maximum likelihood estimations, the likelihood function is given by:

\Phi\left(\frac{y^+ - (a+bx)}{\sigma}\right) - \Phi\left(\frac{y^- - (a+bx)}{\sigma}\right)

Here, \Phi denotes the cumulative distribution function of the normal distribution.

Ok. Again repeating to ensure I understand. Now we are no longer saying

but rather the opposite. There’s bounded error in our observation of y which is separate from the residual. To proceed, we need to know more about this bounded error distribution. Is it uniform between the bounds? Gaussian with known sd truncated between the bounds? Something else?

If you know this is the likelihood function you are interested in, just use that (with appropriate priors). In general, the likelihood functions don’t change whether we analyze a model under a frequentist or bayesian mode of inference.

What I’m particularly interested in is the instruction line in Stan that formulates the likelihood function while accounting for the interval [y^-, y^+] in which y is situated. For instance, we know if the observation is a single point y, the likelihood in Stan might look like: y \sim \text{normal}(a + bx, \sigma). Now, the question arises: what would the likelihood expression be in Stan if y now falls within the interval [y^-, y^+]?

I’m still confused. Do you believe there’s something wrong with the likelihood function you gave above?

If so, or just to double check, then we can try to re-build the model ourselves. To do so, we need to clarify exactly what is observed.

You say "my observed variable, obs*. Do you actually observe obs*? The confusing thing is:

If you observe obs* then this information is not additional. Or did you mean to say that the additional information is that the true value obs falls within a known interval?

So does any of these capture your scenario:

  1. You observe obs*, and you know that the true value of obs lies in some interval [obs-, obs+]
  2. You observe neither obs* nor obs and know only that the true value of obs lies in some interval [obs-, obs+]
  3. You observe neither obs* nor obs and know only that the true value of obs* lies in some interval [obs-, obs+]
  4. Something else?

The scenario that accurately describes my situation is the third one: we neither observe obs* nor obs , and our knowledge is limited to the fact that the true value of obs* lies within the interval [obs-, obs+] . In this context, our observations consist of interval-valued dependent variables in the form of [obs-, obs+].

model {
  // interval censored likelihood
  target += log_diff_exp(
    normal_lcdf(y_upper | a + b * x, sigma),
    normal_lcdf(y_lower | a + b * x, sigma)
  );
}
2 Likes