Interval-valued regression with Stan

medelguide · October 30, 2023, 3:31pm

Hello Stan community,
I’m working on a model where my observed variable, obs*, is generated as obs* = obs + error, with error ~ N(0, sigma). I also have additional information that obs* falls within the interval [obs-, obs+]. I’m wondering how to correctly specify the likelihood function in Stan that incorporates this interval constraint.

Any insights or suggestions on how to model this situation in Stan would be greatly appreciated!

Thank you in advance for your help.

jsocolar · October 30, 2023, 3:57pm

Do you want to model some process error separate from this observation error, or is the entire residual assumed to be this bounded error term?

medelguide · October 30, 2023, 4:17pm

I’m actually considering the entire residual to be attributed to a single bounded error term in such a way that the likelihood of obs* falls into the interval becomes

\operatorname{Pr}\left(obs^*\in [obs^-, obs^+]\right)=\operatorname{Pr}(obs^* \leq obs^+)-\operatorname{Pr}(obs^*\leq obs^-)

jsocolar · October 30, 2023, 4:45pm

Let me make sure I understand. I think that you’re saying

is one by assumption, with the first term on the right hand side equal to one and the second term equal to zero.

If that’s right, then the next question is whether it is obvious that there is any choice of coefficients in the model that enable you to satisfy all of these constrains. An example that would be impossible would be
x1 = 0, obs1- = -1, obs1+ = 1
x2 = 1, obs2- = 9, obs2+ = 11
x3 = 2, obs3- = -1, obs3+ = 1
where the model is obs* regressed on x. There’s no way to draw a straight line that passes through all three intervals.

If we are on the right track, and this regression definitely has a valid solution, then we can start to think about how to implement it. A final question for now:

Do you mean that the error is drawn from something that looks like a truncated gaussian, or do you intend to specify that the error is drawn from something that looks approximately Gaussian and is narrow enough to look approximately Gaussian even as it respects the constraint on obs*?

medelguide · October 31, 2023, 12:46pm

To adress your question, allow me to provide a specific example in the following form:

y = a + bx + e

Here, e is a random variable following a normal distribution N(0, \sigma).

In practice, we do not have direct access to y itself, but rather we observe an interval [y^-, y^+] within which y lies.

The objective is to estimate the parameters a, b, and \sigma.

For maximum likelihood estimations, the likelihood function is given by:

\Phi\left(\frac{y^+ - (a+bx)}{\sigma}\right) - \Phi\left(\frac{y^- - (a+bx)}{\sigma}\right)

Here, \Phi denotes the cumulative distribution function of the normal distribution.

jsocolar · October 31, 2023, 12:51pm

Ok. Again repeating to ensure I understand. Now we are no longer saying

but rather the opposite. There’s bounded error in our observation of y which is separate from the residual. To proceed, we need to know more about this bounded error distribution. Is it uniform between the bounds? Gaussian with known sd truncated between the bounds? Something else?

If you know this is the likelihood function you are interested in, just use that (with appropriate priors). In general, the likelihood functions don’t change whether we analyze a model under a frequentist or bayesian mode of inference.

medelguide · October 31, 2023, 1:11pm

What I’m particularly interested in is the instruction line in Stan that formulates the likelihood function while accounting for the interval [y^-, y^+] in which y is situated. For instance, we know if the observation is a single point y, the likelihood in Stan might look like: y \sim \text{normal}(a + bx, \sigma). Now, the question arises: what would the likelihood expression be in Stan if y now falls within the interval [y^-, y^+]?

jsocolar · October 31, 2023, 1:27pm

I’m still confused. Do you believe there’s something wrong with the likelihood function you gave above?

If so, or just to double check, then we can try to re-build the model ourselves. To do so, we need to clarify exactly what is observed.

You say "my observed variable, obs*. Do you actually observe obs*? The confusing thing is:

If you observe obs* then this information is not additional. Or did you mean to say that the additional information is that the true value obs falls within a known interval?

So does any of these capture your scenario:

You observe obs*, and you know that the true value of obs lies in some interval [obs-, obs+]
You observe neither obs* nor obs and know only that the true value of obs lies in some interval [obs-, obs+]
You observe neither obs* nor obs and know only that the true value of obs* lies in some interval [obs-, obs+]
Something else?

medelguide · October 31, 2023, 2:00pm

The scenario that accurately describes my situation is the third one: we neither observe obs* nor obs , and our knowledge is limited to the fact that the true value of obs* lies within the interval [obs-, obs+] . In this context, our observations consist of interval-valued dependent variables in the form of [obs-, obs+].

hhau · October 31, 2023, 2:30pm

model {
  // interval censored likelihood
  target += log_diff_exp(
    normal_lcdf(y_upper | a + b * x, sigma),
    normal_lcdf(y_lower | a + b * x, sigma)
  );
}

Topic		Replies	Views
Another measurement error modeling question Modeling	13	551	July 5, 2023
Extended Kalman filter, problem estimating process and observation error size Modeling rstan , specification	7	936	July 14, 2022
Regression with parameter constraints Modeling	3	1572	November 2, 2017
Categorical Predictor with Measurement Error Regression Modeling specification	2	395	January 15, 2021
Specifying likelihood for bounded measurement errors model Modeling	1	548	November 27, 2017

Interval-valued regression with Stan

Related Topics