Sanity check: behavior of residual variance posterior with interval-censored observations

My intuition is that if I have censored observations with intervals, I should be getting less information from them than if I had the exact point i.e. the posterior should be closer to the prior.

However, when I run an interval-censored regression in brms, I see that residual variances are consistently underestimated and have values that are actually farther away from the prior mean, which is the exact opposite of what I would have expected.

Example:

library(brms)
N = 100
x = rnorm(N)
y = rnorm(N, 0.5*x)
yy = y
y_se = 0.7
y1 = y - 1.5
y2 = y + 1.5
y_cens = "interval"
d = data.frame(x, y, yy, y_se, y1, y2, y_cens)
fit_pr = brm(
  bf(y ~ 1), data = d,
  sample_prior = "only", cores = 4
)
# Family Specific Parameters: 
#       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma     2.69      2.73     0.09     9.81 1.00     1946     1275

fit = brm(
  bf(y ~ x) + bf(y1 | cens(y_cens, y2) ~ x) +
  bf(yy | mi(y_se) ~ x) + set_rescor(FALSE),
  data = d, cores = 4
)
# Family Specific Parameters: 
#          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
# sigma_y      1.06      0.08     0.92     1.22 1.00     5451     2483
# sigma_y1     0.68      0.09     0.52     0.88 1.00     7473     3009
# sigma_yy     0.78      0.10     0.59     1.00 1.00     1293     1411

So we see that the mean residual standard deviations for observations with censoring and measurement error are 20-30% smaller than the error-free estimate and about a quarter of the prior mean.

I suppose that mathematically it makes sense that smaller variances would maximize the probability of my observations being in the given intervals, but am I wrong in thinking this is not what one would usually want to get as a result? What would be the right way of achieving the behavior I’ve described instead?

I think one way to frame the intuition here is that in your interval-censored and measurement-error cases, you are allowing the regression itself to dictate where within their intervals the points are likely to be, rather than forcing the regression to deal with where the points actually are.

One way would be to implement a “cut”, i.e. don’t let the data from the regression flow back to inform the imputation. In Stan, this is most straightforwardly done via many repeated model fits over multiply imputed data.

1 Like

Thank you! I suppose the regression proper is a bit of a red herring here, I initially put it to check what would happen with the slopes. But you’re right, this is basically “asking” the distribution what parameter values would maximize the probability of it producing observations that fall within the intervals.

Made a visualization that fully made the intuition click, leaving it here for posterity’s sake:

library(dplyr)
library(tidyr)
library(ggplot2)
intp = function(l,u,s) (pnorm(u, sd = s) - pnorm(l, sd = s))
N = 100
y = rnorm(N)
y1 = y - 2.5
y2 = y + 2.5
data.frame(y1, y2) %>%
  mutate(sigma = list(seq(0.01, 1.5, 0.01))) %>%
  unnest(sigma) %>%
  mutate(prob = log(intp(y1, y2, sigma))) %>%
  group_by(sigma) %>%
  summarise(jointprob = exp(sum(prob))) %>%
  ggplot(aes(x = sigma, y = jointprob)) +
  geom_point()

(it’s particularly revealing to notice how sigma = 0 will be the “winner” as long as all intervals happen to be crossing the mean)

I suppose that a more adequate model for the situation I had in mind would take the observations to be the midpoint of the interval and then model them as being drawn from a uniform with a known interval length but endpoints depending on some unobserved value drawn from the normal.

\begin{align} y_{mid} \mid \mu &\sim U(\mu - k, \mu + k)\\ \mu &\sim N(\mu^\star, \sigma) \end{align}