Mixed (right, left and interval) censored log-normal with brms

I’m worried that I misunderstood something about how I should specify a brms model for the situation where I have different types of censoring present. Perhaps the problems I describe below are due to how I specify the limits for different types of censoring.

My assumptions until now: I had assumed that I write y1 | cens(cens_type, y2), where if the data are interval censored I put the lower limit in y1 and upper in y2, if a record is right censored I put the upper limit in y1 an some non-infinite number >0 in y2, and if the record is left censored I put the lower limit in y1 and some non-infinite number >0 in y2.

Data example: I have data, where I assume that the underlying distribution is log-normal (baseline blood eosinophil counts from an asthma study), but I will only use the information on how many patients fall into each of 3 intervals (0 to <150, 150 to <300, >=300 cells/m^3). In this particular study I know more than that (we know the median of the underlying blood eosinophil counts was 255 cells/m^3 [=5.54 on the log scale] and the mean 360 with SD of 366; values ranged from something that got rounded to zero/is below the limit of quantification to 4330), but I wanted to try out these models for situations where I don’t know. I’m interested in what are reasonable guesses about the parameters of the log-normal distribution.

brms model that "does not work"

library(tidyverse)
library(brms)

dat = tibble(lower=c(0, 150, 300), upper=c(150, 300, Inf),  
              n=c(542, 527, 831), cens_type = c("right", "interval", "left"))

dat1 = dat %>%
  mutate(y1 = case_when(
    cens_type=="right" ~ upper,
    cens_type=="interval" ~ lower,
    cens_type=="left" ~ lower,
    TRUE ~ NA_real_),
    y2 = case_when(
      cens_type=="right" ~ upper,
      cens_type=="interval" ~ upper,
      cens_type=="left" ~ lower,
      TRUE ~ NA_real_))

brmfit1 = brm( y1 | cens(cens_type, y2) + weights(n) ~ 1,
               family = lognormal(),
               prior = prior(class="Intercept", 
                             normal(5.521461, 2.0)) +
                 prior(class="sigma", normal(0.6931472, 2.0)),
               data=dat1, backend="cmdstanr",
               control=list(adapt_delta=0.9999))

post_preds1 = posterior_predict(brmfit1, newdata=tibble(patid=1))[,1]
hist(post_preds1, breaks=50)

Here I get “Warning: 491 of 4000 (12.0%) transitions ended with a divergence”. Ouch. Additionally, the predictions based on the posterior don’t make sense to me:
post_pred_plot
Clearly, something has gone wrong! Looking at the posterior, it seems like sigma got massively underestimated. Looking at the pairs plot, I see this (not quite sure why, but just calling pairs(brmfit1) does not color the divergent transitions in red for me):

map(brmfit1$fit@sim$samples,
    function(x){
      tibble(x) %>%
        bind_cols(attributes(x)$sampler_params)
    }) %>%
  bind_rows() %>%
  ggplot(aes(x=b_Intercept, y=sigma, col=factor(divergent__, levels=c(1L, 0L)))) +
  geom_point(alpha=0.3)

Using sigma_link=“softplus” (“squareplus” for some reason resulted in an error) helped a lot (obviously prior need to be somewhat different) in terms of avoiding divergent transitions, but not in terms of getting better posterior predictions.

A brms model that "does work"
Here, I pretend all data are interval censored and just put huge negative lower limit (for right censored) or a huge positive upper limit (for left censored data). This also gave matching results to a from-scratch Stan approach (see below).

# -7 might just as well be -Inf on the log scale, similarly 11.5 is huge on the log-scale in the positive direction
dat0 = dat1 %>%
  mutate(y1 = ifelse(lower==0, -7, log(lower)),
         y2 = ifelse(upper==Inf, 11.5, log(upper)),
         cens_type="interval")

brmfit0 = brm( y1 | cens(cens_type, y2) + weights(n) ~ 1,
               family = gaussian(),
               data=dat0, 
               backend="cmdstanr",
               control=list(adapt_delta=0.99))
post_preds = posterior_predict(brmfit0, newdata=tibble(patid=1))[,1]

hist(post_preds, breaks=50)

tibble(pps=post_preds) %>%
  mutate(ppsc = case_when(
    pps<log(150) ~1L,
    pps<log(300) ~2L,
    T ~3L)) %>%
  group_by(ppsc) %>%
  summarize(n=n()) %>%
  mutate(prop=n/(sum(n)))

No longer any divergent transitions,

predicted proportions in the three intervals make sense and the plotted distribution looks reasonable:
better_pos_preds

Stan approach that works
Additionally, what I thought was the most obvious approach, initially, works nicely without problems and gives identical results to the brms model above: I thought would be logical to use a multinomial likelihood and wrote a little bit of Stan code to do that. I’m a bit perplexed, why everything seems okay from the diagnostics perspective here, because it’s so similar to the code generated by brms for the first brms model.

smodel = "
data {
  int records;
  int n[records];
  real<lower=0> lower[records];
  real<lower=0> upper[records];
}

parameters{
  real mu;
  real logsigma;
}

transformed parameters{
  real<lower=0> sigma;
  vector[records] loggroupprob;

  sigma = exp(logsigma);

  for (record in 1:records){
    if (lower[record]==0) {
      loggroupprob[record] = lognormal_lcdf( upper[record] | mu, sigma);
    } else if (lower[record]==9999) {
      loggroupprob[record] = lognormal_lccdf( lower[record] | mu, sigma);
    } else {
       loggroupprob[record] = log_diff_exp( lognormal_lcdf(upper[record] | mu, sigma),
          lognormal_lcdf(lower[record] | mu, sigma));
    }
  }
}

model {
  mu ~ normal(5.521461, 2.0);
  logsigma ~ normal(0.6931472, 2.0);
  n ~ multinomial( softmax( loggroupprob ) ); 
}
"

library(cmdstanr)

f <- write_stan_file(smodel)
mod <- cmdstan_model(f, compile = F)
mod$compile(force_recompile = T)

fit <- mod$sample(
  data = list(records=3, n=dat0$n, 
              lower = dat0$lower, 
              upper=ifelse(dat0$upper==Inf, 9999, dat0$upper)), 
  seed = 123, 
  chains = 4, 
  parallel_chains = 4,
  refresh = 500, 
  adapt_delta = 0.9999
)

fit$cmdstan_diagnose()

fit$cmdstan_summary()

Stan code generated by brms for model that "did not work"
For comparison purposes, here’s the Stan code generated by brms for the first model that did not work (i.e. had divergent transitions), but so much of that seems incredibly similar to what I wrote down as Stan code above, except for where I explicitly introduce multinomial probabilities.

// generated with brms 2.16.1
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  vector<lower=0>[N] weights;  // model weights
  int<lower=-1,upper=2> cens[N];  // indicates censoring
  vector[N] rcens;  // right censor points for interval censoring
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
}
transformed parameters {
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    for (n in 1:N) {
    // special treatment of censored data
      if (cens[n] == 0) {
        target += weights[n] * lognormal_lpdf(Y[n] | mu[n], sigma);
      } else if (cens[n] == 1) {
        target += weights[n] * lognormal_lccdf(Y[n] | mu[n], sigma);
      } else if (cens[n] == -1) {
        target += weights[n] * lognormal_lcdf(Y[n] | mu[n], sigma);
      } else if (cens[n] == 2) {
        target += weights[n] * log_diff_exp(
          lognormal_lcdf(rcens[n] | mu[n], sigma),
          lognormal_lcdf(Y[n] | mu[n], sigma)
        );
      }
    }
  }
  // priors including constants
  target += normal_lpdf(Intercept | 5.521461, 2);
  target += normal_lpdf(sigma | 0.6931472, 2)
    - 1 * normal_lccdf(0 | 0.6931472, 2);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}

Additional information

  • Operating System: Windows 10
  • R version: 4.1.1
  • brms Version: 2.16.1

Having looked into this it seems as if the cdf and ccdf functions are not numerically stable enough to make this model work. I took the first model and did set n=1 and then this model runs just fine (but that’s not the fit we’d want here). Converting the model to use a normal distribution and known standard errors seemed to be a promising route. In this case the weighting is not any more needed as it’s done by rescaling this standard deviations from the normal distributions involved. These seem to work better. Have others also experienced issues with cdfs & ccdfs?