Modeling censored geometric data fails with small p

Hi

I encountered an issue when trying to fit a geometric mixture model with censored data, so I set up an
simpler version and encountered the following issue: When setting the probability of my random geometric sample below 0.05 there is no way that Stan finds the initial values, I tried zero initial values,
and only limiting the number of censored values to a maximum of 5 leads to a fit. Also setting a strong prior on the Intercept did not work out. I would be very grateful for any help here!

n <- 10000; c <- 3000
data.uncens <- rgeom(n = n, prob = .01)
censd <- sample(1:n, replace = T, size = c)
cens.ind <- rep(0, n)
cens.ind[censd] <- 1
data.uncens[censd] <- data.uncens[censd] - 5
data.uncens <- ifelse(data.uncens < 0, 1, data.uncens)
data <- data.frame(x = data.uncens, censoring = cens.ind)
standata <- list(x = data$x[-censd], x_cens = data$x[censd], N = length(data$x[-censd]), N_cens = length(data$x[censd]))

censored.geometric <-

data {
int<lower=1> N;
int<lower=1> N_cens;
int x[N];
int x_cens[N_cens];
}
parameters {
real<lower = 0> Intercept;
}
model {
\\target += student_t_lpdf(Intercept | 100, 1, .00001);
x ~ neg_binomial_2(Intercept, 1);
target += neg_binomial_2_lccdf(x_cens | Intercept, 1);
}

stan(model_code = censored.geometric, data = standata,
iter=2000, warmup=1000, chains=2, cores = 2, seed=483892929)

It throws out this error (many times):
Chain 1: Rejecting initial value:
Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Stan can’t start sampling from this initial value.

Best,
Giuachin

1 Like

So the situation is that you have two sets of data, x which is uncensored and x_cens which is lower censored (low values are ignored)?

Do you know the lower censoring point for x_cens?

1 Like

The data is right censored - for x_cens it is known that the uncensored value is at least as large the censored value. In principle it is the same situation as in the Stan manual just that each x_cens has an individual lower bound.

1 Like

Hmm, seems okay to me then.

There must to be a difficulty evaluating the likelihood.

The question would be figuring out what combination of x_cens and Intercept causes problems.

Do you mind investigating this a bit more and report back?

I think you could use ‘expose_stan_functions’ to expose this function to R and then figure out where it is breaking (expose_stan_functions documented here: https://cran.r-project.org/web/packages/rstan/rstan.pdf – you might have to write a wrapper around neg_binomial_2_lccdf to get it to export, I’m not sure).

1 Like

Thanks to your explanation I was able to solve the issue, although in a presumably cheap way…

Since the algorithm only fails when the data is generated with a fairly small p for the geometric distribution, the model must have had problems with small values for “Intercept” (the neg_binomial distribution is parametrised as 1/p which is a location parameter).

All I had to do to fit the thing was to pass initial values larger than ~ 20 (the true location parameter that generated the data is 100).

So
stan(model_code = censored.geometric, data = standata, init = list(list(Intercept = 20), list(Intercept = 20)), iter=2000, warmup=1000, chains=2, cores = 2, seed=483892929)

1 Like

Oh okay, sounds good.

In terms of providing custom initial conditions, the goal is to make sure they are overdispersed so the Rhat diagnostics work right.

The goal is to have a bunch of chains start in a bunch of different locations and take a bunch of different paths and end up in the same situation.

This is to avoid having a bunch of chains start in a bunch of different locations, but because they are all so far away from the solution in sort of the same way, they take the same paths and you are tricked into thinking they have converged to one thing before you hit the solution.

So initial conditions normal(100, 20) would probably be fine (you’d want that 20 to be larger than what you’d expect the standard deviation of the posterior of the Intercept to be). normal(20, 1) means everything basically starts in the same place.

Very likely doesn’t matter (it’s not like we’re doing anything but guessing with the defaults and are probably not doing it right a lot of the time), just lettin’ yah know.

1 Like

Also I recommend you don’t fix seeds. Even in development work. It leaves you vulnerable to weird seed-specific bugs, and I honestly don’t think it gives you much (unless you’re hunting very specific C++ segfaults or something weird).