Modeling right censored poisson regression

Hi

I am a new user of stan. I am trying to model the distribution of events counts which is right censored. The distribution is modeled as poisson where lamdba = np.exp(alpha+beta*t/100). When event count >= threshold one does not know the true value of count, only that it is >=threshold. The threshold is also variable.
I am using cmdstanpy to call method optimize 1000 times to get a distribution of the maximum posterior probability for alpha and beta.


alpha=np.log(10)
beta=np.log(5)
N_samples = 100

t = np.arange(N_samples)
lamdba = np.exp(alpha+beta*t/100)

alpha_hat=[]
beta_hat=[]

for it in range(0,1000):
    count = np.random.poisson(lam=lamdba)
    threshold = np.random.randint(10,50,N_samples)
    # output s is right censored
    count_observed = np.minimum(count, threshold) 
    sales_data = {
       'N' : N_samples,
       'y' : count_observed,
       'inv' : threshold
    }
    fit = model.optimize(data=s_data,output_dir="optimize",show_console=False)
    alpha_hat.append(fit.stan_variable("alpha"))
    beta_hat.append(fit.stan_variable("beta"))

the stan model is

data {
  int<lower=0> N;
  array[N] int<lower=0> inv;
  array[N] int<lower=0> y;
  }

parameters {
  real <lower=0> alpha;
  real <lower=0> beta;
}


model {
    for (n in 1:N) {
        real lambda = exp(alpha+beta*(n-1)/100);
        if (y[n] < inv[n]) {
          // Uncensored data likelihood
          y[n] ~ poisson(lambda);
        } else {
          // Censored data likelihood
          target += poisson_lccdf(inv[n] | lambda);
        }
    }
}

With the model stated above a get unbiased distribution for alpha_hat-alpha and beta_hat-beta

Questions:

  • A fraction of the estimates for alpha and beta have zero value. Why ?
  • Since the distribution is censored when y>= inv, and the ccdf is defined as probability Y > y, I though that the correct increment for the likelihood would be
    target += poisson_lccdf(inv[n] | lambda) + poisson_lpmf(inv[n] | lambda);
    The last term (poisson_lpmf) would cover the probability Y= inv
    However this causes the distribution for beta to be biased

    Why ?

Not quite. The way you wrote it the first time is correct. An observation is either observed with a value in range (hence the Poisson likelihood) or its censored. If it’s censored at inv[n], then we only know that the value is greater than or equal to inv[n], which means we only want to include the CCDF.

The way you wrote it adds the likelihood for an observation of inv[n] as if there were no censoring, plus the censored effect.

Also, you can leave lambda on the log scale and write:

log_lambda_n = alpha + beta * (n - 1) / 100;
y[n] ~ poisson_log(log_lambda_n);

Bayesian posterior means will be unbiased estimates (that also minimize expected square error) if the data follows the model’s data generating process and the computation is working. You’re using posterior optimization, so that’s not going to give you unbiased estimates in general (for example, the maximum likelihood estimate for covariance is biased, which is why we divide by N - 1 to get an unbiased estimate and divide by N to get the MLE).

I was about to guess it’s zero-centered priors, but you don’t have that because for some reason you’re constraining your parameters to be positive. You probably don’t want the parameters to be constrained to be positive in that you have an exp() in the regression that will make sure that lambda is positive. if you keep alpha and beta positive, then you don’t even necessarily need the exp, though that’s the traditional “link function” to use for Poisson regression.

I doubt they’re actually 0 to arithmetic precision. But the issue is that if you have only positive covariates like you do here, then the slope and intercept of a regression can be highly correlated in that raising either one will give you the right answer for the other. Working through the non-linearity of the exp is also an issue.

1 Like

Thank you very much for the reply.
From a previous post

I understood that CCDF is defined as P [Y>y] . How is the term P(Y=inv) accounted ?

The point is that you only need one or the other. If you observe a value that’s censored, all you know is that it’s greater than some value, so you only want to include the CCDF term. You don’t have the Y = inv case—that’s censored. The biggest uncensored value in your model is inv - 1.

Thanks again. Removing the constraints “alpha and beta should be positive” also removed the bias

You mentioned that I could leave lamdba on the log scale

log_lambda_n = alpha + beta * (n - 1) / 100;
y[n] ~ poisson_log(log_lambda_n);

Is there a stan function for the corresponding lcdf or lccdf of poisson_log ?

Great! Hard boundary constraints can introduce this kind of bias vs. the unconstrained solution when probability mass piles up at the boundary (i.e., the boundary is consistent with the data given the model).

Not yet. We added some alternative parameterizations but didn’t complete them. The CDF algorithms are already a mess as they’re not implemented natively on the log scale, which is terrible for precision. But we don’t know how to write better ones. We wouldn’t have an lcdf implementation that’s more than just applying the exp() to the arg and calling our existing one.

On the other hand, we should probably take a look at cases like the Poisson, where the cdf function looks pretty simple (it’s a ratio of an incomplete gamma function and a gamma function for the factorial): Poisson distribution - Wikipedia

I don’t know if there’s a good implementation of the incomplete gamma function with the argument on the log scale. The water gets deep quickly: Incomplete gamma function - Wikipedia

Given that the biggest uncensored value in the model is inv[n]-1 should the censored likelihood be evaluated at inv[n]-1 ?

// Censored data likelihood
          target += poisson_lccdf(inv[n] - 1  | lambda);

Censoring is a feature of the measurement process. What makes you think censoring is going on here? Are there a bunch of measurements that show up with this maximum value?

d_n is demand of product at given day.
y_n is number of units sold at same day.
inv_n is the inventory at start of each day.
I want to estimate the demand of the product from the sold units.
When y_n < inv_n then d_n = y_n
When y_n = inv_n then d_n>= inv_n ( when product is sold out the demand is equal or greater than number of units sold)
The last “uncensored value” of y_n is inv_n -1 .

OK, if \text{inv}_n is the censoring point so that `\text{inv}_n - 1 is not censored, then for a censored observation, what you want is

\Pr[y_n \geq \text{inv}_n] = 1 - \Pr[y_n < \text{inv}_n],

so that if y_n and \text{inv}_n are integers, then

1 - \Pr[y_n < \text{inv}_n] = 1 - \Pr[y_n + 1 \leq \text{inv}_n].