I am a new user of stan. I am trying to model the distribution of events counts which is right censored. The distribution is modeled as poisson where lamdba = np.exp(alpha+beta*t/100). When event count >= threshold one does not know the true value of count, only that it is >=threshold. The threshold is also variable.
I am using cmdstanpy to call method optimize 1000 times to get a distribution of the maximum posterior probability for alpha and beta.

alpha=np.log(10)
beta=np.log(5)
N_samples = 100
t = np.arange(N_samples)
lamdba = np.exp(alpha+beta*t/100)
alpha_hat=[]
beta_hat=[]
for it in range(0,1000):
count = np.random.poisson(lam=lamdba)
threshold = np.random.randint(10,50,N_samples)
# output s is right censored
count_observed = np.minimum(count, threshold)
sales_data = {
'N' : N_samples,
'y' : count_observed,
'inv' : threshold
}
fit = model.optimize(data=s_data,output_dir="optimize",show_console=False)
alpha_hat.append(fit.stan_variable("alpha"))
beta_hat.append(fit.stan_variable("beta"))

the stan model is

data {
int<lower=0> N;
array[N] int<lower=0> inv;
array[N] int<lower=0> y;
}
parameters {
real <lower=0> alpha;
real <lower=0> beta;
}
model {
for (n in 1:N) {
real lambda = exp(alpha+beta*(n-1)/100);
if (y[n] < inv[n]) {
// Uncensored data likelihood
y[n] ~ poisson(lambda);
} else {
// Censored data likelihood
target += poisson_lccdf(inv[n] | lambda);
}
}
}

With the model stated above a get unbiased distribution for alpha_hat-alpha and beta_hat-beta

A fraction of the estimates for alpha and beta have zero value. Why ?

Since the distribution is censored when y>= inv, and the ccdf is defined as probability Y > y, I though that the correct increment for the likelihood would be
target += poisson_lccdf(inv[n] | lambda) + poisson_lpmf(inv[n] | lambda);
The last term (poisson_lpmf) would cover the probability Y= inv
However this causes the distribution for beta to be biased

Not quite. The way you wrote it the first time is correct. An observation is either observed with a value in range (hence the Poisson likelihood) or its censored. If it’s censored at inv[n], then we only know that the value is greater than or equal to inv[n], which means we only want to include the CCDF.

The way you wrote it adds the likelihood for an observation of inv[n] as if there were no censoring, plus the censored effect.

Also, you can leave lambda on the log scale and write:

Bayesian posterior means will be unbiased estimates (that also minimize expected square error) if the data follows the model’s data generating process and the computation is working. You’re using posterior optimization, so that’s not going to give you unbiased estimates in general (for example, the maximum likelihood estimate for covariance is biased, which is why we divide by N - 1 to get an unbiased estimate and divide by N to get the MLE).

I was about to guess it’s zero-centered priors, but you don’t have that because for some reason you’re constraining your parameters to be positive. You probably don’t want the parameters to be constrained to be positive in that you have an exp() in the regression that will make sure that lambda is positive. if you keep alpha and beta positive, then you don’t even necessarily need the exp, though that’s the traditional “link function” to use for Poisson regression.

I doubt they’re actually 0 to arithmetic precision. But the issue is that if you have only positive covariates like you do here, then the slope and intercept of a regression can be highly correlated in that raising either one will give you the right answer for the other. Working through the non-linearity of the exp is also an issue.

The point is that you only need one or the other. If you observe a value that’s censored, all you know is that it’s greater than some value, so you only want to include the CCDF term. You don’t have the Y = inv case—that’s censored. The biggest uncensored value in your model is inv - 1.

Great! Hard boundary constraints can introduce this kind of bias vs. the unconstrained solution when probability mass piles up at the boundary (i.e., the boundary is consistent with the data given the model).

Not yet. We added some alternative parameterizations but didn’t complete them. The CDF algorithms are already a mess as they’re not implemented natively on the log scale, which is terrible for precision. But we don’t know how to write better ones. We wouldn’t have an lcdf implementation that’s more than just applying the exp() to the arg and calling our existing one.

On the other hand, we should probably take a look at cases like the Poisson, where the cdf function looks pretty simple (it’s a ratio of an incomplete gamma function and a gamma function for the factorial): Poisson distribution - Wikipedia

I don’t know if there’s a good implementation of the incomplete gamma function with the argument on the log scale. The water gets deep quickly: Incomplete gamma function - Wikipedia