I though I finally found a solution for a problem of mine, being infer parameters of a negative binomial just on the upper half of my data, when I realized that the “missing data paradigm”, cannot be applied for discrete variables.
data {
int<lower=1> S;
int y[S];
}
parameters {
real<lower=0> sigma;
real mu;
int<lower=0, upper=min(y)> y_miss[S];
}
model {
y ~ neg_binomial_2(mu, sigma);
y_miss ~ neg_binomial_2(mu, sigma);
}
Is there any workaround you can think of?
I should point out that the negative binomial is probably too over-dispersed to be modeled as Gamma.
I think you might be able to accomplish what you’re trying to do by using the generated quantities block.
data {
int<lower=1> S;
int y[S];
}
parameters {
real<lower=0> sigma;
real mu;
}
model {
y ~ neg_binomial_2(mu, sigma);
}
generated quantities {
int<lower=0, upper=min(y)> y_miss[S];
for (s in 1:S)
y_miss[s] = neg_binomial_2_rng(mu, sigma);
}
this is literally the actual model (which obviously does not work in this form)
y =rnbinom(10000, mu=2000, size=0.5)
y = y[y>2000]
fit = stan(
model_code="
data {
int<lower=1> S;
int y[S];
}
parameters {
real<lower=0> sigma;
real mu;
int<lower=0, upper=min(y)> y_miss[S];
}
model {
y ~ neg_binomial_2(mu, sigma);
y_miss ~ neg_binomial_2(mu, sigma);
}
",
data = list(S=length(y), y=y))
If you’re trying to fit the actual model, you’re not actually after the y_miss. You’re after a truncated model. Something like (not compiled / checked):
data {
int<lower=1> S;
int y[S];
}
parameters {
real<lower=0> sigma;
real mu;
}
model {
/*
for (s in 1:S) {
target += neg_binomial_2_lpmf(y[s] | mu, sigma)
- neg_binomial_2_lcdf(2000 | mu, sigma);
}
*/
target += neg_binomial_2_lpmf(y | mu, sigma)
- S * neg_binomial_2_lcdf(2000 | mu, sigma);
}
Does it fail to converge or does it give a result of the wrong magnitude? There are two parameterizations of the negative binomial distribution in Stan. I’m not sure which one is used in R.
And adding priors would be something reasonable to do.
It’s in the error message. You didn’t provide it the variable L.
Now back to your problem of estimating this thing… there is key information you’re leaving off. If you knew that you started with 1000 values, that makes the estimation problem a lot easier.
Sorry that was dum, I kept looking at the model, and forgot about the data input.
here is the model
fit_NM = stan(
model_code="data {
int<lower=1> N;
int L;
int<lower=L> y[N];
}
parameters {
real<lower=0> sigma;
real<lower=0> mu;
}
model {
for (n in 1:N) y[n] ~ neg_binomial_2(mu, sigma) T[L,];
}",
data = list(N=length(y), y=y, L = 2000))
summary(fit_NM, pars=c("mu", "sigma"))
Rejecting initial value:
Gradient evaluated at the initial value is not finite.
Stan can't start sampling from this initial value.
Although I don’t understand fully your comment, I will be able to know the mu, therefore I assume there are the same number of upper and lower values. So I guess I know it.
FYI: The general problem is that, the lower part of the NB is mixed with other distribution/noise, the upper part, is safely pure.
data {
int<lower=1> S;
int y[S];
}
parameters {
real<lower = 0> sigma;
real<lower = 0> mu;
}
model {
sigma ~ normal(0, 1);
mu ~ normal(2000, 100);
target += neg_binomial_2_lpmf(y | mu, sigma)
- S * neg_binomial_2_lccdf(2000 | mu, sigma);
}
But you have to supply initial values because Stan really can’t evaluate the negative binomial lccdf term well when mu and sigma are between exp(-2) and exp(2) or 0.13 and 7.38.
Turns out you don’t even need to use the total you started with, although I’m sure that would make the model fit better.