Truncated model for neg_binomial_2

Hello,

I though I finally found a solution for a problem of mine, being infer parameters of a negative binomial just on the upper half of my data, when I realized that the “missing data paradigm”, cannot be applied for discrete variables.

	data {
		int<lower=1> S;
		int y[S]; 
	}
	parameters { 
		real<lower=0> sigma; 
		real mu;
		int<lower=0, upper=min(y)> y_miss[S]; 
	}
	model { 
		y ~ neg_binomial_2(mu, sigma); 
		y_miss ~ neg_binomial_2(mu, sigma); 
	}

Is there any workaround you can think of?

I should point out that the negative binomial is probably too over-dispersed to be modeled as Gamma.

Thanks a lot

I think you might be able to accomplish what you’re trying to do by using the generated quantities block.

data {
  int<lower=1> S;
  int y[S]; 
}
parameters { 
  real<lower=0> sigma; 
  real mu;
}
model { 
  y ~ neg_binomial_2(mu, sigma); 
}
generated quantities {
  int<lower=0, upper=min(y)> y_miss[S]; 
  for (s in 1:S)
    y_miss[s] = neg_binomial_2_rng(mu, sigma); 
}

(I didn’t try to compile it)

Thanks Syclik,

I am actually not interested in the y_miss itself.

I just need it into the model for a correct prediction of mu, sigma given missing data.

Maybe you could state what the actual model is, otherwise it’s hard to code.

Hello,

this is literally the actual model (which obviously does not work in this form)

y =rnbinom(10000, mu=2000, size=0.5)
y = y[y>2000]

fit = stan(
	model_code="
	data {
		int<lower=1> S;
		int y[S]; 
	}
	parameters { 
		real<lower=0> sigma; 
		real mu;
		int<lower=0, upper=min(y)> y_miss[S]; 
	}
	model { 
		y ~ neg_binomial_2(mu, sigma); 
		y_miss ~ neg_binomial_2(mu, sigma); 
	}
	", 
	data = list(S=length(y), y=y))

If you’re trying to fit the actual model, you’re not actually after the y_miss. You’re after a truncated model. Something like (not compiled / checked):

data {
  int<lower=1> S;
  int y[S]; 
}
parameters { 
  real<lower=0> sigma; 
  real mu;
}
model {
  /*
  for (s in 1:S) {
    target += neg_binomial_2_lpmf(y[s] | mu, sigma) 
              - neg_binomial_2_lcdf(2000 | mu, sigma);
  }
  */
  target += neg_binomial_2_lpmf(y | mu, sigma) 
              - S * neg_binomial_2_lcdf(2000 | mu, sigma);
}

Yes you are right!

however when I implement the model fails to converge, giving result of the wrong magnitude

y = rnbinom(1000, mu=2000, size=0.5)
y = y[y>2000]

fit_NL = stan(
	model_code="data {
	int<lower=1> S;
	int y[S]; 
		}
		parameters { 
			real<lower=0> sigma; 
			real<lower=0> mu;
		}
		model {
				target += neg_binomial_2_lpmf(y | mu, sigma) - S * neg_binomial_2_lcdf(2000 | mu, sigma);
		}", 
	data = list(S=length(y), y=y))
summary(fit_NL, pars=c("mu", "sigma"))

In principle it shouldn’t need any hyperprior being so data rich.

Does it fail to converge or does it give a result of the wrong magnitude? There are two parameterizations of the negative binomial distribution in Stan. I’m not sure which one is used in R.

And adding priors would be something reasonable to do.

Both. It fails to converge and it does it on wrong magnitude values

sigma 200 instead of 0.5
mu 15000 intead of 2000

neg_binomial_2 is correct, correspond to the R function rnbinom(mu, size)

There must be something missing that I cannot figure out…

I also tried the standard truncation format but based on the error I suspect the truncation bound cannot be an integer

fit_NM = stan(
	model_code="data {
	int<lower=1> N;
	int L;
	int<lower=L> y[N]; 
		}
		parameters { 
			real<lower=0> sigma; 
			real<lower=0> mu;
		}
		model {
				for (n in 1:N) y[n] ~ neg_binomial_2(mu, sigma) T[L,];
		}", 
	data = list(N=length(y), y=y))
summary(fit_NM, pars=c("mu", "sigma"))

Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : 
  variable does not exist; processing stage=data initialization; variable name=L; base type=int

Nope. That’s not true at all. See:

It’s in the error message. You didn’t provide it the variable L.


Now back to your problem of estimating this thing… there is key information you’re leaving off. If you knew that you started with 1000 values, that makes the estimation problem a lot easier.

Sorry that was dum, I kept looking at the model, and forgot about the data input.

here is the model

fit_NM = stan(
	model_code="data {
	int<lower=1> N;
	int L;
	int<lower=L> y[N]; 
		}
		parameters { 
			real<lower=0> sigma; 
			real<lower=0> mu;
		}
		model {
				for (n in 1:N) y[n] ~ neg_binomial_2(mu, sigma) T[L,];
		}", 
	data = list(N=length(y), y=y, L = 2000))
summary(fit_NM, pars=c("mu", "sigma"))

Rejecting initial value:
  Gradient evaluated at the initial value is not finite.
  Stan can't start sampling from this initial value.

Although I don’t understand fully your comment, I will be able to know the mu, therefore I assume there are the same number of upper and lower values. So I guess I know it.

FYI: The general problem is that, the lower part of the NB is mixed with other distribution/noise, the upper part, is safely pure.

I might have mixed up the cdf for the ccdf.

This works:

data {
  int<lower=1> S;
  int y[S]; 
}
parameters { 
  real<lower = 0> sigma; 
  real<lower = 0> mu;
}
model {
  sigma ~ normal(0, 1);
  mu ~ normal(2000, 100);
  target += neg_binomial_2_lpmf(y | mu, sigma) 
    - S * neg_binomial_2_lccdf(2000 | mu, sigma);
}

But you have to supply initial values because Stan really can’t evaluate the negative binomial lccdf term well when mu and sigma are between exp(-2) and exp(2) or 0.13 and 7.38.

Turns out you don’t even need to use the total you started with, although I’m sure that would make the model fit better.

2 Likes

Sorry, it gives error to me :) I’m not sure I have understood the reason to change to ccdf

y =rnbinom(1000, mu=2000, size=0.5)
y = y[y>2000]

fit_NL = stan(
	model_code="data {
  int<lower=1> S;
  int y[S]; 
}
parameters { 
  real<lower = 0> sigma; 
  real<lower = 0> mu;
}
model {
  sigma ~ normal(0, 1);
  mu ~ normal(2000, 100);
  target += neg_binomial_2_lpmf(y | mu, sigma) 
    - S * neg_binomial_2_lccdf(2000 | mu, sigma);
}", 
	data = list(S=length(y), y=y))
summary(fit_NL, pars=c("mu", "sigma"))

Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

However, this formulation should be quite safe, if only it worked.

data {
	int<lower=1> N;
	int L;
	int<lower=L> y[N]; 
}
parameters { 
	real<lower=0> sigma; 
	real<lower=0> mu;
}
model {
   for (n in 1:N) y[n] ~ neg_binomial_2(mu, sigma) T[L,];
}

It’s not working due to it not being able to initialize. You need to provide initial values.

Sorry, I was thinking about priors…

If you print the lccdf term, you’ll find that it’s always -inf with the default initialization.

I confirm that this R code works

y =rnbinom(1000, mu=2000, size=0.5)
y = y[y>2000]
init_fun <- function() { list(mu=runif(1,1000,2200), sigma=runif(1,0.1,1)) }

fit_NL = stan(
	model_code="data {
  int<lower=1> S;
  int y[S]; 
}
parameters { 
  real<lower = 0> sigma; 
  real<lower = 0> mu;
}
model {
  sigma ~ normal(0, 1);
  mu ~ normal(2000, 100);
  target += neg_binomial_2_lpmf(y | mu, sigma) 
    - S * neg_binomial_2_lccdf(2000 | mu, sigma);
}", 
	data = list(S=length(y), y=y), init=init_fun)

summary(fit_NL, pars=c("mu", "sigma"))

I have never been a fun of setting initial values. I always though that this hide pathologies in the model/data.