Hurdle lognormal distribution

Hi everyone!

I read from the user guide that Stan does not support the zero-inflated model for continuous distribution for now. I am wondering if there is a way out or around? For example, a simplified problem may look like:

y_i=0 with probability \theta and y_i\sim \text{lognormal}(\mu_i, \sigma_i^2) with probability 1-\theta, \mu_i and \sigma_i follows some prior distributions, i=1,\cdots,n.

I followed max’s response in another thread and wrote the following stan model:

model2='
data {
int<lower=0> N;
int<lower=0> y[N];
}
parameters {
real<lower=0, upper=1> theta;
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, 10);
sigma ~ lognormal(0, 2);
for (n in 1:N) {
if (y[n] == 0)
target += bernoulli_lpmf(0 | theta);
else
target += bernoulli_lpmf(1 | theta) + lognormal_lpdf(y[n] | mu,sigma);
}
}
'

I generate some data points and test the model above:

set.seed(1)
z = rbinom(200,1,0.8)
generate_lognormal = function(z) ifelse(z==0, 0, exp(rnorm(1,0,1)))
y = sapply(z, generate_lognormal)
dat2 = list(N=200, y=y)
fit2 <- stan(model_code=model2, data=dat2, chains=2, warmup=500, iter=1000, cores=1, refresh=0)

But I got an error:

Error in mod$fit_ptr() :
Exception: int variable contained non-int values; processing stage=data initialization; variable name=y; base type=int (in ‘model10c6d393b3f99_728a875fb87c0036fa51d56fb3a6ee03’ at line 4)
failed to create the sampler; sampling not done

It seems that y must be an integer… Is this a requirement for the hurdle model? Does anyone encounter similar problems? Thanks for your help!

Hey there!

The problem is that you can not inflate any point in the support of a continuous distribution (the probability of a single point is zero). What you describe is a hurdle model, which you can easily fit in Stan, since the point mass that you are adding (at zero) is not in the support of the continuous distribution. These models are thus sometimes called two part models.

And I think you have coded the model correctly. (you can actually also put a more informative prior on \theta – right now it’s uniform on [0,1]. The error that you get is due to the fact that the y's you are inputting into the model are not integer. Changing int<lower=0> y[N]; to something like vector<lower=0>[N] y; or real<lower=0> y[N] should fix that.

Cheers,
Max

Hi Max,

Thank you so much for your reply! I followed your suggestion and fixed the problem! Can I ask a follow-up question? I would like to add another layer, x_i\sim \text{Poisson}(y_i). I think this is pretty similar to the “eight-schools” example and I tried the following codes but got an error: [1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."

model6 = 'data {
int<lower=1> N; 
int<lower=0> x[N];
}
parameters {
real mu;
real<lower=0> sigma;
simplex[2] theta;
vector[N] y; 
}
model {
target += normal_lpdf(mu | 0, 10);
target += lognormal_lpdf(sigma | 0, 10);
for (n in 1:N) {
if (y[n] == 0) 
target += bernoulli_lpmf(1 | theta[1]);
else
target += bernoulli_lpmf(1 | theta[2]) + lognormal_lpdf(y[n] | mu, sigma);
}
for (n in 1:N)
target += poisson_lpmf(x[n] | y[n]);
}

My test data:

set.seed(1)
p = 0.7
z = rbinom(200, 1, p)
generate_lognormal = function(z) ifelse(z==0, 0, exp(rnorm(1,0,1)))
y = sapply(z, generate_lognormal)
x = rpois(200, y)
fit6 = stan(model_code=model6, data=list(N=200, x=x), chains=2)

Thanks in advance!

Best,
Bowei

Parameter y needs an explicit lower bound.

vector<lower=0>[N] y; 

@nhuurre, thank you so much! Explicitly specifying the lower bound fixes my issue. So we should always give an explicit domain for parameters when it is possible, right?

While the codes run successfully now, the estimates are not good. True values for \mu, \sigma, \theta are 0, 1, (0.3, 0.7). But the posterior means are -0.72, 1.36, (0,1). Please see the attached output details. I checked the time series of y, they never touch 0 (although at some points very close to 0). This means in the “if else” structure, the likelihood in the “else” branch is always be chosen, which makes \theta_2 (the probability mass at 0) is estimated as 1.

I notice that poisson_lpmf function can only take positive value for parameter \lambda. So I adjusted my stan model as follows to treat separately, but still got the same result: posterior means theta[1]=0, theta[2]=1. I appreciate it if anyone has any suggestions for this. Thanks a lot!

model6 = 'data {
int<lower=1> N; 
int<lower=0> x[N];
}
parameters {
real mu;
real<lower=0> sigma;
simplex[2] theta;
vector<lower=0>[N] y; 
}
model {
target += normal_lpdf(mu | 0, 10);
target += lognormal_lpdf(sigma | 0, 10);
for (n in 1:N) {
if (y[n] == 0) {
if (x[n] == 0)
target += bernoulli_lpmf(1 | theta[1]);
else
target += 0;
}
else {
target += bernoulli_lpmf(1 | theta[2]) + lognormal_lpdf(y[n] | mu, sigma);
target += poisson_lpmf(x[n] | y[n]);
}
}
}

Best,
Bowei

Any value allowed by the parameter constraints must have nonzero probability, so in this case y ~ lognormal() requires <lower=0> for y.

Oh, I see. Yes, you cannot branch on y[i] == 0 when y is a parameter. Instead, you must marginalize out the hurdle part.

model {
  target += normal_lpdf(mu | 0, 10);
  target += lognormal_lpdf(sigma | 0, 10);
  for (n in 1:N) {
    real lp_yzero = bernoulli_lpmf(1 | theta[1])
                    + poisson_lpmf(x[n] | 0.0 );
    real lp_ypos = bernoulli_lpmf(1 | theta[2])
                    + poisson_lpmf(x[n] | y[n] );
    target += log_sum_exp(lp_yzero, lp_ypos);
    target += lognormal_lpdf(y[n] | mu, sigma);
  }
}

Here y is still never zero; it should be interpreted as the estimate of y conditional on y\neq 0.

Thank you for your quick reply, @nhuurre! I tried your code and got the posterior means for \theta is (0.01, 0.99). Still there is something wrong. I am wondering what is the value of poisson_lpmf(x[n] | 0.0 ). In R, dpois function returns 1 if x[n]=0, 0 otherwise. Thanks!

Best,
Bowei

lpmf is the logarithm of probility so poisson_lpmf(x|0.0) should be 0 if x=0 and negative infinity otherwise.

How many of the xs are zeros?

Yes, that is right. I forget log. It seems that no x’s visit 0, since sum(extract(fit6, "x")[[1]]==0)=0. This is wired because the posterior mean of theta[1] is 0.01, not 0.

theta[1] is the probability of x and y being zero, right? Even if it never happens in your dataset you can’t completely rule out it happening in a replicated data so it should be a very small non-zero number.

By the way, the model could be written without the negative infinity, maybe it’s a bit cleaner like this:

model {
  target += normal_lpdf(mu | 0, 10);
  target += lognormal_lpdf(sigma | 0, 10);
  target += lognormal_lpdf(y | mu, sigma);
  for (n in 1:N) {
    if (x[n] == 0) {
      real lp_yzero = bernoulli_lpmf(1 | theta[1])
                      + poisson_lpmf(x[n] | 0.0 );
      real lp_ypos = bernoulli_lpmf(1 | theta[2])
                      + poisson_lpmf(x[n] | y[n] );
      target += log_sum_exp(lp_yzero, lp_ypos);
    } else {
      target += bernoulli_lpmf(1 | theta[2])
                 + poisson_lpmf(x[n] | y[n] );
    }
  }
}

Wait, wait? I thought x was data? How are you extracting it from the fit object? What is the complete model?

That works! Thank you so much for your neat codes, @nhuurre!

theta[1] is the probability of y=0. I think you use it correctly in your code.

For sum(extract(fit6, "x")[[1]]==0)=0, I use x in my code as y here. So it is the unobserved poisson mean. Sorry for the misleading.

Best,
Bowei