Zero one inflated beta regression in STAN


#1

Can anyone provide any references to modeling a Zero one inflated beta regression in STAN?

Any packages that can run that model through R?

Thank you


#2

There is the cdfquantreg R package that has likelihoods that put positive mass on 0 and / or 1, but it does not use Stan. I started to incorporate it into RStanArm but didn’t finish it. In the meantime you could transform your y with

(y * (n - 1) + .5) / n

to force it interior, where n is the number of observations.


#3

You’d code it in Stan just like the zero-inflated Poisson, only with two inflated points. Beta regressions are awfully hard to fit. And ones that allow true zero and true one outputs sound dangerous. Where’d the values in (0, 1) come from? If they’re counts, you can (and should) just model those directly.


#4

You may want to try out the zero_one_inflated_beta() family in brms.


#5

Did you ever write code to do this in Stan proper? I’m trying to put together a zero-one inflated beta regression model currently, and I am having trouble taking the various parameterizations across papers and writing it into Stan.


#6

If you just want additional probability for 0 and 1, the generative model will be probability theta[1] of generating 0, probability theta[2] of generating 1, and probability theta[3] of drawing from poisson(lambda) (or whatever the parameter of the Poisson is). In Stan, that’s a mixture that we can simplify because we know some of the mixture probability components are zero in some cases.

data {
  int<lower = 0> N;
  int<lower = 0>[N] y;

parameters {
  simplex[3] theta;  // mixture of 0, 1, poisson (in that order)
  ...
transformed parameters {
  vector<upper = 0>[3] log_theta = log(theta);  
  ...
model {
  for (n in 1:N) {
    real poisson_lp = log_theta[3] + poisson_lpmf(y[n] | lambda);
    if (y[n] == 0)
       target += log_sum_exp(log_theta[1], poisson_lp);
    else if (y[n] == 1)
      target += log_sum_exp(log_theta[2], poisson_lp);
    else
      target += poisson_lp;
  }
  ...

This could be generalized to 0 to K inflation, and written as:

simplex[K + 2] theta;
...
real poisson_lp = log_theta[K + 2] + poisson_lpdf(y[n] | lambda);
...
if (y[n] <= K)
  target += log_sum_exp(log_theta[y[n] + 1], poisson_lp);
else
  target += poisson_lp

#7

Sounds good, thanks! I need to read more of the manual on the simplex and what (I think) appears to be incrementing, as I’ve been adapting a lot of things from JAGS and BUGS. For example, I’ve adapted this JAGS code to a zero-one inflated beta regression model in Stan by:

data {
  int n;
  int x[n];
  vector<lower=0, upper=1>[n] y;
}
transformed data {
  int<lower=0, upper=1> is_discrete[n]; 
  int<lower=-1, upper=1> y_discrete[n];

  for (i in 1:n) { 
    if (y[i] == 0) {
      is_discrete[i] = 1;
      y_discrete[i] = 0;
    } else if (y[i] == 1) {
      is_discrete[i] = 1;
      y_discrete[i] = 1;
    } else {
      is_discrete[i] = 0;
      // throws error if passed to bernoulli
      y_discrete[i] = -1;
    }
  }
}
parameters {
  vector[2] coef_a;
  vector[2] coef_g;
  vector[2] coef_m;
  vector[2] coef_p;
}
transformed parameters {
  vector<lower=0, upper=1>[n] alpha;
  vector<lower=0, upper=1>[n] gamma;
  vector[n] mu;
  vector<lower=0>[n] phi;
  vector<lower=0>[n] p;
  vector<lower=0>[n] q;

  for (i in 1:n) {
    alpha[i] = inv_logit(coef_a[1] + coef_a[2] * x[i]);
    gamma[i] = inv_logit(coef_g[1] + coef_g[2] * x[i]);
    mu[i] = inv_logit(coef_m[1] + coef_m[2] * x[i]);
    phi[i] = exp(coef_p[1] + coef_p[2] * x[i]);
    p[i] = mu[i] * phi[i];
    q[i] = phi[i] - mu[i] * phi[i];
  }
}
model {
  coef_a ~ normal(0, 1);
  coef_g ~ normal(0, 1);
  coef_m ~ normal(0, 1);
  coef_p ~ normal(0, 1);
  
  is_discrete ~ bernoulli(alpha);
  for (i in 1:n) {
    if (is_discrete[i] == 1) {
      y_discrete[i] ~ bernoulli(gamma[i]);
    } else {
      y[i] ~ beta(p[i], q[i]);
    }
  }
}

Unfortunately, I am having a hard time getting accurate estimates using that code from data I generated using the same model specification, so I was wondering if anyone had code for the zero-and-one case.


#8

Hey @markhw - any more details on what seems to be going wrong with parameter recovery, e.g., which parameters are not recoverable, is there consistent bias, how do things scale with N?

Another strategy would be to simplify the model (e.g., remove all covariate effects and add them back to the model one-by-one), so that you can get a better sense for what specifically is problematic.

I tweaked the R script that you had sent earlier, and it seems like the model recovers the true parameters for large N: https://gist.github.com/mbjoseph/a0961d20263ad50dd953c11defa59736

*Note that I’ve changed the data declaration for x from int x[n] to vector[n] x


#9

I’ve attached the R code that generates the data (similar to what you’ve seen). zoib_generation.R (1.7 KB)

It calls a file called zoib.stan, which is what I posted in the above comment. It looks like it is capturing the beta distribution well, but not the Bernoulli, as the coefficients predicting alpha and gamma are off, but mu and phi coefficients are not:

  coefficient estimated  real difference
        <chr>     <dbl> <dbl>      <dbl>
1   coef_a[1]      0.12 -0.13       0.25
2   coef_a[2]     -0.34 -0.05      -0.29
3   coef_g[1]     -0.02  0.00      -0.02
4   coef_g[2]      0.45  0.30       0.15
5   coef_m[1]      0.01  0.00       0.01
6   coef_m[2]      0.32  0.30       0.02
7   coef_p[1]      1.86  1.80       0.06
8   coef_p[2]      0.29  0.30      -0.01

It does look like you’re right about large N, though. When I upped the N to 10,000 instead of 1,000, I got better recovery of the coefficients:

  coefficient estimated  real difference
        <chr>     <dbl> <dbl>      <dbl>
1   coef_a[1]     -0.13 -0.13       0.00
2   coef_a[2]     -0.13 -0.05      -0.08
3   coef_g[1]     -0.09  0.00      -0.09
4   coef_g[2]      0.44  0.30       0.14
5   coef_m[1]     -0.02  0.00      -0.02
6   coef_m[2]      0.32  0.30       0.02
7   coef_p[1]      1.77  1.80      -0.03
8   coef_p[2]      0.32  0.30       0.02

I’m wondering if the method is really that biased for certain data, or if there’s something in how Ospina and Ferrari define the likelihood that we are missing?


#10

I think the likelihood is correct based on my reading of the paper, but if somebody can find a mistake that we’ve missed, that would be great!

In looking at the marginal posterior distributions of each parameter relative to the true values, things don’t look terrible to me:

One thing to consider also is that only the discrete observations provide information on gamma with this current parameterization, so that in this case there are < 1000 independent data points that can contribute to the likelihood for coef_g.


#11

You want to replace this pattern:

gamma = inv_logit(...);
y ~ bernoulli(gamma);

with

y ~ bernoulli_logit(...);

I don’t understand the is-discrete thing here, but can’t you just vectorize these to:

y_discrete ~ bernoulli(gamma);
y ~ beta(p, q);

Otherwise, it seems like you wind up with unmodeled gaps in either y_discrete or y.


#12

Would replacing inv_logit and bernoulli to bernoulli_logit actually decrease the bias in estimation, increase the efficiency of the computing, or is it to make the code more readable?

The is_discrete variable is modeling whether or not the y is {0, 1}. If it is, then y_discrete is the value (0 or 1) that the discrete variable takes. y is then for all other observations (0, 1).

I was worried about unmodeled gaps, too. We are trying to model alpha, gamma, mu, and phi, however, where the values of 0 and 1 have the probabilities:

(taken from 6 here: https://arxiv.org/pdf/0705.0700.pdf)

That might help explain the is_discrete variable. So if we condition on alpha and then to the if-else, we incorporate the alpha * gamma probability of being one.

I felt that the unmodeled gaps would be an issue, so I tried to model it:

is_discrete ~ bernoulli(alpha);
  for (i in 1:n) {
    if (is_discrete[i] == 1) {
      y[i] ~ bernoulli(gamma[i]);
    } else {
      y[i] ~ beta(p[i], q[i]);
    }
  }

But it throws an error for using two different likelihoods, depending on the case?


#13

Trying this:

  for (i in 1:n) {
    if (is_discrete[i] == 1) {
      y[i] ~ bernoulli(gamma[i]);
    } else {
      y[i] ~ beta(p[i], q[i]);
    }
  }

should raise a type error, because bernoulli_lpmf needs an integer y. Declaring y_discrete as an integer array in the transformed data block, and y as a vector in the data block is a workaround. Every element in y is modeled (there are no gaps), but the conditional logic ensures that bernoulli_lpmf and beta_lpdf receive data of the right type.


#14

I suppose if you wanted to have the code better match the definition in the paper, you might consider something like this: zoib.stan (981 Bytes)

Where instead of declaring y_discrete you use the same conditional logic outlined in eq. 6 to increment the log probability:

  for (i in 1:n) {
    if (y[i] == 0) {
      target += log(alpha[i]) + log1m(gamma[i]);
    } else if (y[i] == 1) {
      target += log(alpha[i]) + log(gamma[i]);
    } else {
      target += log1m(alpha[i]) + beta_lpdf(y[i] | p[i], q[i]);
    }
  }

#15

It’ll make the arithmetic more stable and make the program more efficient. It shouldn’t affect the answer if you’re not getting any warnings as is.

Thanks for pasting in the definition. That’s more like the hurdle model. Rather than using two (0,1) parameters alpha and gamma, I just used one simplex

simplex[3] theta = [alpha * (1 - gamma), alpha * gamma, 1 - alpa]';  // 0, 1, (0,1)

It’d make sense to decompose the simplex into two (0, 1) constrained parameters if you had informative priors for alpha or gamma.

Beta regressions are really hard to fit, zero inflated or not. You don’t want to check point estimates, but posterior coverage.

To implement what you pasted in, you need this:

model {
  real log_zero = log(alpha * (1 - gamma));
  real log_one = log(alpha * gamma);
  real log_other = log(1 - alpha);
  ...
  for (i in 1:N) {
    if (y[i] == 0) target += log_zero;
    else if (y[i] == 1) target += log_one;
    else target += log_other + beta_lpdf(y[i] | mu, phi);
  }
...
}

You’ll find this all much much more efficient if in transformed data you create this:

transformed data {
  int<lower = 0> N_zero;  // number of y[i] == 0
  int<lower = 1> N_one;  / number of y[i] == 1
  vector[num_non_zero(y)] y_non_zero;
  ...
}

where y_non_zero holds the non-zero elements of y and num_non_zero(y) computes how many entries there are so that you can declare it. This only gets computed once.

That lets the likelihood be simplified to this much more efficient form:

target += N_zero * log_zero;
target += N_one * log_one;
target += beta_lpdf(y_non_zero | mu, phi);

I think that’s probably because y[i] is continuous, but bernoulli() requires a discrete variate. This won’t be a problem with the definition I coded above.


#16

Exactly! I missed @mbjoseph’s reply before replying.