Can anyone provide any references to modeling a Zero one inflated beta regression in STAN?

Any packages that can run that model through R?

Thank you

Can anyone provide any references to modeling a Zero one inflated beta regression in STAN?

Any packages that can run that model through R?

Thank you

There is the **cdfquantreg** R package that has likelihoods that put positive mass on 0 and / or 1, but it does not use Stan. I started to incorporate it into RStanArm but didn’t finish it. In the meantime you could transform your `y`

with

```
(y * (n - 1) + .5) / n
```

to force it interior, where `n`

is the number of observations.

You’d code it in Stan just like the zero-inflated Poisson, only with two inflated points. Beta regressions are awfully hard to fit. And ones that allow true zero and true one outputs sound dangerous. Where’d the values in (0, 1) come from? If they’re counts, you can (and should) just model those directly.

Did you ever write code to do this in Stan proper? I’m trying to put together a zero-one inflated beta regression model currently, and I am having trouble taking the various parameterizations across papers and writing it into Stan.

If you just want additional probability for 0 and 1, the generative model will be probability `theta[1]`

of generating 0, probability `theta[2]`

of generating 1, and probability `theta[3]`

of drawing from `poisson(lambda)`

(or whatever the parameter of the Poisson is). In Stan, that’s a mixture that we can simplify because we know some of the mixture probability components are zero in some cases.

```
data {
int<lower = 0> N;
int<lower = 0>[N] y;
parameters {
simplex[3] theta; // mixture of 0, 1, poisson (in that order)
...
transformed parameters {
vector<upper = 0>[3] log_theta = log(theta);
...
model {
for (n in 1:N) {
real poisson_lp = log_theta[3] + poisson_lpmf(y[n] | lambda);
if (y[n] == 0)
target += log_sum_exp(log_theta[1], poisson_lp);
else if (y[n] == 1)
target += log_sum_exp(log_theta[2], poisson_lp);
else
target += poisson_lp;
}
...
```

This could be generalized to 0 to K inflation, and written as:

```
simplex[K + 2] theta;
...
real poisson_lp = log_theta[K + 2] + poisson_lpdf(y[n] | lambda);
...
if (y[n] <= K)
target += log_sum_exp(log_theta[y[n] + 1], poisson_lp);
else
target += poisson_lp
```

Sounds good, thanks! I need to read more of the manual on the simplex and what (I think) appears to be incrementing, as I’ve been adapting a lot of things from JAGS and BUGS. For example, I’ve adapted this JAGS code to a zero-one inflated beta regression model in Stan by:

```
data {
int n;
int x[n];
vector<lower=0, upper=1>[n] y;
}
transformed data {
int<lower=0, upper=1> is_discrete[n];
int<lower=-1, upper=1> y_discrete[n];
for (i in 1:n) {
if (y[i] == 0) {
is_discrete[i] = 1;
y_discrete[i] = 0;
} else if (y[i] == 1) {
is_discrete[i] = 1;
y_discrete[i] = 1;
} else {
is_discrete[i] = 0;
// throws error if passed to bernoulli
y_discrete[i] = -1;
}
}
}
parameters {
vector[2] coef_a;
vector[2] coef_g;
vector[2] coef_m;
vector[2] coef_p;
}
transformed parameters {
vector<lower=0, upper=1>[n] alpha;
vector<lower=0, upper=1>[n] gamma;
vector[n] mu;
vector<lower=0>[n] phi;
vector<lower=0>[n] p;
vector<lower=0>[n] q;
for (i in 1:n) {
alpha[i] = inv_logit(coef_a[1] + coef_a[2] * x[i]);
gamma[i] = inv_logit(coef_g[1] + coef_g[2] * x[i]);
mu[i] = inv_logit(coef_m[1] + coef_m[2] * x[i]);
phi[i] = exp(coef_p[1] + coef_p[2] * x[i]);
p[i] = mu[i] * phi[i];
q[i] = phi[i] - mu[i] * phi[i];
}
}
model {
coef_a ~ normal(0, 1);
coef_g ~ normal(0, 1);
coef_m ~ normal(0, 1);
coef_p ~ normal(0, 1);
is_discrete ~ bernoulli(alpha);
for (i in 1:n) {
if (is_discrete[i] == 1) {
y_discrete[i] ~ bernoulli(gamma[i]);
} else {
y[i] ~ beta(p[i], q[i]);
}
}
}
```

Unfortunately, I am having a hard time getting accurate estimates using that code from data I generated *using the same model specification,* so I was wondering if anyone had code for the zero-and-one case.

Hey @markhw - any more details on what seems to be going wrong with parameter recovery, e.g., which parameters are not recoverable, is there consistent bias, how do things scale with N?

Another strategy would be to simplify the model (e.g., remove all covariate effects and add them back to the model one-by-one), so that you can get a better sense for what specifically is problematic.

I tweaked the R script that you had sent earlier, and it seems like the model recovers the true parameters for large N: https://gist.github.com/mbjoseph/a0961d20263ad50dd953c11defa59736

*Note that I’ve changed the data declaration for x from `int x[n]`

to `vector[n] x`

I’ve attached the R code that generates the data (similar to what you’ve seen). zoib_generation.R (1.7 KB)

It calls a file called `zoib.stan`

, which is what I posted in the above comment. It looks like it is capturing the beta distribution well, but not the Bernoulli, as the coefficients predicting alpha and gamma are off, but mu and phi coefficients are not:

```
coefficient estimated real difference
<chr> <dbl> <dbl> <dbl>
1 coef_a[1] 0.12 -0.13 0.25
2 coef_a[2] -0.34 -0.05 -0.29
3 coef_g[1] -0.02 0.00 -0.02
4 coef_g[2] 0.45 0.30 0.15
5 coef_m[1] 0.01 0.00 0.01
6 coef_m[2] 0.32 0.30 0.02
7 coef_p[1] 1.86 1.80 0.06
8 coef_p[2] 0.29 0.30 -0.01
```

It does look like you’re right about large N, though. When I upped the N to 10,000 instead of 1,000, I got better recovery of the coefficients:

```
coefficient estimated real difference
<chr> <dbl> <dbl> <dbl>
1 coef_a[1] -0.13 -0.13 0.00
2 coef_a[2] -0.13 -0.05 -0.08
3 coef_g[1] -0.09 0.00 -0.09
4 coef_g[2] 0.44 0.30 0.14
5 coef_m[1] -0.02 0.00 -0.02
6 coef_m[2] 0.32 0.30 0.02
7 coef_p[1] 1.77 1.80 -0.03
8 coef_p[2] 0.32 0.30 0.02
```

I’m wondering if the method is really that biased for certain data, or if there’s something in how Ospina and Ferrari define the likelihood that we are missing?

I think the likelihood is correct based on my reading of the paper, but if somebody can find a mistake that we’ve missed, that would be great!

In looking at the marginal posterior distributions of each parameter relative to the true values, things don’t look terrible to me:

One thing to consider also is that only the discrete observations provide information on `gamma`

with this current parameterization, so that in this case there are < 1000 independent data points that can contribute to the likelihood for `coef_g`

.

You want to replace this pattern:

```
gamma = inv_logit(...);
y ~ bernoulli(gamma);
```

with

```
y ~ bernoulli_logit(...);
```

I don’t understand the is-discrete thing here, but can’t you just vectorize these to:

```
y_discrete ~ bernoulli(gamma);
y ~ beta(p, q);
```

Otherwise, it seems like you wind up with unmodeled gaps in either `y_discrete`

or `y`

.

Would replacing `inv_logit`

and `bernoulli`

to `bernoulli_logit`

actually decrease the bias in estimation, increase the efficiency of the computing, or is it to make the code more readable?

The `is_discrete`

variable is modeling whether or not the `y`

is {0, 1}. If it is, then `y_discrete`

is the value (0 or 1) that the discrete variable takes. `y`

is then for all other observations (0, 1).

I was worried about unmodeled gaps, too. We are trying to model alpha, gamma, mu, and phi, however, where the values of 0 and 1 have the probabilities:

(taken from 6 here: https://arxiv.org/pdf/0705.0700.pdf)

That might help explain the `is_discrete`

variable. So if we condition on alpha and then to the if-else, we incorporate the alpha * gamma probability of being one.

I felt that the unmodeled gaps would be an issue, so I tried to model it:

```
is_discrete ~ bernoulli(alpha);
for (i in 1:n) {
if (is_discrete[i] == 1) {
y[i] ~ bernoulli(gamma[i]);
} else {
y[i] ~ beta(p[i], q[i]);
}
}
```

But it throws an error for using two different likelihoods, depending on the case?

Trying this:

```
for (i in 1:n) {
if (is_discrete[i] == 1) {
y[i] ~ bernoulli(gamma[i]);
} else {
y[i] ~ beta(p[i], q[i]);
}
}
```

should raise a type error, because `bernoulli_lpmf`

needs an integer `y`

. Declaring `y_discrete`

as an integer array in the transformed data block, and `y`

as a vector in the data block is a workaround. Every element in `y`

is modeled (there are no gaps), but the conditional logic ensures that `bernoulli_lpmf`

and `beta_lpdf`

receive data of the right type.

I suppose if you wanted to have the code better match the definition in the paper, you might consider something like this: zoib.stan (981 Bytes)

Where instead of declaring `y_discrete`

you use the same conditional logic outlined in eq. 6 to increment the log probability:

```
for (i in 1:n) {
if (y[i] == 0) {
target += log(alpha[i]) + log1m(gamma[i]);
} else if (y[i] == 1) {
target += log(alpha[i]) + log(gamma[i]);
} else {
target += log1m(alpha[i]) + beta_lpdf(y[i] | p[i], q[i]);
}
}
```

It’ll make the arithmetic more stable and make the program more efficient. It shouldn’t affect the answer if you’re not getting any warnings as is.

Thanks for pasting in the definition. That’s more like the hurdle model. Rather than using two (0,1) parameters alpha and gamma, I just used one simplex

```
simplex[3] theta = [alpha * (1 - gamma), alpha * gamma, 1 - alpa]'; // 0, 1, (0,1)
```

It’d make sense to decompose the simplex into two (0, 1) constrained parameters if you had informative priors for `alpha`

or `gamma`

.

Beta regressions are really hard to fit, zero inflated or not. You don’t want to check point estimates, but posterior coverage.

To implement what you pasted in, you need this:

```
model {
real log_zero = log(alpha * (1 - gamma));
real log_one = log(alpha * gamma);
real log_other = log(1 - alpha);
...
for (i in 1:N) {
if (y[i] == 0) target += log_zero;
else if (y[i] == 1) target += log_one;
else target += log_other + beta_lpdf(y[i] | mu, phi);
}
...
}
```

You’ll find this all much much more efficient if in transformed data you create this:

```
transformed data {
int<lower = 0> N_zero; // number of y[i] == 0
int<lower = 1> N_one; / number of y[i] == 1
vector[num_non_zero(y)] y_non_zero;
...
}
```

where `y_non_zero`

holds the non-zero elements of `y`

and `num_non_zero(y)`

computes how many entries there are so that you can declare it. This only gets computed once.

That lets the likelihood be simplified to this much more efficient form:

```
target += N_zero * log_zero;
target += N_one * log_one;
target += beta_lpdf(y_non_zero | mu, phi);
```

I think that’s probably because `y[i]`

is continuous, but `bernoulli()`

requires a discrete variate. This won’t be a problem with the definition I coded above.