Hi all!
My name is Jeffrey and I am currently doing a summer research project focusing on bayesian model for annotation in noisy categorical data situations. This has led me to the excellent work of @Bob_Carpenter particularly the noisy categorical rating example in the manual which works very well on our data. I am now trying to code in stan some of the models from Multilevel Bayesian Models of
Categorical Data Annotation 1 but I have got stuck.
I am trying to code up the first and simplest model referred to as the binomial model (described on pages 5 - 7 of the paper.
The stan code which I feel most accurately reproduces the model is the following:
data {
int<lower=1> N; // total number of annotations
int<lower=1> I; // number of items
int<lower=1> ii[N]; // item labels
int<lower=0, upper=1> y[N]; // label for annotation n
}
parameters {
real<lower=0, upper=1> pi;
real<lower=0> theta0;
real<lower=0> theta1;
}
transformed parameters {
vector[2] log_p_z[I];
for (i in 1:I) {
log_p_z[i, 1] += bernoulli_lpmf(1 | pi);
log_p_z[i, 2] += bernoulli_lpmf(0 | pi);
}
for (n in 1:N) {
log_p_z[ii[n], 1] += bernoulli_lpmf(y[ii[n]] | theta1);
log_p_z[ii[n], 2] += bernoulli_lpmf(y[ii[n]] | (1 - theta0));
}
}
model {
pi ~ beta(2,2);
theta0 ~ beta(2, 2);
theta1 ~ beta(2, 2);
for (i in 1:I) {
target += log_sum_exp(log_p_z[i]);
}
}
However this model cannot even sample (!) due to errors like:
Log probability evaluates to log(0), i.e. negative infinity.
and errors with parameters of the Bernoulli distributions being invalid.
There is clearly something major and I would love if anyone could point me in the direction of what that could be.
I have also tried mixture distributions i.e.
...
for (i in 1:I) {
target += log_mix(pi,
bernoulli_lpmf(y[i] | theta1),
bernoulli_lpmf(y[i] | (1 - theta1)));
}
or
...
for (i in 1:I) {
target += log_sum_exp(bernoulli_lpmf(0 | pi) + bernoulli_lpmf(y[i] | theta1),
bernoulli_lpmt(1 | pi) + bernoulli_lpmf(y[i] | (1 - theta1)));
}
Following the Algebra and the Missing Oxen blog post by Richard McElreath but these models have lots of divergences and don’t reproduce simulation parameters.
Some code I wrote (in R) to simulate from the model is:
set.seed(42)
N <- 1000
ii <- rep(1:100, 10)
pi <- 0.2
theta1 <- 0.8
theta0 <- 0.7
z <- rbinom(N, 1, pi)
y <- rbinom(N, 1, z * theta1 + (1 - z) * (1 - theta0))
and to run the model:
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
N <- length(y)
I <- max(ii)
stan_data <- list(N = N, y = y, ii = ii, I = I)
fit <- stan(file = "path to file"), data = stan_data, seed = 42)
I am very new to the art of translating these (at least partly) non-trivial models into Stan so any help would be appreciated!
Thanks!
p.s. Thanks for some great models/papers in this area @Bob_Carpenter !