Bayesian categorical annotation

Hi all!

My name is Jeffrey and I am currently doing a summer research project focusing on bayesian model for annotation in noisy categorical data situations. This has led me to the excellent work of @Bob_Carpenter particularly the noisy categorical rating example in the manual which works very well on our data. I am now trying to code in stan some of the models from Multilevel Bayesian Models of
Categorical Data Annotation
1 but I have got stuck.

I am trying to code up the first and simplest model referred to as the binomial model (described on pages 5 - 7 of the paper.

The stan code which I feel most accurately reproduces the model is the following:


data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> I;               // number of items
  int<lower=1> ii[N];           // item labels
  int<lower=0, upper=1> y[N];   // label for annotation n
}

parameters {
  real<lower=0, upper=1> pi;
  real<lower=0> theta0;
  real<lower=0> theta1;
}

transformed parameters {
  vector[2] log_p_z[I];
  for (i in 1:I) {
    log_p_z[i, 1] += bernoulli_lpmf(1 | pi);
    log_p_z[i, 2] += bernoulli_lpmf(0 | pi);
  }
  for (n in 1:N) {
    log_p_z[ii[n], 1] += bernoulli_lpmf(y[ii[n]] | theta1);
    log_p_z[ii[n], 2] += bernoulli_lpmf(y[ii[n]] | (1 - theta0));
  }
}

model {
  pi ~ beta(2,2);
  theta0 ~ beta(2, 2);
  theta1 ~ beta(2, 2);

  for (i in 1:I) {
    target += log_sum_exp(log_p_z[i]);
  }
}

However this model cannot even sample (!) due to errors like:

Log probability evaluates to log(0), i.e. negative infinity. and errors with parameters of the Bernoulli distributions being invalid.

There is clearly something major and I would love if anyone could point me in the direction of what that could be.

I have also tried mixture distributions i.e.

...
for (i in 1:I) {
    target += log_mix(pi,
                      bernoulli_lpmf(y[i] | theta1),
                       bernoulli_lpmf(y[i] | (1 - theta1)));
 }

or

...
for (i in 1:I) {
    target += log_sum_exp(bernoulli_lpmf(0 | pi)  + bernoulli_lpmf(y[i] | theta1),
                          bernoulli_lpmt(1 | pi)  + bernoulli_lpmf(y[i] | (1 - theta1)));
 }

Following the Algebra and the Missing Oxen blog post by Richard McElreath but these models have lots of divergences and don’t reproduce simulation parameters.

Some code I wrote (in R) to simulate from the model is:

set.seed(42)

N <- 1000
ii <- rep(1:100, 10)
pi <- 0.2
theta1 <- 0.8
theta0 <- 0.7
z <- rbinom(N, 1, pi)
y <- rbinom(N, 1, z * theta1 + (1 - z) * (1 - theta0))

and to run the model:

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
N <- length(y)
I <- max(ii)

stan_data <- list(N = N, y = y, ii = ii, I = I)

fit <- stan(file = "path to file"), data = stan_data, seed = 42)

I am very new to the art of translating these (at least partly) non-trivial models into Stan so any help would be appreciated!

Thanks!

p.s. Thanks for some great models/papers in this area @Bob_Carpenter !

  1. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.1374&rep=rep1&type=pdf

theta0 and theta1 also need upper=1 bounds like pi.

Hi hhau,

Thanks for you response!

When I have the bounds you mention in place I get errors like:

Log probability evaluates to log(0), i.e. negative infinity.

And when I do not I get errors like:

Exception: bernoulli_lpmf: Probability parameter is -3.4802, but must be in the interval [0, 1] (in 'model4a3811c4b215_binomial' at line 26)

Honestly I find both these errors pretty mystifying!

All the best!

I think the code assumes log_p_z is initialised to zero, which I don’t think it is. Per section 3.6 in the manual:

For all blocks defining variables (transformed data, transformed parameters, gen- erated quantities), real values are initialized to NaN and integer values are initialized to the smallest legal integer (i.e., a large absolute value negative number).

Doing something simple in the transformed parameters block like:

data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> I;               // number of items
  int<lower=1> ii[N];           // item labels
  int<lower=0, upper=1> y[N];   // label for annotation n
}

parameters {
  real<lower=0, upper=1> pi_prob;
  real<lower=0, upper=1> theta0;
  real<lower=0, upper=1> theta1;
}

transformed parameters {
  vector[2] log_p_z[I];

  for (i in 1:I) {
    log_p_z[i, 1] = 0.0;
    log_p_z[i, 2] = 0.0;
  }

  for (i in 1:I) {
    log_p_z[i, 1] += bernoulli_lpmf(1 | pi_prob);
    log_p_z[i, 2] += bernoulli_lpmf(0 | pi_prob);
  }
  for (n in 1:N) {
    log_p_z[ii[n], 1] += bernoulli_lpmf(y[ii[n]] | theta1);
    log_p_z[ii[n], 2] += bernoulli_lpmf(y[ii[n]] | (1 - theta0));
  }
}

model {
  pi_prob ~ beta(2, 2);
  theta0 ~ beta(2, 2);
  theta1 ~ beta(2, 2);

  for (i in 1:I) {
    target += log_sum_exp(log_p_z[i]);
  }
}

seems to work (the initialising to zero should be vectorisable but the correct syntax for doing so eludes me right now).

1 Like

Hi hhau,

Thanks for your response! With the alterations you suggest the model runs! Unfortunately it still doesn’t spit out the correct parameter estimates… I’m not sure whether this is due to model itself, how I’m simulating from it or my stan implementation though!

All the best!