Bayesian categorical annotation

jeffreypullin · January 10, 2019, 11:19am

Hi all!

My name is Jeffrey and I am currently doing a summer research project focusing on bayesian model for annotation in noisy categorical data situations. This has led me to the excellent work of @Bob_Carpenter particularly the noisy categorical rating example in the manual which works very well on our data. I am now trying to code in stan some of the models from Multilevel Bayesian Models of
Categorical Data Annotation ¹ but I have got stuck.

I am trying to code up the first and simplest model referred to as the binomial model (described on pages 5 - 7 of the paper.

The stan code which I feel most accurately reproduces the model is the following:


data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> I;               // number of items
  int<lower=1> ii[N];           // item labels
  int<lower=0, upper=1> y[N];   // label for annotation n
}

parameters {
  real<lower=0, upper=1> pi;
  real<lower=0> theta0;
  real<lower=0> theta1;
}

transformed parameters {
  vector[2] log_p_z[I];
  for (i in 1:I) {
    log_p_z[i, 1] += bernoulli_lpmf(1 | pi);
    log_p_z[i, 2] += bernoulli_lpmf(0 | pi);
  }
  for (n in 1:N) {
    log_p_z[ii[n], 1] += bernoulli_lpmf(y[ii[n]] | theta1);
    log_p_z[ii[n], 2] += bernoulli_lpmf(y[ii[n]] | (1 - theta0));
  }
}

model {
  pi ~ beta(2,2);
  theta0 ~ beta(2, 2);
  theta1 ~ beta(2, 2);

  for (i in 1:I) {
    target += log_sum_exp(log_p_z[i]);
  }
}

However this model cannot even sample (!) due to errors like:

Log probability evaluates to log(0), i.e. negative infinity. and errors with parameters of the Bernoulli distributions being invalid.

There is clearly something major and I would love if anyone could point me in the direction of what that could be.

I have also tried mixture distributions i.e.

...
for (i in 1:I) {
    target += log_mix(pi,
                      bernoulli_lpmf(y[i] | theta1),
                       bernoulli_lpmf(y[i] | (1 - theta1)));
 }

or

...
for (i in 1:I) {
    target += log_sum_exp(bernoulli_lpmf(0 | pi)  + bernoulli_lpmf(y[i] | theta1),
                          bernoulli_lpmt(1 | pi)  + bernoulli_lpmf(y[i] | (1 - theta1)));
 }

Following the Algebra and the Missing Oxen blog post by Richard McElreath but these models have lots of divergences and don’t reproduce simulation parameters.

Some code I wrote (in R) to simulate from the model is:

set.seed(42)

N <- 1000
ii <- rep(1:100, 10)
pi <- 0.2
theta1 <- 0.8
theta0 <- 0.7
z <- rbinom(N, 1, pi)
y <- rbinom(N, 1, z * theta1 + (1 - z) * (1 - theta0))

and to run the model:

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
N <- length(y)
I <- max(ii)

stan_data <- list(N = N, y = y, ii = ii, I = I)

fit <- stan(file = "path to file"), data = stan_data, seed = 42)

I am very new to the art of translating these (at least partly) non-trivial models into Stan so any help would be appreciated!

Thanks!

p.s. Thanks for some great models/papers in this area @Bob_Carpenter !

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.1374&rep=rep1&type=pdf

hhau · January 10, 2019, 11:28am

theta0 and theta1 also need upper=1 bounds like pi.

jeffreypullin · January 10, 2019, 11:40am

Hi hhau,

Thanks for you response!

When I have the bounds you mention in place I get errors like:

Log probability evaluates to log(0), i.e. negative infinity.

And when I do not I get errors like:

Exception: bernoulli_lpmf: Probability parameter is -3.4802, but must be in the interval [0, 1] (in 'model4a3811c4b215_binomial' at line 26)

Honestly I find both these errors pretty mystifying!

All the best!

hhau · January 10, 2019, 1:54pm

I think the code assumes log_p_z is initialised to zero, which I don’t think it is. Per section 3.6 in the manual:

For all blocks defining variables (transformed data, transformed parameters, gen- erated quantities), real values are initialized to NaN and integer values are initialized to the smallest legal integer (i.e., a large absolute value negative number).

Doing something simple in the transformed parameters block like:

data {
  int<lower=1> N;               // total number of annotations
  int<lower=1> I;               // number of items
  int<lower=1> ii[N];           // item labels
  int<lower=0, upper=1> y[N];   // label for annotation n
}

parameters {
  real<lower=0, upper=1> pi_prob;
  real<lower=0, upper=1> theta0;
  real<lower=0, upper=1> theta1;
}

transformed parameters {
  vector[2] log_p_z[I];

  for (i in 1:I) {
    log_p_z[i, 1] = 0.0;
    log_p_z[i, 2] = 0.0;
  }

  for (i in 1:I) {
    log_p_z[i, 1] += bernoulli_lpmf(1 | pi_prob);
    log_p_z[i, 2] += bernoulli_lpmf(0 | pi_prob);
  }
  for (n in 1:N) {
    log_p_z[ii[n], 1] += bernoulli_lpmf(y[ii[n]] | theta1);
    log_p_z[ii[n], 2] += bernoulli_lpmf(y[ii[n]] | (1 - theta0));
  }
}

model {
  pi_prob ~ beta(2, 2);
  theta0 ~ beta(2, 2);
  theta1 ~ beta(2, 2);

  for (i in 1:I) {
    target += log_sum_exp(log_p_z[i]);
  }
}

seems to work (the initialising to zero should be vectorisable but the correct syntax for doing so eludes me right now).

jeffreypullin · January 11, 2019, 1:47am

Hi hhau,

Thanks for your response! With the alterations you suggest the model runs! Unfortunately it still doesn’t spit out the correct parameter estimates… I’m not sure whether this is due to model itself, how I’m simulating from it or my stan implementation though!

All the best!

Topic		Replies	Views
Categorical Predictor with Measurement Error Regression Modeling specification	2	412	January 15, 2021
Bayesian Neural Network with Categorical Likelihood: efficient implementation Modeling	5	718	March 16, 2021
Errors while trying to write joint likelihood of a normal_lpdf and a bernoulli_lpmf in Stan Modeling	6	658	March 19, 2018
Posterior predictive checks of a Multilevel regression model Modeling	1	704	November 27, 2017
Help understanding code and generating estimates General	9	946	April 25, 2020

Bayesian categorical annotation

Related topics