Modeling outliers with multi-level model fails

Hi all!

I’m new to Stan and I’m trying to model data that is generated from binomial distribution, but has outliers. For example, each hour I have 1000 attempts and the number of failures is around 1%, but rarely (every ~40 hours), it jumps to 10%.

Here is the code, from R:

model_string <- "
data{
    int<lower=0> attempts[200];
    int<lower=0> drops[200];
}
parameters{
  
    real<lower=0,upper=1> p;
    real<lower=0,upper=1> spike_chance;
    real<lower=0,upper=1> spike_height;
}
model {
    real real_p;
    int is_spike;
    
    spike_height ~ uniform(0, 1);
    spike_chance ~ uniform(0, 1);
    p ~ uniform(0, 1);
    is_spike ~ binomial(1, spike_chance);

    real_p = p + is_spike * spike_height;
    drops ~ binomial( attempts , real_p );
}
"
my_attempts = rep(1000,200)
my_drops = rbinom(200,1000,0.01)

# Add some outliers
my_drops[seq(0,200,40)] = 100

data_list <- list(attempts = my_attempts, drops = my_drops)

# Compiling and producing posterior samples from the model.
stan_samples <- stan(model_code = model_string, data = data_list, chains = 1)

The approach I’ve taken is to have a drop rate as normal drop rate + is_spike * spike_drop_rate, where is_spike is integer that is 1 approximately 1% of the time. However, this model fails with the message:

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: binomial_lpmf: Successes variable is -2147483648, but must be in the interval [0, 1]  (in 'model1208741e5057_11e3fb205845c7ea7b9ea0b3acfc1cce' at line 19)

Is my approach correct? If so, how can I fix my model? Thanks in advance!

is_spike ~ binomial(1, spike_chance);

The ~ syntax isn’t actually sampling a distribution in Stan, so this isn’t doing what you expected.

The ~ statement translates to:

target += binomial_lpdf(is_spike | 1, spike_change);

and since is_spike has not been set to any particular value, it starts off with something strange.

Stan doesn’t sample discrete parameters, but if you can integrate them out you can still work with the model. I believe what you’re describing can be fit as a mixture model (at every time point, there is a probability of the outcome coming from distribution 1 or distribution 2). There are some examples of mixture models in the manual that you might be able to adapt: https://mc-stan.org/docs/2_24/stan-users-guide/zero-inflated-section.html (and the previous sections).

2 Likes