I have data representing the time it takes for an activity to complete; however this data is noisy and sometimes the stop watch fails so that timings are inflated.
The model I am looking to fit assumes the observed data Y can be expressed as: Y = T + BE, where T and E respectively denote true and excess timing, and B is Bernoulli distributed. That is when (unobserved) B = 0 the timing is accurate, and when B= 1 the timing is inflated.
Since both T,\, E represent timings, it is reasonable to assume they are Gamma distributed: so I am looking to estimate parameters \alpha_T, \beta_T, \alpha_E, \beta_E, \theta.
To implement this in Stan, I was imagining it would be best to specify this as a mixture model: for which I need the distribution of both T, and T + E to pass to log_mix
.
The challenge is that to the best of my knowledge, there is no closed form solution to the pdf of the sum of two Gamma distributions.
I would greatly appreciate any thoughts on how to specify this model in Stan.
Thanks.
The code below sets out the special case where I make the simplifying assumption \alpha_T = \alpha_E = 1, so that the time estimates are assumed to be Exponentially distributed. In this case the PDF of the sum T + E is explicitly known:
\frac{\lambda_T \lambda_E}{|\lambda_T - \lambda_E|} \left( e^{-\lambda_T x} - e^{-\lambda_E x} \right)
data {
int M; // samples
vector[M] y; // observed values
}
parameters {
positive_ordered[2] rate; // rate of true/excess exponentials
real<lower=0,upper=1> theta; // probability of incurring excess
}
transformed parameters {
real exponential_sum_lpdf[M];
// log likelihood for exponential sum
for(m in 1:M)
exponential_sum_lpdf[m] = log(rate[1]) + log(rate[2]) -
log(rate[2] - rate[1]) +
log( exp( -y[m] * rate[1] ) - exp( -y[m] * rate[2] ));
}
model {
// priors
rate[1] ~ exponential(1);
rate[2] ~ exponential(1);
theta ~ beta(1,2);
// likelihood
for (m in 1:M)
target += log_mix(1 - theta,
exponential_lpdf(y[m] | rate[2]),
exponential_sum_lpdf[m]);
}
Code for running in R:
library(rstan)
set.seed(20200421)
df <- data.frame(
true = rexp(100, rate = 1/10),
excess = rexp(100, rate = 1/40),
bin = rbinom(100, size = 1, prob = 0.2)
)
df$y = df$true + df$bin * df$excess
# stan_code argument refers to the above.
fit <- rstan::stan(model_code = stan_code, data = list(M = nrow(df), y = df$y))
summary(fit, pars = c("rate[1]", "rate[2]", "theta"))