# Specifying a model for inflated event timings

I have data representing the time it takes for an activity to complete; however this data is noisy and sometimes the stop watch fails so that timings are inflated.

The model I am looking to fit assumes the observed data Y can be expressed as: Y = T + BE, where T and E respectively denote true and excess timing, and B is Bernoulli distributed. That is when (unobserved) B = 0 the timing is accurate, and when B= 1 the timing is inflated.

Since both T,\, E represent timings, it is reasonable to assume they are Gamma distributed: so I am looking to estimate parameters \alpha_T, \beta_T, \alpha_E, \beta_E, \theta.

To implement this in Stan, I was imagining it would be best to specify this as a mixture model: for which I need the distribution of both T, and T + E to pass to log_mix.

The challenge is that to the best of my knowledge, there is no closed form solution to the pdf of the sum of two Gamma distributions.

I would greatly appreciate any thoughts on how to specify this model in Stan.

Thanks.

The code below sets out the special case where I make the simplifying assumption \alpha_T = \alpha_E = 1, so that the time estimates are assumed to be Exponentially distributed. In this case the PDF of the sum T + E is explicitly known:

\frac{\lambda_T \lambda_E}{|\lambda_T - \lambda_E|} \left( e^{-\lambda_T x} - e^{-\lambda_E x} \right)

data {
int M;                    // samples
vector[M] y;              // observed values
}

parameters {
positive_ordered rate;    // rate of true/excess exponentials
real<lower=0,upper=1> theta; // probability of incurring excess
}

transformed parameters {
real exponential_sum_lpdf[M];

// log likelihood for exponential sum
for(m in 1:M)
exponential_sum_lpdf[m] = log(rate) + log(rate) -
log(rate - rate) +
log( exp( -y[m] * rate ) - exp( -y[m] * rate ));
}

model {
// priors
rate ~ exponential(1);
rate ~ exponential(1);
theta ~ beta(1,2);

// likelihood
for (m in 1:M)
target += log_mix(1 - theta,
exponential_lpdf(y[m] | rate),
exponential_sum_lpdf[m]);
}


Code for running in R:

library(rstan)
set.seed(20200421)

df <- data.frame(
true = rexp(100, rate = 1/10),
excess = rexp(100, rate = 1/40),
bin = rbinom(100, size = 1, prob = 0.2)
)

df$y = df$true + df$bin * df$excess

# stan_code argument refers to the above.
fit <- rstan::stan(model_code = stan_code, data = list(M = nrow(df), y = df\$y))

summary(fit, pars = c("rate", "rate", "theta"))


You are correct that in general, summing Gamma-distributed variables is not possible. There are ways to work around this (I wrote a blog about approximating sums of random variables which has a link for the gamma case in the end). But I don’t think this would suit you that well as it is complex and doesn’t really work great.

Instead I think that as a first choice, you should treat this as a measurement error model (https://mc-stan.org/docs/2_23/stan-users-guide/bayesian-measurement-error-model.html) where you have an explicit Gamma-distributed parameter for the additional time. This tends to be computationally intensive, so it might be problematic for large datasets.

Does that make sense?

Hi @martinmodrak - Thank you for your response, I really appreciate it.

I think you’re right that the saddle point approximation is going to be a big task to implement, (given my moderate knowledge of stan!).

In terms of the measurement error model - thanks for reminding me of that section of the documentation, completely passed me by. Unfortunately, for the application I have I don’t feel comfortable a fixed assumption on the distribution of the excess timings.

What this did jog me into thinking about however is whether I could simplify life by assuming the errors are exponential, but still allowing a generic Gamma for the main timing element. This feels like an okay assumption for me - as I still capture the lag’ in the density (when \alpha > 1) which is what the exponential model missed.

Doing some calculations yesterday I was able to derive the density of \Gamma(\alpha, \beta) + \text{Exp}(\lambda) as

\frac{\lambda \beta^\alpha}{(\beta - \lambda)^\alpha} e^{-\lambda x} \frac{\gamma\big(\alpha, (\beta-\lambda)x\big)}{\Gamma(\alpha)}

which seems valid for \beta > \lambda. Since the last term is implemented in Stan as gamma_p I’ve been able to derive the log PDF, which I can then feed to the second argument in log_mix.

functions {
// Defines the log pdf of the sum of a Gamma and Exponential random variable;
// formula is valid for beta > lambda.
real gamma_exp_sum_lpdf(real x, real alpha, real beta, real lambda){
real lpdf;
lpdf = log(lambda) + alpha * log(beta) -
alpha * log(beta - lambda) +
log( gamma_p(alpha, (beta - lambda) * x)) -
lambda * x;

return lpdf;
}
}
`
1 Like