Specifying a model for inflated event timings

I have data representing the time it takes for an activity to complete; however this data is noisy and sometimes the stop watch fails so that timings are inflated.

The model I am looking to fit assumes the observed data Y can be expressed as: Y = T + BE, where T and E respectively denote true and excess timing, and B is Bernoulli distributed. That is when (unobserved) B = 0 the timing is accurate, and when B= 1 the timing is inflated.

Since both T,\, E represent timings, it is reasonable to assume they are Gamma distributed: so I am looking to estimate parameters \alpha_T, \beta_T, \alpha_E, \beta_E, \theta.

To implement this in Stan, I was imagining it would be best to specify this as a mixture model: for which I need the distribution of both T, and T + E to pass to log_mix.

The challenge is that to the best of my knowledge, there is no closed form solution to the pdf of the sum of two Gamma distributions.

I would greatly appreciate any thoughts on how to specify this model in Stan.

Thanks.


The code below sets out the special case where I make the simplifying assumption \alpha_T = \alpha_E = 1, so that the time estimates are assumed to be Exponentially distributed. In this case the PDF of the sum T + E is explicitly known:

\frac{\lambda_T \lambda_E}{|\lambda_T - \lambda_E|} \left( e^{-\lambda_T x} - e^{-\lambda_E x} \right)

data {
  int M;                    // samples
  vector[M] y;              // observed values
}

parameters {
  positive_ordered[2] rate;    // rate of true/excess exponentials
  real<lower=0,upper=1> theta; // probability of incurring excess
}

transformed parameters {
  real exponential_sum_lpdf[M];

  // log likelihood for exponential sum
  for(m in 1:M)
    exponential_sum_lpdf[m] = log(rate[1]) + log(rate[2]) - 
      log(rate[2] - rate[1]) +
      log( exp( -y[m] * rate[1] ) - exp( -y[m] * rate[2] ));
}

model {
  // priors
  rate[1] ~ exponential(1);
  rate[2] ~ exponential(1);
  theta ~ beta(1,2);
  
  // likelihood
  for (m in 1:M)
    target += log_mix(1 - theta,
        exponential_lpdf(y[m] | rate[2]),
        exponential_sum_lpdf[m]);
  }

Code for running in R:

library(rstan)
set.seed(20200421) 

df <- data.frame(
  true = rexp(100, rate = 1/10),
  excess = rexp(100, rate = 1/40),
  bin = rbinom(100, size = 1, prob = 0.2)
  )

df$y = df$true + df$bin * df$excess

# stan_code argument refers to the above.
fit <- rstan::stan(model_code = stan_code, data = list(M = nrow(df), y = df$y))

summary(fit, pars = c("rate[1]", "rate[2]", "theta"))

You are correct that in general, summing Gamma-distributed variables is not possible. There are ways to work around this (I wrote a blog about approximating sums of random variables which has a link for the gamma case in the end). But I don’t think this would suit you that well as it is complex and doesn’t really work great.

Instead I think that as a first choice, you should treat this as a measurement error model (https://mc-stan.org/docs/2_23/stan-users-guide/bayesian-measurement-error-model.html) where you have an explicit Gamma-distributed parameter for the additional time. This tends to be computationally intensive, so it might be problematic for large datasets.

Does that make sense?

Hi @martinmodrak - Thank you for your response, I really appreciate it.

I think you’re right that the saddle point approximation is going to be a big task to implement, (given my moderate knowledge of stan!).

In terms of the measurement error model - thanks for reminding me of that section of the documentation, completely passed me by. Unfortunately, for the application I have I don’t feel comfortable a fixed assumption on the distribution of the excess timings.

What this did jog me into thinking about however is whether I could simplify life by assuming the errors are exponential, but still allowing a generic Gamma for the main timing element. This feels like an okay assumption for me - as I still capture the `lag’ in the density (when \alpha > 1) which is what the exponential model missed.

Doing some calculations yesterday I was able to derive the density of \Gamma(\alpha, \beta) + \text{Exp}(\lambda) as

\frac{\lambda \beta^\alpha}{(\beta - \lambda)^\alpha} e^{-\lambda x} \frac{\gamma\big(\alpha, (\beta-\lambda)x\big)}{\Gamma(\alpha)}

which seems valid for \beta > \lambda. Since the last term is implemented in Stan as gamma_p I’ve been able to derive the log PDF, which I can then feed to the second argument in log_mix.

Thanks again for your input.

functions {
  // Defines the log pdf of the sum of a Gamma and Exponential random variable;
  // formula is valid for beta > lambda.
  real gamma_exp_sum_lpdf(real x, real alpha, real beta, real lambda){
    real lpdf;
    lpdf = log(lambda) + alpha * log(beta) - 
    alpha * log(beta - lambda) +
    log( gamma_p(alpha, (beta - lambda) * x)) -
    lambda * x;
    
    return lpdf;
  }
}
1 Like