Compound Poisson–gamma distribution

Hi,

I have zero-inflated continuous response data that I’d like to model using the compound Poisson–gamma distribution, which is a special case of the Tweedie distribution (https://en.wikipedia.org/wiki/Tweedie_distribution).

There’s an old forum post (https://groups.google.com/forum/#!topic/stan-users/XyFv7ERq0oA) on this topic, as well as a closed GitHub issue (https://github.com/stan-dev/math/issues/417) and linked Gist (https://gist.github.com/MatsuuraKentaro/952b3301686c10adcb13).

I am wondering if someone could clarify if and how one can fit such models with Stan.

Thanks!
Dan.

I don’t know much about the Tweedie distribution and I haven’t read the links you posted, but I recently did a model for zero inflated continuous data from a chemical gas sensor with a detection limit. I’ll post some of that here.

I used the inverse logit function to map partial pressure being measured to the probability of the sensor not detecting any amount of the chemical. The function looked like this:

prob_{non-detect} = logit^{-1}\left(log(2.0/1.0e6) * {\hat P \over P_{det.lim.}} + log(1.0e6)\right)

Where \hat P is the partial pressure being measured and P_{det.lim.} is the partial pressure at which the sensor is twice as likely to measure zero than it is to make a non-zero measurement. The log(1.0e6) is the log-odds of making a zero measurement when \hat P = 0; I set the odds to 1 million to make the logistic curve steep.

Here are some plots for \hat P = 25, P_{det.lim.} = 25, and \sigma = 10, where \sigma is the standard deviation of the measurement error.

Here’s how I implemented the data likelihood in Stan. Notice how the zeroes in the data increment the log-posterior with the log probability of a non-detect, while the non-zeroes in the data increment the log-posterior with the log probability of a detection using log1m_inv_logit(...), as well as the log density of the noisy observation. It’s important to use the log probability in both places.

for (n in 1:N) for(m in 1:3) { // Trial, chemical species
  
  if (P_obs[n, m] == 0) { 
    // Zero measurement process
    target += 
      // Log-probability of zero measurement
      log_inv_logit( 
        log(2.0/1.0e6) * P_hat[n,m]/det_lim + 
        log(1.0e6) // Log-odds of a zero measurement at P_obs == 0
      );
    
  } else {                
    // Non-zero measurement process
    target += 
      // Log-probability of non-zero measurement
      log1m_inv_logit(
        log(2.0/1.0e6) * P_hat[n,m]/det_lim + 
        log(1.0e6) // Log-odds of a zero measurement at P_obs == 0
      );
      
    P_obs[n, m] ~ 
      // Noisy observation of a positive quantity 
      gamma(
        square(P_hat[n, m] / sigma), // Shape
        P_hat[n, m] / square(sigma)  // Inverse scale
      );
       
  }
  
}

And here’s how I implemented the Generated Quantities block:

real P_rep[N, 3];
int <lower=0, upper=1> non_detect;

for (n in 1:N) for (m in 1:3) { // Trial, chemical species
  // Probabilistically decide which measurement process to sample
  non_detect = 
    bernoulli_logit_rng(
      log(2.0/1.0e6) * P_hat[n,m]/det_lim + 
      log(1.0e6) // Log-odds of a zero measurement at P_obs == 0
     );
  
  if (non_detect == 1) {
    // Zero measurement process
    P_rep[n, m] = 0.0;
    
  } else {
    // Non-zero measurement process
    P_rep[n, m] = 
      gamma_rng(
          square(P_hat[n, m] / sigma), // Shape
          P_hat[n, m] / square(sigma)  // Inverse scale
      );
    
  }
  
}
2 Likes

Hi! I was trying to model a Tweedie distribution in Stan and came across your post (and the links you posted). Was this issue solved? There seems to be something in the stan users mailing list but I don’t have access to it. Thank you!

It’s for sure possible.

The gist you’ve posted properly implements the likelihood of the response variable with no additional covariates. What remains to be specified in code is the link function, or relationship between predictors and the likelihood. I took a look at the SAS documentation but it was vague (mentioned the likelihood but no mention of how covariates are related to the likelihood’s parameters).

One could specify a linear link to model the relationship between tweeties distribution’s parameters of interest, for example the mean, dispersion or power parameters.

I think designing the model depends more on what you would like to know. If there’s a more concrete question we have, I could probably code something up.