Error evaluating a 1D integral for Pareto/GGG Buy Till You Die marginal likelihood

I am currently implementing the Pareto/GGG Buy Till You Die mixture model described in Platzer and Reutterer (2016). One of the parameters \tau represents the customer lifetime and follows a tricky distribution conditional on whether or not the customer is “alive”, so I opted to marginalize it out of the likelihood described in Equation 8. Integrating out \tau by splitting the likelihood into a t_x < \tau < T case and a T < \tau < \infty case gives the following marginal likelihood function. Note that K contains all of the terms in the original likelihood function that are independent of \tau, S_{\Gamma} is the Gamma CCDF, T and t_x are provided data values at the individual customer level, and k, \lambda, and \mu are parameters contained for each customer.

K * [\int_{t_x}^{T} S_{\Gamma}(\tau - t_x | k, k\lambda) \mu e^{-\mu\tau}d\tau + e^{-\mu T}S_{\Gamma}(T - t_x | k, k\lambda)]

The integral \int_{t_x}^{T} S_{\Gamma}(\tau - t_x | k, k\lambda) \mu e^{-\mu\tau}d\tau in the above expression is not fitting via the integrate_1d function. The corresponding error is “Chain 1: Exception: integrate: error estimate of integral 2.08696e-07 exceeds the given relative tolerance times norm of integral (in ‘model9d5a44fe393d_ParetoGGG_cleaned_up_for_help’ at line 106)”.

I have tried the following to resolve this:

  • Changing the tolerance. This inevitably results in another error estimate error for a different value.

  • Plotting the integrand and exploring how it changes with different values of k, \lambda, and \mu to understand potential discontinuities. There does not seem to be a predictable way to anticipate them.

  • Playing around with the xc argument for the integrand. I kept getting errors about values very close to 0. I suspect I may not be understanding how to use this correctly.

  • Changing hyperpriors to be more or less strict.

  • Bypassing integrate_1d by using Simpson’s rule to integrate, which is what the BTYDplus package authors did in their Gibbs sampling approach. However, that results in “Chain 1: Rejecting initial value:
    Chain 1: Gradient evaluated at the initial value is not finite.
    Chain 1: Stan can’t start sampling from this initial value.”

Below is my Stan model, along with a self-contained R script I used to fit it:

// ParetoGGG-cleaned-up-for-help.stan
// Modified from Pareto/NBD in these Collab notebooks
// 
// https://colab.research.google.com/drive/1yxUkWZSBJWIoxBAi7MIxvkWKNgr0NXDH
// https://www.briancallander.com/posts/customer_lifetime_value/pareto-nbd.html

functions {
  
  real pggg_unscaled_palive_base_integrand(real x,
                                  real tx,
                                  real k,
                                  real lambda,
                                  real mu) {
    real unscaled_log_integrand = gamma_lccdf(x - tx | k, k*lambda) - mu*x;
    return exp(unscaled_log_integrand);
 }
 
  real pggg_palive_base_integrand(real x,
                                  real tx,
                                  real k,
                                  real lambda,
                                  real mu) {
    return mu * pggg_unscaled_palive_base_integrand(x, tx, k, lambda, mu);
  }
  
  // Function to integrate MUST have this format. 
  // https://mc-stan.org/docs/2_21/stan-users-guide/integrate-1d.html
  //
  // ...however it still results in a syntax error, so I'm using the solution here
  // https://discourse.mc-stan.org/t/calling-integrate-1d-reports-parsing-error-parser-expected/19611a
  real pggg_palive_integrand(real x, // independent variable to integrate over
                             real xc, // high-precision version of distance
                             real[] theta, // parameters (k, lambda, mu)
                             real[] x_r, // data (real) (tx)
                             int[] x_i) { // data (integer)
    // Integrate this function to get a numerator term in Equation 7
    // in Platzer and Reutterer (2016)
    
    real unscaled_sol;
    real lower_threshold;
    real upper_threshold;

    real k = theta[1];
    real lambda = theta[2];
    real mu = theta[3];

    real tx = x_r[1];
    real Tcal = x_r[2];
    
    unscaled_sol = pggg_unscaled_palive_base_integrand(x, tx, k, lambda, mu);
    return mu * unscaled_sol;
  }
}
data {
  int<lower=1> N; // number of customers
  vector<lower=0>[N] x; // number of recurring events
  vector<lower=0>[N] tx; // time of most recent transaction
  
 // Called T in the paper, but Tcal in BTYDplus.
 // (We avoid using T because T is also the Stan keyword for a truncated distribution)
  vector<lower=0>[N] Tcal;  // total time observed for each customer. 
  vector[N] litt; // Sum of logarithmic transaction times
}
parameters {
  // each customer gets THEIR OWN k, lambda, and mu
  
  vector<lower=0>[N] k;  // regularity
  vector<lower=0>[N] lambda;  // transaction rate
  vector<lower=0>[N] mu; // dropout rate
  
  // Parameters for distributions of rates k, lambda, and mu
  
  // k ~ Gamma(t, gamma)
  real<lower=0> t;  // // NOT to be confused with tx or Tcal
  real<lower=0> gamma;
  
  // lambda ~ Gamma(r, alpha)
  real<lower=0> r;
  real<lower=0> alpha;
  
  // mu ~ Gamma(s, beta)
  real<lower=0> s;
  real<lower=0> beta;
}
model {
  
  // Hyperpriors for t, gamma, r, alpha, s, beta
  t ~ normal(2, 0.1);
  gamma ~ normal(0.5, 1);
  r ~ normal(0.5, 0.1);
  alpha ~ normal(10, 1);
  s ~ normal(0.5, 0.1);
  beta ~ normal(10, 1);

  // Draw k, lambda, and mu
  k ~ gamma(t, gamma);
  lambda ~ gamma(r, alpha);
  mu ~ gamma(s, beta);
  
  // Log-Likelihood of Pareto/GGG
  target += (k .* x .* log(k .* lambda)) - (x .* lgamma(k)) - (k .* lambda .* tx) + ((k - 1) .* litt);
  for (i in 1:N) {
     int x_i[0]; // placeholder to be passed into integrate_1d
     real left_term;

     left_term = integrate_1d(
       pggg_palive_integrand,
       tx[i],
       Tcal[i],
       {k[i], lambda[i], mu[i]},
       {tx[i], Tcal[i]},
        x_i,
        1e-8
     );
     
     target += log_sum_exp(
         log(left_term),
         gamma_lccdf(Tcal[i] - tx[i] | k[i], k[i]*lambda[i]) - mu[i] * Tcal[i]
      );
  }
}
library(dplyr)
library(rstan)
library(BTYDplus)

data(groceryElog)
groceryCBS <- elog2cbs(groceryElog, T.cal = "2006-12-31")

N <- nrow(groceryCBS)
x <- groceryCBS$x
tx <- groceryCBS$t.x
Tcal <- groceryCBS$T.cal
litt <- groceryCBS$litt

# Get initial values by fitting a Pareto/NBD model
pnbd_params <- BTYD::pnbd.EstimateParameters(groceryCBS[, c("x", "t.x", "T.cal")])
names(pnbd_params) <- c("r", "alpha", "s", "beta")

generate_init_fun <- function(pnbd_params, cbs) {
  function(chain_id) {
    l <- list(t = 1,
              gamma = 1,
              r = unname(pnbd_params["r"]),
              alpha = unname(pnbd_params["alpha"]),
              s = unname(pnbd_params["s"]),
              beta = unname(pnbd_params["beta"]))
    l$k <- array(rep(1, nrow(cbs)))
    l$lambda <- array(rgamma(nrow(cbs), l$r, l$alpha))
    l$mu <- array(rgamma(nrow(cbs), l$s, l$beta))

    l
  }
}

init_fun <- generate_init_fun(pnbd_params, groceryCBS)
model <- stan(file = "ParetoGGG-cleaned-up-for-help.stan",
              chains = 1,
              cores = 1,
              iter = 1,
              seed = 234234243,
              init = init_fun,
              data = list(
                N = N,
                x = x,
                tx = tx,
                Tcal = Tcal,
                litt = litt
              ))
1 Like

Just glancing at the referenced paper, specifically equation (11), is it possible that this line in pggg_unscaled_palive_base_integrand:

real unscaled_log_integrand = gamma_lccdf(x - tx | k, k*lambda) - mu*x;

should instead read:

real unscaled_log_integrand = gamma_lccdf(x - tx | k, k*lambda) - mu*(x - tx);

I suspect part of the problem with the integration as written is that exp(unscaled_log_integrand) can be very small for large values of x, causing underflow and numerical precision issues.

I don’t think so because the BTYDplus implementation uses exp(-mu*x) for when the same integral appears in the expression for calculating P(alive) (which I intend to be in generated quantities once I get everything else working). However I acknowledge that it is totally possible that I messed up the marginal likelihood.

If it helps, here’s the math I did to arrive at that particular integral. Note that \tau \sim Exp(\mu), however there obviously has to be some truncation here because the lifetime \tau must be after the customer’s last transaction time t_x.

L(k, \lambda, \mu) = \int_{t_x}^{T} L(k, \lambda, \mu | \tau)L(\tau) d\tau + \int_{T}^{\infty} L(k, \lambda, \mu | \tau)L(\tau) d\tau
= K * [\int_{t_x}^{T} S_{\Gamma}(\tau - t_x | k, k\lambda) * \mu e^{-\mu\tau}d\tau + \int_{T}^\infty S_{\Gamma}(T - t_x | k, k\lambda) * \mu e^{-\mu\tau}d\tau].

The second term has a closed form, and the first term is what’s above.

Sounds good, just thought I’d ask in case it was something simple like that (I’m not too familiar with these types of models).

Generally I think the issue is with that mu * x term, though. In the example dataset, there are some large values of x which cause the whole integrand to be very small, I suspect leading to numerical precision issues, especially when mu is not small enough to counteract the size of x.

Perhaps there is some way to constrain the maximum size of mu. I don’t fully understand the model, so I’m not sure how to interpret mu (which would help in figuring out a reasonable upper bound), but say we constrain it to be between 0 and 1 / Tcal:

vector<lower=0, upper=1 ./ Tcal>[N] mu

Running this yields fewer integration errors, at least on my computer, suggesting the issue may indeed be with underflow.

The reason why informative priors wouldn’t necessarily help here is because Stan might still initialize mu at too large a value, causing errors before any sampling can take place. You might be able to get around this by supplying Stan with better initial values for mu, in addition to more informative priors that keep the MCMC algorithm from then straying into parameter space that causes errors.

(I see now that there is indeed some code in your R example to get better initial values, it looks like they aren’t being fed into Stan though.)

1 Like

Thanks! I tried using that syntax on my machine but it isn’t working, so I can’t try it. Is that a cmdstan thing? Also I think you’re confusing the x in the dataset (which is the total number of transactions) with the x in the integral, which is just an integration variable I use instead of writing tau. Large values of x causing underflow would have to come from values passed in via integrate_1d, not from the data.

Also, in a BTYD model, \mu captures how likely a customer is to churn in a given time period. I suspect there’s an interpretation in terms of a hazard function in survival analysis but I haven’t gone that deep into the math.

Thanks for the catch on not passing in init_fun. I’ve edited my original post to include that, but it still leads to the same error. In particular, “Chain 1: Exception: integrate: error estimate of integral 2.326e-09 exceeds the given relative tolerance times norm of integral (in ‘model1645cb765790_ParetoGGG_cleaned_up_for_help’ at line 106)”.

Yes, my mistake about x. Since the integral is going from tx[i] to Tcal[i] the important thing then would be the large values of Tcal[i].

And yes it’s possible that the vector<lower=0, upper=1 ./ Tcal>[N] mu syntax is from a newer version of Stan than what you’re working with – I’m using Stan version 2.30.1.

At this point I would recommend experimenting in a more focused way with Stan’s integrator to try to figure out when the error arises. I would do this by having a minimal Stan example that tries to integrate the function using fixed inputs:

functions {
  real pggg_unscaled_palive_base_integrand(real x,
                                  real tx,
                                  real xc,
                                  real k,
                                  real lambda,
                                  real mu) {
      if(xc < 0) return 0; // xc may be negative due to numerical precision errors
      return exp(gamma_lccdf(xc | k, k*lambda) - mu * x);
    }
 
 real pggg_palive_integrand(real x, // independent variable to integrate over
                             real xc, // high-precision version of distance
                             real[] theta, // parameters (k, lambda, mu)
                             real[] x_r, // data (real) (tx)
                             int[] x_i) { // data (integer)
    // Integrate this function to get a numerator term in Equation 7
    // in Platzer and Reutterer (2016)
    
    real unscaled_sol;
    real lower_threshold;
    real upper_threshold;

    real k = theta[1];
    real lambda = theta[2];
    real mu = theta[3];

    real tx = x_r[1];
    real Tcal = x_r[2];
    
    unscaled_sol = pggg_unscaled_palive_base_integrand(x, tx, xc, k, lambda, mu);
    return unscaled_sol;
  }
}
data {
  real<lower=0> tx;
  real<lower=0> Tcal;  // total time observed for each customer. 
  
  real<lower=0> k;
  real<lower=0> lambda;
  real<lower=0> mu;
  
  real<lower=0> right_limit;
  real<lower=0> rel_tol;
}
model {
  
}
generated quantities {
  int x_i[0]; // placeholder to be passed into integrate_1d
  real left_term = integrate_1d(
       pggg_palive_integrand,
       tx,
       right_limit,
       {k, lambda, mu},
       {tx, right_limit},
        x_i,
        rel_tol
     );
  print(left_term);    
}

Then you can call it from R using fixed_params = TRUE (I use cmdstanr, but this should be easily adaptable to rstan):

library(dplyr)
library(rstan)
library(BTYDplus)

data(groceryElog)
groceryCBS <- elog2cbs(groceryElog, T.cal = "2006-12-31")

N <- nrow(groceryCBS)
x <- groceryCBS$x
tx <- groceryCBS$t.x
Tcal <- groceryCBS$T.cal
litt <- groceryCBS$litt

# Get initial values by fitting a Pareto/NBD model
pnbd_params <- BTYD::pnbd.EstimateParameters(groceryCBS[, c("x", "t.x", "T.cal")])
names(pnbd_params) <- c("r", "alpha", "s", "beta")

model <- cmdstanr::cmdstan_model("pareto_test.stan")

model$sample(
  iter_sampling = 5,
  data = list(
    x = x[1],
    tx = tx[1],
    Tcal = Tcal[1],
    right_limit = Tcal[1],
    rel_tol = 0.01,
    mu = 0.05,
    k = 1,
    lambda = 1
  ),
  fixed_param = TRUE,
  chains = 1
)

You could also try simplifying the integrand to get a feel for the Stan integrator. I found for example that integrating exp(-mu * x) can lead to the same error message if the relative tolerance is too small, and the right limit of integration is too large.

Just from fooling around for a few minutes, it seems like there is a relationship between mu and the relative tolerance in terms of the error being produced: the larger the mu, the larger the relative tolerance needs to be. Perhaps bounding mu in some way, and making the relative tolerance much larger, might help solve the problem.

It might be a good idea to run the integrator on many inputs and compare it to the output of R’s integrate function for the same inputs as a way to ensure you end up with the right results.

1 Like

Got it, thanks! I will play around with this when I have some time. A few quick questions:

  • What’s the reasoning behind using xc instead of x - tx directly in pggg_unscaled_palive_base_integrand? The documentation seems to imply that you should use it only in situations where x is “close enough” to an integration bound.

  • If this doesn’t work, do you think reparametrizing the integral itself could fix numerical issues? For example, substituting u = tau - tx and du = dtau so that the bounds of the integral are 0 and Tcal - tx, respectively. I also thought about subtracting the max value of the integrand and adding it back in, but getting that value requires numerically solving for y in log(mu) + gamma_lccdf(y - tx | k, k*lambda) - gamma_lpdf(y - tx | k, k*lambda) = 0, since it doesn’t have a closed-form solution. I’m not super familiar with numerical integration, so I figure there may be tricks you’re aware of that I wouldn’t have thought about.

You’re welcome, I hope you can figure something out! To respond to your questions:

  • Yes it’s possible that x-tx directly may be better, or better to test to see if x is close to tx, and only then use xc. I’m not an expert on how this works so trying a couple things might be a good idea.
  • It’s possible reparameterizing might help, although I think the issue is just that the integrand gets so small at the rightmost bound. Another idea I had was to try to truncate the right bound to a more reasonable value (that is, figure out some rule for when the integrand is typically very close to zero and stops effecting the integral, and then only integrate to that new bound.)
1 Like

First of all, I see that you are staying true to the model by using Gamma heterogeneity.

In my own experience with Pareto/NBD in Stan with the famous CDNOW dataset, using Gamma heterogeneity cannot get convergence no matter how I reparameterize them.

I did get great convergence with lognormal instead of Gamma. I posted my code in this forum but I can’t find it now as I am on my phone.

Try to use lognormal and see if that helps.

1 Like

Thank you! Lognormal for which parameters? Do you have any resources on such an implementation? I’ve seen a joint lognormal distribution on \lambda and \mu in Abe (2009) for Pareto/NBD with covariates, but I’m not aware of any lognormal BTYD models that allow for a customer-level regularity parameter.

With regard to Pareto/NBD, I was able to fit my own implementation just fine on the groceryCBS, but got “Warning: 4 of 4 chains had an E-BFMI less than 0.2.” when fitting the same implementation to CDNOW. Both times I got Exception: gamma_lpdf: Random variable[576] is inf, but must be positive finite!, but only during the first few warmup iterations.

For reference, this is my implementation of Pareto/NBD.

// Thanks to
// https://colab.research.google.com/drive/1yxUkWZSBJWIoxBAi7MIxvkWKNgr0NXDH
// https://www.briancallander.com/posts/customer_lifetime_value/pareto-nbd.html

data {
  int<lower=1> N;
  vector<lower=0>[N] x;
  vector<lower=0>[N] Tcal;
  vector<lower=0>[N] tx;
}
parameters {
  real<lower=0> r;
  real<lower=0> alpha;
  real<lower=0> s;
  real<lower=0> beta;
  vector<lower=0>[N] mu;
  vector<lower=0>[N] lambda;
}
model {

  // set 2nd-stage priors
  r ~ normal(0.5, 0.1);
  alpha ~ normal(10, 1);
  s ~ normal(0.5, 0.1);
  beta ~ normal(10, 1);

  // set 1st-stage priors
  lambda ~ gamma(r, alpha);
  mu ~ gamma(s, beta);

  // likelihood
  target += x .* log(lambda) - log(lambda + mu);
  for (i in 1:N) {
    target += log_sum_exp(
      log(lambda[i]) - (lambda[i] + mu[i]) .* Tcal[i],
      log(mu[i]) - (lambda[i] + mu[i]) .* tx[i]
    );
  }
}
generated quantities {
  // See Equation 9 from Ma and Liu (2007)
  // For some reason, setting <lower=0, upper=1> for p_alive results in an error that returns NA for all p_alive, so we pick a value slightly greater
  vector<lower=0, upper=1.0000001>[N] p_alive;
  for (i in 1:N) {
    // Taking log gives the expression below
    real log_p_alive = log(lambda[i] + mu[i]) - Tcal[i]*(lambda[i] + mu[i]) - log_sum_exp(
      log(mu[i]) - tx[i]*(lambda[i] + mu[i]),
      log(lambda[i]) - Tcal[i]*(lambda[i] + mu[i])
      );
    p_alive[i] = exp(log_p_alive);
  }
}

This is my code for the Pareto/NBD: Problem with initialization - #6 by sonicking

The problem with your Pareto/NBD code above is that you are using pretty informative priors for r, alpha, s, and beta. But they are parameters of interest themselves and should come from less-informative priors. When you do, you will see Stan is unable to get convergence.

To answer your question, this means I would consider using lognormal distribution for lambda, mu and perhaps k, too.

My biggest issue with your approach for the Pareto/GGG is that it does not appear that one needs to evaluate that integral in each iteration. IIt should be performed only once when you are trying to compute P(alive).

1 Like

Thanks! That’s a good point about the priors being too informative.

My biggest issue with your approach for the Pareto/GGG is that it does not appear that one needs to evaluate that integral in each iteration. IIt should be performed only once when you are trying to compute P(alive).

The integral is part of the marginal likelihood though. It’s not just used for calculating P(alive). In this comment I go through the math to demonstrate that.

Isn’t the paper’s Equation 8 the marginal likelihood at the individual-level? I don’t see the integral there.

Maybe my terminology or notation was confusing, or I did this totally wrong.

Yes, Equation 8 contains the individual likelihood. However, it is a function of \tau. The marginal likelihood I speak of is no longer dependent on \tau. In my previous comment, I linked to where I obtained that \tau-independent marginal likelihood by calculating:

\int_{t_x}^{\infty} P(data | k, \mu, \lambda, \tau)*P(\tau | k, \mu, \lambda) d\tau

= \int_{t_x}^{\infty} Equation8Likelihood * \mu e^{-\mu\tau} d\tau

= K*\int_{t_x}^{\infty}S_{\Gamma}(\min(T, \tau) - t_{x} | k, k\lambda) * \mu e^{-\mu\tau} d\tau,

where K = \frac{(k\lambda)^{kx}}{\Gamma(k)^x}e^{-k\lambda t_x} \left(\prod_{j=1}^x (\Delta t_j)^{k-1}\right) is the the part of Equation 8 that is independent of \tau. Then, I split the integral into a t_x < \tau < T case, and a T < \tau < \infty case. In the former case, we have \min(T, \tau) = \tau, which results in the integral I’m trying to solve. In the latter case - in which \min(T, \tau) = T, we have a closed-form solution.

I cannot write LaTex. So let me try my best.

I understand your point about Eqn(8) is conditional on tau, and thus you need to remove the conditioning. There are two scenarios to consider:

  1. tau is between t_x and T, in which case the probability is the integral of mu * exp(-mu*tau), with the integral limits of tau from t_x and T. I think you can solve this integral in closed-form.
  2. tau is > T, in which case the probability of that is exp(-mu*T). There is no integral.

My logic is based on the equation(12)-(14) of Pareto/NBD derivation by Fader and Hardie: https://brucehardie.com/notes/009/pareto_nbd_derivations_2005-11-05.pdf

Alternatively, you can consider sampling tau explicitly and then write out the corresponding likelihood function now that tau is “known”.

Ah, got it, thanks. I’ll read through that paper when I have more time.

As for sampling tau explicitly, I tried that but every model diagnostic was throwing enormous red flags.

// Pareto/GGG attempt (including estimating tau)
// Modified from Pareto/NBD in these Collab notebooks
// 
// https://colab.research.google.com/drive/1yxUkWZSBJWIoxBAi7MIxvkWKNgr0NXDH
// https://www.briancallander.com/posts/customer_lifetime_value/pareto-nbd.html

data {
  int<lower=1> N; // number of customers
  vector<lower=0>[N] x; // number of recurring events
  vector<lower=0>[N] tx; // time of most recent transaction
  
 // Called T in the paper, but Tcal in BTYDplus.
 // (We avoid using T because T is also the Stan keyword for a truncated distribution)
  vector<lower=0>[N] Tcal;  // total time observed for each customer. 
  vector[N] litt; // Sum of logarithmic transaction times
}
parameters {
  // each customer gets THEIR OWN k, lambda, and mu

  vector<lower=0>[N] k;  // regularity
  vector<lower=0>[N] lambda;  // transaction rate
  vector<lower=0>[N] mu; // dropout rate
  
  vector<lower=0>[N] tau_raw;  // UNTRUNCATED customer lifetime

  // Parameters for distributions of rates k, lambda, and mu
  
  // k ~ Gamma(t, gamma)
  real<lower=0> t;  // // NOT to be confused with tx or Tcal
  real<lower=0> gamma;
  
  // lambda ~ Gamma(r, alpha)
  real<lower=0> r;
  real<lower=0> alpha;
  
  // mu ~ Gamma(s, beta)
  real<lower=0> s;
  real<lower=0> beta;
}
transformed parameters {
  // https://mc-stan.org/docs/stan-users-guide/vectors-with-varying-bounds.html
  // tau must be greater than tx
  vector[N] tau = tx + tau_raw;
}
model {
  // Hyperpriors for t, gamma, r, alpha, s, beta
  t ~ exponential(0.5);
  gamma ~ exponential(0.005);
  r ~ exponential(0.005);
  alpha ~ exponential(0.005);
  s ~ exponential(0.005);
  beta ~ exponential(0.005);

  // Draw k, lambda, and mu
  k ~ gamma(t, gamma);
  lambda ~ gamma(r, alpha);
  mu ~ gamma(s, beta);
  
  tau ~ exponential(mu);
  
  // Log-Likelihood of Pareto/GGG
  target += (k .* x .* log(k .* lambda)) - (x .* lgamma(k)) - (k .* lambda .* tx) + ((k - 1) .* litt);
  for (i in 1:N) {
    if (tau[i] < Tcal[i]) {
      target += gamma_lccdf(tau[i] - tx[i] | k[i], k[i]*lambda[i]);
    } else {
      target += gamma_lccdf(Tcal[i] - tx[i] | k[i], k[i]*lambda[i]);
    }
  }
}

Try tau_raw ~ exponential(mu) instead of tau~exponential(mu)

I ran this last night on the groceryElog dataset with the below script. Diagnostics failed yet again. I got Warning: 6000 of 6000 (100.0%) transitions hit the maximum treedepth limit of 10.. For all parameters, rhat>>1, neff/N is less than 0.01x the total sample size, and the Monte Carlo standard error is 0.6-0.7x that of the posterior standard deviation. The chains also show very high autocorrelation. Additionally, log posteriors differ quite a bit from chain to chain.

library(dplyr)
library(rstan)
library(BTYDplus)

data(groceryElog)
groceryCBS <- elog2cbs(groceryElog, T.cal = "2006-12-31")
cbs <- groceryCBS

N <- nrow(cbs)
x <- cbs$x
tx <- cbs$t.x
Tcal <- cbs$T.cal
litt <- cbs$litt

# Get initial values by fitting a Pareto/NBD model
pnbd_params <- BTYD::pnbd.EstimateParameters(cbs[, c("x", "t.x", "T.cal")])
names(pnbd_params) <- c("r", "alpha", "s", "beta")

generate_init_fun <- function(pnbd_params, cbs) {
  function(chain_id) {
    l <- list(t = 1,
              gamma = 1,
              r = unname(pnbd_params["r"]),
              alpha = unname(pnbd_params["alpha"]),
              s = unname(pnbd_params["s"]),
              beta = unname(pnbd_params["beta"]))
    
    mean_lambda <- mean(rgamma(10000, l$r, l$alpha))
    mean_mu <- mean(rgamma(10000, l$s, l$beta))
    mean_tau_raw <- mean(rexp(10000, mean_mu))
    
    l$k <- rep(1, nrow(cbs))
    l$lambda <- rep(mean_lambda, nrow(cbs))
    l$mu <- rep(mean_mu, nrow(cbs))
    l$tau_raw <- rep(mean_tau_raw, nrow(cbs))  # possibly change this
    
    l
  }
}




init_fun <- generate_init_fun(pnbd_params, cbs)
mod <- cmdstanr::cmdstan_model("ParetoGGG-keep-tau.stan")
fit <- mod$sample(
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 1500,
  seed = 234234243,
  init = init_fun,
  data = list(
    N = N,
    x = x,
    tx = tx,
    Tcal = Tcal,
    litt = litt
  ))

Hmm. I never figured it out. I understand why you had this line in your code:
vector<lower=0>[N] tau_raw; // UNTRUNCATED customer lifetime

But I am not sure whether this line is correct:

// tau must be greater than tx
  vector[N] tau = tx + tau_raw;

I think what you really want is a truncated from below exponential distribution for tau. Namely:

tau ~ exponential(mu) T[tx, ];
//I assume truncation in Stan can handle a vector bounds, need to confirm

But this is one of the remaining issues about BTYD models that I am not 100% certain about, especially for Stan implementation. This is why I would integrate out tau if possible.

Stan’s truncation syntax doesn’t support vector bounds, which is why I took the approach described here.