Problem with initialization

I am very new to probabilistic modeling in general and was trying to run the Pareto/NBD model from this post (https://github.com/datascienceinc/oreilly-intro-to-predictive-clv/blob/master/oreilly-an-intro-to-predictive-clv-tutorial.ipynb) but have been running into initialization issues for my data set. Any help is greatly appreciated.

This is the error I am running into and the model is below.

Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.
Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

Initialization between (-1, 1) failed after 100 attempts.
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
data{
int<lower=0> n_cust; //number of customers 
vector<lower=0>[n_cust] x; 
vector<lower=0>[n_cust] tx; 
vector<lower=0>[n_cust] T; 
}

parameters{
// vectors of lambda and mu for each customer. 
// Here I apply limits between 0 and 1 for each 
// parameter. A value of lambda or mu > 1.0 is unphysical 
// since you don't enough time resolution to go less than 
// 1 time unit. 
vector <lower=0,upper=1.0>[n_cust] lambda; 
vector <lower=0,upper=1.0>[n_cust] mu;

// parameters of the prior distributions : r, alpha, s, beta. 
// for both lambda and mu
real <lower=0>r;
real <lower=0>alpha;
real <lower=0>s;
real <lower=0>beta;
}

model{

// temporary variables : 
vector[n_cust] like1; // likelihood
vector[n_cust] like2; // likelihood 

// Establishing hyperpriors on parameters r, alpha, s, and beta. 
r ~ normal(0.5,0.1);
alpha ~ normal(10,1);
s ~ normal(0.5,0.1);
beta ~ normal(10,1);

// Establishing the Prior Distributions for lambda and mu : 
lambda ~ gamma(r,alpha); 
mu ~ gamma(s,beta);

// The likelihood of the Pareto/NBD model : 
like1 = x .* log(lambda) + log(mu) - log(mu+lambda) - tx .* (mu+lambda);
like2 = (x + 1) .* log(lambda) - log(mu+lambda) - T .* (lambda+mu);

// Here we increment the log probability density (target) accordingly 
target+= log(exp(like1)+exp(like2));
}

The error you’re getting is that the target log density is negative infinity. Probably because something is trying to take a log of a zero or something.

These things can sometimes be annoying to track down, but the best thing to do is just print stuff out and look.

Probably like1 or like2 is negative infinity. Just do a:

print(like1, like2);

After you assign them values and have a look. Give it a go and see if you find something suspicious.

The Stan program in the notebook you are copying from is not that great, although it does initialize for me in rstan after I made some changes. Not sure if it would have initialized without the changes, but I didn’t want to find out.

data {
  int<lower=0> n_cust; //number of customers 
  vector<lower=0>[n_cust] x; 
  vector<lower=0>[n_cust] tx; 
  vector<lower=0>[n_cust] T; 
}

parameters{
// vectors of lambda and mu for each customer. 
// Here I apply limits between 0 and 1 for each 
// parameter. A value of lambda or mu > 1.0 is unphysical 
// since you don't enough time resolution to go less than 
// 1 time unit. 
  vector <lower=0,upper=1.0>[n_cust] lambda; 
  vector <lower=0,upper=1.0>[n_cust] mu;

// parameters of the prior distributions : r, alpha, s, beta. 
// for both lambda and mu
  real <lower=0> r;
  real <lower=0> alpha;
  real <lower=0> s;
  real <lower=0> beta;
}

model {

// temporary variables :
  vector[n_cust] log_lambda = log(lambda);
  vector[n_cust] mu_lambda = mu + lambda;
  vector[n_cust] log_mu_lambda = log(mu_lambda);
  vector[n_cust] like1 = x .* log_lambda + log(mu) - log_mu_lambda - tx .* mu_lambda;
  vector[n_cust] like2 = (x + 1) .* log_lambda - log_mu_lambda - T .* mu_lambda;  

// Here we increment the log probability density (target) accordingly 
  for (n in 1:n_cust) target += log_sum_exp(like1[n], like2[n]);

// Establishing hyperpriors on parameters r, alpha, s, and beta. 
  r ~ normal(0.5, 0.1);
  alpha ~ normal(10, 1);
  s ~ normal(0.5, 0.1);
  beta ~ normal(10, 1);

// Establishing the Prior Distributions for lambda and mu : 
  lambda ~ gamma(r, alpha); 
  mu ~ gamma(s, beta);
}

Thanks for both of your help. I was originally printing the temporary values and removing records in the dataset which were causing the problem and was able to successfully run the original code.

With the change that @bgoodri suggested it works for all records. Can you help clarify why it works when we update probability one record at a time rather than the entire dataset as in the original code.

log_sum_exp(a, b) is numerically accurate and log(exp(a) + exp(b)) is numerically unaccurate and prone to overflow. Even still, it hits the maximum treedepth, does not converge, and is not a great example to learn from.

Hi, I have written a vectorized version of a correlated Pareto/NBD:

data{
  
  int<lower=1> N; //  number of customers
  vector<lower=0>[N] x; // number of purchases per customer
  vector<lower=0>[N] t_x; // recency per customer 
  vector<lower=0>[N] T; // observation period per customer
  
}

parameters{
  
  // Individual-level correlated coefficients

  matrix[2, N] eta; // normal errors
  cholesky_factor_corr[2] L_Omega; // cholesky corr.
  row_vector[2] chi; // mean
  vector<lower=0,upper=pi()/2>[2] tau_unif; // scaled variance
    

}

transformed parameters{
  
  vector[N] lambda; // poisson rate (correlated)
  vector[N] mu; // death rate (correlated)
   
  vector<lower=0>[2] tau; // variance of individual-level correlated coefficients
  matrix[N,2] zeta; // individual-level correlated coefficients
 
  for (r in 1:2)
    tau[r] = 2.5 * tan(tau_unif[r]);
   
  zeta = rep_matrix(chi,N) + (diag_pre_multiply(tau,L_Omega) * eta)';  

  lambda=exp(zeta[,1]);
  mu=exp(zeta[,2]);

}


model{
  
  vector[N] lam_mu=lambda+mu;     
  
  to_vector(eta) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(chi) ~ normal(0,3);

  target += x .* log(lambda) - log(lam_mu) - lam_mu .* t_x + 
        log(mu + lambda .* exp( -(T-t_x) .* (lam_mu) )) ; 

  
}


generated quantities {
  
  matrix[2,2] Omega;
  matrix[2,2] Sigma;
  vector[N] p_alive;
  vector[N] cond_exp_T2;
  int T2=78;
  
  Omega = multiply_lower_tri_self_transpose(L_Omega);
  Sigma = quad_form_diag(Omega, tau);
  
  p_alive = inv( 1 + mu ./ (lambda + mu) .* (exp( (lambda+mu) .* (T-t_x) ) - 1) ) ;
  cond_exp_T2 = lambda ./ mu .* ( 1 - exp(-mu .* (T2-T)) ) .* p_alive; 
}

This model is based on the note from Dr Fader and Dr Hardie. I believe this is also the model by Abe (2009).

What I notice is that Stan is unable to converge when using Gamma distribution for heterogeneity on lambda and mu, and I have tried this original formulation for over a month. But it works really well when replacing it with lognormal distribution.

I hope this help.

1 Like