Absolutely stumped debugging target = -inf even though all log-likelihoods are real

Elie · July 28, 2019, 10:49am

I get persistent:

Log probability evaluates to log(0), i.e. negative infinity

breakdowns which I cannot for the life of me (or at least the last three days of me) debug.

A simple version of the question is: If I print out the log likelihoods for all of the sampling statements in my model block, and they are all real, how can the target be -infinity?

In more detail, there are three normal sampling statements in my hierarchical model, that look like:

mu1[i] ~ multi_normal(mu1_mean, mu1_Sigma);
x_res[i] ~ normal(phi_ar^dday[i] * x_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));
y_res[i] ~ normal(phi_ar^dday[i] * y_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));

The mu1 is the hierarchical means of locations for k individuals, the x_res and y_res are individual “residuals” at a lower hierarchical level. All the location/scale parameters are nice suitable real numbers, which I know because I print everything in debugging, including the log-probabilities, i.e.:

print("multi_normal lpdf: ", multi_normal_lpdf(mu1 | mu1_mean, mu1_Sigma))
print("normal x lpdf: ", normal_lpdf(x_res[i] | phi_ar^dday[i] * x_res[i-1], sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));
print("normal y lpdf: ", normal_lpdf(y_res[i] | phi_ar^dday[i] * y_res[i-1],  sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));

And only ever get nice numbers. But when I print the target, it is (nearly always) -inf.

What the heck!? What else can I do!?

Also I get this: Initialization between (-2, 2) failed after 100 attempts, even though ALL of my parameters are constrained and initialized correctly - as far as I can tell.

Any insights and help appreciated!

Complete code

The complete stan code and R code to replicate is below (this is - in fact - a much simplified version of the model that I am actually interested in). Here, k individuals are tracked for j_k locations (each tracked for different amounts of time) move with individual Ornstein-Uhlenbeck (continuous-time autocorrelated Gaussian) processes around mean locations \{X_k, Y_k\} \sim {\cal BivarN}(\mu, \Sigma). In the simulated example, there are 6 individuals, 5 locations each, 0 auto-correlation and some slight ellipticity to the range of centroids. Occasionally (i.e. for some resimulations), it works just fine. But mostly it doesn’t.

Stan code

functions {
  // this function computes the adjustment to the variance
  // as a dependence of the intrinsic time scale and the 
  // interval between locations 
  
  real getphilag(real phi, real lag) {
    real philag;
    int i;
    philag = 0;
    i = 0; 
    while (i <= lag) { 
      philag = philag + pow(phi, lag * 2.0);
      i = i + 1; 
    } 
    return philag;
  }
}

data {
  int<lower=0> n;               // number of total observations
  int k;                        // n. of individuals
  real y[n];                    // x-coordinate
  real x[n];                    // y-coordinate
  real yday[n];                 // day of year
  real dday[n];                 // yday[i] - yday[i-1] for each individual
  int<lower=1,upper=k> id[n];   // ID
  int<lower=0,upper=1> printme;
} 

parameters {
  //vector<lower = min(x), upper = max(x)>[2] mux_mean;
  //vector<lower = min(y), upper = max(y)>[2] muy_mean;
  
  real<lower = min(x), upper = max(x)> mux_mean; 
  real<lower = min(x), upper = max(x)> muy_mean; 
  
  real<lower = min(x), upper = max(x)> mux1[k]; 
  real<lower = min(y), upper = max(y)> muy1[k]; 

  real<lower = 0> mux1_sigma;
  real<lower = 0> muy1_sigma;
  real<lower=-1, upper=1> rho1;

  real<lower = 0, upper = 100> sigma_ar; 
  real<lower = 0, upper = 100> tau; 
}

transformed parameters {
  cov_matrix[2] mu1_Sigma;
  real phi_ar;
  real sigma2;
  real A;
  
  mu1_Sigma[1,1] = mux1_sigma^2;
  mu1_Sigma[2,2] = muy1_sigma^2;
  mu1_Sigma[1,2] = mux1_sigma * muy1_sigma * rho1;
  mu1_Sigma[2,1] = mux1_sigma * muy1_sigma * rho1;

  phi_ar = exp(-1.0/tau);
  sigma2 = 2.0 * sigma_ar^2 / (tau * (1 - exp(-2.0/tau))^2); 
  A = sigma2 * (-2.0*log(0.05)*pi());
}  

model {
  // ancillary variables
  vector[n] x_hat;
  vector[n] y_hat;
  vector[n] x_res;
  vector[n] y_res;
  vector[2] mu1_mean;
  vector[2] mu1[k];  // mean locations of each individual
  
  //sigma_ar ~ uniform(0.0, 20.0); //exponential(1.0/(max(x) - min(x)));
  //tau ~ lognormal(7.0, 1);
  
  if(printme == 1){
    print("sigma_ar: ", sigma_ar);
    print("phi_ar: ", phi_ar);
  }
  
  mu1_mean[1] = mux_mean;
  mu1_mean[2] = muy_mean;

  for (i in 1:k){
    mu1[i][1] = mux1[i];
    mu1[i][2] = muy1[i];
    mu1[i] ~ multi_normal(mu1_mean, mu1_Sigma);
  }
  
  // hierarchical parameters
  if(printme == 1){
    print("mu1: ", mu1);
    print("mu1_Sigma: ", mu1_Sigma);
    print("multi_normal lpdf: ", multi_normal_lpdf(mu1 | mu1_mean, mu1_Sigma));
  }  
  x_res[1] = 0;
  y_res[1] = 0;
  
  for(i in 2:n){
    x_hat[i]  = mu1[id[i]][1];
    y_hat[i]  = mu1[id[i]][2];

    if(id[i] == id[i-1]){
      x_res[i] = x[i] - x_hat[i];
      y_res[i] = y[i] - y_hat[i];
      x_res[i] ~ normal(phi_ar^dday[i] * x_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));
      y_res[i] ~ normal(phi_ar^dday[i] * y_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));
    if(printme == 1){   
        print("normal x lpdf: ", normal_lpdf(x_res[i] | phi_ar^dday[i] * x_res[i-1], sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));
        print("normal y lpdf: ", normal_lpdf(y_res[i] | phi_ar^dday[i] * y_res[i-1],  sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));
      }                  
    } else {
      x_res[i] = 0;
      y_res[i] = 0;
    }
  }
  
  if(printme == 1){
    print("target:", target());
  }
}

R code:


makeinits <- function(data){
  with(data, list(list(
    mux_mean = mean(x),
    muy_mean = mean(y), 
    mux1 = x[dday == 0],
    muy1 = y[dday == 0], 
    mux1_sigma = sd(x),
    muy1_sigma = sd(y),
    rho1 = 0, 
    sigma_ar = mean(tapply(x,id,sd)),
    tau = 1)))
}
system.time(model_simple <- stan_model(file = "buildfromscratch_v2.stan"))

k <- 6; j <- 5; n <- k * j; 
x.means <- c(-3,-2,-1,1,2,0)
y.means <- c(-2,-3,0,0,1,2)
sigma <- 1

set.seed(10)
data_simple <- {list(n = n,
                    k = k,
                    x = rep(x.means, each = j) + rnorm(n, sd = sigma),
                    y = rep(y.means, each = j) + rnorm(n, sd = sigma),
                    yday = rep(1:j, k),
                    id = rep(1:k, each = j),
                    dday = rep(c(0,rep(1,j-1)), k))
}

with(data_simple, plot(x,y, col = id, asp = 1, pch = 19))
data_simple$printme <- 1

sink("test.out")
fit_test <- sampling(model_simple, data = data_simple, 
                          init = makeinits(data_simple),
                          chains = 1,
                          iter = 3)
sink(NULL)

nhuurre · July 28, 2019, 2:29pm

There’s an adjustment to target due to parameter constraints even before the model block executes but that’s not the problem here. I think here the reason you don’t see any bad values printed is that if a sampling statement fails it can reject the proposal immediately and preclude the execution of any statements that come after it.

I tried running your model and it seems the culprit is multi_normal(). The problem is that if correlation rho1 is high then rounding errors can ruin the covariance matrix mu1_Sigma. Problematic values had rho1=1 but also tau=100 and sigma_ar=100.

You should use multi_normal_cholesky() instead. It is more stable.
Here’s how to factor mu1_Sigma

  cholesky_factor_cov[2] mu1_Sigma_chol;
  mu1_Sigma_chol[1,1] = mux1_sigma;
  mu1_Sigma_chol[1,2] = 0;
  mu1_Sigma_chol[2,1] = muy1_sigma*rho1;
  mu1_Sigma_chol[2,2] = muy1_sigma*sqrt(1-square(rho1));

Or even better, declare mu1_Sigma_chol as a parameter and mux1_sigma etc. as transformed parameters.

Your commented out lognormal(7,1) prior for tau is nonsense as it implies prior expectation exp(7) which is about 1100. (Maybe you meant lognormal(log(7),1)?) The uniform prior for sigma_ar is also not ideal.

Elie · July 29, 2019, 12:52am

Thanks Nhuurre for the quick response,

The Cholesky decomposition is a great tip and good to know about (esp. since I rely heavily on multivariate normals). Unfortunately, it didn’t solve my problem, the same errors appear. Any more ideas?

I think here the reason you don’t see any bad values printed is that if a sampling statement fails it can reject the proposal immediately and preclude the execution of any statements that come after it.

I don’t really understand this “rejection of proposals immediately”. Isn’t each proposal what’s being printed? How does one debug if the log-likelihoods are real? How - for example - did you figure out that problems were happening at large rho and Sigma

Or even better, declare mu1_Sigma_chol as a parameter and mux1_sigma etc. as transformed parameters.

Curious - in general - about this too. There are parameters that are of actual (biological) interest (like mux1_sigma) and ones that are computationally more useful (like mu1_Sigma_chol). It seems more convenient to return (i.e. - place in the parameters block) the parameters that are more relevant, and to hide under the hood (i.e. - place in transformed parameters) the ones that are more computationally important. Is there a computational advantage to doing things the other way around?

Your commented out lognormal(7,1) prior for tau is nonsense as it implies prior expectation exp(7) which is about 1100. The uniform prior for sigma_ar is also not ideal.

You’re witnessing (failed) attempts to try to get priors to work. Somehow, they wouldn’t. I.e., I would have something like sigma_ar ~ exponential (1.0 / (max(x) - min(x)), which was supposed to give average values ~100, but invariably the proposed values from whatever the default distribution is (maybe standard lognormal?) I was going to follow up on this as well at some point in my debugging (mis)-adventures. But, for what its worth, a prior expectation for tau with a mean of 1100 is not intrinsically nonsense, since it depends very much on the scale and units of the time measurements.

Updated STAN model with Cholesky decomposition:

functions {
  // this function computes the adjustment to the variance
  // as a dependence of the intrinsic time scale and the 
  // interval between locations 
  
  real getphilag(real phi, real lag) {
    real philag;
    int i;
    philag = 0;
    i = 0; 
    while (i <= lag) { 
      philag = philag + pow(phi, lag * 2.0);
      i = i + 1; 
    } 
    return philag;
  }
}

data {
  int<lower=0> n;               // number of total observations
  int k;                        // n. of individuals
  real y[n];                    // x-coordinate
  real x[n];                    // y-coordinate
  real yday[n];                 // day of year
  real dday[n];                 // yday[i] - yday[i-1] for each individual
  int<lower=1,upper=k> id[n];   // ID of caribou
  int<lower=0,upper=1> printme;
} 

parameters {
  //vector<lower = min(x), upper = max(x)>[2] mux_mean;
  //vector<lower = min(y), upper = max(y)>[2] muy_mean;
  
  real<lower = min(x), upper = max(x)> mux_mean; 
  real<lower = min(x), upper = max(x)> muy_mean; 
  
  real<lower = min(x), upper = max(x)> mux1[k]; 
  real<lower = min(y), upper = max(y)> muy1[k]; 

  real<lower = 0> mux1_sigma;
  real<lower = 0> muy1_sigma;
  real<lower=-1, upper=1> rho1;

  real<lower = 0, upper = 100> sigma_ar; 
  real<lower = 0, upper = 100> tau; 
}

transformed parameters {
  cov_matrix[2] mu1_Sigma_chol;
  real phi_ar;
  real sigma2;
  real A;
  
  mu1_Sigma_chol[1,1] = mux1_sigma;
  mu1_Sigma_chol[1,2] = 0;
  mu1_Sigma_chol[2,1] = muy1_sigma * rho1;
  mu1_Sigma_chol[2,2] = muy1_sigma * sqrt(1- square(rho1));

  phi_ar = exp(-1.0/tau);
  sigma2 = 2.0 * sigma_ar^2 / (tau * (1 - exp(-2.0/tau))^2); 
  A = sigma2 * (-2.0*log(0.05)*pi());
}  

model {
  // ancillary variables
  vector[n] x_hat;
  vector[n] y_hat;
  vector[n] x_res;
  vector[n] y_res;
  vector[2] mu1_mean;
  vector[2] mu1[k];  // mean locations of each individual
  
  //sigma_ar ~ uniform(0.0, 20.0); //exponential(1.0/(max(x) - min(x)));
  //tau ~ lognormal(7.0, 1);
  
  if(printme == 1){
    print("sigma_ar: ", sigma_ar);
    print("phi_ar: ", phi_ar);
  }
  
  mu1_mean[1] = mux_mean;
  mu1_mean[2] = muy_mean;

  for (i in 1:k){
    mu1[i][1] = mux1[i];
    mu1[i][2] = muy1[i];
    mu1[i] ~ multi_normal_cholesky(mu1_mean, mu1_Sigma_chol);
  }
  
  // hierarchical parameters
  if(printme == 1){
    print("mu1: ", mu1);
    print("mu1_Sigma_chol: ", mu1_Sigma_chol);
    print("multi_normal_cholesky lpdf: ", multi_normal_cholesky_lpdf(mu1 | mu1_mean, mu1_Sigma_chol));
  }  
  x_res[1] = 0;
  y_res[1] = 0;
  
  for(i in 2:n){
    x_hat[i]  = mu1[id[i]][1];
    y_hat[i]  = mu1[id[i]][2];

    if(id[i] == id[i-1]){
      x_res[i] = x[i] - x_hat[i];
      y_res[i] = y[i] - y_hat[i];
      x_res[i] ~ normal(phi_ar^dday[i] * x_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));
      y_res[i] ~ normal(phi_ar^dday[i] * y_res[i-1], 
                        sigma_ar * sqrt(getphilag(phi_ar, dday[i])));
    if(printme == 1){   
        //print("x_res: ", x_res[i]);
        //print("y_res: ", y_res[i]);
       // print("x_res mean: ", phi_ar^dday[i] * x_res[i-1]);
        //print("y_res mean: ", phi_ar^dday[i] * y_res[i-1]);
        //print("xy_res sd: ", sigma_ar * sqrt(getphilag(phi_ar, dday[i-1])));
        print("normal x lpdf: ", normal_lpdf(x_res[i] | phi_ar^dday[i] * x_res[i-1], sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));
        print("normal y lpdf: ", normal_lpdf(y_res[i] | phi_ar^dday[i] * y_res[i-1],  sigma_ar * sqrt(getphilag(phi_ar, dday[i]))));
      }                  
    } else {
      x_res[i] = 0;
      y_res[i] = 0;
    }
  }
  
  if(printme == 1){
    print("target:", target());
  }
}

nhuurre · July 29, 2019, 10:07am

You seem to have this backwards - parameters block is what Stan’s HMC algorithm uses internally, transformed parameters (and generated quantities) are there just for user convenience.

It probably doesn’t matter which way you do it. If you want to set a prior for rho1 it should be a parameter, not a transformed parameter. (For example (1+rho1)/2 ~ beta(2,2))

Stan generates a joint proposal for all parameter values and then computes the log likelihood of that proposal by running transformed parameters and model blocks. The log likelihood (and its autodiff gradient) is used to decide where to put the next proposal and which proposal will become the next posterior sample.

The code in the model block is executed sequentially, just like in any other imperative programming language. Print statements print their contents if and when the execution reaches them. A sampling statement increments target (i.e. log likelihood of the proposal) but if the inputs are out of bounds it throws a C++ exception and the execution stops. Stan interprets that to mean target is log(0). The fact that no further print statements are executed in that proposal does complicate debugging.

Most of the proposals printed lots of log-likelihoods, all of which were finite, including target at the end. However, one printed only the first couple parameters and then there was a message saying, something like, the current Metropolis proposal rejected: multi_normal_lpdf: the last variance is nan, but must be positive. Longer runs yielded a few cases where target was finite but ridiculously large, over -10^{49} or so. There the full info was printed and everything was small except multi_normal log likelihood. This all was a strong hint that the correlation parameter was causing numerical errors. I also added print statements for tau and rho1 at the start of the model block to see what’s going on with those. The fact that they hit the upper bound is a red flag.

Yeah, I don’t know what’s reasonable in the context of your model. You had upper bound 100 on tau so I assumed it couldn’t be too large. You’ll need weakly informative priors.

Elie · July 30, 2019, 11:12am

It works now, like a charm!

The problem (go figure) was that the model I coded up was wrong (the error having to do with the details of the continuous-time OU to discrete-time AR(1) equivalence). So the lesson is that a correct model is that much more likely to work than a wrong model!

I have a few more quick (general-ish) questions:

The first time I implemented the Cholesky decomposition, I noticed that the matrix was still declared as an (implicitly symmetrix) cov_matrix rather than a cholesky_factor_cov. As far as I could tell, that didn’t cause any error flags - though the matrix was obviously not symmetric. It might, however, be a reason intermediate iterations didn’t work (hard to tell because lots was changing). Shouldn’t that cause an error? Or - if not - what’s the point of declaring the matrix type?
An important piece of this model is that the time intervals between observations can be irregular. In earlier versions, I would have a separate data vector (dtime) for those intervals. Now - for simplicity - I simply declare a dt object within the model block and compute it within the individual loop. In R, it is usually best to save that extra computation step within a loop by performing it externally. Is that true here too?

Thanks for helping dig out of this hole Nhuurre!

Hierarchical continuous time OU ranging model

data {
  int<lower=0> n;               // number of total observations
  int k;                        // n. of individuals
  real y[n];                    // x-coordinate
  real x[n];                    // y-coordinate
  real yday[n];                 // day of year
  int<lower=1,upper=k> id[n];   // individual ID
} 

parameters {
  //vector<lower = min(x), upper = max(x)>[2] mux_mean;
  //vector<lower = min(y), upper = max(y)>[2] muy_mean;
  
  real<lower = min(x), upper = max(x)> mux_mean; 
  real<lower = min(y), upper = max(y)> muy_mean; 
  
  real<lower = min(x), upper = max(x)> mux1[k]; 
  real<lower = min(y), upper = max(y)> muy1[k]; 

  real<lower = 0> mux1_sigma;
  real<lower = 0> muy1_sigma;
  real<lower=-1, upper=1> rho1;

  real<lower = 0> sigma_X; 
  real<lower = 0, upper = 1> phi_ar;
}

transformed parameters {
  cholesky_factor_cov[2] mu1_Sigma_chol;
  real tau;
  real A;
  
  mu1_Sigma_chol[1,1] = mux1_sigma;
  mu1_Sigma_chol[1,2] = 0;
  mu1_Sigma_chol[2,1] = muy1_sigma * rho1;
  mu1_Sigma_chol[2,2] = muy1_sigma * sqrt(1- square(rho1));

  tau = -1.0/log(phi_ar); 
  A = -sigma_X^2 * log(0.05)*pi();  
}  

model {
  // ancillary variables
  vector[n] x_hat;
  vector[n] y_hat;
  vector[n] x_res;
  vector[n] y_res;
  vector[2] mu1_mean;
  vector[2] mu1[k];  
  real sigma_res;
  real dt; 
  
  sigma_X ~ exponential(1.0/(max(x) - min(x)));

  mu1_mean[1] = mux_mean;
  mu1_mean[2] = muy_mean;

  for (i in 1:k){
    mu1[i][1] = mux1[i];
    mu1[i][2] = muy1[i];
    mu1[i] ~ multi_normal_cholesky(mu1_mean, mu1_Sigma_chol);
  }
  
  // hierarchical parameters

  x_res[1] = 0;
  y_res[1] = 0;
  
  for(i in 2:n){
    x_hat[i]  = mu1[id[i]][1];
    y_hat[i]  = mu1[id[i]][2];

    if(id[i] == id[i-1]){
      x_res[i] = x[i] - x_hat[i];
      y_res[i] = y[i] - y_hat[i];
      dt = yday[i] - yday[i-1];
      sigma_res = sigma_X * (1 - phi_ar^(2*dt)) / (1 - phi_ar^2);
      x_res[i] ~ normal(phi_ar^dt * x_res[i-1], sigma_res);
      y_res[i] ~ normal(phi_ar^dt * y_res[i-1], sigma_res);
    } else {
      x_res[i] = 0;
      y_res[i] = 0;
    }
  }
}

nhuurre · July 30, 2019, 11:50am

Yes, I think that should cause an error? Constraints are only checked between blocks, though, see the user’s guide.

It’s a little more efficient to compute dt in the transformed data block.

Topic		Replies	Views
Simple Example of an Improper Posterior Without Warnings from Stan General	22	2953	December 7, 2019
MCMC sampling does not work when execute Modeling fitting-issues	78	11456	November 1, 2017
Divergent transitions & BFMI low in a state-space model Modeling	6	1059	August 17, 2017
Divergent transitions in hierarchical model Modeling fitting-issues	26	2060	November 7, 2019
Failure to start because of initial values Modeling	16	3549	July 31, 2017