Failed to Initialize Error (Log Probability Evaluates to Negative Infinity log(0))

Hoping to get some help understanding the source of this zero probability error. My model is relatively straightforward after understanding the purpose of my user-defined function.

findx() takes two variable values (1) context which is an integer 1-16 which selects one of my empirically extracted distributions (implemented as piecewise linear functions (plfs) and (2) theta which is a cumulative density threshold between 0 and 1 and outputs a cardinal x-value associated with the location of the cumulative density threshold on the plf density curve.

I’ve tried to debug with print statements in various places but haven’t been able to find the source of my issue. I have also included a standump file below with the data I am feeding into the model.

Thank you in advance to anyone who is able to help.

gtmproduction_data.Rdata (7.3 MB)


functions {
  real findx (int context, real theta, real[,,] plfs, int rows) {
    real plf[rows, 3] = plfs[context,,];
    real default_x;

    for (row in 1:rows) {
      real x = plf[row, 1];
      real m = plf[row, 2];
      real b = plf[row, 3];
      real y = ((m * x) + b);

      if (theta <= y) {
        return ((theta - b) / m);
      }
    }
    default_x = (theta - plf[rows, 3]) / plf[rows, 2];
    return(default_x);
  }
}
data {
  int<lower=1> numrows;
  int<lower=0> N;
  int<lower=0> contexts[N, 5];
  real<lower=-1, upper=70> plfs[16, numrows, 3];
}
parameters {
  real<lower=0, upper=1> theta_few;
  real<lower=0, upper=1> theta_many;
  vector<lower=0>[16] sigma;

}
model {
  theta_few ~ uniform(0, 1);
  theta_many ~ uniform(0, 1);

  for (c in 1:16) {
    sigma[c] ~ gamma(5,3);
  }

  for (i in 1:N) {
    if (contexts[i, 3] == 1) {
      contexts[i, 5] ~ bernoulli(normal_cdf(contexts[i,4], findx(contexts[i, 2], theta_many, plfs, numrows), sigma[contexts[i, 2]]));
    } else if (contexts[i, 3] == 2) {
      contexts[i, 5] ~ bernoulli(1 - (normal_cdf(contexts[i,4], findx(contexts[i, 2], theta_few, plfs, numrows), sigma[contexts[i, 2]])));
    }
  }
}

This appears to be a numerical accuracy problem. The implicit default initialization sigma ~ 2 is so far from the optimum (around sigma ~ 20-30, I think) that some of the Bernoulli probabilities calculated in the model end up being outside the range of double-precision floating point arithmetic, and thus round down to zero. The model at least initializes when rewritten as

  for (i in 1:N) {
    if (contexts[i, 3] == 1) {
      real x = findx(contexts[i, 2], theta_many, plfs, numrows);
      real s = sigma[contexts[i, 2]];
      real logit_p = normal_lcdf(contexts[i,4]| x, s) - normal_lcdf(-contexts[i,4]| -x, s);
      contexts[i, 5] ~ bernoulli_logit(logit_p);
    } else if (contexts[i, 3] == 2) {
      real x = findx(contexts[i, 2], theta_few, plfs, numrows);
      real s = sigma[contexts[i, 2]];
      real logit_p = normal_lcdf(-contexts[i,4]| -x, s) - normal_lcdf(contexts[i,4]| x, s);
      contexts[i, 5] ~ bernoulli_logit(logit_p);
    }
  }

By the way, your findx function seems to eat a lot of RAM. It’s more efficient to access the array directly without creating intermediate variables.

  real findx(int context, real theta, real[,,] plfs, int rows) {
    real default_x;

    for (row in 1:rows) {

      if (theta <= ((plfs[context, row, 2] * plfs[context, row, 1]) + plfs[context,row, 3])) {
        return ((theta - plfs[context,row, 3]) / plfs[context,row, 2]);
      }
    }
    default_x = (theta - plfs[context,rows, 3]) / plfs[context,rows, 2];
    return(default_x);
  }
3 Likes

Hi Niko,

Thank you so much for your thoughtful response! I share your intuition that a low sigma value might be contributing to the underflow error. Your logit transform seems like a good solution but when I implement it I still get the same initialization error.

I’ve included an updated version of my code below for reference. I also modified my findx() function according to your suggestions. Much appreciated.

functions {
  real findx(int context, real theta, real[,,] plfs, int rows) {
    real default_x;

    for (row in 1:rows) {
      if (theta <= ((plfs[context, row, 2] * plfs[context, row, 1]) + plfs[context,row, 3])) {
        return ((theta - plfs[context,row, 3]) / plfs[context,row, 2]);
      }
    }
    default_x = (theta - plfs[context,rows, 3]) / plfs[context,rows, 2];
    return(default_x);
  }
  
}
data {
  int<lower=1> numrows;
  int<lower=0> N;
  int<lower=0> contexts[N, 5];
  real<lower=-1, upper=70> plfs[16, numrows, 3];
}
parameters {
  real<lower=0, upper=1> theta_few;
  real<lower=0, upper=1> theta_many;
  vector<lower=0>[16] sigma;

}
model {
  theta_few ~ uniform(0, 1);
  theta_many ~ uniform(0, 1);

  for (c in 1:16) {
    sigma[c] ~ gamma(5,3);
  }

  for (i in 1:N) {
    if (contexts[i, 3] == 1) {
      real x = findx(contexts[i, 2], theta_many, plfs, numrows);
      real s = sigma[contexts[i, 2]];
      real logit_p = normal_lcdf(contexts[i,4]| x, s) - normal_lcdf(-contexts[i,4]| -x, s);
      contexts[i, 5] ~ bernoulli_logit(logit_p);
    } else if (contexts[i, 3] == 2) {
      real x = findx(contexts[i, 2], theta_few, plfs, numrows);
      real s = sigma[contexts[i, 2]];
      real logit_p = normal_lcdf(-contexts[i,4]| -x, s) - normal_lcdf(contexts[i,4]| x, s);
      contexts[i, 5] ~ bernoulli_logit(logit_p);
    }
  }
}

Using CmdStan 2.32.2 and the data file provided in the first post, that last model initializes for me.
Optimization finds the maximum likelihood estimate:

theta_few = 0.503236
theta_many = 0.651295
sigma = [17.4533,19.7441,18.7241,16.2543,10.7599,12.2353,9.96245,9.4284,8.85776,9.39683,11.5336,16.5574,14.2274,13.2863,17.0483,8.26992]

HMC sampling seems a bit more problematic but initializing from the maximum likelihood estimate it still runs

                 Mean      MCSE    StdDev     5%    50%    95%  N_Eff  N_Eff/s  R_hat

lp__            -2754  4.09e-01  2.84e+00  -2759  -2754  -2750   48.2    0.170   1.01
accept_stat__   0.911  5.57e-03  8.36e-02  0.747  0.936   1.00    225    0.794  0.997
stepsize__      0.599       nan  0.00e+00  0.599  0.599  0.599    nan      nan    nan
treedepth__      3.00       nan  2.23e-15   3.00   3.00   3.00    nan      nan    nan
n_leapfrog__     7.04  3.96e-02  5.66e-01   7.00   7.00   7.00    204    0.721    nan
divergent__     0.000       nan  0.00e+00  0.000  0.000  0.000    nan      nan    nan
energy__         2763  6.12e-01  4.22e+00   2757   2762   2771   47.5    0.168   1.00

theta_few       0.502  6.86e-04  1.47e-02  0.477  0.502  0.526    460     1.63  0.996
theta_many      0.650  7.45e-04  1.25e-02  0.629  0.650  0.670    283    0.999  0.999
sigma[1]         17.5  7.00e-02  1.14e+00   15.5   17.4   19.3    267    0.942  0.998
sigma[2]         19.9  5.40e-02  1.07e+00   18.2   19.9   21.7    392     1.38   1.00
sigma[3]         19.0  7.62e-02  1.31e+00   16.9   18.9   21.5    294     1.04  1.000
sigma[4]         16.4  5.43e-02  1.17e+00   14.4   16.4   18.1    460     1.63  0.996
sigma[5]         10.9  4.19e-02  7.17e-01   9.83   10.8   12.1    294     1.04  0.996
sigma[6]         12.4  4.64e-02  8.41e-01   11.0   12.4   13.8    329     1.16  0.997
sigma[7]         10.1  4.54e-02  8.04e-01   8.87   10.1   11.6    314     1.11  0.995
sigma[8]         9.52  3.16e-02  6.77e-01   8.51   9.51   10.6    460     1.63  0.995
sigma[9]         9.16  5.30e-02  7.33e-01   7.92   9.14   10.3    191    0.676  0.995
sigma[10]        9.54  4.17e-02  8.41e-01   8.25   9.59   10.9    407     1.44  0.996
sigma[11]        11.7  4.75e-02  9.76e-01   10.3   11.6   13.2    422     1.49  0.995
sigma[12]        16.9  7.81e-02  1.20e+00   14.7   16.8   18.8    236    0.833  0.997
sigma[13]        14.4  4.46e-02  9.56e-01   12.9   14.2   16.0    460     1.63  0.997
sigma[14]        13.4  5.24e-02  9.34e-01   11.9   13.4   15.1    318     1.12  0.996
sigma[15]        17.2  5.52e-02  1.02e+00   15.6   17.1   19.1    339     1.20  0.995
sigma[16]        8.39  5.56e-02  6.18e-01   7.41   8.33   9.55    124    0.437  0.996

There’s no obvious problems with the fit but it’s only 200 draws from a single chain.

It looks like your findx() is some kind of interpolation scheme. It may be faster to compute the thresholds ahead of time and do a binary search rather than a linear scan.

2 Likes

I can’t get this to initialize using rstan so I’ve switched over to cmdstanr with cmdstan version 2.34.1 but now my model will not even compile. It shows a function argument declaration error w.r.t. “real[,] plfs” which seems from the Stan Reference Manual to be a perfectly legitimate function argument.

> gtm_model <- cmdstan_model('gtm-production.stan')
Compiling Stan program...
Syntax error in '/tmp/RtmpZYNPuq/model-7cb8369eb287.stan', line 2, column 42 to column 43, parsing error:
   -------------------------------------------------
     1:  functions {
     2:    real findx(int context, real theta, real[ , , ] plfs, int rows) {
                                                   ^
     3:      real default_x;
     4:  
   -------------------------------------------------

An identifier is expected as a function argument name.

make: *** [make/program:50: /tmp/RtmpZYNPuq/model-7cb8369eb287.hpp] Error 1

Error: An error occured during compilation! See the message above for more information.

The section on Dimensionality declaration begins

Arguments and return types may be arrays, and these are indicated with optional brackets and commas as would be used for indexing.

which isn’t entirely accurate, as the following sentence reads

For example, int denotes a single integer argument or return, whereas array[] real indicates a one-dimensional array of reals, array[,] real a two-dimensional array and array[,,] real a three-dimensional array

Note that the [,,] goes before real and there’s a new keyword array. This change happened in Stan 2.33
The correct syntax is:

functions {
  real findx(int context, real theta, array[,,] real plfs, int rows) {
    ...
  }
}
data {
  int<lower=1> numrows;
  int<lower=0> N;
  array[N, 5] int<lower=0> contexts;
  array[16, numrows, 3] real<lower=-1, upper=70> plfs;
}
parameters {
  real<lower=0, upper=1> theta_few;
  real<lower=0, upper=1> theta_many;
  vector<lower=0>[16] sigma;
}
2 Likes

I have another version of this model which is categorical rather than binomial. I have attempted to translate the use of logspace over to this model but keep getting a number of different errors depending on how I go about it. Currently, I’m using softmax to normalize but I’m getting a “rejecting initial value” error.

Chain 1 Rejecting initial value:
Chain 1 Gradient evaluated at the initial value is not finite.
Chain 1 Stan can’t start sampling from this initial value.

I’ve attached my code below. I very much appreciate any insights.

functions {
  row_vector comp_post(int context, int quant, real theta, real sigma, array[,,] real plfs, int rows, array[,] real evds) {
    real denom = 0;
    row_vector[70] unnormalized;
    real findx;
    
    findx = (theta - plfs[context,rows, 3]) / plfs[context,rows, 2];
    for (row in 1:rows) {
      if (theta <= ((plfs[context, row, 2] * plfs[context, row, 1]) + plfs[context,row, 3])) {
        findx = ((theta - plfs[context,row, 3]) / plfs[context,row, 2]);
        break;
      }
    }
    
    for (val in 1:70) {
      if (quant == 1) {
        unnormalized[val] = (log(evds[context,val]) + normal_lcdf(val | findx, sigma));
      } else { unnormalized[val] = (log(evds[context,val]) + normal_lcdf(-val | -findx, sigma)); }
    }
    
    return to_row_vector(softmax(to_vector(unnormalized)));
  }
}

data {
  int<lower=1> numrows;
  int<lower=0> N;
  array[N,4] int<lower=0, upper=91> contexts;
  array[16,numrows,3] real<lower=-1> plfs;
  array[16,70] real<lower=0> evds;
}
parameters {
  real<lower=0, upper=1> theta_few;
  real<lower=0, upper=1> theta_many;
  array[16] real<lower=0> sigma;
  
}
transformed parameters {
  matrix[16,70] post_many;
  matrix[16,70] post_few;
  
   for (c in 1:16) {
      post_many[c,] = comp_post(c, 1, theta_many, sigma[c], plfs, numrows, evds);
      post_few[c,] = comp_post(c, 2, theta_few, sigma[c], plfs, numrows, evds);
  }
}
model {
  theta_few ~ uniform(0, 1);
  theta_many ~ uniform(0, 1);
  
  for (c in 1:16) {
    sigma[c] ~ gamma(2,1);
  }
  
  for (i in 1:N) {
    contexts[i,4] ~ categorical(to_vector(post_many[contexts[i,2]]));
  }
}

gtmcomprehension_data.Rdata (7.3 MB)

You should use categorical_logit(..) instead of categorical(softmax(..)) but either way, the data are inconsistent with the model.

The model basically says

for (n in 1:N)
  contexts[n,4] ~ categorical(evds[contexts[n,2],:] * (...));

and this implies zero probability for observations where evds[contexts[n,2],contexts[n,4]] == 0 but evds in the provided data file contains many such zeros.
The model initializes successfully if these zeros are bumped to a small positive number

transformed data {
  array[16,70] real<lower=0> evds_ = evds;
  for (i in 1:16) for (j in 1:70) if (evds_[i,j] == 0) {
    evds_[i,j] = 1e-8;
  }
}

but that only papers over the inconsistency.