Rejecting initial value due to infinite gradient in a hidden markov model

Hi all,

I’m working on a hidden markov model to analyze multi-state capture-recapture data when there is uncertainty in the assignment of an individual’s state upon capture. In the ecological literature these are referred to as “multi-event” models. The transition and emission matrices are parametrized in terms of demographic rates and detection probabilities.

More concretely, my animals can be in three states: juveniles, adults or dead (coded 1,2,3 resp.). The observations take place in two periods each year, during the breeding season, and outside the breeding season. Thus there are four possible observable events for an individual in a given year: observed during the breeding season only, observed outside the breeding season only, observed in both periods, or neither (coded 1,2,3,4 resp.). The data consists of a capture history matrix, with each row corresponding to an individual, each column to a year, and each entry giving the individual’s observation code in that year.

Here’s my model, with the forward algorithm to implemented as described in the Stan User Manual’s page on HMMs:

data {
  int<lower=2> T;                          // number of years
  int<lower=1> N;                         // number of individuals
  int<lower=1,upper=4> y[N,T];    // capture histories
}

parameters {
  // survival probabilities
  real<lower=0,upper=1> psi1;    // juveniles 
  real<lower=0,upper=1> psi2;    // adults 
  // recruitment probability
  real<lower=0,upper=1> beta;
  // detection probabilities
  real<lower=0,upper=1> q1;      // juveniles outside breeding season
  real<lower=0,upper=1> q2;      // adults outside br. season
  real<lower=0,upper=1> p;        // adults during br. season
}

transformed parameters {
  simplex[3] theta[3];       // transition probabilities
  simplex[4] phi[3];          // emission probabilities
  // define theta
  theta[1] = [psi1*(1-beta), psi1*beta,   1-psi1]';
  theta[2] = [            0,      psi2,   1-psi2]';
  theta[3] = [            0,         0,        1]';
  //define phi
  phi[1] = [       0,       q1,    0,         1-q1]';
  phi[2] = [p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2)]';
  phi[3] = [       0,        0,    0,            1]';
}

model {
  // default priors
  // likelihood
  for (n in 1:N) { // loop over individuals
    matrix[T, 3] gamma;
    vector[3] acc;
    // initialise forward algorithm at t = 1
    for (k in 1:3) {
      gamma[1, k] = log(phi[k,y[n,1]]);
    }
    // recursion
    for (t in 2:T) {
      for (k in 1:3) {       // k is hidden state at t
        for (j in 1:3) {      // j is hidden state at t-1
          acc[j] = gamma[t-1,j] + log(theta[j, k])+ log(phi[k, y[n,t]]);
        }
        gamma[t, k] = log_sum_exp(acc);
      }
    }
    target += log_sum_exp(gamma[T]);
  }
}

And here is the R code I used to simulate data under the model:

### Set values for parameters and transformed parameters ###
# parameters
psi1 <- 0.77  
psi2 <- 0.9   
beta <- 0.33
q1 <- 0.7  
q2 <- 0.8  
p <-0.8  

# transition matrix
theta <- matrix(c(psi1*(1-beta), psi1*beta, 1-psi1,
                              0,      psi2, 1-psi2,
                              0,         0,      1), byrow=T, nrow=3)

# emission matrix
phi <- matrix(c(       0,       q1,    0,         1-q1,
                p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2),
                       0,        0,    0,           1), byrow=T, nrow=3)

### Simulate data ###
# Assume the individuals are all detected as juveniles at time 1
T <- 10
N <- 100

# Simulate hidden states
z <- matrix(NA,nrow=N,ncol=T)
z[,1] <- rep(1,N)
for (n in 1:N){
  for (t in 1:(T-1)){
    z[n,t+1] <- sample(1:3,1,prob=theta[z[n,t],])
  }
}

# Simulate observations
y <- matrix(4, N,T)
for (n in 1:N){
  for (t in 1:T){
    y[n,t] <- sample(1:4,1,prob=phi[z[n,t],])
  }
}

# Remove individuals that are never observed
observed <- which(apply(y,1,min)!=4)
y <- y[observed,]
z <- z[observed,]

# Fit model
N <- dim(y)[1]
T <- dim(y)[2]
input_data <- list(T=T,N=N,y=y)
fit <- stan("model.stan",data=input_data)

The model fails to initialize: the initial values are repeatedly rejected, with the error “Gradient evaluated at initial value is not finite.” I don’t understand why?

All the parameters are appropriately constrained, and I think all the simplices in the transformed parameters block are valid. Setting “inits=0” in the call to stan() does not help.

Perhaps the issue relates to log(0) values that occur in the forward algorithm, e.g. for log(theta[2,1])? These negative infinity values should behave the right way as far as the calculation of the likelihood goes, but perhaps they are a problem for the gradient?

Any insights or suggestions would be much appreciated!

1 Like

Ideally, Stan’s error message could contain some diagnostic information to help you locate which derivative is not finite.

It looks like one can provide an output file name for an optional diagnostic_file argument when invoking Stan. I haven’t tried using that myself, but that may help indicate which gradients are not finite: Fit a model with Stan — stan • rstan .

That could make sense. The derivative of log(p) is unbounded as p approaches 0 from the right.

If there isn’t a principled way to debug what is going on using Stan diagnostic output, one thing you could try is artificially narrowing the constraints of each probability parameter to prevent them from being zero. E.g. pick an arbitrary small positive finite constant and use that to redefine the lower bound of all survival, recruitment and detection probabilities, e.g. real<lower=0.01,upper=1>.

Another idea could be to sanity check that each individual is contributing a finite term to the log probability target – perhaps for whatever reason one or more individuals are deemed impossible by the model. Perhaps you could add a print statement before the target += log_sum_exp(gamma[T]); line to check this.

Another idea could be to think about priors for the survival, recruitment and detection probabilities. I suspect in theory and in practice none of these probabilities could ever be exactly zero or exactly one.

Instead of // default priors in the model block, you could try explicitly setting a prior for each of the probability parameters. Perhaps a reasonable place to start, in terms of getting the model to run at all, could be to set beta priors that vanish at both zero and one, say psi1 ~ beta(2, 2), and similarly for all the other probability parameters.

2 Likes

FWIW I’ve had good results implementing capture-recapture likelihoods like this on the probability scale (not the log scale), e.g., example-models/js_ms.stan at master · stan-dev/example-models · GitHub

4 Likes

Interesting idea! I believe the main reason for working on log-scale would be to avoid underflow when taking the product of a large number of probability factors.

In this model the posterior probability of each individual’s trajectory is given by something roughly like a product of T transition probabilities with a product of T observation probabilities. If each transition probability t is 0.05 or higher and each observation probability o is 0.04 or higher then it looks like (o \, t)^T would be positive for 64-bit precision floating point for T=115 but becomes zero for T=120 or higher. So that suggests if the number of years of observations T is 100 or less and the factors involved in the products are 0.04 or larger you might get away without working in the log scale.

1 Like

FWIW, we also implement some multi-state capture-recapture models on the probability-scale with a large number of transition and capture probabilities, but a small number of capture occasions. The reason for this in my case is approximately 10x speed-up by taking advantage of matrix multiplication (inspired by Fujiwara and Caswell (2002)) instead of the using the forward-algorithm. We also use an m-array like approach to gather alike capture histories with a large number of individuals.

Details here: biom.13171.pdf (1.4 MB)

As for the original question: yes, try print statements to see if you’re getting NaN. I wonder if you might be running into an issue where Stan is treating log(0) * 0 as NaN even though log(0^0) = 1.

1 Like

Stan correctly evaluates 0*log(0), but incorrectly evaluates the gradient. See

2 Likes

@mbc another way to work around this issue is to redefine the transition and emission matrices to replace all the zeros with very small positive constants:

transformed parameters {
  simplex[3] theta[3];       // transition probabilities
  simplex[4] phi[3];         // emission probabilities

  real delta;
  delta = 1.0e-15; // arbitrary constant s.t. $\delta \approx 0$ and $\log(\delta)$ finite

  // define theta
  theta[1] = [psi1*(1-beta), psi1*beta, 1-psi1]';
  theta[2] = [delta, (1-delta)*psi2, (1-delta)*(1-psi2)]';
  theta[3] = [delta, delta, (1-2*delta)*1]';
  //define phi
  phi[1] = [delta, (1-2*delta)*q1, delta, (1-2*delta)*(1-q1)]';
  phi[2] = [p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2)]';
  phi[3] = [delta, delta, delta, (1-3*delta)*1]';
}

This modification fixes the symptom in cmdstan-2.28.1. It isn’t very elegant!

Arguably a more elegant fix for the transition matrix part could be to rewrite the math and change the representation of the transition matrix to a sparse matrix, and compute the transitions as a log-domain sparse matrix vector product, so that when log-exp-summing to combine a number of log probs, we exploit the known sparsity pattern to omit the terms corresponding to impossible transitions that are log(0) aka -inf . This is exactly a matrix-vector multiply for CSC sparse matrix format, but in the log domain, replacing * with + and sum with log-exp-sum. I don’t think Stan offers that operation out of the box, it’d have to be defined. I have it implemented in some of my python code, when I get a bit more time I’ll port it to Stan user defined function.

1 Like

Problem solved, thanks for your help everyone!

It seems pretty settled that the issue arose due to zeros in the transition and emission matrices, and the resulting log(0) values that arise in the forward algorithm. When I added some print statements the likelihood itself was finite.

The model runs without issue if I work on the probability scale in the forward algorithm, storing marginal probabilities (not log marginal probs) in gamma, and only taking logarithms at the last step to increment target. Thanks @mbjoseph and @Dalton for that suggestion. It also works without issue when replacing zeros in the transition and emission matrices with tiny positive numbers, thanks @rfc.

Working on the log scale is presumably ‘the right way’ to do things, so down the line it would be interesting to know how to do that. @rfc I don’t know much about sparse matrix arithmetic, but that approach sounds promising.

1 Like

If anyone is curious, I’ve got a version of this model working where the transition matrix is encoded as a sparse matrix in the log-domain. It’s roughly a mashup of log-sum-exp with 7.1 Compressed row storage | Stan Functions Reference and 7.3 Sparse matrix arithmetic | Stan Functions Reference . I’m not very familiar with Stan so it probably isn’t idiomatic or fast.

Stan samples without complaining about any infinite gradients, and is able to approximately recover the parameters from the simulated data.

I find the resulting code about one to two orders of magnitude harder to read than @mbc 's original model – the sparse matrix encoding details are much less intuitive, and the Stan language forces different parts of the sparse matrix data to be defined in different blocks – it seems the integer data defining the column indices and the offsets into the packed data are not allowed to live inside transformed parameters block, as they are discrete parameters. Could be helpful for HMMs involving large sparse transition matrices that require the log domain for stability, but perhaps otherwise best avoided if there is a possibility of working with probabilities and small dense matrices.

functions {
    /* Returns log-domain matrix-vector product of
     * sparse matrix A with vector b.
     *
     * The sparse matrix is assumed to be CSR encoded
     * following the convention of csr_matrix_times_vector.
     * All entries in the vector and the matrix are assumed
     * to be log-domain, so entries not explicitly encoded
     * are assumed to be log(0) = -inf.
     *
     * @param m number of rows of A
     * @param n number of columns of A
     * @param w log-domain entries of A
     * @param v column indices of A corresponding to the entries w
     * @param u length n+1 array of indices into w and v
     * @param b source vector to multiply, of length n, with entries in log-domain.
     */
    vector log_csr_matrix_times_vector(int m, int n, vector w, int[] v, int[] u, vector b) {
        vector[m] dst;
        real acc_max, acc;
        int j;
        for (i in 1:m) {
            // log-sum-exp reduction for the sparse inner product
            // giving the i-th entry of the destination vector
            acc_max = negative_infinity();
            for (iota in u[i]:u[i+1]-1) {
                j = v[iota];  // j is column index
                acc_max = fmax(acc_max, w[iota] + b[j]);
            }
            if (acc_max <= negative_infinity()) {
                dst[i] = acc_max;
            } else {
                acc = 0.0;
                for (iota in u[i]:u[i+1]-1) {
                    j = v[iota];  // j is column index
                    acc += exp(w[iota] + b[j] - acc_max);
                }
                dst[i] = log(acc) + acc_max;
            }
        }
        return dst;
    }
}


data {
    int<lower=2> T; // number of years
    int<lower=1> N; // number of individuals
    int<lower=1,upper=4> y[N,T]; // capture histories
}

transformed data {
    int N_STATES = 3;
    int STATE_JUVENILE = 1;
    int STATE_ADULT = 2;
    int STATE_DEAD = 3;

    int N_OBS_TYPES = 4;
    int OBS_DURING_BREEDING_ONLY = 1;
    int OBS_OUTSIDE_BREEDING_ONLY = 2;
    int OBS_BOTH_PERIODS = 3;
    int OBS_NEITHER_PERIOD = 4;

    int N_TRANSITIONS = 6;
    int tr_cols[N_TRANSITIONS];
    int tr_ptrs[N_STATES + 1];

    // the sparse matrix format requires
    // transitions to be grouped by destination state

    tr_ptrs[STATE_JUVENILE] = 1;
    tr_cols[1] = STATE_JUVENILE; // juvenile -> juvenile

    tr_ptrs[STATE_ADULT] = 2;
    tr_cols[2] = STATE_JUVENILE; // juvenile -> adult
    tr_cols[3] = STATE_ADULT; // adult -> adult

    tr_ptrs[STATE_DEAD] = 4;
    tr_cols[4] = STATE_JUVENILE; // juvenile -> dead
    tr_cols[5] = STATE_ADULT; // adult -> dead
    tr_cols[6] = STATE_DEAD; // dead -> dead

    tr_ptrs[STATE_DEAD + 1] = 7;
}


parameters {
    // survival probabilities
    real<lower=0,upper=1> psi1;  // juveniles
    real<lower=0,upper=1> psi2;  // adults
    // recruitment probability
    real<lower=0,upper=1> beta;
    // detection probabilities
    real<lower=0,upper=1> q1;  // juveniles outside breeding season
    real<lower=0,upper=1> q2;  // adults outside br. season
    real<lower=0,upper=1> p;  // adults during br. season
}


transformed parameters {
    // transmission probabilities
    // these are encoded as log-probabilities into a log-domain
    // sparse matrix.

    vector[N_TRANSITIONS] tr_data;

    // the sparse matrix format requires
    // transitions to be grouped by destination state

    tr_data[1] = log(psi1*(1-beta)); // juvenile -> juvenile
    tr_data[2] = log(psi1*beta); // juvenile -> adult
    tr_data[3] = log(psi2); // adult -> adult
    tr_data[4] = log(1-psi1); // juvenile -> dead
    tr_data[5] = log(1-psi2); // adult -> dead
    tr_data[6] = log(1); // dead -> dead

    //define log_phi -- matrix of emission log-probabilities
    matrix[N_OBS_TYPES, N_STATES] log_phi;
    log_phi[1:N_OBS_TYPES, STATE_JUVENILE] = [log(0), log(q1), log(0), log(1-q1) ]';
    log_phi[1:N_OBS_TYPES, STATE_ADULT] = [log(p*(1-q2)), log((1-p)*q2), log(p*q2), log((1-p)*(1-q2))]';
    log_phi[1:N_OBS_TYPES, STATE_DEAD] = [log(0), log(0), log(0), log(1) ]';
}


model {
    // default priors
    // likelihood
    matrix[T, N_STATES] gamma;
    for (n in 1:N) { // loop over individuals
        // initialise forward algorithm at t = 1
        gamma[1, 1:N_STATES] = log_phi[y[n,1], 1:N_STATES];

        // recursion
        for (t in 2:T) {
            // multiply by sparse log-domain transition matrix
            gamma[t, 1:N_STATES] = log_csr_matrix_times_vector(N_STATES, N_STATES, tr_data, tr_cols, tr_ptrs, gamma[t-1, 1:N_STATES]')';
            // multiply by emission probability conditional on hidden state
            gamma[t, 1:N_STATES] += log_phi[y[n, t], 1:N_STATES];
        }
        target += log_sum_exp(gamma[T]);
    }
}
1 Like