Rejecting initial value due to infinite gradient in a hidden markov model

Hi all,

I’m working on a hidden markov model to analyze multi-state capture-recapture data when there is uncertainty in the assignment of an individual’s state upon capture. In the ecological literature these are referred to as “multi-event” models. The transition and emission matrices are parametrized in terms of demographic rates and detection probabilities.

More concretely, my animals can be in three states: juveniles, adults or dead (coded 1,2,3 resp.). The observations take place in two periods each year, during the breeding season, and outside the breeding season. Thus there are four possible observable events for an individual in a given year: observed during the breeding season only, observed outside the breeding season only, observed in both periods, or neither (coded 1,2,3,4 resp.). The data consists of a capture history matrix, with each row corresponding to an individual, each column to a year, and each entry giving the individual’s observation code in that year.

Here’s my model, with the forward algorithm to implemented as described in the Stan User Manual’s page on HMMs:

data {
  int<lower=2> T;                          // number of years
  int<lower=1> N;                         // number of individuals
  int<lower=1,upper=4> y[N,T];    // capture histories
}

parameters {
  // survival probabilities
  real<lower=0,upper=1> psi1;    // juveniles 
  real<lower=0,upper=1> psi2;    // adults 
  // recruitment probability
  real<lower=0,upper=1> beta;
  // detection probabilities
  real<lower=0,upper=1> q1;      // juveniles outside breeding season
  real<lower=0,upper=1> q2;      // adults outside br. season
  real<lower=0,upper=1> p;        // adults during br. season
}

transformed parameters {
  simplex[3] theta[3];       // transition probabilities
  simplex[4] phi[3];          // emission probabilities
  // define theta
  theta[1] = [psi1*(1-beta), psi1*beta,   1-psi1]';
  theta[2] = [            0,      psi2,   1-psi2]';
  theta[3] = [            0,         0,        1]';
  //define phi
  phi[1] = [       0,       q1,    0,         1-q1]';
  phi[2] = [p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2)]';
  phi[3] = [       0,        0,    0,            1]';
}

model {
  // default priors
  // likelihood
  for (n in 1:N) { // loop over individuals
    matrix[T, 3] gamma;
    vector[3] acc;
    // initialise forward algorithm at t = 1
    for (k in 1:3) {
      gamma[1, k] = log(phi[k,y[n,1]]);
    }
    // recursion
    for (t in 2:T) {
      for (k in 1:3) {       // k is hidden state at t
        for (j in 1:3) {      // j is hidden state at t-1
          acc[j] = gamma[t-1,j] + log(theta[j, k])+ log(phi[k, y[n,t]]);
        }
        gamma[t, k] = log_sum_exp(acc);
      }
    }
    target += log_sum_exp(gamma[T]);
  }
}

And here is the R code I used to simulate data under the model:

### Set values for parameters and transformed parameters ###
# parameters
psi1 <- 0.77  
psi2 <- 0.9   
beta <- 0.33
q1 <- 0.7  
q2 <- 0.8  
p <-0.8  

# transition matrix
theta <- matrix(c(psi1*(1-beta), psi1*beta, 1-psi1,
                              0,      psi2, 1-psi2,
                              0,         0,      1), byrow=T, nrow=3)

# emission matrix
phi <- matrix(c(       0,       q1,    0,         1-q1,
                p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2),
                       0,        0,    0,           1), byrow=T, nrow=3)

### Simulate data ###
# Assume the individuals are all detected as juveniles at time 1
T <- 10
N <- 100

# Simulate hidden states
z <- matrix(NA,nrow=N,ncol=T)
z[,1] <- rep(1,N)
for (n in 1:N){
  for (t in 1:(T-1)){
    z[n,t+1] <- sample(1:3,1,prob=theta[z[n,t],])
  }
}

# Simulate observations
y <- matrix(4, N,T)
for (n in 1:N){
  for (t in 1:T){
    y[n,t] <- sample(1:4,1,prob=phi[z[n,t],])
  }
}

# Remove individuals that are never observed
observed <- which(apply(y,1,min)!=4)
y <- y[observed,]
z <- z[observed,]

# Fit model
N <- dim(y)[1]
T <- dim(y)[2]
input_data <- list(T=T,N=N,y=y)
fit <- stan("model.stan",data=input_data)

The model fails to initialize: the initial values are repeatedly rejected, with the error “Gradient evaluated at initial value is not finite.” I don’t understand why?

All the parameters are appropriately constrained, and I think all the simplices in the transformed parameters block are valid. Setting “inits=0” in the call to stan() does not help.

Perhaps the issue relates to log(0) values that occur in the forward algorithm, e.g. for log(theta[2,1])? These negative infinity values should behave the right way as far as the calculation of the likelihood goes, but perhaps they are a problem for the gradient?

Any insights or suggestions would be much appreciated!

1 Like

Ideally, Stan’s error message could contain some diagnostic information to help you locate which derivative is not finite.

It looks like one can provide an output file name for an optional diagnostic_file argument when invoking Stan. I haven’t tried using that myself, but that may help indicate which gradients are not finite: Fit a model with Stan — stan • rstan .

That could make sense. The derivative of log(p) is unbounded as p approaches 0 from the right.

If there isn’t a principled way to debug what is going on using Stan diagnostic output, one thing you could try is artificially narrowing the constraints of each probability parameter to prevent them from being zero. E.g. pick an arbitrary small positive finite constant and use that to redefine the lower bound of all survival, recruitment and detection probabilities, e.g. real<lower=0.01,upper=1>.

Another idea could be to sanity check that each individual is contributing a finite term to the log probability target – perhaps for whatever reason one or more individuals are deemed impossible by the model. Perhaps you could add a print statement before the target += log_sum_exp(gamma[T]); line to check this.

Another idea could be to think about priors for the survival, recruitment and detection probabilities. I suspect in theory and in practice none of these probabilities could ever be exactly zero or exactly one.

Instead of // default priors in the model block, you could try explicitly setting a prior for each of the probability parameters. Perhaps a reasonable place to start, in terms of getting the model to run at all, could be to set beta priors that vanish at both zero and one, say psi1 ~ beta(2, 2), and similarly for all the other probability parameters.

2 Likes

FWIW I’ve had good results implementing capture-recapture likelihoods like this on the probability scale (not the log scale), e.g., example-models/js_ms.stan at master · stan-dev/example-models · GitHub

4 Likes

Interesting idea! I believe the main reason for working on log-scale would be to avoid underflow when taking the product of a large number of probability factors.

In this model the posterior probability of each individual’s trajectory is given by something roughly like a product of T transition probabilities with a product of T observation probabilities. If each transition probability t is 0.05 or higher and each observation probability o is 0.04 or higher then it looks like (o \, t)^T would be positive for 64-bit precision floating point for T=115 but becomes zero for T=120 or higher. So that suggests if the number of years of observations T is 100 or less and the factors involved in the products are 0.04 or larger you might get away without working in the log scale.

1 Like

FWIW, we also implement some multi-state capture-recapture models on the probability-scale with a large number of transition and capture probabilities, but a small number of capture occasions. The reason for this in my case is approximately 10x speed-up by taking advantage of matrix multiplication (inspired by Fujiwara and Caswell (2002)) instead of the using the forward-algorithm. We also use an m-array like approach to gather alike capture histories with a large number of individuals.

Details here: biom.13171.pdf (1.4 MB)

As for the original question: yes, try print statements to see if you’re getting NaN. I wonder if you might be running into an issue where Stan is treating log(0) * 0 as NaN even though log(0^0) = 1.

1 Like

Stan correctly evaluates 0*log(0), but incorrectly evaluates the gradient. See

2 Likes

@mbc another way to work around this issue is to redefine the transition and emission matrices to replace all the zeros with very small positive constants:

transformed parameters {
  simplex[3] theta[3];       // transition probabilities
  simplex[4] phi[3];         // emission probabilities

  real delta;
  delta = 1.0e-15; // arbitrary constant s.t. $\delta \approx 0$ and $\log(\delta)$ finite

  // define theta
  theta[1] = [psi1*(1-beta), psi1*beta, 1-psi1]';
  theta[2] = [delta, (1-delta)*psi2, (1-delta)*(1-psi2)]';
  theta[3] = [delta, delta, (1-2*delta)*1]';
  //define phi
  phi[1] = [delta, (1-2*delta)*q1, delta, (1-2*delta)*(1-q1)]';
  phi[2] = [p*(1-q2), (1-p)*q2, p*q2, (1-p)*(1-q2)]';
  phi[3] = [delta, delta, delta, (1-3*delta)*1]';
}

This modification fixes the symptom in cmdstan-2.28.1. It isn’t very elegant!

Arguably a more elegant fix for the transition matrix part could be to rewrite the math and change the representation of the transition matrix to a sparse matrix, and compute the transitions as a log-domain sparse matrix vector product, so that when log-exp-summing to combine a number of log probs, we exploit the known sparsity pattern to omit the terms corresponding to impossible transitions that are log(0) aka -inf . This is exactly a matrix-vector multiply for CSC sparse matrix format, but in the log domain, replacing * with + and sum with log-exp-sum. I don’t think Stan offers that operation out of the box, it’d have to be defined. I have it implemented in some of my python code, when I get a bit more time I’ll port it to Stan user defined function.

1 Like

Problem solved, thanks for your help everyone!

It seems pretty settled that the issue arose due to zeros in the transition and emission matrices, and the resulting log(0) values that arise in the forward algorithm. When I added some print statements the likelihood itself was finite.

The model runs without issue if I work on the probability scale in the forward algorithm, storing marginal probabilities (not log marginal probs) in gamma, and only taking logarithms at the last step to increment target. Thanks @mbjoseph and @Dalton for that suggestion. It also works without issue when replacing zeros in the transition and emission matrices with tiny positive numbers, thanks @rfc.

Working on the log scale is presumably ‘the right way’ to do things, so down the line it would be interesting to know how to do that. @rfc I don’t know much about sparse matrix arithmetic, but that approach sounds promising.

1 Like

If anyone is curious, I’ve got a version of this model working where the transition matrix is encoded as a sparse matrix in the log-domain. It’s roughly a mashup of log-sum-exp with 7.1 Compressed row storage | Stan Functions Reference and 7.3 Sparse matrix arithmetic | Stan Functions Reference . I’m not very familiar with Stan so it probably isn’t idiomatic or fast.

Stan samples without complaining about any infinite gradients, and is able to approximately recover the parameters from the simulated data.

I find the resulting code about one to two orders of magnitude harder to read than @mbc 's original model – the sparse matrix encoding details are much less intuitive, and the Stan language forces different parts of the sparse matrix data to be defined in different blocks – it seems the integer data defining the column indices and the offsets into the packed data are not allowed to live inside transformed parameters block, as they are discrete parameters. Could be helpful for HMMs involving large sparse transition matrices that require the log domain for stability, but perhaps otherwise best avoided if there is a possibility of working with probabilities and small dense matrices.

functions {
    /* Returns log-domain matrix-vector product of
     * sparse matrix A with vector b.
     *
     * The sparse matrix is assumed to be CSR encoded
     * following the convention of csr_matrix_times_vector.
     * All entries in the vector and the matrix are assumed
     * to be log-domain, so entries not explicitly encoded
     * are assumed to be log(0) = -inf.
     *
     * @param m number of rows of A
     * @param n number of columns of A
     * @param w log-domain entries of A
     * @param v column indices of A corresponding to the entries w
     * @param u length n+1 array of indices into w and v
     * @param b source vector to multiply, of length n, with entries in log-domain.
     */
    vector log_csr_matrix_times_vector(int m, int n, vector w, int[] v, int[] u, vector b) {
        vector[m] dst;
        real acc_max, acc;
        int j;
        for (i in 1:m) {
            // log-sum-exp reduction for the sparse inner product
            // giving the i-th entry of the destination vector
            acc_max = negative_infinity();
            for (iota in u[i]:u[i+1]-1) {
                j = v[iota];  // j is column index
                acc_max = fmax(acc_max, w[iota] + b[j]);
            }
            if (acc_max <= negative_infinity()) {
                dst[i] = acc_max;
            } else {
                acc = 0.0;
                for (iota in u[i]:u[i+1]-1) {
                    j = v[iota];  // j is column index
                    acc += exp(w[iota] + b[j] - acc_max);
                }
                dst[i] = log(acc) + acc_max;
            }
        }
        return dst;
    }
}


data {
    int<lower=2> T; // number of years
    int<lower=1> N; // number of individuals
    int<lower=1,upper=4> y[N,T]; // capture histories
}

transformed data {
    int N_STATES = 3;
    int STATE_JUVENILE = 1;
    int STATE_ADULT = 2;
    int STATE_DEAD = 3;

    int N_OBS_TYPES = 4;
    int OBS_DURING_BREEDING_ONLY = 1;
    int OBS_OUTSIDE_BREEDING_ONLY = 2;
    int OBS_BOTH_PERIODS = 3;
    int OBS_NEITHER_PERIOD = 4;

    int N_TRANSITIONS = 6;
    int tr_cols[N_TRANSITIONS];
    int tr_ptrs[N_STATES + 1];

    // the sparse matrix format requires
    // transitions to be grouped by destination state

    tr_ptrs[STATE_JUVENILE] = 1;
    tr_cols[1] = STATE_JUVENILE; // juvenile -> juvenile

    tr_ptrs[STATE_ADULT] = 2;
    tr_cols[2] = STATE_JUVENILE; // juvenile -> adult
    tr_cols[3] = STATE_ADULT; // adult -> adult

    tr_ptrs[STATE_DEAD] = 4;
    tr_cols[4] = STATE_JUVENILE; // juvenile -> dead
    tr_cols[5] = STATE_ADULT; // adult -> dead
    tr_cols[6] = STATE_DEAD; // dead -> dead

    tr_ptrs[STATE_DEAD + 1] = 7;
}


parameters {
    // survival probabilities
    real<lower=0,upper=1> psi1;  // juveniles
    real<lower=0,upper=1> psi2;  // adults
    // recruitment probability
    real<lower=0,upper=1> beta;
    // detection probabilities
    real<lower=0,upper=1> q1;  // juveniles outside breeding season
    real<lower=0,upper=1> q2;  // adults outside br. season
    real<lower=0,upper=1> p;  // adults during br. season
}


transformed parameters {
    // transmission probabilities
    // these are encoded as log-probabilities into a log-domain
    // sparse matrix.

    vector[N_TRANSITIONS] tr_data;

    // the sparse matrix format requires
    // transitions to be grouped by destination state

    tr_data[1] = log(psi1*(1-beta)); // juvenile -> juvenile
    tr_data[2] = log(psi1*beta); // juvenile -> adult
    tr_data[3] = log(psi2); // adult -> adult
    tr_data[4] = log(1-psi1); // juvenile -> dead
    tr_data[5] = log(1-psi2); // adult -> dead
    tr_data[6] = log(1); // dead -> dead

    //define log_phi -- matrix of emission log-probabilities
    matrix[N_OBS_TYPES, N_STATES] log_phi;
    log_phi[1:N_OBS_TYPES, STATE_JUVENILE] = [log(0), log(q1), log(0), log(1-q1) ]';
    log_phi[1:N_OBS_TYPES, STATE_ADULT] = [log(p*(1-q2)), log((1-p)*q2), log(p*q2), log((1-p)*(1-q2))]';
    log_phi[1:N_OBS_TYPES, STATE_DEAD] = [log(0), log(0), log(0), log(1) ]';
}


model {
    // default priors
    // likelihood
    matrix[T, N_STATES] gamma;
    for (n in 1:N) { // loop over individuals
        // initialise forward algorithm at t = 1
        gamma[1, 1:N_STATES] = log_phi[y[n,1], 1:N_STATES];

        // recursion
        for (t in 2:T) {
            // multiply by sparse log-domain transition matrix
            gamma[t, 1:N_STATES] = log_csr_matrix_times_vector(N_STATES, N_STATES, tr_data, tr_cols, tr_ptrs, gamma[t-1, 1:N_STATES]')';
            // multiply by emission probability conditional on hidden state
            gamma[t, 1:N_STATES] += log_phi[y[n, t], 1:N_STATES];
        }
        target += log_sum_exp(gamma[T]);
    }
}
2 Likes

Hi everyone,

Is there a solution yet for the forward algorithm apparently requiring the product of probabilities instead of the sum of the log probabilities? I am currently running into this issue with a complex HMM and would prefer to sum log probabilities.

Thanks!

Matt

I’m working on a similar problem right now. Although I’ m thinking about it in terms and using the language of multi-state mark-recapture models; it’s largely the same thing as a HMM. Maybe you can give me some details of your problem and I can describe the solution I’ve hit upon. I’m not sure if it will be directly comparable. I have a large state matrix but relatively few capture occasions.

Hey Dalton,

I can share code, but it’s quite long. The key thing is that it works fine if I update the acc and gamma marginal probabilities by simply multiplying the probabilities (prob1 * prob2). But if I try to do it on the log scale ( log(prob1) + log(prob2)) and instead of summing probs (sum(prob1, prob2) I use log_sum_exp(prob1, prob2) I get a Gradient evaluated at the initial value is not finite error.

Does this sound familiar for what you may have developed a solution for?

Thanks,

Matt

Is there a solution yet for the forward algorithm apparently requiring the product of probabilities instead of the sum of the log probabilities? I am currently running into this issue with a complex HMM and would prefer to sum log probabilities.

The solution, which can be more annoying to code but theoretically can be more efficient (but cannot necessarily take advantage of slick matrix multiplication approaches, when available), is to special-case scenarios where the probability is zero in the forward algorithm. You don’t actually need to evaluate or store the forward probabilities associated with these cases; you can avoid them via the control flow and store only forward probabilities associated state sequences with nonzero probability mass. For an example in the context of multiseason occupancy models, see the Stan code output by cat(flocker:::make_forward_colex).

This is something I’m actively working on and I’ve got it working for fairly simple toy model, which I’m now expanding. It’s based off of this model: https://academic.oup.com/biometrics/article/76/3/900/7429143, where the structure of the state transition matrix is upper triangular - but that shouldn’t matter too much. The key thing is that within this model I know where the zeroes are in the transition matrix. That’s fairly simple with it being upper triangular, but if you’ve got a more sparse and square transition matrix, you might need to have a function tracking indices so that you’re avoiding log(0).

I’m doing a kind of matrix multiplication, but with the transition matrices stored as a vector, and some utility functions for selecting indices. In testing this is as fast or faster than my approach of using matrix multiplication on the natural scale.
TSCJS_Simple_Multistate_Single_Release.stan (10.1 KB)

I’ve got to log off for the evening, but dropping my toy model here. I’ll upload some code for simulation tomorrow.

2 Likes

Hi Jacob,

Thanks for your input. I have been doing that… the code below works fine:

    for (t in (first[i] + 1):n_prim) {
      for (s in 1:3) {
        for (k in 1:n_sec) {
          for (o in 1:3) {
            acc.2[k][o] = tpm.2[i, t, k][s, o];
            if (acc.2[k][o] > 0.0) {
              for (l in 1:n_diag) {
                if (acc.2[k][o] > 0.0) {
                  acc.2[k][o] *= tpm.3[i, t, k][o, y[i, t, k, l]];
                }
              } // l
            }
          } // o
        } // k
        for (ss in 1:3) {
          if (gamma[t - 1][ss] == 0.0) {
             acc.1[ss] = 0.0;
          } else {
          acc.1[ss] = gamma[t - 1][ss] * tpm.1[i, t][ss, s];
          if (acc.1[ss] > 0.0) {
            for (k in 1:n_sec) {
              if (acc.1[ss] > 0.0) {
                acc.1[ss] *= sum(acc.2[k]);
              }
            } // k
          }
        } // ss
        gamma[t][s] = sum(acc.1);
      } // s
      target += log(sum(gamma[n_prim]))
    }

And this doesn’t (perhaps I made an error somewhere):

    for (t in (first[i] + 1):n_prim) {
      for (s in 1:3) {
        for (k in 1:n_sec) {
          for (o in 1:3) {
            acc.2[k][o] = log(tpm.2[i, t, k][s, o]);
            if (!is_inf(acc.2[k][o])) {
              for (l in 1:n_diag) {
                if (!is_inf(acc.2[k][o])) {
                  acc.2[k][o] += log(tpm.3[i, t, k][o, y[i, t, k, l]]);
                }
              } // l
            }
          } // o
        } // k
        for (ss in 1:3) {
          if (is_inf(gamma[t - 1][ss])) {
            acc.1[ss] = negative_infinity();
          } else {
            acc.1[ss] = gamma[t - 1][ss] + log(tpm.1[i, t][ss, s]);
            if (!is_inf(acc.1[ss]))) {
              for (k in 1:n_sec) {
                if (!is_inf(acc.1[ss])) {
                  acc.1[ss] += log_sum_exp(acc.2[k]);
                }
              } // k
            }
          }
        } // ss
        gamma[t][s] = log_sum_exp(acc.1);
      } // s
      target += log_sum_exp(gamma[n_prim])
    }

Do you see anything?

What error or misbehavior do you see with the code that doesn’t work?

Hi,

I get the Gradient evaluated at the initial value is not finite error. However, I don’t think I’m actually doing the right via control flow. So I’ll keep working on that.

Hey again,

I was able to compute the likelihood for each individual without zeros. I did this by only computing the marginal likelihood of the alive states until the last primary that individuals were observed. When individuals were not observed in a secondary, the diagnostic (data) process was excluded and only the “not captured” probability (1 - p) was propagated. Only after individuals weren’t captured was the dead state updated. I was also able to use matrix multiplication for lots of it. Here’s the code which works great in a partial sum:

real partial_sum_lpmf(data array[] int seq_ind, int start, int end,
                        data array[,,,] int y, data array[,,] int q, data int n_s, data int n_o, data int n_d, data int n_prim, data int n_sec, data array[] int first, data array[] int last, data array[] int first_sec, data vector tau,
                        real eta_a, vector phi_a, real phi_b, vector psi_a, vector p_a, vector r, vector fr, matrix m, array[] matrix n) {

      // number of individuals in partial sum and initialise partial target
      int n_ind = end - start + 1;
      real ptarget = 0;
      
      // parameters
      real eta;
      array[n_s - 1] vector[n_prim - 1] phi, psi;
      array[n_s - 1] matrix[n_sec, n_prim] p;
      tuple(vector[n_prim], matrix[n_sec, n_prim]) delta;
      
      // transform intercepts
      vector[n_s - 1] log_phi_a = log(phi_a);
      
      // transition rate/probability matrices (trm/tpm)
      tuple(array[n_prim - 1] matrix[n_s, n_s],
            array[n_prim, n_sec] matrix[n_s - 1, n_o],
            array[n_prim, n_sec] matrix[n_o - 1, n_d - 1]) tpm;
      array[n_prim - 1] matrix[n_s, n_s] trm;
      
      // initialise probabilities
      array[n_prim] matrix[n_o - 1, n_sec] acc;
      matrix[n_s, n_prim] gamma;
      
      // for each individual
      for (i in 1:n_ind) {
        
        // index that maps to original n_ind
        int ii = i + start - 1;
        
        // initialise marginal probabilities
        gamma = rep_matrix(1, n_sec, n_prim);
        
        // probability of entering as infected
        eta = eta_a;
        
        // marginal probability of alive states at first capture
        gamma[1:2, first[ii]] = [ 1 - eta, eta ]';
        
        // mortality rates
        phi[1] = rep_vector(phi_a[1], n_prim - 1);
        phi[2] = exp(log_phi_a[2] + phi_b * m[1:(n_prim - 1), ii]);
        
        // infection state transition rates
        psi[1] = rep_vector(psi_a[1], n_prim - 1);
        psi[2] = rep_vector(psi_a[2], n_prim - 1);
        
        // for successive primary occasions
        for (t in first[ii]:(n_prim - 1)) {
          
          // ecological trm
          trm[t][1, 1] = -(psi[1][t] + phi[1][t]);
          trm[t][1, 2] = psi[2][t];
          trm[t][2, 1] = psi[1][t];
          trm[t][2, 2] = -(psi[2][t] + phi[2][t]);
          trm[t][3, 1] = phi[1][t];
          trm[t][3, 2] = phi[2][t];
          trm[t][:, 3] = zeros_vector(3);
          
          // ecological tpm
          tpm.1[t] = matrix_exp(trm[t] * tau[t]);
          
        } // t
        
        // individual and sample infection detection probabilities
        delta.1 = 1 - pow(1 - r[1], m[:, ii]);
        delta.2 = 1 - pow(1 - r[2], n[ii]);
        
        // detection probabilities (fixed at 1 for secondary of first capture)
        for (s in 1:(n_s - 1)) {
          p[s] = rep_matrix(p_a[s], n_sec, n_prim);
          p[s][first_sec[ii], first[ii]] = 1;
        } // s
        
        // for each primary occasion
        for (t in first[ii]:n_prim) {
          
          // for each secondary occasion
          for (k in 1:n_sec) {
            
            // observation tpm
            tpm.2[t, k][1, 1] = p[1][k, t] * (1 - fr[1]);
            tpm.2[t, k][1, 2] = p[1][k, t] * fr[1];
            tpm.2[t, k][1, 3] = 1 - p[1][k, t];
            tpm.2[t, k][2, 1] = p[2][k, t] * (1 - delta.1[t]);
            tpm.2[t, k][2, 2] = p[2][k, t] * delta.1[t];
            tpm.2[t, k][2, 3] = 1 - p[2][k, t];
            
            // if individual was not detected
            if (q[ii, t, k] == 0) {
              
              // marginal detection probabilities
              gamma[1:2, t] .*= tpm.2[t, k][:, 3];
              
            } else {
              
              // diagnostic tpm
              tpm.3[t, k][1, 1] = 1 - fr[2];
              tpm.3[t, k][1, 2] = fr[2];
              tpm.3[t, k][2, 1] = 1 - delta.2[k, t];
              tpm.3[t, k][2, 2] = delta.2[k, t];
              
              // probability of each alive observed state conditioned on data
              for (o in 1:(n_o - 1)) {
                acc[t][o, k] = prod(tpm.3[t, k][o, y[ii, t, k]]);
              } // o
              
              // marginal probability after each secondary for each alive ecological state
              gamma[1:2, t] .*= tpm.2[t, k][1:2, 1:2] * acc[t][1:2, k];
              
            }
            
          } // k
        } // t
        
        // marginal probability of alive states between first and last primary with captures
        if (first[ii] < last[ii]) {
          for (t in (first[ii] + 1):last[ii]) {
            gamma[1:2, t] .*= tpm.1[t - 1][1:2, 1:2] * gamma[1:2, t - 1];
          } // t
        }
        
        // if last captured in the last primary
        if (last[ii] == n_prim) {
          
          // increment target with marginal probabilities of alive states
          ptarget += log(sum(gamma[1:2, n_prim]));
          
        } else {
          
          // marginal probabilities of all ecological states in primary after last capture
          gamma[1:2, last[ii] + 1] .*= tpm.1[last[ii]][1:2, 1:2] * gamma[1:2, last[ii]];
          gamma[3, last[ii] + 1] = tpm.1[last[ii]][3, 1:2] * gamma[1:2, last[ii]];
          
          // if first occasion after last capture is the last primary
          if ((last[ii] + 1) == n_prim) {
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log(sum(gamma[:, n_prim]));
            
          } else {
            
            // marginal probabilities of all ecological states until last primary
            for (t in (last[ii] + 2):n_prim) {
              gamma[:, t] .*= tpm.1[t - 1] * gamma[:, t - 1];
            } // t
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log(sum(gamma[:, n_prim]));
          }
        }
        
      } // i
      
      return(ptarget);
      
    }

I once again tried to translate this to log scale. As far as I can tell, I just need to sum the log probs instead of multiply, log_sum_exp() the log probs instead of sum, and write a function that does matrix multiplication for log matrices (that is, compute log(AB) from log(A) and log(B)), which is here (note I wrote two other ones that take (row_)vectors:

  /**
   * Compute log(A * B) from log(A) and log(B)
   *
   * @param a:  Logarithm of matrix A
   * @param b:  Logarithm of matrix B
   *
   * @return  Logarithm of matrix A * B
   */
  matrix log_mat_prod(matrix a, matrix b) {
    int x = rows(a);
    int y = cols(b);
    matrix[x, y] c;
    for (i in 1:x) {
      for (j in 1:y) {
        c[i, j] = log_sum_exp(row(a, i)' + col(b, j));
      }
    }
    return(c);
  }

Now, it’s easy to translate the first code block to work with logs. Note that I do initialise gamma with 0s (and 1s in the first version) to make easier to iterate gamma in the observation process (i.e., gamma[1:2, t] += log(tpm.2[t, k][:, 3]); however, none of these 0 values ever get passed to (p )target, but perhaps it’s still an issue?

real partial_sum_lpmf(data array[] int seq_ind, int start, int end,
                        data array[,,,] int y, data array[,,] int q, data int n_s, data int n_o, data int n_d, data int n_prim, data int n_sec, data array[] int first, data array[] int last, data array[] int first_sec, data vector tau,
                        real eta_a, vector phi_a, real phi_b, vector psi_a, vector p_a, vector r, vector fr, matrix m, array[] matrix n) {

      // number of individuals in partial sum and initialise partial target
      int n_ind = end - start + 1;
      real ptarget = 0;
      
      // parameters
      real eta;
      array[n_s - 1] vector[n_prim - 1] phi, psi;
      array[n_s - 1] matrix[n_sec, n_prim] p;
      tuple(vector[n_prim], matrix[n_sec, n_prim]) delta;
      
      // transform intercepts
      vector[n_s - 1] log_phi_a = log(phi_a);
      
      // transition rate/probability matrices (trm/tpm)
      tuple(array[n_prim - 1] matrix[n_s, n_s],
            array[n_prim, n_sec] matrix[n_s - 1, n_o],
            array[n_prim, n_sec] matrix[n_o - 1, n_d - 1]) tpm;
      array[n_prim - 1] matrix[n_s, n_s] trm;
      
      // initialise probabilities
      array[n_prim] matrix[n_o - 1, n_sec] acc;
      matrix[n_s, n_prim] gamma;
      
      // for each individual
      for (i in 1:n_ind) {
        
        // index that maps to original n_ind
        int ii = i + start - 1;
        
        // initialise marginal log probabilities
        gamma = rep_matrix(0, n_sec, n_prim);
        
        // probability of entering as infected
        eta = eta_a;
        
        // marginal log probability of alive states at first capture
        gamma[1:2, first[ii]] = log([ 1 - eta, eta ]');
        
        // mortality rates
        phi[1] = rep_vector(phi_a[1], n_prim - 1);
        phi[2] = exp(log_phi_a[2] + phi_b * m[1:(n_prim - 1), ii]);
        
        // infection state transition rates
        psi[1] = rep_vector(psi_a[1], n_prim - 1);
        psi[2] = rep_vector(psi_a[2], n_prim - 1);
        
        // for successive primary occasions
        for (t in first[ii]:(n_prim - 1)) {
          
          // ecological trm
          trm[t][1, 1] = -(psi[1][t] + phi[1][t]);
          trm[t][1, 2] = psi[2][t];
          trm[t][2, 1] = psi[1][t];
          trm[t][2, 2] = -(psi[2][t] + phi[2][t]);
          trm[t][3, 1] = phi[1][t];
          trm[t][3, 2] = phi[2][t];
          trm[t][:, 3] = zeros_vector(3);
          
          // ecological tpm
          tpm.1[t] = matrix_exp(trm[t] * tau[t]);
          
        } // t
        
        // individual and sample infection detection probabilities
        delta.1 = 1 - pow(1 - r[1], m[:, ii]);
        delta.2 = 1 - pow(1 - r[2], n[ii]);
        
        // detection probabilities (fixed at 1 for secondary of first capture)
        for (s in 1:(n_s - 1)) {
          p[s] = rep_matrix(p_a[s], n_sec, n_prim);
          p[s][first_sec[ii], first[ii]] = 1;
        } // s
        
        // for each primary occasion
        for (t in first[ii]:n_prim) {
          
          // for each secondary occasion
          for (k in 1:n_sec) {
            
            // observation tpm
            tpm.2[t, k][1, 1] = p[1][k, t] * (1 - fr[1]);
            tpm.2[t, k][1, 2] = p[1][k, t] * fr[1];
            tpm.2[t, k][1, 3] = 1 - p[1][k, t];
            tpm.2[t, k][2, 1] = p[2][k, t] * (1 - delta.1[t]);
            tpm.2[t, k][2, 2] = p[2][k, t] * delta.1[t];
            tpm.2[t, k][2, 3] = 1 - p[2][k, t];
            
            gamma[1:2, t] = zeros_vector(2);
            
            // if individual was not detected
            if (q[ii, t, k] == 0) {
              
              // marginal detection probabilities
              gamma[1:2, t] += log(tpm.2[t, k][:, 3]);
              
            } else {
              
              // diagnostic tpm
              tpm.3[t, k][1, 1] = 1 - fr[2];
              tpm.3[t, k][1, 2] = fr[2];
              tpm.3[t, k][2, 1] = 1 - delta.2[k, t];
              tpm.3[t, k][2, 2] = delta.2[k, t];
              
              // probability of each alive observed state conditioned on data
              for (o in 1:(n_o - 1)) {
                acc[t][o, k] = sum(log(tpm.3[t, k][o, y[ii, t, k]]));
              } // o
              
              // marginal probability after each secondary for each alive ecological state
              gamma[1:2, t] += log_mat_prod(log(tpm.2[t, k][1:2, 1:2]), acc[t][1:2, k]);
              
            }
            
          } // k
        } // t
        
        // marginal probability of alive states between first and last primary with captures
        if (first[ii] < last[ii]) {
          for (t in (first[ii] + 1):last[ii]) {
            gamma[1:2, t] += log_mat_prod(log(tpm.1[t - 1][1:2, 1:2]), gamma[1:2, t - 1]);
          } // t
        }
        
        // if last captured in the last primary
        if (last[ii] == n_prim) {
          
          // increment target with marginal probabilities of alive states
          ptarget += log_sum_exp(gamma[1:2, n_prim]);
          
        } else {
          
          // marginal probabilities of all ecological states in primary after last capture
          gamma[1:2, last[ii] + 1] += log_mat_prod(log(tpm.1[last[ii]][1:2, 1:2]), gamma[1:2, last[ii]]);
          gamma[3, last[ii] + 1] = log_mat_prod(log(tpm.1[last[ii]][3, 1:2]), gamma[1:2, last[ii]]);
          
          // if first occasion after last capture is the last primary
          if ((last[ii] + 1) == n_prim) {
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log_sum_exp(gamma[:, n_prim]);
            
          } else {
            
            // marginal probabilities of all ecological states until last primary
            for (t in (last[ii] + 2):n_prim) {
              gamma[:, t] += log_mat_prod(log(tpm.1[t - 1]), gamma[:, t - 1]);
            } // t
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log_sum_exp(gamma[:, n_prim]);
          }
        }
        
      } // i
      
      return(ptarget);
      
    }

I guess my issue comes down to this: when can you have a 0 in gamma and when can’t you?

I’m leaving this here in case it helps someone else, but I figured it out. Turns out that at no point there can be a 0 in gamma, which was still happening with my initial implementation. From the moment you start populating gamma for each state and primary occasion, there has to be a non-zero value. The only other change with running the sampler is that not setting inits has at least one chain going a bit crazy, so I set init = 0.1 and that seemed to do the trick. I’m not sure why this was happening. Anyway, here it is. FWIW, in this particular simulation, the probability version had higher bulk ESS summed across parameters but tail ESS was almost the same.

/**
   * Compute partial sum over individuals of forward algorithm for hidden Markov model (HMM)
   *
   * @return  Log probability for subset of individuals
   */
  real partial_sum_lpmf(
    data array[] int seq_ind, int start, int end,
    data array[,,,] int y, data array[,,] int q, data int n_s, data int n_o, data int n_d, data int n_prim, data int n_sec, data array[] int first, data array[] int last, data array[] int first_sec, data vector tau,
    real eta_a, vector phi_a, real phi_b, vector psi_a, vector p_a, vector r, vector fr, matrix m, array[] matrix n) {
      
      // number of individuals in partial sum and initialise partial target
      int n_ind = end - start + 1;
      real ptarget = 0;
      
      // parameters
      real eta;
      array[n_s - 1] vector[n_prim - 1] phi, psi;
      array[n_s - 1] matrix[n_sec, n_prim] p;
      tuple(vector[n_prim], matrix[n_sec, n_prim]) delta;
      
      // transform intercepts
      vector[n_s - 1] log_phi_a = log(phi_a);
      
      // transition rate/probability matrices (trm/tpm)
      tuple(array[n_prim - 1] matrix[n_s, n_s - 1],
      array[n_prim, n_sec] matrix[n_s - 1, n_o],
      array[n_prim, n_sec] matrix[n_o - 1, n_d - 1]) log_tpm;
      array[n_prim - 1] matrix[n_s, n_s] trm;
      
      // for each individual
      for (i in 1:n_ind) {
        
        // initialise probabilities
        array[n_prim] matrix[n_o - 1, n_sec] acc;
        matrix[n_s, n_prim] gamma;
        
        // index that maps to original n_ind
        int ii = i + start - 1;
        
        // probability of entering as infected
        eta = eta_a;
        
        // mortality rates
        phi[1] = rep_vector(phi_a[1], n_prim - 1);
        phi[2] = exp(log_phi_a[2] + phi_b * m[1:(n_prim - 1), ii]);
        
        // infection state transition rates
        psi[1] = rep_vector(psi_a[1], n_prim - 1);
        psi[2] = rep_vector(psi_a[2], n_prim - 1);
        
        // for successive primary occasions
        for (t in first[ii]:(n_prim - 1)) {
          
          // ecological trm
          trm[t][1, 1] = -(psi[1][t] + phi[1][t]);
          trm[t][1, 2] = psi[2][t];
          trm[t][2, 1] = psi[1][t];
          trm[t][2, 2] = -(psi[2][t] + phi[2][t]);
          trm[t][3, 1] = phi[1][t];
          trm[t][3, 2] = phi[2][t];
          trm[t][:, 3] = zeros_vector(3);
          
          // ecological tpm
          log_tpm.1[t] = log(matrix_exp(trm[t] * tau[t])[:, 1:2]);
          
        } // t
        
        // individual and sample infection detection probabilities
        delta.1 = 1 - pow(1 - r[1], m[:, ii]);
        delta.2 = 1 - pow(1 - r[2], n[ii]);
        
        // detection probabilities (fixed at 1 for secondary of first capture)
        for (s in 1:(n_s - 1)) {
          p[s] = rep_matrix(p_a[s], n_sec, n_prim);
          p[s][first_sec[ii], first[ii]] = 1;
        } // s
        
        // for each primary occasion
        for (t in first[ii]:n_prim) {
          
          // for each secondary occasion
          for (k in 1:n_sec) {
            
            // observation tpm
            log_tpm.2[t, k][1, 1] = log(p[1][k, t]) + log1m(fr[1]);
            log_tpm.2[t, k][1, 2] = log(p[1][k, t]) + log(fr[1]);
            log_tpm.2[t, k][1, 3] = log1m(p[1][k, t]);
            log_tpm.2[t, k][2, 1] = log(p[2][k, t]) + log1m(delta.1[t]);
            log_tpm.2[t, k][2, 2] = log(p[2][k, t]) + log(delta.1[t]);
            log_tpm.2[t, k][2, 3] = log1m(p[2][k, t]);
            
            // diagnostic tpm
            log_tpm.3[t, k][1, 1] = log1m(fr[2]);
            log_tpm.3[t, k][1, 2] = log(fr[2]);
            log_tpm.3[t, k][2, 1] = log1m(delta.2[k, t]);
            log_tpm.3[t, k][2, 2] = log(delta.2[k, t]);
            
            // probability of each alive observed state conditioned on data
            if (q[ii, t, k] == 1) {
              for (o in 1:(n_o - 1)) {
                acc[t][o, k] = sum(log_tpm.3[t, k][o, y[ii, t, k]]);
              }
            }
            
          } // k
          
          // initialise marginal probability with first secondary (required to avoid gamma = 0)
          if (q[ii, t, 1] == 0) {
            gamma[1:2, t] = log_tpm.2[t, 1][:, 3];
          } else {
            gamma[1:2, t] = log_mat_prod(log_tpm.2[t, 1][:, 1:2], acc[t][:, 1]);
          }
          
          // for successive primaries
          for (k in 2:n_sec) {
            
            // if individual was not detected
            if (q[ii, t, k] == 0) {
              
              // marginal detection probabilities
              gamma[1:2, t] += log_tpm.2[t, k][:, 3];
              
            } else {
              
              // marginal probability after each secondary for each alive ecological state
              gamma[1:2, t] += log_mat_prod(log_tpm.2[t, k][:, 1:2], acc[t][:, k]);
              
            }
            
          } // k
        } // t
        
        // marginal log probability of alive states at first capture
        gamma[1:2, first[ii]] += [ log1m(eta), log(eta) ]';
        
        // marginal probability of alive states between first and last primary with captures
        if (first[ii] < last[ii]) {
          for (t in (first[ii] + 1):last[ii]) {
            gamma[1:2, t] += log_mat_prod(log_tpm.1[t - 1][1:2, :], gamma[1:2, t - 1]);
          } // t
        }
        
        // if last captured in the last primary
        if (last[ii] == n_prim) {
          
          // increment target with marginal probabilities of alive states
          ptarget += log_sum_exp(gamma[1:2, n_prim]);
          
        } else {
          
          // marginal probabilities of all ecological states in primary after last capture
          gamma[1:2, last[ii] + 1] += log_mat_prod(log_tpm.1[last[ii]][1:2, :], gamma[1:2, last[ii]]);
          gamma[3, last[ii] + 1] = log_mat_prod(log_tpm.1[last[ii]][3, :], gamma[1:2, last[ii]]);
          
          // if first occasion after last capture is the last primary
          if ((last[ii] + 1) == n_prim) {
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log_sum_exp(gamma[:, n_prim]);
            
          }
          else {
            
            // marginal probabilities of all ecological states until last primary
            for (t in (last[ii] + 2):n_prim) {
              gamma[1:2, t] += log_mat_prod(log_tpm.1[t - 1][1:2, :], gamma[1:2, t - 1]);
              gamma[3, t] = log_sum_exp(log_mat_prod(log_tpm.1[t - 1][3, :], gamma[1:2, t - 1]), gamma[3, t - 1]);
            } // t
            
            // increment target with marginal probabilities of all ecological states
            ptarget += log_sum_exp(gamma[:, n_prim]);
          }
        }
        
      } // i
      
      return(ptarget);
      
    }
2 Likes