Avoiding label switching in a capture-recapture model

Reviving this topic to ask a new label-switching question !

I implemented in Stan the Latent Class Model for Capture-Recapture described in this paper and implemented (without Stan) in the LCMCR package. The model includes a latent class

z \sim \text{Categorical}((1,...,K);(\pi_1,...,\pi_K))

which determines the list recording probabilities \lambda_{jz}.

y_j\ |\ z \sim \text{Bern}(\lambda_{jz}) \text{ independent across lists }j

where y_j = 1 if recorded in list j and =0 otherwise. The capture-recapture goal is to estimate how many are never recorded on any list, and therefore the population total, N.

As recommended in the Stan User’s Guide, I marginalize out z.

But without ordering the list recording probabilities \lambda_{jz}, there is a label switching issue. For K = 3 latent classes, for example, these are the results, which has good rhat for my quantity of interest (N, the total population size) and lp__:

> fit$summary(c("N","pi","alpha","lp__"))
# A tibble: 6 x 10
  variable       mean     median      sd      mad          q5        q95  rhat ess_bulk ess_tail
  <chr>         <dbl>      <dbl>   <dbl>    <dbl>       <dbl>      <dbl> <dbl>    <dbl>    <dbl>
1 N          9079.      8915.    853.    753.       8010.      10710.     1.00  1213.     1771. 
2 pi[1]         0.353      0.331   0.328   0.467       0.0185      0.704  1.74     6.09    118. 
3 pi[2]         0.256      0.137   0.268   0.181       0.0185      0.692  2.10     5.25     28.3
4 pi[3]         0.391      0.303   0.168   0.0272      0.269       0.696  1.53     7.11     30.8
5 alpha         2.05       1.59    1.73    1.18        0.342       5.48   1.14    18.1      36.2
6 lp__     -11271.    -11271.      4.20    4.15   -11278.     -11265.     1.01  2550.     4510. 

I think I would only need to order them for one list (e.g. list number 1, without loss of generality). So that the labels are such that the first latent class is the least likely to be in list number 1, the second latent class the second-to-least likely to be in list number 1, etc etc.

Does that make sense ? I’m not sure how to accomplish this syntactically.

data {
  int<lower=1> J; // number of lists
  int<lower=1> C; // number of observed cells in the dataset, up to 2^J-1
  int list_indicators[C, J]; // indicators of being in lists
  vector<lower=0>[C] cell_count; // cell count for each capture pattern
  int<lower=1> K; // number of latent classes
}
transformed data {
  real<lower=0> observed = sum(cell_count);
  int zeros[J] = rep_array(0,J);
}
parameters {
  vector<lower=0,upper=1>[J] lambda[K]; // list inclusion probabilities for each latent class
  vector<lower=0,upper=1>[K - 1] breaks; // break proportions for stick-breaking prior on pi
  real<lower=observed> N;
  real<lower=0> alpha; // stick-breaking prior parameter
}
transformed parameters {
  simplex[K] pi; // latent class mixing proportions
  vector[C] log_cell_probability; // log cell probability for each observed capture pattern
  real log_unobserved_cell_probability;
  // https://mc-stan.org/docs/2_26/stan-users-guide/arithmetic-precision.html#underflow-and-the-log-scale
  vector[K] log_pi;  // cache log calculation
  vector[K] lps_unobserved;
  { // https://discourse.mc-stan.org/t/better-way-of-modelling-stick-breaking-process/2530/2
  // see BDA3 p.548
    real sum_sticks_so_far = 0;
    pi[1] = breaks[1];
    sum_sticks_so_far = pi[1];
    for (k in 2:(K - 1)) {
      pi[k] = (1 - sum_sticks_so_far) * breaks[k];
      sum_sticks_so_far += pi[k];
    }
    pi[K] = 1 - sum_sticks_so_far;
  }
  log_pi = log(pi);  // cache log calculation
  lps_unobserved = log_pi;
  for (c in 1:C) {
    vector[K] lps = log_pi;
    for (k in 1:K) {
      lps[k] += bernoulli_lpmf(list_indicators[c] | lambda[k]); // https://mc-stan.org/docs/2_26/functions-reference/vectorization.html#evaluating-vectorized-log-probability-functions
    }
    log_cell_probability[c] = log_sum_exp(lps);
  }
  for (k in 1:K) {
    lps_unobserved[k] += bernoulli_lpmf(zeros | lambda[k]);
  }
  log_unobserved_cell_probability = log_sum_exp(lps_unobserved);
} 
model {
  target += lchoose(N, observed) + (N - observed)*log_unobserved_cell_probability + cell_count' * log_cell_probability;
  target += -log(N);
  breaks ~ beta(1, alpha);
  alpha ~ gamma(0.25,0.25);
}
1 Like

I don’t quite understand your model but do you want the ordering on lambda or on pi?

parameters {
positive_ordered[J] lambda_raw[K];
...
}
transformed parameters {
simplex[J] lambda[K];
...
for (k in 1:K)
   lambda[k] = lambda_raw[k] / sum(lambda_raw);
}
1 Like

Thanks so much, @spinkney !

The list recording probabilities \lambda_{jz} are not a simplex, they will not sum to 100%. For example, someone of latent type z might have an 80% probability of being recorded in list j =1 (so \lambda_{1z} = 0.80) and a 90% probability of being recorded in list j = 2 (so \lambda_{2z} = 0.9).

I was indeed thinking to order the list recording probabilities \lambda_{jz} (for one list should be sufficient for identification, right ?), not the mixing proportions. See e.g. “Estimating Parameters of a Mixture” : “The location parameter mu is declared to be an ordered vector in order to identify the model.” They do not order the mixing proportions simplex theta.

You could try an inverse logit on an ordered lambda. That way each lambda will be between 0 and 1 and have the ordering identification.

parameters {
ordered[J] lambda_raw[K];
...
}
transformed parameters {
vector[J] lambda[K];
...
for (k in 1:K)
   lambda[k] = inv_logit(lambda_raw[k]);
}
1 Like

I happened to implement precisely this last week and it worked great.

Glad to see you back, but I took the liberty to move this into a new topic as I think the connection to the orignal is quite loose. Also revived topics tend to get less attention than new topics :-)

Hope you’ll be able to resolve all your issues!

1 Like

Unrelated to the label switching, you might run into issues fitting this model when treating the population size N as continuous. To fit the same model that’s being fit in LCMCR you need to also marginalize out N, which is analytically possible (see table 1 of [2101.09304] Revisiting Identifying Assumptions for Population Size Estimation). Here’s code I’ve used previously that’s implemented the model from LCMCR with N marginalized out:

data {
    // Data
    int<lower=0> J; // Number of conflict data sources
    int<lower=0> twoJ; // 2^J
    int<lower=0> z[twoJ - 1]; // Observed overlap from conflict data sources
    int<lower=0, upper=1> x[twoJ, J]; // Design matrix for linear model 
    int<lower=1> K_star;

    // Prior hyperparmaters
    real<lower=0> a; 
    real<lower=0> b;
}
parameters 
{
    real<lower=0> alpha;
    matrix<lower=0, upper=1>[K_star, J] lambda;
    real<lower=0, upper=1> V[K_star - 1];
}
transformed parameters{
    real<upper=0> log_nu[K_star];
    real<upper=0> log_pis[twoJ];
    simplex[twoJ] pis;
    real<upper=0> log_pis_temp[K_star];
    
    // Cribbed from https://ecosang.github.io/blog/study/dirichlet-process-with-stan/
    log_nu[1] = log(V[1]);
    for(k in 2:(K_star - 1)){
      log_nu[k] = log(V[k]) + log(1 - V[k - 1]) + log_nu[k - 1] - log(V[k - 1]);
    }
    log_nu[K_star] = log(1 - V[K_star - 1]) + log_nu[K_star - 1] - log(V[K_star - 1]);
    
    for(i in 1:twoJ){
        for(k in 1:K_star){
            log_pis_temp[k] =  log_nu[k] + bernoulli_lpmf(x[i, ] | lambda[k, ]);
        }
        log_pis[i] = log_sum_exp(log_pis_temp);
        pis[i] = exp(log_pis[i]);
    }
}
model 
{
    alpha ~ gamma(a, b);
    for(k in 1:K_star){
       lambda[k, ] ~ beta(1, 1); 
    }
    V ~ beta(1, alpha);
    
    for(i in 1:(twoJ - 1)){
        target += z[i] * log_pis[i + 1]; 
    }
    
    target += -1 * sum(z) * log1m_exp(log_pis[1]);
}
generated quantities{
    int<lower=0> z_0;
    z_0 = neg_binomial_rng(sum(z), (1 - pis[1]) / pis[1]);
}

I believe what I’ve called x corresponds to what you call list_indicators.

1 Like

Are the results sensitive to the choice of K_star ?

Very much! Theoretically, if you use too many latent classes (i.e. K_star is too big) the model is non-identifiable and the posterior for the population size N will be multimodal in many cases (see Section 3 of the paper I linked in the last post). Practically, because you’re fitting a mixture model for multivariate categorical data, all of the well known problems with fitting mixture models applies.

1 Like