Mark-recapture-like STAN model


I am new to the STAN forum so bear with me if I’ve posted in the wrong category. The post could just as easily fit into the General category too, or even the Publicity category as well. Apologies too if the post is rather long.

I am interested in employing STAN and provide some background, as well as a proposed STAN program below on a problem of deep interest to me that I have been working on from various angles for my PhD. However, I am experiencing difficulty with correctly specifying the model.


My work involves the estimation of sample sizes for DNA barcoding, a technique to identify species based on short regions of their DNA, that has myriad applications from monitoring invasive species, to combatting seafood fraud.

Researchers in this field are eager to know how many individuals of a given species to collect in order to capture a given fraction of genetic diversity that may exist for that species. Naive estimates of necessary sample sizes range from 5-10 individuals per species, but this is far from sufficient.

The problem is, we don’t actually know how much genetic diversity actually exists for any species of interest, because we cannot sample all specimens of that species. So, Assumption 1 is that the collected sample accurately reflects the population of interest. This is probably far from true.

To assess sampling effort, we plot “accumulation curves” for each species of interest. These curves depict the mean amount of genetic diversity found (H* on the y-axis) for every additional individual collected (N* on the x-axis). Eventually, these curves will level off to an asymptote, which signals that specimen collection should be halted because it is unlikely that any new genetic diversity will be found. However, for many species, these curves are basically linear in shape, which indicates that a lot more sampling needs to be done. I am interested in where an asymptote is likely to occur (N* on the x-axis); that is, how many individuals must be sampled to capture the majority of existing genetic diversity for a species of interest.

So, we have 3 pieces of information on hand to compute an estimate of N:

  • the number of sampled individuals (i.e., DNA sequences) of a species of interest (N)

  • the number of haplotypes (i.e., unique DNA sequences) in our collected sample (H*)

  • the probability distribution vector of haplotypes, (probs)

As a motivating example, consider the Siamese fighting fish (Betta splendens). A DNA sequence dataset which I downloaded several years ago had the following characteristics:

  • N = 76 individuals (DNA sequences)

  • H* = 4 unique DNA sequences

  • probs = (53/76, 13/76, 9/76, 1/76) (Note: probs sum to 1)

Essentially, the data show that the majority of sampled individuals share the same unique DNA sequence. So, if we were to sample new individuals randomly from the wild, we would find that many would possess the same DNA sequence as the 53 specimens already sampled, while some individuals may have rarer DNA sequences, or even yet unseen DNA sequences. Assumption 2 here is that unobserved DNA sequences occur at very low frequency in the population (otherwise the would have already been collected). Like Assumption 1, Assumption 2 is probably unrealistic.

In the example, to plot the accumulation curve, we randomly sample one of the H* = 4 DNA sequences at random with replacement, getting a number 1-4 each time. In the above example, the number 1 has a frequency of 53/76 in a given row, the number 2 has a frequency of 13/76 in a given row, etc. We do this N = 76 times. This scheme is repeated a large number of times (say, perms = 10000 times). The information is stored in a matrix called pop of dimensions perms x N and the arithmetic mean is computed along all the columns (H). We then compute N* via N* = (NH*) / H. The endpoint of the curve is the point of interest because once it hits the asymptote, this means that sampling can be safely halted.

My current work has implemented this in a sort of genetic algorithm framework, where the above formula for N* is iterated; that is we set N = N* and repeat the entire process. In this way, H \rightarrow H*, and clearly N \rightarrow N*. I’m now wishing to see if the problem can be solved easily in STAN.


Here, I take inspiration from the mark-recapture (Lincoln-Peterson) model in the STAN manual as the formulae have the same mathematical form. However, there are two differences

  1. there is only 1 sampling event (as opposed to 2 in mark-recapture)

  2. we are interested in sample size, not population size

I am facing difficulty with finding the right priors for the model parameters.

The model is as follows:

data {
int<lower=1> N;
int<lower=1> Hstar;
simplex[Hstar] probs; // elements sum to 1
int<lower=2> perms;

transformed data {
matrix[perms, N] pop;
simplex[N] distrib = rep_vector(1.0 / N, N);
int<lower = 1, upper = N> resample_idxs[N];
vector[perms] res;
for (i in 1:N)
resample_idxs[i] = categorical_rng(distrib);
for (j in 1:perms)
res = N * probs[resample_idxs];
pop = append_col(res, pop);

transformed parameters {
real<lower=N> Nstar;
vector[perms] H = col(pop, N); // retrieve last column
Nstar = (N * Hstar) / mean(H);

model {
Nstar ~ ???;

Any ideas on how to form a good prior?

I would also need an efficient way to fill the pop matrix with random positive integers generated according to probs.

Any advice is greatly appreciated and warmly welcomed.

You might be able to find some useful priors or at least a starting point here: Yackulic CB, Dodrill M, Dzul M, Sanderlin JS, Reid JA. A need for speed in Bayesian population models: a practical guide to marginalizing and recovering discrete latent states. Ecological Applications. 2020 Feb 29:e02112.


But now upon further reading and continuing to expand my STAN program, I have found that perhaps having a empty model block is sufficient. This is because I generate random variates via categorical_rng. I am trying to debug the program at this point.

1 Like

@jphill01you might find this paper useful:

Dorazio, Robert M., et al. “Estimating species richness and accumulation by modeling species occurrence and detectability.” Ecology 87.4 (2006): 842-854.

Replace “species” with “haplotype” for the analogy. There is a slight difference, though, in that (if I follow), individuals can only have one haplotype, but communities can have multiple species.

1 Like

Hey, I am sorry, I don’t really get what you are trying to do. Please correct me where I am wrong or if I misunderstood something: in your background example you have the probs (not an estimate, but the actual haplotype frequencies), so there is no need to sample pop x times since every row would have approximately the same frequencies. Computing the mean seems pointless to me as these are categories and you could assign basically any label to them (for instance just reverse the labels). If you want sufficient coverage, you’d need at least 76 samples (since one haplotype has a frequency 1/76), but x * 76 would be better.

In your Stan model you have data of N, Hstar and probs and compute Nstar from the last column of the pop matrix. There are no random variables involved at this point? Say, if you wanted to estimate the haplotype frequencies themselves (and number of haplotypes) any clustering algorithm that works on sequences would do (for instance shorah).

As you said later: your model block is therefore empty, you only want to sample categorical variables (?). You can do that with sample(seq(4), N, prob=probs, replace=TRUE) but then again: you already have the true haplotype frequencies …


Hello Simon,

Allow me to clarify.

Yes, I only have N, Hstar and probs and I want to estimate Nstar = (N * Hstar) / mean(H) where H the the last column of the pop matrix. N = 76 serves as my best guess for the number of individuals that need to be randomly sampled to capture the majority of the haplotypes (i.e., genetic diversity), so I would need at least N = 76 specimens. Actually, each of the Hstar = 4 haplotypes has a different frequency in the sample: Haplotype 1 is represented by 53 specimens in the sample (a frequency of 53/76), Haplotype 2 by 13 (a frequency of 13/76), etc.

I have created an R package called HACSim to automate this process (since the pop matrix would need to be regenerated at each iteration). At each iteration, pop gets bigger since more individuals are sampled. The algorithm can probably best be equated to a genetic algorithm.

Actually, I consider probs to be an estimate: there are likely more haplotypes out there, but they likely occur at very low frequency (otherwise we would have already sampled them).

Right now, I guess you can say that there is really no randomness in the model, except for the categorical RNG, which is employed to randomly sample an index of probs in order to assign a haplotype to one of the 76 individuals in pop.

In the future, I’d like to encode prior information somehow, but it’s not clear to be what a good prior would be, nor which quantities to track in posterior sampling. This was the purpose of my original question.

The mean of last column in pops changes at each iteration (because it will include more values). Eventually mean(H) will approach Hstar asymptotically and so N will approach Nstar.

To be clear, the actual STAN program is buggy. I have posted a question with only my data and STAN code in order to tease out the bugs. If interested, feel free to check it out.


If I follow, the following unknowns could be treated as random variables/assigned priors in a Bayesian setting:

  1. The total number of haplotypes in the population: this must be greater than or equal to the observed number of haplotypes in the sample.

  2. The frequencies of each haplotype in the population: the sampled haplotype frequencies contain information about these frequencies.

If you had a posterior distribution for these unknowns, it seems like you could get a posterior distribution for N* (the number of individuals that would need to be sampled to capture some particular fraction of genetic diversity) by simulating the sampling process.

If each individual has exactly one haplotype, could the observation process be modeled as a partly observed multinomial with sample size N, where N is the number of sampled individuals? It’s not a completely observed multinomial random variable because the total number of haplotypes is unknown. In your example where four haplotypes were observed, if there were actually 5 haplotypes in the population the multinomial vector would be [53, 13, 9, 1, 0], but if there were actually 8 haplotypes in the population the multinomial vector would be [53, 13, 9, 1, 0, 0, 0, 0].

Assuming that you want to build a probabilistic model for N*, do you have any prior information that could be used to build a model for the true number of haplotypes or the haplotype frequencies? e.g., is there some genetic argument to assume a particular prior distribution of haplotype frequencies?

This approach you suggest seems reasonable, albeit with uncertainties as to the true number of haplotypes and the true haplotype frequency distribution.

Each sampled specimen will have exactly one haplotype associated with it.

As you point out however, this is essentially a latent process. In a frequentist framework, one could possibly employ the EM algorithm through iteratively maximizing the likelihood until convergence is reached, but certainly a Bayesian framework would allow greater flexibility.

I will continue to investigate through reading related posts within the forum and elsewhere, I’ll consider your previous reply to be a solution that can help get me started.


My bad, I actually overlooked the transformed data block.

As I wrote above, a Bayesian infinite mixture model, such as implemented in shorah is maybe want you are looking for. If you want to however implement it in Stan, you’d need to

  • put a prior on the haplotype frequencies, e.g., a truncated Dirichlet process, for instance using a stick breaking construction (as Stan doesn’t work with discrete parameters),
  • define a likelihood that works on DNA sequences