Hello,
I am new to the STAN forum so bear with me if I’ve posted in the wrong category. The post could just as easily fit into the General category too, or even the Publicity category as well. Apologies too if the post is rather long.
I am interested in employing STAN and provide some background, as well as a proposed STAN program below on a problem of deep interest to me that I have been working on from various angles for my PhD. However, I am experiencing difficulty with correctly specifying the model.
Background
My work involves the estimation of sample sizes for DNA barcoding, a technique to identify species based on short regions of their DNA, that has myriad applications from monitoring invasive species, to combatting seafood fraud.
Researchers in this field are eager to know how many individuals of a given species to collect in order to capture a given fraction of genetic diversity that may exist for that species. Naive estimates of necessary sample sizes range from 5-10 individuals per species, but this is far from sufficient.
The problem is, we don’t actually know how much genetic diversity actually exists for any species of interest, because we cannot sample all specimens of that species. So, Assumption 1 is that the collected sample accurately reflects the population of interest. This is probably far from true.
To assess sampling effort, we plot “accumulation curves” for each species of interest. These curves depict the mean amount of genetic diversity found (H* on the y-axis) for every additional individual collected (N* on the x-axis). Eventually, these curves will level off to an asymptote, which signals that specimen collection should be halted because it is unlikely that any new genetic diversity will be found. However, for many species, these curves are basically linear in shape, which indicates that a lot more sampling needs to be done. I am interested in where an asymptote is likely to occur (N* on the x-axis); that is, how many individuals must be sampled to capture the majority of existing genetic diversity for a species of interest.
So, we have 3 pieces of information on hand to compute an estimate of N:
-
the number of sampled individuals (i.e., DNA sequences) of a species of interest (N)
-
the number of haplotypes (i.e., unique DNA sequences) in our collected sample (H*)
-
the probability distribution vector of haplotypes, (probs)
As a motivating example, consider the Siamese fighting fish (Betta splendens). A DNA sequence dataset which I downloaded several years ago had the following characteristics:
-
N = 76 individuals (DNA sequences)
-
H* = 4 unique DNA sequences
-
probs = (53/76, 13/76, 9/76, 1/76) (Note: probs sum to 1)
Essentially, the data show that the majority of sampled individuals share the same unique DNA sequence. So, if we were to sample new individuals randomly from the wild, we would find that many would possess the same DNA sequence as the 53 specimens already sampled, while some individuals may have rarer DNA sequences, or even yet unseen DNA sequences. Assumption 2 here is that unobserved DNA sequences occur at very low frequency in the population (otherwise the would have already been collected). Like Assumption 1, Assumption 2 is probably unrealistic.
In the example, to plot the accumulation curve, we randomly sample one of the H* = 4 DNA sequences at random with replacement, getting a number 1-4 each time. In the above example, the number 1 has a frequency of 53/76 in a given row, the number 2 has a frequency of 13/76 in a given row, etc. We do this N = 76 times. This scheme is repeated a large number of times (say, perms = 10000 times). The information is stored in a matrix called pop of dimensions perms x N and the arithmetic mean is computed along all the columns (H). We then compute N* via N* = (NH*) / H. The endpoint of the curve is the point of interest because once it hits the asymptote, this means that sampling can be safely halted.
My current work has implemented this in a sort of genetic algorithm framework, where the above formula for N* is iterated; that is we set N = N* and repeat the entire process. In this way, H \rightarrow H*, and clearly N \rightarrow N*. I’m now wishing to see if the problem can be solved easily in STAN.
Model
Here, I take inspiration from the mark-recapture (Lincoln-Peterson) model in the STAN manual as the formulae have the same mathematical form. However, there are two differences
-
there is only 1 sampling event (as opposed to 2 in mark-recapture)
-
we are interested in sample size, not population size
I am facing difficulty with finding the right priors for the model parameters.
The model is as follows:
data {
int<lower=1> N;
int<lower=1> Hstar;
simplex[Hstar] probs; // elements sum to 1
int<lower=2> perms;
}
transformed data {
matrix[perms, N] pop;
simplex[N] distrib = rep_vector(1.0 / N, N);
int<lower = 1, upper = N> resample_idxs[N];
vector[perms] res;
for (i in 1:N)
resample_idxs[i] = categorical_rng(distrib);
for (j in 1:perms)
res = N * probs[resample_idxs];
pop = append_col(res, pop);
}
transformed parameters {
real<lower=N> Nstar;
vector[perms] H = col(pop, N); // retrieve last column
Nstar = (N * Hstar) / mean(H);
}
model {
Nstar ~ ???;
}
Any ideas on how to form a good prior?
I would also need an efficient way to fill the pop matrix with random positive integers generated according to probs.
Any advice is greatly appreciated and warmly welcomed.