Parameterising fossil occurence data

New Stan user here. I’m working with palaeontological data, consisting of L fossil sites and their ages and T species of animals that can be located in two or more sites. The occurence data is a [L, T] matrix where a value of 1 or 0 of its [l, t] element denotes the presence or absence of a species in a site. I am attempting to obtain appearance and disappearance ages and uncertainties for each species.

What would be the best way to express the posterior likelihood? I think IRT models ([1.11 Item-response theory models | Stan User’s Guide]1.11 Item-response theory models | Stan User’s Guide) are somewhat conceptually similar.
The resulting parameter distributions are all flat, so I think the posterior is wrong. The Python script and test data I use is below. I’d be grateful for any guidance!

data {
  int<lower=0> T; // number of taxa
  int<lower=0> L; // number of sites
  real<lower=0,upper=70> loc_ages[L]; // locality ages
  int<lower=0,upper=2>occ_matrix[L, T]; // 0-1 matrix indicating occurence/absence, L rows, T cols
}

parameters {
    real <lower=0,upper=70> a[T]; //first appearance ages of taxa
    real <lower=0,upper=40> duration[T]; // durations of taxon existence
}

model {
  int theta = 0;
for (l in 1:L) {
  for (t in 1:T) {
    if (a[t] <= loc_ages[l] <= a[t]+duration[t])  // locality dates are between a[t] and a[t]+duration[t]
      {
      theta = 1;
      }
    else {
      theta = 0;
    }  
    occ_matrix[l,t] ~ bernoulli_logit(theta);  
  }
}
}



Here’s the Python script I use. The occurence matrix with test data is here: 10test.txt (110 Bytes).

T = 5
L = 10
ages = [1., 5, 10, 20, 25, 30, 40, 50, 55, 60]
durations = [15, 15, 30, 35, 24]
mat = np.genfromtxt(“10test.txt”, dtype=int)
data={‘T’:T, ‘L’:L, ‘loc_ages’:ages, ‘occ_matrix’:mat}
fit = model.sampling(data=data, seed=420, iter=10000, chains=4, warmup=5000, thin=1)

1 Like

Hi, welcome to discourse!

There are a few things that we need to work through here:

  1. Where you write bernoulli_logit(theta), I think you mean to write bernoulli(theta). theta doesn’t appear to be a logit-scale parameter in this model, instead you want the bernoulli probability to be zero when the site age is inconsistent with the estimated appearance date and duration.
  2. I’m pretty sure that Stan doesn’t understand the condition in your if statement. I think a < b < c doesn’t work. (a < b) && (b < c) works.
  3. The conditions in your if statement appear to be backwards. Don’t you want theta to be 1 when a[t] is greater (i.e. older) than loc_ages[l] and a[t] - duration is less than loc_ages[l]? (Note that time “runs backwards” in your parameterization. Larger ages occurred before smaller ages). Alternatively, you can leave your conditional intact and interpret a[t] as the age of disappearance, and a[t] + duration as the age of appearance.
  4. Stan doesn’t play nicely with your latent discrete parameter theta. To avoid this, you can reformulate your likelihood as
model {
  for (l in 1:L) {
    for (t in 1:T) {
      if ((a[t] >= loc_ages[l]) && (loc_ages[l] >= (a[t]-duration[t])))  // locality dates are between a[t] and a[t]+duration[t]
        {
        target += bernoulli_lpmf(occ_matrix[l,t] | 1);
        }
      else {
        target += bernoulli_lpmf(occ_matrix[l,t] | 0);
      }  
    }
  }
}
  1. By supplying properly formulated inits, Stan can run this model, but it returns all kinds of divergences. However, that’s not a big problem because…
  2. I don’t think this can possibly be the model you intend. It implies that the probability of detection at a given site during the lifetime of a species is one. Your test data fortuitously doesn’t include any cases where a species appears at age A and age C, but is undetected at age B. This model sees such data as literally impossible.
  3. If this is the model that you want, then it is equivalent to saying that the age of first appearance is uniformly distributed between the oldest detection and the youngest prior non-detection (or for the species that was detected at the oldest site, 70, which is the upper bound that you allow for a). And the age of extinction is uniformly distributed between the youngest detection and the oldest subsequent non-detection. You can fit that model in native python using a uniform RNG–no Stan needed!
  4. By the way, the number of iterations that you are requesting is much higher than you are likely to need in Stan. Generally 4000 posterior iterations (i.e. 1000 post-warmup iterations per chain) is plenty.

Edited to add: I don’t think your modeling situation is similar to item-response theory. I suggest that you look into pre-existing literature on estimating appearance and extinction dates from the fossil record to get a sense of what sorts of models are applied to those data. If you find a specific model that seems like a good fit for your problem, there’s a very good chance that this forum can help you implement the model in Stan, particularly if you continue (as you’ve done here) to make a good-faith effort to solve the problem yourself and show your work.

3 Likes

Thank you so much for taking your time to look into this problem! Really appreciated. Indeed, issues 1-3 are an oversight on my part, thanks for pointing them out.

I can’t fit the model you have provided here, as I keep bumping into RuntimeError: Initialization failed error. I use Pystan 2.19 and it is likely that it does not accept the list of initial values (e.g. init=[{"a": [10, 10, 10, 15, 20]}]*4)) I am trying to specify, because the error message says Initialization between (-2, 2) failed after 1 attempts. What are the inits you are passing to Stan?

This is a prototype model – as you have correctly pointed out, it does not account for missing occurences (relatively common) and false identifications (rarer), it also assumes that the ages are uniformly distributed as you have described below. I’m going to use priors for appearance ages based on site age estimates and related uncertainties, as well as model the probabilities of false positives/negatives. Later on, I’m going to use this model to estimate the body sizes of predators and prey, geographical distributions, etc. However, I’d like to start with a simple working model and build complexity later on.

A MCMC code doing something very similar is described and published by Puolamäki et al. 2006, with the source code written in C provided here.
Unfortunately, I had two issues with this program: one is that the fossil site order is permutated randomly and the dates of the sites, i.e. information about the actual distance in time between them, is not included in the posterior calculation (except as possible constraints on site ordering). This makes it particularly unsuitable for uncertainty estimation. The code also has an issue with estimated age indexes, and I could not get in touch with the author for clarification. It did not work well with my data, providing age estimates inconsistent with real values.
I’d like to re-implement a similar model (especially in terms of false negative/positive treatment), and Stan has an added benefit of many diagnostic tools, which a standalone custom MCMC implementation lacks.

Simplifying the model usually helps but in this case it exaggerates discreteness and Stan does not deal well with discrete models.
Initialization often fails if parameters constraints don’t rule out zero-probability configuration. If your model says that a species cannot possibly be observed at a site from before its appearance then you must explicitly specify its oldest observation as the lower bound on the date of appearance.
Something like this:

transformed data {
  vector[T] first_seen;
  vector[T] last_seen;
  for (t in 1:T) {
    first_seen[t] = -1;
    for (l in 1:L) {
      if (occ_matrix[l,t] == 1) {
        last_seen[t] = loc_ages[l];
        if (first_seen[t] < 1)
          first_seen[t] = loc_ages[l];
      }
    }
  }
}
parameters {
  // vector-like bounds, requires CmdStanPy
  vector<lower=0,upper=first_seen>[T] appeared;
  vector<lower=last_seen,upper=70>[T] disappeared;
}
generated quantities {
  vector[T] duration = disappeared - appeared;
}
2 Likes

Totally agree with everything @nhuurre says. Because your prototype model is so strongly constrained, you’ll even need additional constraints on the parameters appeared and disappeared. Your current model says that the species is guaranteed to be detected during its evolutionary lifetime, so you need to constrain appeared to be between the last nondetection and the first detection, and disappeared to be between the last detection and the first subsequent nondetection.

I also agree that it might be time to add some minimal amount of complexity to the model. The super-simple initial model has served its purpose of spotting issues with if statements and bernoulli_logit, but at this point to actually get the initial model up and running will require adding structure (in the form of constraints) that you don’t actually want in your final model.

Perhaps a first pass at a more complex model would be let sites during the evolutionary lifetime of the species sample the species according to some fitted Bernoulli probability p. Then you’d need bounds that @nhuurre suggests but not additional bounds. One caution is that the likelihood will then have discontinuities as the appeared and disappeared parameters move past leading or trailing zeros in the detection histories, which might cause problems. There are strategies for ameliorating these problems, if necessary, but they might require some modification to the model that you’re trying to fit. Happy to help cross that bridge when you come to it!