Exact, approximate, and minimum count data

Hi all,

I’m working with some data on the abundance of a variety of plants. I’m hoping to work up to fitting some GLMs with these abundance data as the response. However, the abundance data is collected in an unstandardized way, and I’m not totally sure how to best make use of this data and account for the variable data collection methods.

The data is stored in these character strings that sometimes specify the abundance as an exact count (e.g. “none”, “37 plants”) and sometimes specify the abundance as an approximation or an expression that represents an order of magnitude or a minimum count (e.g. “about 40 plants”, “thousands of plants”, “at least 200 plants”). It seems to me that there is plenty of information stored in all of these non-exact count data, but the crux is how to retain as much information as possible while accounting for this sort of variable observation process.

My initial thought was to reduce all non-exact counts to a number that represents the assumed minimum count (“about 40 plants” → 30*, “thousands of plants” → 1000, “at least 200 plants” → 200). From there, I suppose I could draw from N-mixture modeling to estimate some real count along with a detection probability**. Say, specify the likelihood of the exact counts with some Poisson or Negative Binomial distribution, and then specify the approximations as draws from a binomial distribution with a latent total population size and a success probability equal to the probability of detecting any one individual. It is my vague intuition, however, that these distinct types of approximations aren’t adequately represented by a shared detection probability.

Have others encountered situations like this? I’m eager to make use of this data, but I’m feeling quite unsure of how to handle these variable types of approximations.

Thanks for your thoughts!

Jeremy

* I recognize that for those approximate values, “about X”, I’d need to make some decision about what the associated minimum value is
** At some later time, it may be desirable to define the full dataset as some N-mixture model as there is certainly some probability of non-detection even in what I’m calling exact counts… Perhaps when I get to that point, there could be a separate detection probability parameter for the exact counts from that of the non-exact counts.

1 Like

Would this be a good use case for interval censoring? E.g. Mixed (right, left and interval) censored log-normal with brms

4 Likes

This seems spot on! I’ve only ever used this approach for survival analysis, and I guess I failed to recognize it’s generality. Thank you for pointing me in this direction!

Except that log normal PDF has a continuous domain (i.e. is defined for positive real numbers) and you have integer data, I think. You could use Poisson or negative binomial or you might even want to go to a more flexible shape like discrete Burr type XII which is listed here: brms github issues. But I have Stan function block code if you want to implement it in Stan rather than brms.

The general way to treat this is as a missing data problem. There’s a chapter in the Stan User’s Guide outlining the basics:

Specifically, think about what each description implies as a distribution over the unobserved true values. So if they say “about 30”, should that be Poisson(30) or something more or less dispersed like negative_binomial(30, 1.5)? An interval is going to say something like all of the values in [10, 50] are equally likely and thus isn’t very flexible. on the other hand, if you don’t know anything other than that the value was rounded to the nearest integer n, then a constraint of (n-0.5, n+0.5] makes sense.

1 Like