Hi all,
I’m working with some data on the abundance of a variety of plants. I’m hoping to work up to fitting some GLMs with these abundance data as the response. However, the abundance data is collected in an unstandardized way, and I’m not totally sure how to best make use of this data and account for the variable data collection methods.
The data is stored in these character strings that sometimes specify the abundance as an exact count (e.g. “none”, “37 plants”) and sometimes specify the abundance as an approximation or an expression that represents an order of magnitude or a minimum count (e.g. “about 40 plants”, “thousands of plants”, “at least 200 plants”). It seems to me that there is plenty of information stored in all of these non-exact count data, but the crux is how to retain as much information as possible while accounting for this sort of variable observation process.
My initial thought was to reduce all non-exact counts to a number that represents the assumed minimum count (“about 40 plants” → 30*, “thousands of plants” → 1000, “at least 200 plants” → 200). From there, I suppose I could draw from N-mixture modeling to estimate some real count along with a detection probability**. Say, specify the likelihood of the exact counts with some Poisson or Negative Binomial distribution, and then specify the approximations as draws from a binomial distribution with a latent total population size and a success probability equal to the probability of detecting any one individual. It is my vague intuition, however, that these distinct types of approximations aren’t adequately represented by a shared detection probability.
Have others encountered situations like this? I’m eager to make use of this data, but I’m feeling quite unsure of how to handle these variable types of approximations.
Thanks for your thoughts!
Jeremy
* I recognize that for those approximate values, “about X”, I’d need to make some decision about what the associated minimum value is
** At some later time, it may be desirable to define the full dataset as some N-mixture model as there is certainly some probability of non-detection even in what I’m calling exact counts… Perhaps when I get to that point, there could be a separate detection probability parameter for the exact counts from that of the non-exact counts.