Modelling count-data accounting for detection and uneven sampling

Hi Stan community,
Here is my research question: To what extent do beavers influence the number of insect species?
Sampling design: 16 streams. We define four (4) locations within each stream: Control, Inflow, Pond, Outflow.
Expectation: I expect to count more insect species in the beaver influenced locations jointly (Inflow, Pond, Outflow) than in the Control location.
My inquiries:
a) How do I account for the no detection of insect species even if it possible that these species are there?
b) How do I account for the fact that I compare one (1) Control to three (3) locations?


Maybe a model like that… but I am not sure.

b0 <-
  brm(data = d, family = negbinomial,
      y ~ 0 + location + offset(effort) + (0 + location | stream),
      iter = 2500, warmup = 500, cores = 2, chains = 2,
      seed = 10)

Does the offset relate to how many sample I have per location?
Here’s a sub-sample of the data:
d.csv (55.6 KB)


Thank you very much in advance.
Leonardo

In general, correcting for imperfect detection requires special forms of data, reflecting detection/nondetection histories over repeated sampling events or some kind of time-to-detection data, or an ability to make VERY strong assumptions which covariates influence detection versus occupancy and about the correctness of your model specification.

To compare 1 to 3 locations, just make sure that the control location is the reference category in your categorical predictor. R can handle this for you, but sometimes I find it easier to manually encode the 4-level categorical predictor via 3 binary predictors, which is particularly helpful if I want to have meaningful control over the priors I’m using for different categories.

1 Like

In addition to the other response, what’s the definition of “effort”? The model you have could be fine for addressing the issues you bring up. The main concern is where effort is really an accurate representation–does double the effort double the expected number found?

You’re counting species so beyond a certain effort I would expect saturation and you need a model that accounts for that. If effort is similar across sites it should be fine, just fit the model and look at whether the posterior predicted lines up with observed as a start.

Thanks @sakrejda and @jsocolar.

@sakrejda In addition to the other response, what’s the definition of “effort”?

I suggest to comment using this figure:

Here the control category represent the number of distinct insect species with respect to the other three locations (inflow, pond, outflow). While the control + beaver category represent the number of insect species detected in control, inflow, pond and outflow.
When I refer to effort I am referring to the fact the the control + beaver category is the sum of 4 locations, while the control category represents only one location within each site.
Therefore, if I am using the poisson likelihood, do I need to account for this artifact / unbalance ? I am using as reference, @Solomon ebook , 10th chapter (link).
The data:
pd.csv (946 Bytes)
The model:

bform <-  bf(species_richness ~ 0 + location + offset(log(offset)) + (0 + location | site), 
          family = poisson(link = "log"))
# Then set informative priors
priors_informative <- c(  
  #~~~~~~~~~~~ the number of insect species of the control location ~~~~~~~~~
  set_prior("normal(1.25, 0.2)", class = "b", coef = "locationcontrol"), 
  #~~~~~~~~~~~ the number of insect species of the four (4) locations jointly ~~~~~~~~~
  set_prior("normal(1.5, 0.5)", class = "b", coef = "locationtotal"), 
  #~~~~~~~~~~~ the between site variability of the control location~~~~~~~~~
  set_prior("normal(0, 0.5)", class = "sd", coef = "locationcontrol" , group = "site" ),
  #~~~~~~~~~~~ the  between site  variability of the four location~~~~~~~~~
  set_prior("normal(0, 0.75)", class = "sd", coef = "locationtotal" , group = "site" )
)
# Fit the model 
b1  <-  brm(formula = bform,
          data = pd,
          prior = priors_informative,
          chains = 4,
          iter = 2000,
          warmup = 500,
          cores = parallel::detectCores(),
          control = list(adapt_delta = 0.99, max_treedepth = 15), # facilitate convergence
          seed = 7, # allow exact replication of results 
         sample_prior = T # sample also priors for predictive check
        )