Multilevel model: should I account for lack of independence? If so, how?

Hi,
I just started learning Bayesian Statistics and STAN, therefore it is quite possible the following question is silly. My apologies in advance.

I am working with vegetation surveys that were conducted in several river networks. I am interested in analyzing how the number of species in common between sample sites changes as function of network distance across different river networks.

I calculated 1) the number of species in common between all pairs of points and 2) network distances between pairs of points. As an interface for STAN I am using the “rethinking” package from Richard McElreath’s book “Statistical Rethinking”.

The code below fits a linear multilevel model with varying intercepts and slopes for each river basin/network:

m1<-map2stan(alist(
      species~dnorm(mu,sigma),
      mu<-a_basin[basin_id]+b_basin[basin_id]*net_distance,
      c(a,b)[basin_id]~dmvnorm2(c(a,b),sigma_basin,Rho),
      a~dnorm(0,1),
      b~dnorm(0,1),
      sigma_basin~dcauchy(0,1),
      Rho ~ dlkjcorr(2),
      sigma~dcauchy(0,1)
    ),data=d,warmup=1000,iter=5000,chains=3,cores=5
    )

The code works but I am not sure the model is correct. I calculated metrics for each pair of points, which means the same point shows up in several pairwise comparisons. Does this create some kind of lack of independence that has to be taken into account in the model? If so, how should I include it?

Below is the corresponding STAN code

Thank you.

data{
    int<lower=1> N;
    int<lower=1> N_basin_id;
    real sorenson[N];
    real net_dist_s[N];
    int basin_id[N];
}
parameters{
    vector[N_basin_id] b_dist_net;
    vector[N_basin_id] a_basin;
    real a;
    real b;
    vector<lower=0>[2] sigma_basin;
    corr_matrix[2] Rho;
    real<lower=0> sigma;
}
transformed parameters{
    vector[2] v_a_basinb_dist_net[N_basin_id];
    vector[2] Mu_ab;
    cov_matrix[2] SRS_sigma_basinRho;
    for ( j in 1:N_basin_id ) {
        v_a_basinb_dist_net[j,1] = a_basin[j];
        v_a_basinb_dist_net[j,2] = b_dist_net[j];
    }
    for ( j in 1:2 ) {
        Mu_ab[1] = a;
        Mu_ab[2] = b;
    }
    SRS_sigma_basinRho = quad_form_diag(Rho,sigma_basin);
}
model{
    vector[N] mu;
    sigma ~ cauchy( 0 , 1 );
    Rho ~ lkj_corr( 2 );
    sigma_basin ~ cauchy( 0 , 1 );
    b ~ normal( 0 , 1 );
    a ~ normal( 0 , 10000 );
    v_a_basinb_dist_net ~ multi_normal( Mu_ab , SRS_sigma_basinRho );
    for ( i in 1:N ) {
        mu[i] = a_basin[basin_id[i]] + b_dist_net[basin_id[i]] * net_dist_s[i];
    }
    sorenson ~ normal( mu , sigma );
}
generated quantities{
    vector[N] mu;
    for ( i in 1:N ) {
        mu[i] = a_basin[basin_id[i]] + b_dist_net[basin_id[i]] * net_dist_s[i];
    }
}

So for every pair of sample sites you have a count of the number of species in common, right?

And so in your regression you have a term for each of these things? Something like:

for(i in 1:N) {
  number_of_common_species[i] ~ normal(a * distance[i] + b, sigma);
}

In which case, you’re assume they are independent in the generative model. Whether or not this has to be taken into account is a modeling decision.

Since the model that you wrote down probably isn’t how fish get distributed in rivers, it’s up to you to try to do posterior predictives and such to convince yourself that it is representative of the data you collected. I don’t think there will be hard and fast rules in a modeling problem so open ended.

But probably the place to start is plots. Can you plot number_of_common_species vs. distance for a bunch of different river basins and get something that looks like a line?

Yes, that’s right.

I’m attaching a plot number_of_species_in_common ~ vs network distance. The solid
line is the MAP estimate of the mean number of species in common at each distance. The two shaded regions show different 95% plausible regions. The narrow shaded interval around the line is the
distribution of ÎĽ. The wider shaded region represents the region within which the model expects to find 95% of actual number of species in common at each distance.

Thanks for the help

plots.pdf (154.2 KB)

The plot looks fine to me, but I don’t have domain expertise. And as Ben already pointed out, how you set up your model should reflect your knowledge of the problem.

That being said (and seeing that you use the rethinking package), have you thought about using a Gaussian Process with a distance matrix? There is this example of the tools used by ancient island civilization (or something along those lines) in Richard’s lecture, where he brilliantly demonstrates the GP regression. This approach models the spatial dependence a bit more directly. But I don’t want to talk you into using something… And in the end, you are the expert on this and you can judge whether this is reasonable - just wanted to share my first thoughts on this. :)

1 Like

Thanks for the plots!

Out of curiosity, do you have count data available? Like, we counted at this site 50 red fishies and 25 blue fishies in a few hours at a certain site?

Binary, yes-no data only has so much information in it. Counts have more, if you have it.

I’ve been thinking about it, yes. From the exploratory data analysis I was able to tell that euclidian distance is a much stronger predictor than network distance. But I suspect euclidian distance is just correlated with changes in temperature and precipitation which are (probably) the real causes for the differences I observed.

The model would look like this:

species in common ~ precipitation difference + temperate difference + network distance

with varying slopes and intercepts for 1) each river network , 2) different distances between points (with a Gaussian process).

I haven’t tried yet because I just started learning about Bayesian Statistics and I didn’t want to dive into complicated models (considering my level expertise) just yet.

Thanks a lot.

I don’t have counts for each species of plant by I have % of cover (i.e. area of the soil covered by that species). But yes that would definitely be more interesting that working with 1/0 data. Since I just started diving into Bayesian Statistics I figured I should stick to simpler models.

Now that I know that I’m not going about this all wrong I will start fitting more complex models and improve my analysis.

Thanks a lot for the help.

Have you seen Richard McElreath’s stuff: https://youtu.be/FoaxA7sJi7w ?

It’s fun evolutionary/anthropology/ecology stuff. You might find it interesting. He’s got a book: https://xcelab.net/rm/statistical-rethinking/ and a Youtube channel with a bunch of the lectures up too: https://www.youtube.com/channel/UCNJK6_DZvcMqNSzQdEkzvzA/featured

Anyway my impression of the stuff he works on is lots of observational data collected at lots of different places around the world. He and his collaborators are trying to figure out the whys and whats of big, complicated, evolutionary questions where they don’t get to run their own experiments. He does it with Bayesian stuff and Stan, so maybe you’d find some material there that could help you.