Multiple group-level effects - ordinal regression

I am having trouble with the analysis for my experiment design in animal biology. My goal is to understand the influence of predictors (at country level) on the status (health) of several species. I am not interested in the outcome for specific species or countries, but rather whether covariates (varying at the country level) influence the general outcome. I think it is a cross-level design as there is no specific hierarchy : each species (26) lives in each country (3), and thus has a specific status per country.

The status (response variable) is in reality the IUCN Red List of Threatened Species status estimating the extinction risk of the species in the future. Status is an ordered factor with 5 levels from ‘Least Concern’ (very low extinction risk) to ‘Critically Endangered’ (very high extinction risk). I am doing ordinal regression in brms (to take into account that the response are categories and ordered) but my problem would be the same if the extinction risk was instead a numeric with negative value representing low extinction risk and positive values high extinction risk.

The predictor of interest is a country covariate, let’s say the Gross domestic product (GDP: numeric) (thus 3 points only). I expect additional variation at the country level than just GDP. I have one species level covariate (that is usually the main reason for difference of status between species in my specific situation): the size of the species (a positive numeric) and I expect some additional variation at the species level (that may come from differences in their life history traits, mature late or early, behavior, etc…). Below is a simplified dataset with 3 species and 3 countries, and their specific status per country as numeric between 1 and 5 (in my analysis, they will be treated as ordered factor):

Below is my first writing of the model:
status ~ size_species + GDP_country + (0+ size_species|species_ID) + (0+ GDP_country|country_ID)

The other possibility with the random effect on intercept only leads to slightly higher sd in size_species and slightly lower sd in GDP_country:
status ~ size_species + GDP_country + (1|species_ID) + (1|country_ID)

Any thoughts?

Thank you in advance!

1 Like

I will have a few notes:

I am not sure it is useful to have terms like (0+ size_species|species_ID) or (0+ GDP_country|country_ID) - in both cases the predictor is constant within members of each group, so it is very similar to having a (1 | species_ID) or (1 | country_ID) term, with one difference: in the (0+ size_species|species_ID) formulation, countries/species with larger absolute value of size_species or GDP_country will be allowed to differ more from the overall mean (as the coefficient is multiplied by the predictor) than those with lower values. Would that be desirable?

Additionally, if you had multiple predictors for species/country, the model will become non-identified (e.g. if you also had a land_area_country predictor, then (0 + GDP_country + land_area_country | country_ID) will almost certainly result in convergence issues).

More generally, I think the question as you’ve written it is potentially imprecise, making modelling harder. What is “general outcome” in this setting? How do the outcomes for multiple species aggregate? For example: country A has “Least concern” for one species and “Critically Endangered” for another - are the better or worse than a country B that has “Vulnerable” for both? The model as you’ve written it will answer this question implicitly, but there is no guarantee the answer will make sense for the real world.

Additionally, the model doesn’t leave room for GDP (or other country-level predictors) to have different relationship with different species status, although that seems (to me with no real background in ecology) implausible. E.g. one would not be surprised to see countries that have a lot of money and invest in conservation to focus preferentially on the more “nice” species (like mammals).

One way around this would be to sidestep the issue and instead make predictions at the species level and then show how the “general outcome” changes as you weigh the different species / outcomes differently (maybe the results will be consistent across broad range of criteria, maybe not…)

Does that make sense?

First, thank you very much for taking the time to help me.

What is “general outcome” in this setting?

I realise now I should have given more information in the first post. My hypothesis is that management/conservation will improve the IUCN status of all the species I’m studying in the same way (as all of these species are in the same taxon [group of species], and subject to the same type of management/conservation). I expect that a higher GDP (or other better covariates I will use as proxy of management) will make all species transition from bad categories to better categories (a shift towards Least Concern).
I do have to control for differences between species having in general a better category than the average due to their low size compared to more vulnerable larger species. I believe I take care of it by adding the size covariate and allowing for a common variance of the intercepts linked to the species ID. From your comment and as I do not expect the size relationship to vary more at large size, I think I will stick with (1 | species_ID).
Also, all 26 species occur in the 3 countries. I have one or more country covariates in the model but I expected that there is some constant additional variation inherent to the region not captured in the regional fixed effect, so (1 | region_ID).
So the model I run is: status ~ size_species + GDP_country + (1|species_ID) + (1|country_ID)
After running the model, the coefficients for size and the country covariates have their 95CI not overlapping 0. The random effect for species_ID seems important (nice shift of distribution when looking at the forest plot of species effect) but the random country effect seems weak (all the specific country distributions overlap a lot).
Two questions come to my mind:
Should I drop (1|country_ID)? If yes, what is the evidence that I don’t need it (the overlapping distributions?)?
The relationship of GDP_country and status only rely on 3 points (due to only having 3 countries)… I know that the relationship rely in reality on 26 species per 3 countries but it makes me question the confidence one could have in the robustness of the relationship given the low sample size of this group effect.
Thank you again!

1 Like

I don’t think overlapping distributions would be a good reason to drop the (1 | country_ID) term. A good reason would be if you believed your other covariates explain all the between-country variance a-priori (i.e. without needing to refer to the model fit). This seems implausible as countries just differ a lot between each other. Also usually having an extra varying intercept does not hurt model precision very much, so there is IMHO no strong reason to not keep it.

That’s a good point and a big limitation. I would also expect the effect to be hugely uncertain given you only have 3 countries (unless it is like 1 very poor country with all bad conservation, 1 middle income country with all medium conservation and 1 rich country great conservation overall). If I understand you correctly, a narrative description of the model would be something like:

  1. GDP (and other covariates) predict the overall conservation effort (for this taxon)
  2. The conservation status of individual species in each state fluctuates around the overall conservation effort.

This would be decently well matched by a model you described ( GDP_country + (1|country_ID) representing the overall effort and size_species + (1|species_ID) representing the species- specific fluctuation and the ordinal response representing additional variation for each contry-species pair).

One possible source of the problem might be that the GDP coefficient is strongly influenced by one outlying GDP. You could probably check to what extent this is a problem by doing a posterior predictive check (I would start with pp_check(fit, type = "bars_grouped", group = "species_ID") and then with `group = “country_ID”). Additionally, I’ve seen a lot of people use the log of predictors like GDP, income, etc. as it appears that many real world relationships are better approximated by a linear model depending on the log GDP, so you may also want to inspect that.

The best would however be to include data from additional countries. My impression is that you don’t necessarily need all the countries to have data for all species, unless you expect there is some important bias in what species you have data and what species don’t have data.

(1 | country_ID) […] there is IMHO no strong reason to not keep it.

Got it. I do believe that my covariate will not explain all the between-country variance so I will keep the term.

This would be decently well matched by a model you described ( GDP_country + (1|country_ID) representing the overall effort and size_species + (1|species_ID) representing the species- specific fluctuation and the ordinal response representing additional variation for each contry-species pair).

Thank you very much for confirming the writing of the model.

One possible source of the problem might be that the GDP coefficient is strongly influenced by one outlying GDP.

Fortunately the 3 GDP I have represents three different conservation situation (good, medium and bad conservation). The ppc check for the region seems ok, with a slight underestimate of one of the categories in one region, but the species_ID does not look fine (3 example below):

The best would however be to include data from additional countries.

I unfortunately cannot add additional countries.
The IUCN status are attributed to each species by areas (multiple countries). I have thought of using more countries but because the response variable is at the region level and not anymore at the country level. I struggle to find the correct writing for such a model (which I guess i a cross level effect between species and countries nested in a region), and I am not even sure that this analysis would be relevant.

status ~ 1 + size + GDP_country + (1 |species_ID) + (1|region_ID:country_ID)
1 Like

So you are limiting you datasets to countries that also happen to match to regions (as defined by IUCN)? I think this is potentially biasing the data and a bit weird… If your data are at the region level, there is IMHO very little you can do beyond using reigon-level predictors. I agree that combining country-level data into region-level predictors might be challenging due to difference between countries (and I can imagine that are countries that lie in multiple IUCN regions as well), but I don’t thik it is conceptually such a big leap - for example the GDP of the USA is also ignoring the substantial differences in economy (and conservation efforts) between individual states. The only difference is that for country-level predictors, someone did the aggregation for you, but to get region-level predictors you would need to do the aggregation yourself.

Does that make sense?

We determined the regional status (per broad regions) but determining the country-level status is very challenging (time and effort) and in some case almost impossible due to poor knowledge.
I guess your advice to using region-level predictors, even if that means reducing the robustness given the small number of regions, is the way I will continue my analysis:

status_region ~ 1 + size + GDP_region + (1 |species_ID) + (1|region_ID)

I will focus on what is the best way to aggregate the country-level covariate to a region-level covariate.
I will flag your last reply as the solution, even if all of your posts helped me a lot.
Thank you very much for your help!

1 Like

I would just add that I don’t thing using region-level predictors reduces robustness. If your response is at the region level, that’s just the data you have and there is no way to be more robust without having more data/assumptions. Pairing the region-level response with country-level predictors IMHO cannot improve the situation unless you can justify strong assumptions about the relationship between region-level conservation and country-level conservation. In the extreme, if we assumed all countries within a region to have exactly the same conservation status, than one could treat the region response as the country response. One could slightly soften this - if we believed that countries within a region are in some sense quite homogenous, one could probably treat the country-level response as missing data or even take the region response as per-country response and add a country-specific varying intercept that would represent the difference between the country conservation and regional conservation. But that would IMHO mean we believe the between-country differences are quite “well behaved”.

Hope that makes sense :-)