Question about embedded group level effects

I have a dataset where I am looking at features found in different languages, something like:

Language   Feature.A Feature.B
language-1 val.a.1       val.b.1
language-2 val.a.2       ...
language-3 ...               ...

Additionally, I have family information for each language. However, family information is not well structured, it is more a series of groups of different size. For example, for Spanish I have the information:

[Indo-European, Italic,  Romance, Western Romance, Ibero-Romance, West-Iberian, Castilian languages]

And for Guarani I have:

[Tupian, Tupi–Guarani,  Guarani (I),  Guarani]

Where the leftmost label corresponds to most general family group, and the rightmost label corresponds to the most specific family grouping.
Different languages have different numbers of groupings, and there is no obvious way to determine which subgroups for one family should correspond to which other subgroups for a different family.

The model I have in mind should go something like this:

 Feature.A ~ Feature.B + (1|Family)

However, I do not see any reasonable way of including all family information I have. I could of course take the largest and smallest grouping and include those, but this seems like an arbitrary choice which would miss part of the structure of the data.

Is there any way of doing this better? can the group level effect be defined in such a way that it includes all the hierarchical information available?

Thanks!

1 Like

Sorry, can’t respond now, but maybe @Max_Mantei is not busy and can answer?

Looking a bit more into it it seems like I could use something like a phylogenetic model with brms like these: https://cran.r-project.org/web/packages/brms/vignettes/brms_phylogenetics.html , provided that I induce a phylogenetic tree from the family information.

(If I may tag you) @paul.buerkner , would this do what I want?

1 Like

If you can construct a phylogenetic tree then extract the induced correlation matrix, this could be what you want, yes.

2 Likes

Thanks! Follow up question, in my data each observation is a language, and each language belongs to a micro-family. I can build the phylogenetic tree all the way to each language and then fit:

Feature.A ~ Feature.B + (1|language), cov_ranef = list(language = A)

As in your first example. Or I could build the phylogenetic tree just up to the smallest micro-families and then fit:

Feature.A ~ Feature.B + (1|family.2) + (1|family), cov_ranef = list(family = A)

As in your second example.

Is there any reason to prefer one over the other?

Depends on whether you are interested in the second level I would say.

Nothing much to add here. I would have suggested something like Feature.A ~ Feature.B + (1|family.2) + (1|family) without the phylogenetic tree (which I know nothing about). My hunch would be that it doesn’t make much of a difference if there’s no variation in Feature.A and Feature.B across “sub”-families. The phylogenetic tree is most likely the better way to extract the structure of the data and incorporate it in the model.