Relative importance of covariate groups

I am here to ask you for bits of advice on a problem I’d try to formalize and then translate into Stan.

Let’s say I have Y \in \mathbb{R^N} measures of a phenomenon that I formerly treated as a straightforward regression -no mixed effects- using D covariates (including the intercept) :

Y \sim normal(\mu, \sigma)\\ \mu = X \cdot \beta, X \in \mathbb{R}^{N \times D}, \beta \in \mathbb{R}^{D}\\ \beta \sim normal(0, 1)\\ \sigma \sim normal^{+}(0,1)

Now, I would like to extend my model knowing that the values of Y depend on D_1 covariates that can be grouped under a biotic effect and another group of D_2 covariates that are an abiotic effect so that D = D_1 + D_2.

I would like to compute the relative magnitude \alpha of the two groups of covariates, i.e. if the abiotic component is more important than the biotic one.
So, firstly I proposed to extend the model as:

\mu = \alpha_1 \cdot \mu_{bio} + \alpha_2 \cdot \mu_{abio}\\ \mu_{bio} = X_1 \cdot \beta_1, X_1 \in \mathbb{R}^{N \times D_1}, \beta_1 \in \mathbb{R}^{D_1}\\ \mu_{abio} = X_2 \cdot \beta_2, \, X_2 \in \mathbb{R}^{N \times D_1}, \beta_2 \in \mathbb{R}^{D_2}\\ \sum_k^2 \alpha_k = 1

Then, thanks to the former mathematical formalization, I now think that a better approach should be using a finite mixture of 2 components, modelling the \mu with the the covariates.

Do you think that I am unnecessarily complicating my life and there are simpler approaches? Am I catching a red herring?

You could just compare model 1 with D_1 and model 2 with D_2 using loo. The one with the better predictive performance has more informative covariates.

1 Like

@avehtari my question is slightly different (or maybe I am missing something “in the middle”).

Here’s my question: given the Y ~ f(D1+D2) what is the proportion of D1 and D2 in explaining my data? Is it most important the biotic covariates (all together) or the abiotic covariates (all together)? And how much biotic over abiotic? I would have estimates of such proportions Is it possible to answer this with loo? is loo_model_weights the tool to use?

I guess you mean f(D1, D2)? If f is arbitrary (e.g. including interactions and nonlinearieties) this is difficult question. If you mean f1(D1) + f2(D2) it’s easier as you can compare it to cases 0f1(D1) + f2(D2) and f1(D1) + 0f2(D2).

LOO-R^2 would be useful for you. You can compute LOO-R^2 for three cases f1(D1), f2(D2) and f1(D1)+f2(D2) to see how much of the data variance they are explaining.

2 Likes

Thanks, @avehtari for your help and time.
I totally missed your paper on bayesian R^2 (my fault). Anyway I tried to follow the path…
Since Y is a latent variable of an ordinal model, the function loo_R2 is not meant for such models. Anyway I delegated the calculation of some indexes of the predictive distributions to the functions shown here.
I am reporting the three models Y ~ D1 + D2, Y~ D1 and Y ~ D2 where the first model contains all the covariates, the second has only the abiotic group of covariates, and the third only the biotic group of covariates. All three are linear models.

               waic_wts pbma_wts pbma_BB_wts stacking_wts
abiotic+biotic      NaN     0.25        0.25         0.00
abiotic             NaN     0.75        0.74         0.97
biotic              NaN     0.00        0.00         0.03

looic is 2040.4, 2038.2, 2074.0 respectively.
Now, are these statistics sufficient to say that the relative weight of the abiotic factor is 97% in regards to the complete model?
Do you think that is worth to extrapolate the latent variable Y from the ordinal model and apply the loo-R^2 calculation on it? but how I calculate the predictive means and the variance of the residuals in of a latent variable?

It would be good to show also elpd_diffs (please no looic, as it is just -2 times elpd and -2 doesn’t have good justification in case of non-Gaussian models and integrating over the posterior) and corresponding SEs, which would be easier explain than model weights which are not meant for model selection but for model combination. These statistics are sufficient to say that the models which have abiotic are clearly better than the model which doesn’t have abiotic. Furthermore given that abiotic is included biotic can improve the predictive performance only a little.

Probably not. elpd_diff’s would be sufficient to state that abiotic variables are useful and given abiotic, the additional information from biotic is negligible.

1 Like

Thanks @avehtari: I marked your answer as a solution because it helps enough with my arguments. Indeed I do not have to choose a model or another, but show that abiotic factors are fundamental in describing the process. It’s a little less than I would like to show in the beginning, but it’s enough for my purposes. To deepen it further would represent - in my current capacity - jumping the shark.

1 Like