Testing predictions of a hierarchical model against new group-level data


A reviewer of a metaanalysis I helped with asked us to evaluate the model on a new dataset. Luckily we were able to obtain new data, but there were some tricky choices to be made on how to evaluate the model. Below I want to summarize what I came up with the hope that I will get some feedback and that it possibly might help other users with a similar task. I can say I am proud of some of the visualisation, so let’s see what you think :-)

The setup:
We are predicting presence/absence of a set of phenotypes for a rare genetic disease given a specific mutation the patient has. We have patient-level data for a few dozen sources with about 1 - 60 patients each. Since the methodologies to measure/report phenotypes between the sources differ, there is substantial variability between sources. We use logistic regression with a varying intercept per source and phenotype. We use brms (family Bernoulli) the important part of the formula is:

phenotype_value ~  <a bunch of predictors> + ((0 + 
	phenotype) || source) 

Now we have a data from two new sources (~160 and ~60 patients) and i want to check that the model fitted on the original data is useful. How to do that in presence of between-source variability?

Why is it a problem?
Have I said the between-source variability is substantial? It is actually the largest source of variability in the model and it swamps out all the other predictors. This means that checking the predictions of the model for a new source directly is useless, it just tells you the probability a patient in new study will have a phenotype is somewhere between 0 and 1.

My solution
What I do instead is to compute predictions for the difference between the average phenotype prevalence across all patients in the new source and the phenotype prevalence for a particular predictor combination. To be able to compare it to actual data I take the prediction for each patient in the new dataset on the response scale and then group by all predictors (i.e. I take into account how many patients for each predictor combinations there are).

This somewhat works - have a look at a subset of the predictions:

The ranges are 50% and 95% credible intervals and each dot is what we have seen in the new dataset. This difference is better constrained, but since there are generally few patients per predictor combination, the posterior intervals are still very wide. (The maximum theoretical range is -1 to 1)

Can I do better? Can I combine those somehow? Or can I somehow fit the phenotype || source predictor separately to the new data and then use the samples only for the other predictors? Also this obviously would not scale if I increased the number of predictors or had continuous predictors. What would you do then?

Evaluating overall predictions
I had two ideas how to aggregate the above into something sensible. First let’s check the overall prediction trend:

Each dot is a single prediction (the difference in prevalence for a combination of predictors vs. the overall prevalence of the phenotype). Horizontal axis is what was observed, vertical axis is the predicted mean. The red line is the regression line across those and orange dashed line is ideal prediction (observed = predicted mean).

Seems like the model predicts something, but it is not very great (which is kind of to be expected of binary data in a clinical setting).

How to evalute the uncertainty in the model? I borrowed an idea from Simulation-Based Calibration. For each of the predictor combinations in the new data (there are 169 of those) I look how the observed value ranks among posterior samples, the overall distribution should be uniform. Now I do not have so many predictions, but I can look what happens as I divide the data into 13 bins:


looks uniform to me :-)

Technical note: since the predictions are discrete I choose the rank uniformly between all posterior samples that have the exact same value.

Also maybe I could treat each patient x phenotype as a single prediction instead of using predictor combinations? But then I would only get two possible predicted values…

Finally, we evaluate some specific conclusions we made from the original analysis - those are specific statements about the data mostly in the form “certain group of patient has larger prevalence of a phenotype than another one” and can be translated to a statement about a realization of the data. This can neatly be compared with the results of multiverse analysis we have run with the original dataset:

Each column here is a model, except for the validation results (against the single model we report in the main part of the paper) which are on the very right. “Mixed” means we see the pattern in one of the new source but not the other. This turned out really nice - basically the only things not seen in the validation datasets are those we had no evidence for in the first place.

So what do you think? Is there something more I can/should do?

Also note that the statistical analysis is not the only output of the study, there are is also some biological interepretation of why would the mutations differ and the model results are employed to constrain the biohemical relationships of the individual genes that are connected to the disease