Are stacking weights “leaking” information across folds?

Hey everyone,

I have a bayesian model that can be written as p(y|x, \theta ; t ) and it is basically a logistic regression, \theta are the bayesian parameters and t is a hyperparameter. Instead of fixing t I wanted to follow the approach of bayesian model stacking described in Yao et al 2018.

I currently make a 10-fold split of my data and for each fold I train 20 models, each with a distinct value for t. I solved the optimization problem of equation 4 from Yao et al 2018 (using the logarithmic score) \max_{w \in S_1^K} \; \frac{1}{n}\sum_{i=1}^{n} \log \sum_{k=1}^{K} w_k \, p\!\left(y_i \mid y_{-i}, M_k\right) , to obtain the stacking weights w_k. I re adapted a bit the formulas for the 10-fold case instead of LOO, as in my specific case the Pareto smoothing approach for LOO was not working (pareto k > 0.7) (and exact LOO was unfeasible). My question is whether the new predictive density can be considered overfitted to the data. Since the w_k parameters are learned by solving the optimization problem for the unseen data, even if each model is always trained leaving one fold out, the w_k parameters are learned on the left-out folds, so I am afraid this could bring to some level of overfitting. I imagine this of course depends on the number of models being stacked and the number of data points (that in my case are in the order of some thousands, so it should be more or less safe), but I am wondering if this is in general an issue.

In particular, once I compute the stacked predictive density, I have to make two model comparisons:

  • One model comparison is against the model with t = 0, so this would be comparing the stacked predictive density against one of the models included in the stack. (I guess there could be a more “bayesian” way of doing this comparison)

  • The second model comparison I have to do is against another state-of-the-art model used in the field. This model is not among the models in the stack, so it is not of the form P(y|x, theta; t).

Would it be fair to compare the predictive performance of the stacked predictive density against other models’ predictive performance on this data set?

Hi, @Federico_Billeci and welcome to the Stan forums.

I don’t know if @yuling is still hanging out on the forums, as he’d be the best person to answer questions about stacking. Otherwise, I’d try his co-authors, @avehtari and @andrewgelman.

1 Like

I have not tried to follow every detail in the above message, but, speaking in general terms, I think the idea of predictive model averaging (stacking) should work just fine with 10-fold cross-validation.

But I’m not quite sure why you would try 10 different values for t. Why not just add t as another parameter in the model? If you do this, you just might need to think carefully about your joint prior for theta and t.

1 Like

You are right

You need to have another level of cross-validation to get an estimate of the predictive performance of the stacked model, that is, inside each fold of the outer cross-validation you do your 10-fold-CV-stacking. Then comparison of different models using that outer cross-validation is valid.

2 Likes

Hey Andrew, I have already tried this and even though in some cases it works, the predictive performance is still better when I treat t as a hyperparameter.

In that case, why do stacking at all? Why not just do Bayesian inference with t as a hyperparameter?

Can you expand a bit on what do mean by bayesian inference with t as a hyperparameter? I thought what I was doing would already fall under this case.

Thank you very much!!

Hi, yes, I’m just saying you could do Bayesian inference, no stacking required at all. You should just then look at the posterior for t and make sure that it makes sense. When you add a new parameter to a model, it doesn’t always make sense to use an independent prior. But the key is that the prior on the other parameters in the model is informative. As Aki reminded me, the cases where stacking outperforms Bayesian model averaging are typically when the priors on the individual models are too weak.