Hello All,
Context
I am trying to apply the projection approach to variable selection outlined in the wonderful papers by Piironen, Vehtari, and Paasiniemi. However, I am using a student-t noise model, so the simplifications applicable to the exponential family don’t immediately hold (as far as I can tell).
In particular, I am seeking clarification for the KL-minimisation / likelihood-maximisation step for fitting the submodel parameters \theta_\bot:
\begin{aligned}
\theta_\bot &= \arg \underset{\theta \in \Theta}{\min} KL \left[ p(\tilde{y}|\theta_* )\ ||\ p(\tilde{y}|\theta) \right] \\
&= \arg \underset{\theta \in \Theta}{\max} \mathrm{E}_{\tilde{y}|\theta_*} \left[ p(\tilde{y}|\theta) \right]
\end{aligned}\\
This is Equation 8 in Piironen et al. (2018).
Questions
-
When performing this likelihood optimisation, should I be using \nu and \sigma (student-t noise model parameters) from the reference model fit \theta_*, or should they be part of the maximisation along with the predictor coefficient estimates?
-
For the clustered approach described in section 3.3 of the above paper, is the intent to perform likelihood maximisation against the prediction posterior mean of the samples of the reference model within that cluster? Or to perform likelihood maximisation using all individual reference model samples within that cluster?
- From what I can tell, the former (using posterior mean \tilde{y}) is sound for exponential family, but not otherwise. On the other hand, the latter option doesn’t seem to offer much of an efficiency benefit compared to the draw-by-draw approach. Both of these approaches lead to the single-point or draw-by-draw approaches in the cases where no. clusters = 1 or S respectively.
(I am using cmdstan with R and building models to be used in Python environment too, so at this stage I do not want a brms/rstanarm/projpred -reliant solution)
Any advice greatly appreciated!