Predictive projection for linear model with student-t noise: clarifying question(s)

Hello All,

Context
I am trying to apply the projection approach to variable selection outlined in the wonderful papers by Piironen, Vehtari, and Paasiniemi. However, I am using a student-t noise model, so the simplifications applicable to the exponential family don’t immediately hold (as far as I can tell).

In particular, I am seeking clarification for the KL-minimisation / likelihood-maximisation step for fitting the submodel parameters \theta_\bot:

\begin{aligned} \theta_\bot &= \arg \underset{\theta \in \Theta}{\min} KL \left[ p(\tilde{y}|\theta_* )\ ||\ p(\tilde{y}|\theta) \right] \\ &= \arg \underset{\theta \in \Theta}{\max} \mathrm{E}_{\tilde{y}|\theta_*} \left[ p(\tilde{y}|\theta) \right] \end{aligned}\\
This is Equation 8 in Piironen et al. (2018).

Questions

  • When performing this likelihood optimisation, should I be using \nu and \sigma (student-t noise model parameters) from the reference model fit \theta_*, or should they be part of the maximisation along with the predictor coefficient estimates?

  • For the clustered approach described in section 3.3 of the above paper, is the intent to perform likelihood maximisation against the prediction posterior mean of the samples of the reference model within that cluster? Or to perform likelihood maximisation using all individual reference model samples within that cluster?

    • From what I can tell, the former (using posterior mean \tilde{y}) is sound for exponential family, but not otherwise. On the other hand, the latter option doesn’t seem to offer much of an efficiency benefit compared to the draw-by-draw approach. Both of these approaches lead to the single-point or draw-by-draw approaches in the cases where no. clusters = 1 or S respectively.

(I am using cmdstan with R and building models to be used in Python environment too, so at this stage I do not want a brms/rstanarm/projpred -reliant solution)

Any advice greatly appreciated!

Hi, although you don’t want to use R, you could look into projpred R code for the details how Student-t family is handled there (e.g. projpred/families.R at master · stan-dev/projpred · GitHub). Pinging also @fweber144

1 Like

Thank you. Yes, there is some code for the Student-t family in projpred. I think it goes back to issue #3. However, there are a few things to clarify in these lines, see also PR #253. That’s one of the reasons why the Student-t family is considered as “experimental” in projpred (see PR #252 and also PR #233).

1 Like

Thanks!

Thank you!

So, as far as I can tell, nu is not part of the projection optimisation? I.e. student-t projection weights are computed for a given nu ?

I’m not (yet) familiar with the details of the Student-t family in projpred. In issue #3 (linked above), @jpiironen writes:

The student-t is not a log-concave likelihood, and thus the implementation using the same framework as for other likelihoods is likely to cause some problems. However, there is another idea of doing the projection similarly as for the Gaussian case by expressing the t-distribution as a scale mixture of Gaussians, and then using the Gaussian projection with the observations weighted by their precisions. This is not the same as measuring the KL-divergence between the t-distributions but should be easy to implement and might give reasonable results.

When performing the projection as outlined in Piironen et al. (2020) and not using the scale-mixture approach mentioned in issue #3, I think \nu and \sigma from your initial post should be part of the maximization, so they should not simply be used from the reference model’s posterior draws (correct me, @avehtari, if I am wrong here). If you want to take the scale-mixture approach instead of performing the projection as outlined in Piironen et al. (2020), you need to take a look at the implementation in projpred. Sorry for not being able to provide more information on the scale-mixture approach for now. If you do take the scale-mixture approach and find anything in projpred which needs to be corrected, suggestions are very welcome.

1 Like

Concerning your second question

For the clustered approach described in section 3.3 of the above paper, is the intent to perform likelihood maximisation against the prediction posterior mean of the samples of the reference model within that cluster? Or to perform likelihood maximisation using all individual reference model samples within that cluster?

from the initial post: I don’t understand what you mean by

likelihood maximisation using all individual reference model samples within that cluster

In this line, you can see that projpred takes the (weighted) average across the draws, which gives \mu_*^c from underneath equation (18) of Piironen et al. (2020) (but note that in section 3.5 of Piironen et al. (2020), a Gaussian model is considered for simplicity). Does that answer your question?

1 Like

Thank you for following up on this!

The student-t is not a log-concave likelihood, and thus the implementation using the same framework as for other likelihoods is likely to cause some problems.

Is the problem here the stability of finding the global optimum? I have been using a Gaussian (lm()) estimate for the initialisation and the max likelihood optimisation seems to be reasonably tight.

I think \nu and \sigma from your initial post should be part of the maximization, so they should not simply be used from the reference model’s posterior draws.

OK, would you mind explaining the rationale for this? I understand from a pragmatic perspective, since there is no guarantee that the reference model would even have \nu and \sigma parameters to reference. On the other hand, we are trying to minimise KL-divergence with respect to p(\tilde{y}) from the reference model, and the reference \nu, and \sigma seem relevant for this. (or are they accounted for simply by integration over the reference model samples?)

If you want to take the scale-mixture approach…

Thanks for pointing this out. I think I will try this if the MLE approach does not do well in simulation.

Ah ok, so I was trying to ask whether using \mu_*^c (this is what I meant by using the posterior mean within each cluster) is only a simplification that applies to the exponential family.

By this:

likelihood maximisation using all individual reference model samples within that cluster

I meant performing the likelihood maximisation using the cluster samples directly, without first aggregating to \mu_*^c . For example:

Cluster mean approach

yref = n \times S matrix of reference model predictions
x = n \times k matrix of predictors
ind = subset of predictors for subodel
cl_ix= indices of samples belonging to the given cluster
nus = reference model \nu samples (length S)
sigmas = reference model \sigma ssamples (length S)
(just for now assuming using reference model \nu and \sigma is correct).

n = nrow(x)
d = length(ind)

# Taking mean over reference model samples belonging to cluster
y_m = rowMeans(yref[, cl_ix])       # \mu_*^c
nu_m = mean(nus[cl_ix])             # \nu_*^c
sigma_m = mean(sigmas[cl_ix)        # \sigma_*^c

fit_cl <- lmt_proj$optimize(
 data = list(n = n, 
             d = d, 
             s = 1, 
             x = x[, ind], 
             y = y_m, 
             nu = nu_m,  
             sigma = sigma_m)
  )

# log-prob for this projection
lp_cl = fit_cl$lp()
# weight for this projection (used after all clusters)
w_cl = length(cl_ix) / nrow(yref)

Non-mean approach

# Subset reference model samples  belonging to cluster
y_cl = yref[, cl_ix]  
nu_cl = nus[cl_ix]
sigma_cl = sigmas[cl_ix]
n_cl = length(cl_ix)

fit_cl <- lmt_proj$optimize(
 data = list(n = n, 
             d = d, 
             s = n_cl, 
             x = x[, ind], 
             y = y_cl, 
             nu = nu_cl,  
             sigma = sigma_cl)
  )

w_cl = n_cl / nrow(yref)   # model weight for aggregating across clusters
lp_cl = fit_cl$lp() / n_cl # to make comparable to cluster mean approach

In both approaches I would take weighted average across the result from each cluster to get \theta point estimate and log probability estimate.

Stan

// lmt_proj.stan

data {
    int<lower=0> n; // number of observations
    int<lower=0> d; // number of predictors
    int<lower=0> s; // number of reference model samples
    matrix[n, s] y; // reference predictive response samples
    matrix[n, d] X; // predictors
    vector<lower=0>[s] nu; // reference posterior dof samples
    vector<lower=0>[s] sigma; // reference posterior sigma samples
}

parameters {
  vector[d] w;
}

model {
  // wide priors on weights
  w ~ normal(0, 100);

  // observation model - likelihood summed over all samples
  for (i in 1:s) { 
    target += student_t_lpdf(y[, i] | nu[i], X * w, sigma[i]);
  }
}

You would have to ask this to @jpiironen. Sorry that I can’t help out here.

Simply because \nu and \sigma are parts of \theta.

Sorry, I don’t get this. My understanding is that you have a Student-t reference model and you want to project it onto Student-t candidate models. So the reference model as well as the candidate models should have \nu and \sigma as parameters.

1 Like

Thanks for the code, that helps to avoid misunderstandings. But I don’t think it solves the projection problem as outlined in Piironen et al. (2020), most importantly because I think you are skipping the expectations (see, e.g., equation (12) from Piironen et al., 2020).

1 Like

It should be the expectation up to a normalization constant, I think (i.e. it would be the expectation if we divide through by S_c). Averaging \tilde{y}_i across cluster samples first seems to assume that maximising the expectation of the log probability, is the same as maximising the log probability of the expectation, which is what I thought only held for the exponential family. I’ll to review further.

In any case, thanks for the help and explanations.
Much appreciated!

Simply because \nu and \sigma are parts of \theta

OK, got it. Thanks.

My understanding is that you have a Student-t reference model…

Yes, that’s correct in this case, but there will be times when I will want to fit a student-t model to some other reference model. (Just to clarify, this was an argument in favour of your position i.e. fitting \nu and \sigma as part of the submodel optimisation.)

Thanks again!

If you mean the fact that equation (12) from Piironen et al. (2020) can be reduced to equation (16), then you are right: In general, this only holds for an exponential family. The Student-t family is not an exponential family, so you would have to check if you can simplify equation (12) manually.

1 Like

Ok, then we are probably meaning the same thing. Just to clarify: Even if the reference model does not have \nu and \sigma as parameters (e.g., because the reference model is a Gaussian one), \nu and \sigma are still parameters in the Student-t candidate models, so you need to include \nu and \sigma in the maximization.

1 Like

In any case, you have an interesting approach and thanks for sharing your thoughts. Perhaps this gets important in the future.

Thank you for all the help - much appreciated!

This depends on the data. I guess you have large n and not very thick tails, so it can work well for you, but in general it’s not stable enough to be part of generic package.

The ML-approach is a special case, but in other cases you can mix distributions and just minimize KL. It would be trivial to numerically compute KL between two arbitrary continuous distributions with the same support, and optimize that with gradient free algorithm, but that can be slow. We have prioritized cases with fast computational solutions, but it could be useful (and worth experimenting) to allow target models to have overdispersed observation models (e.g. projection Gaussian reference to a student-t). Happy to learn more if you can share more about your project.

Adding that you could probably do the Student-t projection also by using the latent space approach ([2109.04702] Latent space projection predictive inference) to get fast computation in the variable selection phase, and then only for the selected model to find out the projection to \sigma and \nu

1 Like

Ah OK, good to know! I am dealing with financial data (asset return series), so tails can be quite thick but n is indeed reasonable. (I admit that when I said the optimum “seems to be reasonably tight” it definitely wasn’t based on a rigorous analysis!).

And this optimisation would be more stable than the ML case with student-t?
I just tried this (minimising KLD) using optim()'s Nelder Mead with lm() initialisation, and seems OK in a very small example, so might experiment with this further.

Oh, why over-dispersed observation model and not the other way around? I.e. why not a student-t reference model to a Gaussian projection?

I am trying to address a number of tasks that come up in finance quite often, and can all be helped with extension of predictive projection to student-t linear model.

  • Estimation of replicating portfolios (modelling one return series in terms of others, in order to create a proxy) when there are a large number of potentially relevant exposures and we want only the most important subset.
  • Variable selection with large number of predictors and moderate n for straight-forward forecasting with fat tail-tailed response.
  • Shrinkage of correlation matrices through variable selection for robust portfolio optimisation.

Thanks! I only came across this paper yesterday. So if I understand correctly, the link function in my case is just the identity, so it would be the same optimisation for the projection, but without \nu and \sigma? I.e. minimising KLD between X w and Xw_\perp using \nu and \sigma from the reference model?

Thanks for your time!

Thanks for telling this. Do you have or know an open access data that would have similar properties as your data (assuming it’s not public)? We’re interested to look more into thick tailed models.

At least it would be more principled. It probably doesn’t remove the possibility of mutlimodality and possibility of some computational issues.

Cool, please tell if you get more results. If the degrees of freedom are the same there is an analytic form for KL, and if they are different there is an upper bound A Novel Kullback–Leibler Divergence Minimization-Based Adaptive Student's t-Filter | IEEE Journals & Magazine | IEEE Xplore

If minimizing KL is too slow, you may also consider that in general use of reference model in variable selection is helpful even if the approach would not be optimal Using reference models in variable selection | SpringerLink

Cool

Yes. But then after the selection, you need to do the projection to nu and sigma, too, but that computation is then done only for one model.

1 Like