Can projpred be used with a latent dependent variable?

I’d like to use projpred to answer a substantial question about predictor variable importance, in line with The dependent variable in my regression model is a latent variable obtained through an IRT model of responses to a questionnaire, and so I couldn’t figure out how to apply the init_refmodel function of projpred. The y argument of init_refmodel only accepts a vector of values, but I only have posterior draws.

Some minor points about the model in itself: The multilevel structure is due to parents being nested within adolescents. There is some imputation of the outcome variable going on as well.

Loo gives a lot of high pareto-k values, which I think is due to the very small cluster size for the varying intercepts, so I have been using K-fold CV in stead, and planned to use that for the projpred. I previously got some useful feedback on that here K-fold CV in linear regression model with varying intercepts and very small clusters

  int<lower=1> N;
  int n_id;
  int<lower=1> id[N];
  int<lower=1> K;
  matrix[K,N] x;
  int cbq_items;
  //index vector for cases with no missing data on CBQ
  int ind_obs_cbq[N]; 
  int nobs_cbq;
  //index vector for cases with single items missing on CBQ
  int ind_mis_cbq[N];
  int nmis_cbq;
  //index vector for cases with no response to CBQ
  int ind_no_cbq[N];
  int nno_cbq;
  int<lower=0, upper=1> cbq_obs[nobs_cbq,cbq_items]; 
  int<lower=-1, upper=1> cbq_mis[nmis_cbq,cbq_items];

transformed data{
  matrix[N, K] Q = qr_Q(x')[,1:K] * N;
  matrix[K, K] R = qr_R(x')[1:K, ] / N;
  matrix[K, K] R_inv = inverse(R);

  vector[n_id] id_a_var;
  real<lower=0> sigma_a_var;
  vector[K] beta_tilde; // coefficients on Q
  real<lower=1> nu;
  real<lower=0> sigma;
  vector[cbq_items] cbqbeta;
  vector<lower=0>[cbq_items] cbqalpha;
  vector[nobs_cbq] cbqtheta_obs;
  vector[nmis_cbq] cbqtheta_mis;
  vector[nno_cbq] cbqtheta_no;

transformed parameters{
  vector[K] beta = R_inv * beta_tilde;

  //collected parameter vectors for outcome and varying intercept
  vector[N] y;
  vector[N] a_var;
  for(i in 1:N){
    a_var[i] = id_a_var [id][i] * sigma_a_var;
    if (ind_obs_cbq[i]>0) 
      y[i] = cbqtheta_obs[ind_obs_cbq[i]];
    else if (ind_mis_cbq[i]>0) 
      y[i] = cbqtheta_mis[ind_mis_cbq[i]];
      y[i] = cbqtheta_no[ind_no_cbq[i]];}

  id_a_var ~ std_normal();
  sigma_a_var ~ normal(0,5);
  beta ~ normal(0,1);
  sigma ~ normal(0,1);
  cbqtheta_obs ~ std_normal();
  cbqtheta_mis ~ std_normal(); 
  cbqbeta ~ normal(0,3);
  cbqalpha ~ gamma(2,0.5);
  nu ~ gamma(2,0.1);

  //irt model
  for(i in 1:nobs_cbq){
     for (j in 1:cbq_items){
  	cbq_obs[i,j] ~ bernoulli_logit(cbqalpha[j]*(cbqtheta_obs[i] - cbqbeta[j]));
  for(i in 1:nmis_cbq){
      for(j in 1:cbq_items){
           cbq_mis[i,j] ~ bernoulli_logit(cbqalpha[j]*(cbqtheta_mis[i] - cbqbeta[j]));

  y ~ student_t(nu, Q * beta_tilde + a_var, sigma);

generated quantities{
vector[N] log_lik;
vector[N] ppc;
for (n in 1:N){
  log_lik[n] = student_t_lpdf(y[n] | nu, a_var[n] + x'[n,] * beta, sigma);
  ppc[n] = student_t_rng(nu, a_var[n] + x'[n,] * beta, sigma);}

The usual case has also posterior draws. See
" predfun is a function that computes the expected value for the target variable separately for each posterior draw given the new predictor values"
If your posterior draws are already predictions, then just make an appropriate predfun. y in init_refmodel is used to compute log_lik’s for loo. If you can compute log_lik in some other way, then modify init_refmodel function so that you don’t need to give y. If this is not clear, please ask more.

1 Like

Sorry for not replying and acknowledging the helpful response - I’ve been away on summer vacation.

I think I understand now. I can just pass posterior draws for the predicted values from fitting the reference model to the predfun argument of init_refmodel. I hadn’t really taken into consideration that the projection targets being as good as the predictions of the reference model, not predicting the observations. Understanding that made things come together in my mind.

Edit: No, that wasn’t right. It needs to be a function, not the output of such a function. Trying again.

I’m using exact leave-one-cluster-out crossvalidation to compute log_lik (holding out a cluster at a time, as my clusters are of size 1-2), and I modified my code to extract the posterior draws for the dispersion parameter and the predicted values for each fit, and store those in an object conforming to the specification of the cvfits. I think I’ll get it to work.

I’ll post the code when I get it to work properly, in case it should be useful for someone. Thank’s for the help, @avehtari, your generous and friendly attitude is an example of this community at its best!

1 Like

Read more carefully, and understood that I need to look beneath the hood of the init_refmodel function, and simply generate the refmodel with my own code, ensuring that it holds up to requirements and has all the necessary pieces of information (such as the log_likelihood).

I’ll take the liberty of asking more, @avehtari, as I have struggled to get my head around this.

I tried my hand at modifying the init_refmodel function to not need y, and got it to make a refmodel object. I had to find the k_helper function on github, though, as it is not in the namespace of projpred. But it seems the varsel function also expected to find those values (y) in the refmodel object, and I was a bit stumped.

After giving it a bit more thought, I think I might have come to a solution. I recently changed the formatting of my questionnaire responses to long format, giving me a long vector of responses in stead of a matrix, and just indexed the item and person parameters accordingly in the stan model. I did this to vectorise the code for efficiency, but then I realized I could perhaps change my predictor matrix to long format as well for projpred, and then simply make a predfun incorporating the IRT model (using the estimated item parameters and replacing the estimate of person ability with the expected value from the fitted linear model of the predictors), allowing me to use the actual vector of dichotomous responses as y.

I’d then make a cvfits object by running leave-one-cluster out crossvalidation, leaving out all the responses of the one or two cases in each cluster in turn, and calculating the log-likelihood of the individual observations conditional on the expected probability using bernoulli_logit_lpmf.

Could this be a valid approach? I realise that it’s departing from the premise of the thread, but perhaps the solution is to go back to the actual observations?

@paul.buerkner has more recently been working on the projpred internals, maybe he has some idea?

It may probably be best to wait for our new propred version which (a) provides a complete refactor of the projpred internals and (b) supports multilevel models (to which most IRT models belong).

I am afraid, I cannot judge either whether or not your approach will do anything sensible with the current projpred version.

1 Like

Ok, that’s good to know. Looking forward to the next version!

@erognli Thanks for this discussion. I have similar questions like yours. Hope that the new version of projpred could help me too

What’s the state of the current version? I’ve posted a number of patches for it, and some pull requests are still open. I thought it was a matter of summer holidays, but perhaps there’s no more interest in developing the current version?

As for the new version, where is this being developed? Is it possible to contribute?

@fabio Glad to hear it! I found one your threads (Relative importance of covariate groups) very useful as well, so I guess there’s some similarity in what we’re working on.

1 Like

As @paul.buerkner and @avehtari mentioned, we are developing a new version of projpred that affects the internals and models supported, adding broader capabilities. Nonetheless, I took a look at most of your changes and they seem to affect printing format, checking of parameters, etc. I think those can be easily included in the new version, but we don’t have a fixed release date, and as of right now (and probably until the first version is released) we are developing it offline to make sure that everything falls in place, so it is not possible to publicly contribute just yet (I am now on summer holidays as well so I will be unresponsive until 1.9).

I will reread the thread when I come back from my summer holidays in case I can contribute some more insights into the new version and to how these changes would affect the original question regarding the possibility of y being a matrix of posterior draws.

Sorry for the slow reactions. In addition of holidays, it’s the workshop season and Juho Piironen defended his thesis in the end of May and is now in industry. I guess everyone was hoping someone else would answer and describe the situation.

I had hoped we would have new open branch before StanCon, but alas it will happen only after Alejandro comes back from his holiday. We have some open methodological research issues, which eventually can affect the design, and the the current code has some very experimental parts. We’re happy to get more people involved and it’s clear your application is also a great example for testing whether the new design is flexible enough.

Thanks Aki and Alejandro for the update. I have more patches to provide whenever the backlog clears. I’ll be looking forward to trying out the new version whenever it comes.