Variable selection with ordinal model

Not in projpred at the moment. Regarding other approaches to do some sort of variable selection in your case I can’t really tell, as indeed it’s big enough that trying many combinations is infeasible.

Thanks… I sadly marked your answer as a “solution”

The smaller number of observations for C and A+ means some care is needed, but 32 predictors is not that much.

What’s the computation time for the full model?

Can you implement regularized horseshoe or some other sparsity assumption prior on the coefficients? There’s a [ Regression and Other Stories: Student][Regression and Other Stories: Student] example showing an easy case where RHS is able to help to detect the most useful predictors without need for extensive model space search or projpred. Of course it’s not always easy to know when the case is easy, and it also depends on what you want from the variable selection.

1 Like

The full model takes 12 hours to run and 4% of divergent transitions.
I am doing something very similar to what is shown in Regression and Other Stories: Student] and based on LOO-IC. Anyway, it takes a long time to compare the models…
By the way, I just discovered from your post the existence of the book “regression and other stories”. I will definitely buy it!

Hmmm… That’s a lot for both. Is the model code your own or brms generated code?

brms generated. I am working on this model -more or less successfully - for more than a year. see one of my previous posts on setting priors
I think the problem is not the slow computation: the model as it is struggles to distill the signal that is in the response variable. To improve I need to test various models and various combinations of variables. And this is time-consuming …

One approach would be to simulate from the posterior predictive distribution and use that as reference predictions, but project to normal regression model. This would make the projection and variable selection fast, and after finding the path through the model space, compute ordinal regression once for each model size, so there will be max 32 models to run.

1 Like

I do not understand it … but it sounds like a plan! :-)
If it’s not time-consuming for you (i do not went to be pushy) would you mind to help me in understanding the process step-by-step, so what we can sort out together would be publicly available here for a broader audience?
First of all: should I build some generated observation set and a dozen of generated variables (for more control), or start from an available real-case dataset?

Can you copy the brms generated model code here, so I can use the correct variable names

This is the Stan code with default priors (i.e. not horseshoe):

// generated with brms 2.13.5
functions {
  /* cumulative-logit log-PDF for a single response
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   disc: discrimination parameter
   *   thres: ordinal thresholds
   * Returns:
   *   a scalar to be added to the log posterior
   */
   real cumulative_logit_lpmf(int y, real mu, real disc, vector thres) {
     int nthres = num_elements(thres);
     real p;
     if (y == 1) {
       p = inv_logit(disc * (thres[1] - mu));
     } else if (y == nthres + 1) {
       p = 1 - inv_logit(disc * (thres[nthres] - mu));
     } else {
       p = inv_logit(disc * (thres[y] - mu)) -
           inv_logit(disc * (thres[y - 1] - mu));
     }
     return log(p);
   }
  /* cumulative-logit log-PDF for a single response and merged thresholds
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   disc: discrimination parameter
   *   thres: vector of merged ordinal thresholds
   *   j: start and end index for the applid threshold within 'thres'
   * Returns:
   *   a scalar to be added to the log posterior
   */
   real cumulative_logit_merged_lpmf(int y, real mu, real disc, vector thres, int[] j) {
     return cumulative_logit_lpmf(y | mu, disc, thres[j[1]:j[2]]);
   }
  /* ordered-logistic log-PDF for a single response and merged thresholds
   * Args:
   *   y: response category
   *   mu: latent mean parameter
   *   thres: vector of merged ordinal thresholds
   *   j: start and end index for the applid threshold within 'thres'
   * Returns:
   *   a scalar to be added to the log posterior
   */
   real ordered_logistic_merged_lpmf(int y, real mu, vector thres, int[] j) {
     return ordered_logistic_lpmf(y | mu, thres[j[1]:j[2]]);
   }
}
data {
  int<lower=1> N;  // number of observations
  int Y[N];  // response variable
  int<lower=2> nthres;  // number of thresholds
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K;
  matrix[N, Kc] Xc;  // centered version of X
  vector[Kc] means_X;  // column means of X before centering
  for (i in 1:K) {
    means_X[i] = mean(X[, i]);
    Xc[, i] = X[, i] - means_X[i];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  ordered[nthres] Intercept;  // temporary thresholds for centered predictors
}
transformed parameters {
  real<lower=0> disc = 1;  // discrimination parameters
}
model {
  // initialize linear predictor term
  vector[N] mu = Xc * b;
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 0, 2.5);
  // likelihood including all constants
  if (!prior_only) {
    for (n in 1:N) {
      target += ordered_logistic_lpmf(Y[n] | mu[n], Intercept);
    }
  }
}
generated quantities {
  // compute actual thresholds
  vector[nthres] b_Intercept = Intercept + dot_product(means_X, b);
}

A quick note – the text directly after the first figure on https://avehtari.github.io/ROS-Examples/Student/student.html seems to be wrong.

1 Like

Thanks, fixed (github.io seems to slow to show the updated html, but it’s in the repo)

1 Like

For the cumulative ordinal with logit the latent model is close to normal linear with sd=1.6. Here’s a quick draft what to do, but I didn’t have time to check all details. If you have problems, @paul.buerkner said he is happy to help to get this working.

See “Custom reference models” section in projpred Quick Start. Create the reference model as (something like)

predfun <- function(xt) t( b %*% t(xt) )

where b is the b in the model. And then init the reference model (something like)

sigma <- rep(1.6,S)
Emu <- colMeans(mu)
ref <- init_refmodel(Xc, Emu, gaussian(), x=x, predfun=predfun, dis=sigma)

where S is the number of draws, and mu is the posterior draws of mu. Emu is not used for the projection, but is used for RMSE computation.

3 Likes

Thanks a lot to you and to @paul.buerkner too.
I will try it in few hour (I have field work until sunset…) so later I will post here results and discussion.
stay tuned… :-)

I am looking forward to seeing what happens. Please be aware that this is not a validated approach and we actually need to do research to better understand the implications, but it looks promising at I think.

As @AlejandroCatalina explained previously, I am aware of the limitations of this method For the work I have to do it does not have to be bulletproof from a theoretical point of view. Nonetheless, I’m here to help, and I would be happy if this could open an “approximate” path that could help those who, in the community, face a problem similar to mine.

The solution @avehtari proposes would be interesting to try and should give at least an approximate idea for the selection of the variables. So it indeed looks interesting!

So @avehtari, @paul.buerkner, @AlejandroCatalina … I took some time to play with the model and I will split all the workflow in multiple posts.
The following two models are the Stan code that I used to generate a reproducible dataset (generate_data.stan) and to make inference about the parameters (ordinal_regression.stan).
I wrote the “generative code” looting Michael Betancourt’s code (so, thanks @betanalpha ! )

generate_data.stan generate_data.stan (976 Bytes) :

data {
    int<lower = 0> N;  // number of observation
    int<lower = 0> M;  // number of predictors
    int<lower = 1> K;  // number of ordinal categories
    vector[K-1] c;     // cut points

}

transformed data {
    real<lower = 0, upper = 1> sig_prob = 0.05;
    
}
generated quantities {
    vector[M] beta; // slopes
    matrix[N, M] X; // predictors
    vector[N] mu;
    int y[N];
    
    for (m in 1:M) {
        if (bernoulli_rng(sig_prob)) {
            if (bernoulli_rng(0.5)) {
                beta[m] = normal_rng(10, 1);
            } else {
                beta[m] = normal_rng(-10, 1);
            }
            
        } else {
            beta[m] = normal_rng(0, 0.25);
        }

    }
    
    for (n in 1:N) {
        for (m in 1:M) {
             X[n, m] = normal_rng(0, 1); // all predictors are mean centered at 0 and standardized at 1
        }
    }
    
    mu = X * beta;
    
    for (n in 1:N) {
        y[n] = ordered_logistic_rng(mu[n], c);
    }
}

and ordinal_regression.stan ordinal_regression.stan (502 Bytes)

data {
    int<lower = 0> N;           // number of observation
    int<lower = 0> M;           // number of predictors
    int<lower = 1> K;           // number of ordinal categories
    int<lower=1, upper=K> y[N]; // Observed ordinals
    matrix[N, M] X;             // predictors

}

parameters {
    vector[M] beta; // slopes
    ordered[K-1] c; // cut points
    
}

transformed parameters {
    vector[N] mu = X * beta;
}

model {
    c ~ student_t(3, 0, 2.5);
    y ~ ordered_logistic(mu, c);
}
1 Like

Then, with the following code I got slopes 30 slopes (relevant and irrelevant), 500 ordinal score son a 5-value scale, the predictor matrix, and I let stan do the inference.

remove(list = ls())
options(mc.cores = parallel::detectCores())
set.seed(123)

library(rstan)
rstan_options(auto_write = TRUE)
library(projpred)

N = 500 # number of observations
M =  30 # number of predictors
K =   5 # number of ordinal categories

c = sort(runif(n = K - 1, min = -5, max = 5))

generative = stan_model("./generate_data.stan")

gen = sampling(generative, data = list(N = N, M = M, K = K, c = c), seed = 123, algorithm = "Fixed_param", iter = 1, chains = 1)
beta = as.numeric(extract(gen, pars = "beta")$beta)
X = matrix(data = as.numeric(extract(gen, pars = "X")$X), nrow = N, ncol = M)
y = as.integer(extract(gen, pars = "y")$y)

model = stan_model("./ordinal_regression.stan")
fit = sampling(model, data = list(N = N, M = M, K = K, X = X, y = y), seed = 123, chains = 4)

mu = extract(fit, pars = "mu")
Emu = colMeans(mu$mu)

beta_out = extract(fit, pars = "beta")
beta_out = colMeans(beta_out$beta)

the “true” \beta are the following:

beta
 [1]  -0.233045611  -0.082993293   0.085635290  -0.135351220  -0.469997342
 [6]   9.353118990   0.412084392   0.062155729   0.128925395   0.037598942
[11]   0.301502594   0.002788009  -0.075224657  -0.143522047   0.123930052
[16]   0.288065551   0.029104101  -0.023312894  -0.040097277  -0.174772110
[21]   9.476527099   0.008296223  -0.297025240  -0.221369662  -0.020534234
[26]   8.771793267 -12.259626932   0.456705693   0.185068152  -0.180819679

so, we should infer 4 relevant slopes (-12.259626932, 8.771793267, 9.353118990 , 9.476527099).
The followings are the modelled \beta:

eta_out
 [1]  -0.40676485   0.28088248  -0.07927991  -0.09320171  -0.80894627
 [6]  16.42724226   1.48307589   0.45670540  -0.39804118   0.39808276
[11]   0.74836613   0.24019307   0.30879249  -0.20156108   0.23619630
[16]   0.62722394   0.17567108  -0.24024768   0.02899408  -0.32040050
[21]  17.71848729   0.03945813  -0.32671575   0.17693676   0.00956709
[26]  15.72181533 -21.54576394   0.03640765   0.66576612  -0.59754649

So, \beta are not very precise, neither the cut-points c are. However the relevant slopes are still relevant after the inference. Maybe better priors helps…
Now, with the posterior predictive distribution, the reference model can be set.

1 Like

The following step is to set the reference model, in the way suggested by @avehtari.

mu = extract(fit, pars = "mu")
Emu = colMeans(mu$mu)

beta_out = extract(fit, pars = "beta")
beta_out = colMeans(beta_out$beta)

predfun = function(xt) t( beta %*% t(xt) ) 
sigma = rep(1.6, 4000)

ref = init_refmodel(X, Emu, gaussian(), x = X, predfun = predfun, dis = sigma)

To define the \sigma vector I set the nuber of iterations to 4000 (warmup+sampling), and as far as I understood the predictive projection, I used two times the predictor matrix X (one is the argumento of the predfun function) because I should project the X against the predicted \mu.

vs = varsel(ref, method = 'forward') finds 4 relevant \beta :

and the relevant sole are correctly identified:

selected = as.integer(vs$vind)[1:4]
beta_fitted = beta_out[selected]
[1] -21.54576  17.71849  16.42724  15.72182

So, the approximation works. However some useful commands do not work: for example:

suggest_size(vs)
[1] NA
Warning message:
In suggest_size(vs) :
  Could not suggest model size. Investigate varsel_plot to identify if the search was terminated too early. If this is the case, run variable selection with larger value for nv_max.

suggest_size(vs, stat = "rmse") works and identifies 6 relevant slopes. cv_varsel(vs) does not work either:


Error in importance_sampling_object(unnormalized_log_weights = log_weights,  : 
  is.matrix(unnormalized_log_weights) is not TRUE
In addition: Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

I got so far, so now I would like to hear comment from your side. Thanks to guiding me through all this!

4 Likes