Variable selection with ordinal model

Yes of course. I used the Stan models that @fabio provided earlier in this thread.

library(rstan)
library(projpred) # 2.0.0 devtools::install_github("stan-dev/projpred@develop")
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

set.seed(123)

N <- 500 # number of observations
M <-  30 # number of predictors
K <-   5 # number of ordinal categories

c <- sort(runif(n = K - 1, min = -5, max = 5))

generative <- stan_model("./generate_data.stan")

gen <- sampling(generative,
    data = list(N = N, M = M, K = K, c = c),
    seed = 123, algorithm = "Fixed_param", iter = 1, chains = 1
)
beta <- as.numeric(extract(gen, pars = "beta")$beta)
X <- matrix(
    data = as.numeric(extract(gen, pars = "X")$X),
    nrow = N, ncol = M
)
y <- as.integer(extract(gen, pars = "y")$y)

model <- stan_model("./ordinal_regression.stan")
fit <- sampling(model,
    data = list(N = N, M = M, K = K, X = X, y = y),
    seed = 123, chains = 4
)

mu <- extract(fit, pars = "mu")
Emu <- as.matrix(colMeans(mu$mu))

beta_out <- extract(fit, pars = "beta")$beta
beta <- as.matrix(colMeans(beta_out))

predfun <- function(fit, newdata = NULL) {
    if (is.null(newdata)) {
    newdata <- data
    }
    return(as.matrix(newdata[, -1]) %*% t(beta_out))
}

## compute empirical sd; too large
sigma <- apply(mu$mu, 1, sd)

colnames(X) <- paste0("x", seq_len(NCOL(X)))
data <- as.data.frame(cbind(y = Emu, x = X))
formula <- as.formula(paste("y ~", paste(colnames(X), collapse = " + ")))

.extract_model_data <- function(object, newdata = NULL, wrhs = NULL,
                                orhs = NULL, resp_form = NULL) {
  if (is.null(newdata)) {
    newdata <- data
  }

  if (inherits(wrhs, "formula")) {
    weights <- eval_rhs(wrhs, newdata)
  } else if (is.null(wrhs)) {
    weights <- rep(1, NROW(newdata))
  }

  if (inherits(orhs, "formula")) {
    offset <- eval_rhs(orhs, newdata)
  } else if (is.null(orhs)) {
    offset <- rep(0, NROW(newdata))
  }

  if (inherits(resp_form, "formula")) {
    y <- eval_rhs(resp_form, newdata)
  } else {
    y <- NULL
  }

  return(nlist(y, weights, offset))
}

extract_model_data <- function(object, newdata = NULL, wrhs = NULL,
                               orhs = NULL) {
  resp_form <- lhs(formula)
  args <- nlist(object, newdata, wrhs, orhs, resp_form)
  return(do_call(.extract_model_data, args))
}

ref <- init_refmodel(fit, data, formula, gaussian(),
    ref_predfun = predfun, div_minimizer = linear_mle,
    proj_predfun = linear_proj_predfun,
    extract_model_data = extract_model_data, dis = sigma
)

vs <- varsel(ref, method = "forward")
cvs <- cv_varsel(ref, method = "forward")

## fix sigma
sigma <- rep(1.6, 4000)

ref <- init_refmodel(fit, data, formula, gaussian(),
    ref_predfun = predfun, div_minimizer = linear_mle,
    proj_predfun = linear_proj_predfun,
    extract_model_data = extract_model_data, dis = sigma
)

vs <- varsel(ref, method = "forward")
cvs <- cv_varsel(ref, method = "forward")

Sorry for digging this out, but shouldn’t this line

predfun = function(xt) t( beta %*% t(xt) )

read

predfun = function(xt) t( beta_out %*% t(xt) )

for your simulated example?

In any case, thanks for this very interesting discussion!

Yes, I believe the code I shared just above includes the right predfun!

@AlejandroCatalina @Francesco_De_Vincentis
the extract_model_data() function needs the projpred's internal function lhs() thet seems to be not exposed by the package.
I tried to copy it from the source code but I need to copy-paste a lot of dependent functions too.
I use the version 2.0.2 of projpred.
Could you please help us to expose the relevant functions?
Thanks in advance!

Have you tried projpred:::lhs()?

We have now a preprint out on using projpred for, e.g., ordinal and survival models

The code is in the latent_projection branch of the projpred github repo, if you want to test. We’re working on a vignette and additional testing, before merging to main branch.

Soooo thanks @avehtari @paul.buerkner @AlejandroCatalina !

Thanks for making this available @avehtari ! Unfortunately I can’t get it working with a cumulative link model.

I installed projpred with the latent_projection branch using:
devtools::install_github(ā€˜stan-dev/projpred’, build_vignettes = TRUE, ref=ā€˜latent_projection’)

I then perform an ordinal regression using brms (version 2.16.2) as follows:
fit ← brm(var ~ A+B+C+D+E+F, x, family=cumulative())

I then try to perform variable selection using projpred (version 2.0.5.9000) as follows:
ref ← get_refmodel(fit)

But I get the following error message:
Error in extend_family(family) :
Family ā€˜cumulative’ is not supported by projpred.

Do you have any idea what is going on here?

Many thanks,
John

Yes, we are still working on the branch and somethings are still under heavy development. We will hopefully have it fully ready soon!

Best,
Alex

Cool @AlejandroCatalina ! Will keep an eye out for this, looking forward to this being available!

@AlejandroCatalina In the meantime, I still need to do some variable selection with ordinal models. Currently I am using a simple forward feature selection algorithm to pick variables to add to the model by trying to minimise LOOIC. Could you comment on whether this is a sound approach? In particular with respect to overfitting? Many thanks, John

It sounds like it can potentially overfit. You can still use projpred with a small caveat, but you need to manually build the refmodel object and supply Gaussian as family with latent=TRUE. The caveat si that the reported ELPD will be on the latent Gaussian space, instead of being the response scale. I can share couple lines of code to do this if you’re interested.

Best,
Alex

@AlejandroCatalina Are there any issues with reporting ELPD will be on the latent scale? Am I correct that ELPD on the latent scale will still allow me to pick the ā€œbestā€ model?

If you could share the code on how to build the refmodel object and run projpred then that would be fantastic. I am trying to run ordinal regressions using cumulative link models and stopping ratio models.

Much appreciated,
John

Yes indeed the selection is the same, it’s just that you might be looking at the ELPD values thinking that they are on the response scale.

The idea behind the latent approach is to fit the projections on the latent predictive space, meaning we in practice replace the original response by the latent prediction as computed by t(posterior_linpred(fit, transform=FALSE)). So, with this said, you can follow this snippet (and fix whatever necessary):

## asumming you have defined an object `fit`

y_latent <- t(posterior_linpred(fit, transform=FALSE))

data[[".y"]] <- rowMeans(y_latent)

dis_latent <- rep(1, n_draws_fit) # n_draws_fit needs to match the number of posterior draws in fit

ref_formula <- update(formula(fit), ".y ~ .")

ref <- init_refmodel(fit,
data = data, formula = ref_formula, family = gaussian(),
ref_predfun = ref_predfun, div_minimizer = projpred:::linear_mle,
proj_predfun = projpred:::linear_proj_predfun, dis = dis_latent,
extract_model_data = extract_model_data, cvfun = cvfun
)

The only missing pieces are ref_predfun, cvfun and extract_model_data, that are defined as:

cvfun <- function(folds, ...) {
  cvres <- brms::kfold(
    object,
    K = max(folds), save_fits = TRUE, folds = folds, ...
  )
  fits <- cvres$fits[, "fit"]
  return(fits)
}

extract_model_data <- function(object, newdata = NULL, wrhs = NULL,
                               orhs = NULL, extract_y = TRUE) {
  if (!extract_y) {
    resp_form <- NULL
  } else {
    resp_form <- ~.y
  }

  if (is.null(newdata)) {
    newdata <- data
  }

  if (is.null(wrhs) && !is.null(object) &&
    !is.null(object$weights) && length(object$weights) != 0) {
    wrhs <- ~weights
    newdata <- cbind(newdata, weights = object$weights)
  }

  if (is.null(orhs) && !is.null(object) &&
    !is.null(object$offset) && length(object$offset) != 0) {
    orhs <- ~offset
    newdata <- cbind(newdata, offset = object$offset)
  }

  args <- nlist(object, newdata, wrhs, orhs, resp_form)
  return(do_call(projpred:::.extract_model_data, args))
}

ref_predfun <- function(fit, newdata = NULL) {
  return(t(posterior_linpred(fit,
    newdata = newdata,
    transform = FALSE
  )))
}

I have also been looking at variable selection in ordinal models. One ā€œsub-optimal strategyā€ (OK, I admit hack) is to first build the full ordinal response model in brms, then sample from the linear predictor to convert the ordinal response to a continuous response. Then build a full model on this continuous response and use varsel to select the terms.

Plots to compare the ordinal response values to the linear predictor samples and plots comparing ordinal and linear predictor model coefficients show if the linear predictor model is a reasonable approximation.

I have found that this works in a simulation with 3 or 4 factors, varsel either selects the true model structure in the fake data or misses by one term. Building a few ordinal response models around suggest_size and comparing information criteria has always found the fake data structure.

But these are tiny datasets, I do not know how this would work with your huge dataset.

I would be very pleased to see any opinions on this hack - is it reasonable or too dangerous to use!

That is indeed very related to our latent approach. There is some uncertainty regarding what ā€œbuild a full model on this continuous responseā€ means, but it is anyway very related to our approach and therefore principled as far as our latest developments goes.

Best,
Alex

@AlejandroCatalina Thanks for sharing this code. I managed to build the refmodel with a few edits to the code. But some weird stuff happens when I perform the variable selection.

Firstly, when I run the following:
vs <- varsel(ref, method = "forward", nterms_max = projpred:::count_terms_in_formula(ref$formula))
plot(vs, stats = c('elpd', 'rmse'))

I get the the following plot:
elpd_rmse

Does this look right? It strikes me as odd that the model with 47 terms (all of the terms in my full model) does not reach the same elpd or rmse values as the full model indicated by the dashed lines.

Secondly, the cross-validated variable selection fails. When I run the following:
vs <- cv_varsel(ref, method = "forward", nterms_max = projpred:::count_terms_in_formula(ref$formula))

I get the following error message:

[1] ā€œRepeating forward search for 165 LOO foldsā€¦ā€
| | 0%
warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

warning: solve(): system is singular; attempting approx solution
The search direction is not a descent direction (newton decrement = -0, should be positive), , this is likely a bug. Please report to the developers.

error: matrix multiplication: incompatible matrix dimensions: 0x1 and 2x1
Error in glm_ridge_c(x, pseudo_obs, lambda, FALSE, 0, beta_start, w0, :
matrix multiplication: incompatible matrix dimensions: 0x1 and 2x1
In addition: Warning message:
Some Pareto k diagnostic values are too high. See help(ā€˜pareto-k-diagnostic’) for details.

Any ideas about what is going on here?

Thanks,
John

Many things could be happening, but it seems the reference model may have some issues. You can try forward search my passing method=ā€œforwardā€ to varsel. That should be a bit better but if there are many issues it may not resolve it. As to the first varsel plot it seems to indicate that the evaluation of the reference model may be overoptimistic.

Best,
Alex

Thanks for the comment @AlejandroCatalina

I can get rid of the second error when I do some preliminary feature selection (remove variables which are clearly not important).

As for the varsel plot, I’m still confused why the reference model and the model with all of the features (in my case 47 / 47 features) have different elpd and rmse values. I would have expected them to be the same as they are the same model, right? Could you please explain if there is something which I am not understanding.

Many thanks,
John

You should be able to recover the identical behavior by increasing the amount of draws used for prediction with the ndraws_pred argument.

Best,
Alex