Still confused by search_terms in projpred

Hi @AlejandroCatalina, I have a brms model (Bernoulli family), which is a simpler version of what I was describing in Advice on using search_terms in projpred. The formula is:

Y ~ (X1 + X2 + X3 + X4 ) * F1 * F2

X1 to X4 are four different continuous predictors, F1 and F2 are factors with 2 and 5 levels respectively.

I want to use projpred to help determine which of the X should be included in the model because they improve predictions in any of the 2 x 5 = 10 conditions of the experiment. This means that whenever any X is included, I also want to include all of its interactions with F1 and F2. In an attempt to do this in propred, I have specified :

search_terms = c(
"1", 
"F1 + F2 + F1:F2 + X1 + X1:F1 + X1:F2 + X1:F1:F2",
"F1 + F2 + F1:F2 + X2 + X2:F1 + X2:F2 + X2:F1:F2",
"F1 + F2 + F1:F2 + X3 + X3:F1 + X3:F2 + X3:F1:F2",
"F1 + F2 + F1:F2 + X4 + X4:F1 + X4:F2 + X4:F1:F2")

However, when I run vs <- varsel(mdl, search_terms = search_terms), vs$solution_terms shows 19 entries each of which is a single item (e.g., F1, F2, X1:F1) and not the four different composite entries provided search_terms. I have checked that the variable names in search_terms match the names in the model formula, so I don’t understand what’s happening here. Have I made a mistake with the syntax, or maybe I am misunderstanding the output of vs$solution_terms?

1 Like

Hello! Thanks for the follow up, I’ll try this locally to identify where the issue might be. As far as I can see on the phone this should work alright but off course search_terms has not been used a lot and probably not tested all the edge cases, so it might be buggy :).

2 Likes

The factors are sum-coded (contr.sum) in case that’s relevant.

Can you provide a reproducible example with these conditions so I can test and debug this? Thanks and sorry for any inconvenience.

Hi there -

I have the same problem and confused by the syntax. I have a model with say 20 variables but if I specify varsel(mod, method=“forward”, search_terms=c(“1”, “X1”, X2")) I get back a vs object with all terms included and not just the ones I specified.

I was trying to create a reproducible example using the rstanarm logistic regression example at Bayesian Logistic Regression with rstanarm. Using the post2 model on this page I attempt to limit the search as following:

varsel2 ← varsel(post2, method=‘forward’, data=diabetes, search_terms = c(“1”, “glucose”, “bloodpressure”))
However I get this error:
[1] “10% of terms selected.”
[1] “20% of terms selected.”
Error in sub[“kl”, i] : incorrect number of dimensions

Ideally I want to get to the situation where I can make sure one variable is always entered last. I think this has been done on another post but I can’t work out the correct syntax from the documentation and it would be really helpful if I could get a pointer.

Great work on this package by the way.
All the best
Jon

What might be happening in this case is that projpred expects search terms to contain all of these variables in the model formula. Your syntax is correct here. You can pass nterms=2 to let it know that only 2 terms are included. I will automatically set nterms to the number of variables passed in search terms if it’s provided. Thanks for noticing!

Hi @AlejandroCatalina, I thought I would recreate this with a new data frame with the simpler variable names (X1, X2, … F1, F2) as above. But now I can’t even get varsel to complete, so I don’t know what’s going on. I get the error message Error in eval(predvars, data, env) : object 'F2i' not found. This is the code I used (using v.2.0.2 of projpred):

mdl_projpred <-
  brm(
    Y ~ (X1 + X2 + X3 + X4) * F1 * F2,
    data = data,
    family = bernoulli(link = "logit"),
    prior = c(set_prior("student_t(3, 0, 1)", class = "Intercept"),
              set_prior("student_t(3, 0, 1)", class = "b")),
    sample_prior = "yes",
    save_pars = save_pars(all = TRUE),
  )
mdl_projpred

library(projpred)
search_terms = c(
  "1",
  "F1 + F2 + F1:F2 + X1 + X1:F1 + X1:F2 + X1:F1:F2",
  "F1 + F2 + F1:F2 + X2 + X2:F1 + X2:F2 + X2:F1:F2",
  "F1 + F2 + F1:F2 + X3 + X3:F1 + X3:F2 + X3:F1:F2",
  "F1 + F2 + F1:F2 + X4 + X4:F1 + X4:F2 + X4:F1:F2"
)
refmodel <- get_refmodel(mdl_projpred)
vs <- varsel(
  refmodel,
  search_terms = search_terms)

I can make the data set available, if useful.

Yes, it would be useful to have the data frame available so I can run and debug the example myself. Thanks and sorry for the delay!

Thanks Alejandro – no worries – I’ll email you shortly.