Error in get_refmodel and cv_varsel

Hello,

I am currently using rstanarm to run a stan_lmer model. I am trying to use the library projpred and when I try to use the function “get_refmodel” and the function “cv_varsel” and I get the error: “Error in terms.formula(formula) : ‘.’ in formula and no ‘data’ argument”

Can anyone please help me resolved this error? The data set is high-dimensional (163 variables of interest) with repeated measures and grouped data (about 98 individual observations, and 25 if they’re grouped by ID)

Thank you.

Hello @csilos, welcome to The Stan Forums!

Could you try the current development version of projpred instead of the CRAN version? You can install it via devtools::install_github("stan-dev/projpred", ref = "develop") (requires the devtools package to be installed).

Thank you @fweber144 ! I downloaded the development version, and now I am getting the following error: “Error in init_refmodel(object = object, data = data, formula = formula, :
formal argument “data” matched by multiple actual arguments”

Here is my code:
formula_ ← Ipsi_LVV ~ . -id + (1 | id)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = merged_data2,
seed = 12345,
chains = 4)

ref ← get_refmodel(stanglmer, data = merged_data2)

Any help would be appreciated.

You don’t need the data argument when calling get_refmodel(). And in most cases, you don’t even need to call get_refmodel() yourself. Simply plugging the model fit (stanglmer in your case) into projpred functions like cv_varsel() etc. should work.

Thanks again for your help. I removed the data argument and tried the cv_varsel() function again and I’m still getting the original error (the ‘.’ In the formula and the missing data argument.

Again, I really appreciate your help with this.

Are all of your other R packages up-to-date? If not, could you update them to the most recent CRAN versions?

Yes, all of the packages are up to date, and so is my version of R. Again, please let me know if you have any other suggestions. I really appreciate it.

Hm… In that case, I would need a reproducible example. Can you provide one? You can also PM me (but see this note).

Hello, please see the below code for a reproducible example. Note this is not my actual data, as I cannot share that. But it follows the same format. Thank you.

set.seed(101)

df ← data.frame(replicate(154,runif(90, 1, 2.5)))

set.seed(101)

df2 ← data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) ← c(‘C1’,‘C2’,‘C3’, ‘C4’, ‘C5’)

set.seed(101)

df3 ← data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) ← c(‘D1’,‘D2’,‘D3’, ‘D4’, ‘D5’)

set.seed(101)

df$time ← rep(c(1,6,12), each = 1)

df$ID ← rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30), each = 3)

df_complete ← cbind(df,df2,df3)

library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 ← 10 # prior guess for the number of relevant variables
tau0 ← p0/(p-p0) * 1/sqrt(n)
hs_prior ← hs(df=1, global_df=1, global_scale=tau0)
t_prior ← student_t(df = 7, location = 0, scale = 2.5)

formula_ ← X43 ~ . -ID + (1 | ID)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = df_complete,
seed = 12345,
chains = 4)
library(projpred)

varsel2 ← cv_varsel(stanglmer, method=‘forward’, cv_method=‘loo’, nloo = n)

1 Like

This was indeed a bug in projpred. Thanks for reporting this.

I have submitted a corresponding fix on GitHub (here). Until this is merged by a projpred maintainer, you should be able to work around this either by avoiding the . in the formula or by installing projpred from my PR branch: devtools::install_github("fweber144/projpred", ref = "formula_dot_fix").

The important question is now: Does the variable selection with CV run through now for your actual dataset?

So I installed projpred from your PR branch, and ran it on my data, and got this error now:

“Error in ref_predfun(object) :
length(offs) %in% c(1L, nrow(linpred_out)) is not TRUE”

Again, thank you for your help with this.

Hm, I can’t reproduce this with your reprex from above. (I replaced

varsel2 <- cv_varsel(stanglmer, method="forward", cv_method="loo", nloo = n)

by

varsel2 <- cv_varsel(stanglmer, method="forward", cv_method="loo", nloo = 5,
                     nterms_max = 4, nclusters = 3, nclusters_pred = 5,
                     seed = 3546)

to save time, but I don’t think this has an impact on the error you reported.)

Do you get the error also when running the reprex?

Hi, I have reproduced the error we the following reproducible example. The difference is now I have added some missing values and dropped two observations. Please the code below:

set.seed(101)

df ← data.frame(replicate(155,runif(90, 1, 2.5)))

set.seed(101)

df2 ← data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) ← c(‘C1’,‘C2’,‘C3’, ‘C4’, ‘C5’)

set.seed(101)

df3 ← data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) ← c(‘D1’,‘D2’,‘D3’, ‘D4’, ‘D5’)

n=dim(df3)[1]
df3 ← apply (df3, 2, function(x) {x[sample( c(1:n), floor(n/10))] ← NA; x} )

set.seed(101)

df$time ← rep(c(1,6,12), each = 1)

df_complete ← cbind(df,df2,df3)

set.seed(101)

n=dim(df_complete)[1]

df_complete ← as.data.frame(df_complete)

df_complete$time ← rep(c(1,6,12), each = 1)
df_complete$ID ← rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30), each = 3)

rand_ind ← sample(nrow(df_complete), 2, replace = F)

df_complete ← df_complete[-rand_ind,]

library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 ← 10 # prior guess for the number of relevant variables
tau0 ← p0/(p-p0) * 1/sqrt(n)
hs_prior ← hs(df=1, global_df=1, global_scale=tau0)
t_prior ← student_t(df = 7, location = 0, scale = 2.5)
#post2 ← stan_glm(reg_formula, data = diabetes,

family = binomial(link = “logit”),

prior = hs_prior, prior_intercept = t_prior,

seed = SEED, adapt_delta = 0.999, refresh=0)

formula_ ← X43 ~ . -ID + (1 | ID)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = df_complete,
seed = 12345,
chains = 4)
library(projpred)

varsel2 ← cv_varsel(stanglmer, method=‘forward’, cv_method=‘loo’, nloo = n)

Thank you again for your help!

Objects reg_formula and SEED are not defined.

And please use code formatting as described above.

EDIT: Code formatting was not described above; sorry for that.

My apologies, see the code below.


set.seed(101)

df <- data.frame(replicate(155,runif(90, 1, 2.5)))

set.seed(101)

df2 <- data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) <- c('C1','C2','C3', 'C4', 'C5')

set.seed(101)

df3 <- data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) <- c('D1','D2','D3', 'D4', 'D5')

n=dim(df3)[1]
df3 <- apply (df3, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )


set.seed(101)

df$time <- rep(c(1,6,12), each = 1)

df_complete <- cbind(df,df2,df3)

set.seed(101)



n=dim(df_complete)[1]


df_complete <- as.data.frame(df_complete)

df_complete$time <- rep(c(1,6,12), each = 1)
df_complete$ID <- rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
               21,22,23,24,25,26,27,28,29,30), each = 3)


rand_ind <- sample(nrow(df_complete), 2, replace = F)


df_complete <- df_complete[-rand_ind,]


library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 <- 10 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0)
t_prior <- student_t(df = 7, location = 0, scale = 2.5)

formula_ <-  X43 ~ . -ID + (1 | ID)

## Begin Modeling

stanglmer <- stan_glmer(formula_ , 
                        prior = hs_prior,
                        family = gaussian(),
                        data = df_complete,
                        seed = 12345,
                        chains = 4)
library(projpred)


varsel2 <- cv_varsel(stanglmer,  method='forward', cv_method='loo', nloo = n)

The problem is caused by the missing values in the dataset. I admit that projpred needs to handle this in a more user-friendly way. For now, using data = na.omit(df_complete) instead of data = df_complete in the stan_glmer() call should work.

Sorry, I’m realizing I did not ask for proper code formatting here. I confounded this with a different thread. My apologies.

Got it! Thank you so much for your help! I really appreciate it.

1 Like

If everything runs through now, you can tag this thread as resolved.

I also want to point out that the cv_varsel() settings I used above (nterms_max = 4, nclusters = 3, nclusters_pred = 5) were only chosen to speed up the “debugging”. They are not recommended for practice.