Error in get_refmodel and cv_varsel

csilos · February 4, 2022, 10:28pm

Hello,

I am currently using rstanarm to run a stan_lmer model. I am trying to use the library projpred and when I try to use the function “get_refmodel” and the function “cv_varsel” and I get the error: “Error in terms.formula(formula) : ‘.’ in formula and no ‘data’ argument”

Can anyone please help me resolved this error? The data set is high-dimensional (163 variables of interest) with repeated measures and grouped data (about 98 individual observations, and 25 if they’re grouped by ID)

Thank you.

fweber144 · February 5, 2022, 3:43pm

Hello @csilos, welcome to The Stan Forums!

Could you try the current development version of projpred instead of the CRAN version? You can install it via devtools::install_github("stan-dev/projpred", ref = "develop") (requires the devtools package to be installed).

csilos · February 6, 2022, 3:25am

Thank you @fweber144 ! I downloaded the development version, and now I am getting the following error: “Error in init_refmodel(object = object, data = data, formula = formula, :
formal argument “data” matched by multiple actual arguments”

Here is my code:
formula_ ← Ipsi_LVV ~ . -id + (1 | id)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = merged_data2,
seed = 12345,
chains = 4)

ref ← get_refmodel(stanglmer, data = merged_data2)

Any help would be appreciated.

fweber144 · February 6, 2022, 11:07am

You don’t need the data argument when calling get_refmodel(). And in most cases, you don’t even need to call get_refmodel() yourself. Simply plugging the model fit (stanglmer in your case) into projpred functions like cv_varsel() etc. should work.

csilos · February 6, 2022, 2:27pm

Thanks again for your help. I removed the data argument and tried the cv_varsel() function again and I’m still getting the original error (the ‘.’ In the formula and the missing data argument.

Again, I really appreciate your help with this.

fweber144 · February 6, 2022, 5:18pm

Are all of your other R packages up-to-date? If not, could you update them to the most recent CRAN versions?

csilos · February 6, 2022, 8:12pm

Yes, all of the packages are up to date, and so is my version of R. Again, please let me know if you have any other suggestions. I really appreciate it.

fweber144 · February 7, 2022, 8:12am

Hm… In that case, I would need a reproducible example. Can you provide one? You can also PM me (but see this note).

csilos · February 8, 2022, 8:36pm

Hello, please see the below code for a reproducible example. Note this is not my actual data, as I cannot share that. But it follows the same format. Thank you.

set.seed(101)

df ← data.frame(replicate(154,runif(90, 1, 2.5)))

set.seed(101)

df2 ← data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) ← c(‘C1’,‘C2’,‘C3’, ‘C4’, ‘C5’)

set.seed(101)

df3 ← data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) ← c(‘D1’,‘D2’,‘D3’, ‘D4’, ‘D5’)

set.seed(101)

df$time ← rep(c(1,6,12), each = 1)

df$ID ← rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30), each = 3)

df_complete ← cbind(df,df2,df3)

library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 ← 10 # prior guess for the number of relevant variables
tau0 ← p0/(p-p0) * 1/sqrt(n)
hs_prior ← hs(df=1, global_df=1, global_scale=tau0)
t_prior ← student_t(df = 7, location = 0, scale = 2.5)

formula_ ← X43 ~ . -ID + (1 | ID)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = df_complete,
seed = 12345,
chains = 4)
library(projpred)

varsel2 ← cv_varsel(stanglmer, method=‘forward’, cv_method=‘loo’, nloo = n)

fweber144 · February 9, 2022, 1:38pm

This was indeed a bug in projpred. Thanks for reporting this.

I have submitted a corresponding fix on GitHub (here). Until this is merged by a projpred maintainer, you should be able to work around this either by avoiding the . in the formula or by installing projpred from my PR branch: devtools::install_github("fweber144/projpred", ref = "formula_dot_fix").

The important question is now: Does the variable selection with CV run through now for your actual dataset?

csilos · February 9, 2022, 4:17pm

So I installed projpred from your PR branch, and ran it on my data, and got this error now:

“Error in ref_predfun(object) :
length(offs) %in% c(1L, nrow(linpred_out)) is not TRUE”

Again, thank you for your help with this.

fweber144 · February 9, 2022, 6:59pm

Hm, I can’t reproduce this with your reprex from above. (I replaced

varsel2 <- cv_varsel(stanglmer, method="forward", cv_method="loo", nloo = n)

by

varsel2 <- cv_varsel(stanglmer, method="forward", cv_method="loo", nloo = 5,
                     nterms_max = 4, nclusters = 3, nclusters_pred = 5,
                     seed = 3546)

to save time, but I don’t think this has an impact on the error you reported.)

Do you get the error also when running the reprex?

csilos · February 9, 2022, 7:58pm

Hi, I have reproduced the error we the following reproducible example. The difference is now I have added some missing values and dropped two observations. Please the code below:

set.seed(101)

df ← data.frame(replicate(155,runif(90, 1, 2.5)))

set.seed(101)

df2 ← data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) ← c(‘C1’,‘C2’,‘C3’, ‘C4’, ‘C5’)

set.seed(101)

df3 ← data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) ← c(‘D1’,‘D2’,‘D3’, ‘D4’, ‘D5’)

n=dim(df3)[1]
df3 ← apply (df3, 2, function(x) {x[sample( c(1:n), floor(n/10))] ← NA; x} )

set.seed(101)

df$time ← rep(c(1,6,12), each = 1)

df_complete ← cbind(df,df2,df3)

set.seed(101)

n=dim(df_complete)[1]

df_complete ← as.data.frame(df_complete)

df_complete$time ← rep(c(1,6,12), each = 1)
df_complete$ID ← rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30), each = 3)

rand_ind ← sample(nrow(df_complete), 2, replace = F)

df_complete ← df_complete[-rand_ind,]

library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 ← 10 # prior guess for the number of relevant variables
tau0 ← p0/(p-p0) * 1/sqrt(n)
hs_prior ← hs(df=1, global_df=1, global_scale=tau0)
t_prior ← student_t(df = 7, location = 0, scale = 2.5)
#post2 ← stan_glm(reg_formula, data = diabetes,

family = binomial(link = “logit”),

prior = hs_prior, prior_intercept = t_prior,

seed = SEED, adapt_delta = 0.999, refresh=0)

formula_ ← X43 ~ . -ID + (1 | ID)

stanglmer ← stan_glmer(formula_ ,
prior = hs_prior,
family = gaussian(),
data = df_complete,
seed = 12345,
chains = 4)
library(projpred)

varsel2 ← cv_varsel(stanglmer, method=‘forward’, cv_method=‘loo’, nloo = n)

Thank you again for your help!

fweber144 · February 9, 2022, 8:21pm

Objects reg_formula and SEED are not defined.

And please use code formatting as described above.

EDIT: Code formatting was not described above; sorry for that.

csilos · February 9, 2022, 8:30pm

My apologies, see the code below.


set.seed(101)

df <- data.frame(replicate(155,runif(90, 1, 2.5)))

set.seed(101)

df2 <- data.frame(replicate(5,runif(90, 300, 20000)))
colnames(df2) <- c('C1','C2','C3', 'C4', 'C5')

set.seed(101)

df3 <- data.frame(replicate(5,sample(0:1,90,rep=TRUE)))
colnames(df3) <- c('D1','D2','D3', 'D4', 'D5')

n=dim(df3)[1]
df3 <- apply (df3, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )


set.seed(101)

df$time <- rep(c(1,6,12), each = 1)

df_complete <- cbind(df,df2,df3)

set.seed(101)



n=dim(df_complete)[1]


df_complete <- as.data.frame(df_complete)

df_complete$time <- rep(c(1,6,12), each = 1)
df_complete$ID <- rep(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
               21,22,23,24,25,26,27,28,29,30), each = 3)


rand_ind <- sample(nrow(df_complete), 2, replace = F)


df_complete <- df_complete[-rand_ind,]


library(rstanarm)

n=dim(df_complete)[1]
p=dim(df_complete)[2]

p0 <- 10 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0)
t_prior <- student_t(df = 7, location = 0, scale = 2.5)

formula_ <-  X43 ~ . -ID + (1 | ID)

## Begin Modeling

stanglmer <- stan_glmer(formula_ , 
                        prior = hs_prior,
                        family = gaussian(),
                        data = df_complete,
                        seed = 12345,
                        chains = 4)
library(projpred)


varsel2 <- cv_varsel(stanglmer,  method='forward', cv_method='loo', nloo = n)

fweber144 · February 9, 2022, 8:57pm

The problem is caused by the missing values in the dataset. I admit that projpred needs to handle this in a more user-friendly way. For now, using data = na.omit(df_complete) instead of data = df_complete in the stan_glmer() call should work.

fweber144 · February 9, 2022, 9:00pm

Sorry, I’m realizing I did not ask for proper code formatting here. I confounded this with a different thread. My apologies.

csilos · February 9, 2022, 9:06pm

Got it! Thank you so much for your help! I really appreciate it.

fweber144 · February 10, 2022, 7:47am

If everything runs through now, you can tag this thread as resolved.

I also want to point out that the cv_varsel() settings I used above (nterms_max = 4, nclusters = 3, nclusters_pred = 5) were only chosen to speed up the “debugging”. They are not recommended for practice.

Topic		Replies	Views
Projpred error Interfaces projpred	7	70	November 28, 2024
Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?) Modeling	25	1132	July 31, 2025
Projpred Modeling	18	1076	April 28, 2020
Projpred "inherits(formula, "formula") is not TRUE" Modeling projpred	2	489	January 6, 2023
Logistic model selection in rstan Modeling rstan	8	906	April 6, 2023

Error in get_refmodel and cv_varsel

family = binomial(link = “logit”),

prior = hs_prior, prior_intercept = t_prior,

seed = SEED, adapt_delta = 0.999, refresh=0)

Related topics