Loo with k_threshold error for stan_polr()

I’ve fitted models with stan_polr and want to compare fits by loo. Some models have problematic Pareto k observations, and I’m trying to refit with exact loo for these observations. However, when I call loo(WL,k_threshold = 0.7), where WL is my fitted model, I get the following:

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)
Error in .subset(x, j) : invalid subscript type ‘list’

Any suggestions for what I’m missing?

1 Like

Hmm, that looks like a bug, thanks for reporting this. I’ll try to make some time today to figure out what’s going on.

@tomli I think I just fixed this in the rstanarm on GitHub and it will be fixed in the next release. In the meantime, if you don’t want to reinstall from GitHub (which will require compiling all of rstanarm unlike installing from CRAN), here’s a short-term hack that I think will get it to work without having to reinstall anything:

If you have a fitted model object fit then do fit$zeta <- NULL before calling loo(fit, k_threshold = 0.7). Can you let me know if that runs without error?

Thanks @jonah for looking into this. I tried the hack solution. Alas, it did not solve the problem, and I get the same error message as before. In case it’s helpful for further detective-pursuits, here is some info about the platform and versions used:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

other attached packages:
[1] loo_2.3.1 bayesplot_1.7.1 rstanarm_2.21.1 Rcpp_1.0.4.6 rstan_2.21.2
[6] StanHeaders_2.21.0-5 readxl_1.3.1 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
[11] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 ggplot2_3.3.0
[16] tidyverse_1.3.0

Edit: Tried to install rstanarm from Github, but it won’t work, yielding the following error message:

Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ?ellipsis?
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ?tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\Rtmpykwaqj\file28bc5c57976.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/Rtmpykwaqj/remotes28bc6a526a99"? returned error code 1

I also attempted to shift to a linux cluster (x86_64-pc-linux-gnu (64-bit)), but installation of rstanarm crashed during compilation, both when running R 3.5.3 and R 3.6.3. R 4.0 is not yet available on said cluster.

Grateful for any other advice on how to proceed.

Hmmm, sorry that didn’t fix it! Maybe there’s another issue in addition to the one I found. I’m not actually able to reproduce the error message

Error in .subset(x, j) : invalid subscript type ‘list’

so here are a few things could help narrow things down:

  • Does this only happen for you with stan_polr() or does it also happen with other rstanarm modeling functions, e.g., stan_glm()?

  • Are you able to share your data and code so I can reproduce this? If you can’t share them then is it possible to share a toy version using simulated data that also gives you this error? Anything that could help me be able to reproduce this particular error message.

Yes, I get the same issue when using glm. Below is a code that reproduces the error on my computer both with glm and polr. The mc.core=1 was to used to circumvent the issure raised in this thread: Problems running LOO on >1 core
Here is the required data.
reducedvilt.csv (11.4 KB)

df<-read.csv("reducedvilt.csv")
options(mc.cores = 1)
require(rstanarm)
require(rstan)
rstan_options(auto_write = TRUE)

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"


WL<-stan_glm(moddef , data = df)
loo(WL,k_threshold = 0.7)


df$Inno_art<-ordered(df$Inno_art,levels=1:5,labels=LETTERS[1:5])
WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")
loo(WL,k_threshold = 0.7)
1 Like

Thanks for sharing the code and data. I was able to reproduce this and I figured out that the problem is that one of the internal functions called by loo when k_threshold is specified assumes the model formula is of class formula as opposed to just a string. I will fix this so in the future it works, but until then you should be able to avoid the error by changing

to be a formula like this

moddef=as.formula("Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan")

If you do that does it avoid the error? My hunch is that it avoids the error when you use loo with k_threshold after stan_glm() but when you use it with stan_polr() you might still also need this

which is fixed on GitHub but not in your installed version of rstanarm.

Actually there may be another problem with loo stan_polr that I’m trying to sort out now, but the advice above should work for stan_glm. Will report back.

Ok so this is the issue that keeps giving. I’ve already fixed two bugs as a result of this and just found a third:

The reason this wasn’t detected sooner is that it requires some particular features in the data to trigger it.

So my post above has a fix for stan_glm() but with this particular dataset it’s not yet going to work with stan_polr() until we fix this new issue. Just need to discuss with @bgoodri to figure out the best way to fix it.

1 Like

Thanks for looking into the issue. I can confirm that implementing as.formula() made it possible to run stan_glm without issues. It did however not solve running stan_polr, even if it seemed to allow the process to proceed a bit further. Without the fit$zeta <- NULL hack, I got

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)
Error in x$z[omitted, , drop = FALSE] : incorrect number of dimensions

With the fit$zeta <- NULL hack, I got

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)

Fitting model 2 out of 3 (leaving out observation 25)

Fitting model 3 out of 3 (leaving out observation 173)
Error in draws$zeta[, y_i] : subscript out of bounds

…so it seems the refitting was now done for each flagged observation, but something went wrong after that.

Yeah exactly. I’ve found the problem (the issue described in the github link in my previous post) but I don’t know the exact fix yet. Once we sort it out (hopefully very soon) I’ll post back here.

Maybe a bit of progress. Are you able to install rstanarm from github and see if my potential fix works? Unfortunately because it’s installing from github it will have to compile all of rstanarm at install time (unlike CRAN when everything is pre-compiled):

Sys.setenv(MAKEFLAGS = "-j4") # number of cores to use (you can change)
devtools::install_github("stan-dev/rstanarm", build_vignettes = FALSE)

Unfortunately, I’m having trouble installing from github. This is what I keep getting, and I’ve tried different options for the “Enter one or more numbers…” stage with similar [lack of] progress:

Downloading GitHub repo stan-dev/rstanarm@master
stan-dev-rstanarm-5b9b826/tests/testthat/include: Can't create '\\\\?\\C:\\Users\\tomli96\\AppData\\Local\\Temp\\RtmpwtUaHX\\remotes6c5826649a7\\stan-dev-rstanarm-5b9b826\\tests\\testthat\\include'
stan-dev-rstanarm-5b9b826/tests/testthat/stan_files: Can't create '\\\\?\\C:\\Users\\tomli96\\AppData\\Local\\Temp\\RtmpwtUaHX\\remotes6c5826649a7\\stan-dev-rstanarm-5b9b826\\tests\\testthat\\stan_files'
tar.exe: Error exit delayed from previous errors.
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

 1: All                                  
 2: CRAN packages only                   
 3: None                                 
 4: tibble    (3.0.1   -> 3.0.3  ) [CRAN]
 5: vctrs     (0.3.0   -> 0.3.2  ) [CRAN]
 6: pillar    (1.4.4   -> 1.4.6  ) [CRAN]
 7: isoband   (0.2.1   -> 0.2.2  ) [CRAN]
 8: pkgload   (1.0.2   -> 1.1.0  ) [CRAN]
 9: pkgbuild  (1.0.8   -> 1.1.0  ) [CRAN]
10: backports (1.1.7   -> 1.1.8  ) [CRAN]
11: processx  (3.4.2   -> 3.4.3  ) [CRAN]
12: jsonlite  (1.6.1   -> 1.7.0  ) [CRAN]
13: httpuv    (1.5.2   -> 1.5.4  ) [CRAN]
14: htmltools (0.4.0   -> 0.5.0  ) [CRAN]
15: later     (1.0.0   -> 1.1.0.1) [CRAN]
16: promises  (1.1.0   -> 1.1.1  ) [CRAN]
17: dplyr     (0.8.5   -> 1.0.1  ) [CRAN]
18: shiny     (1.4.0.2 -> 1.5.0  ) [CRAN]
19: xfun      (0.13    -> 0.16   ) [CRAN]
20: openssl   (1.4.1   -> 1.4.2  ) [CRAN]
21: sys       (3.3     -> 3.4    ) [CRAN]
22: DT        (0.13    -> 0.14   ) [CRAN]

Enter one or more numbers, or an empty line to skip updates:

Installing 1 packages: rlang
Installing package into ‘C:/Users/tomli96/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rlang_0.4.7.zip'
Content type 'application/zip' length 1129581 bytes (1.1 MB)
downloaded 1.1 MB

package ‘rlang’ successfully unpacked and MD5 sums checked
Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘rlang’
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpwtUaHX\file6c58a13c27.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpwtUaHX/remotes6c5826649a7"’ returned error code 1

Any suggestion for what to try?

Hmm, what does the error message look like when you tell it to install none of them (option 3)? Right now it looks like there’s a message about installing the ‘rlang’ package, but that shouldn’t be necessary.

It says:

Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘rlang’
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpwtUaHX\file6c58d141724.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpwtUaHX/remotes6c582f7e608b"’ returned error code 1

Weird. I don’t know what’s up with ‘rlang’, that doesn’t really have anything to do with rstanarm. I’m just guessing but maybe try remove.packages("rlang") and then install.packages("rlang") before trying to install rstanarm?

Thanks jonah, that seem to have worked. I got a warning message in the rstanarm installation:

Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpmcxkkY\fileb9742b1c15e5.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpmcxkkY/remotesb97427a4815"’ returned error code 1

…but it appears to have installed correctly. I can now run the example code I provided above in its entirety (without previously suggested hacks) without encountering any issues. Thank you very much!

1 Like

Glad that works. I’m not sure about that warning message, but if everything seems to be working I think you can safely ignore it.

Sorry @jonah, but I think there is still some issue. Uncertain if it’s with loo, rstanarm, or something else, but this code…

df<-read.csv("reducedvilt.csv")

options(mc.cores = 1)
require(rstanarm)
require(rstan)
require(loo)
rstan_options(auto_write = TRUE)

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"


df$Inno_art<-ordered(df$Inno_art,levels=1:5,labels=LETTERS[1:5])
WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")


moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")
LOOWith<-loo(WL,k_threshold = 0.7)
LOOWithout<-loo(WoL,k_threshold = 0.7)

…gives the error message…

Error in stanmat[, colnames(x), drop = FALSE] : subscript out of bounds

Swapping the order of which model is fitted first, i.e…



moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"

WL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

LOOWithout<-loo(WoL,k_threshold = 0.7)
LOOWith<-loo(WL,k_threshold = 0.7)

…instead yields:

Error in eval(predvars, data, env) : object 'Lan' not found

In both cases, the first loo calculation works, but the second fail after refitting for pareto-flagged observations. I get the same results when running in parallel, but wanted to run serially to make sure there isn’t some issue related to previous problems with parallel chains on Windows. However, the following, which moves the loo calculation to immediately after stan_polr, works:

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"



WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")

LOOWith<-loo(WL,k_threshold = 0.7)
moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

LOOWithout<-loo(WoL,k_threshold = 0.7)

Not sure what to make of it. It seems the loo calculation “remembers” the lastly fitted object and expects that object structure.

Hmm that’s weird, thanks for letting me know. I’ll try to find some time later today to look into this.