Loo with k_threshold error for stan_polr()

tomli · July 28, 2020, 2:55pm

I’ve fitted models with stan_polr and want to compare fits by loo. Some models have problematic Pareto k observations, and I’m trying to refit with exact loo for these observations. However, when I call loo(WL,k_threshold = 0.7), where WL is my fitted model, I get the following:

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)
Error in .subset(x, j) : invalid subscript type ‘list’

Any suggestions for what I’m missing?

jonah · July 28, 2020, 6:11pm

Hmm, that looks like a bug, thanks for reporting this. I’ll try to make some time today to figure out what’s going on.

jonah · July 28, 2020, 9:27pm

@tomli I think I just fixed this in the rstanarm on GitHub and it will be fixed in the next release. In the meantime, if you don’t want to reinstall from GitHub (which will require compiling all of rstanarm unlike installing from CRAN), here’s a short-term hack that I think will get it to work without having to reinstall anything:

If you have a fitted model object fit then do fit$zeta <- NULL before calling loo(fit, k_threshold = 0.7). Can you let me know if that runs without error?

github.com/stan-dev/rstanarm

loo with k_threshold error for stan_polr

opened 09:19PM - 28 Jul 20 UTC

closed 09:26PM - 28 Jul 20 UTC

jgabry

bug

loo with k_threshold errors for stan_polr because at this line https://githu…b.com/stan-dev/rstanarm/blob/bb59f7a9a6677318a498c0658c1cf81f21fced27/R/loo.R#L644 because of partial matching by `$` (`x$z` partially matches the `x$zeta` in the object returned by stan_polr). This was initially reported on discourse (https://discourse.mc-stan.org/t/issue-with-loo-with-k-threshold/17052), albeit with a slightly different error message than I was getting. I'm about to push a fix for this. @bgoodri Did you already submit the rstanarm patch with the stan_aov() fix? If not we should include this, otherwise it can wait until next release.

tomli · July 29, 2020, 7:38am

Thanks @jonah for looking into this. I tried the hack solution. Alas, it did not solve the problem, and I get the same error message as before. In case it’s helpful for further detective-pursuits, here is some info about the platform and versions used:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

other attached packages:
[1] loo_2.3.1 bayesplot_1.7.1 rstanarm_2.21.1 Rcpp_1.0.4.6 rstan_2.21.2
[6] StanHeaders_2.21.0-5 readxl_1.3.1 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
[11] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 ggplot2_3.3.0
[16] tidyverse_1.3.0

Edit: Tried to install rstanarm from Github, but it won’t work, yielding the following error message:

Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ?ellipsis?
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ?tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\Rtmpykwaqj\file28bc5c57976.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/Rtmpykwaqj/remotes28bc6a526a99"? returned error code 1

I also attempted to shift to a linux cluster (x86_64-pc-linux-gnu (64-bit)), but installation of rstanarm crashed during compilation, both when running R 3.5.3 and R 3.6.3. R 4.0 is not yet available on said cluster.

Grateful for any other advice on how to proceed.

jonah · July 29, 2020, 7:24pm

Hmmm, sorry that didn’t fix it! Maybe there’s another issue in addition to the one I found. I’m not actually able to reproduce the error message

Error in .subset(x, j) : invalid subscript type ‘list’

so here are a few things could help narrow things down:

Does this only happen for you with stan_polr() or does it also happen with other rstanarm modeling functions, e.g., stan_glm()?
Are you able to share your data and code so I can reproduce this? If you can’t share them then is it possible to share a toy version using simulated data that also gives you this error? Anything that could help me be able to reproduce this particular error message.

tomli · July 29, 2020, 10:27pm

Yes, I get the same issue when using glm. Below is a code that reproduces the error on my computer both with glm and polr. The mc.core=1 was to used to circumvent the issure raised in this thread: Problems running LOO on >1 core
Here is the required data.
reducedvilt.csv (11.4 KB)

df<-read.csv("reducedvilt.csv")
options(mc.cores = 1)
require(rstanarm)
require(rstan)
rstan_options(auto_write = TRUE)

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"


WL<-stan_glm(moddef , data = df)
loo(WL,k_threshold = 0.7)


df$Inno_art<-ordered(df$Inno_art,levels=1:5,labels=LETTERS[1:5])
WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")
loo(WL,k_threshold = 0.7)

jonah · July 30, 2020, 5:51pm

Thanks for sharing the code and data. I was able to reproduce this and I figured out that the problem is that one of the internal functions called by loo when k_threshold is specified assumes the model formula is of class formula as opposed to just a string. I will fix this so in the future it works, but until then you should be able to avoid the error by changing

to be a formula like this

moddef=as.formula("Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan")

If you do that does it avoid the error? My hunch is that it avoids the error when you use loo with k_threshold after stan_glm() but when you use it with stan_polr() you might still also need this

which is fixed on GitHub but not in your installed version of rstanarm.

jonah · July 30, 2020, 6:03pm

Actually there may be another problem with loo stan_polr that I’m trying to sort out now, but the advice above should work for stan_glm. Will report back.

jonah · July 30, 2020, 6:34pm

Ok so this is the issue that keeps giving. I’ve already fixed two bugs as a result of this and just found a third:

The reason this wasn’t detected sooner is that it requires some particular features in the data to trigger it.

So my post above has a fix for stan_glm() but with this particular dataset it’s not yet going to work with stan_polr() until we fix this new issue. Just need to discuss with @bgoodri to figure out the best way to fix it.

tomli · July 30, 2020, 7:15pm

Thanks for looking into the issue. I can confirm that implementing as.formula() made it possible to run stan_glm without issues. It did however not solve running stan_polr, even if it seemed to allow the process to proceed a bit further. Without the fit$zeta <- NULL hack, I got

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)
Error in x$z[omitted, , drop = FALSE] : incorrect number of dimensions

With the fit$zeta <- NULL hack, I got

3 problematic observation(s) found.
Model will be refit 3 times.

Fitting model 1 out of 3 (leaving out observation 22)

Fitting model 2 out of 3 (leaving out observation 25)

Fitting model 3 out of 3 (leaving out observation 173)
Error in draws$zeta[, y_i] : subscript out of bounds

…so it seems the refitting was now done for each flagged observation, but something went wrong after that.

jonah · July 30, 2020, 7:24pm

Yeah exactly. I’ve found the problem (the issue described in the github link in my previous post) but I don’t know the exact fix yet. Once we sort it out (hopefully very soon) I’ll post back here.

jonah · July 31, 2020, 5:02pm

Maybe a bit of progress. Are you able to install rstanarm from github and see if my potential fix works? Unfortunately because it’s installing from github it will have to compile all of rstanarm at install time (unlike CRAN when everything is pre-compiled):

Sys.setenv(MAKEFLAGS = "-j4") # number of cores to use (you can change)
devtools::install_github("stan-dev/rstanarm", build_vignettes = FALSE)

tomli · August 2, 2020, 1:48pm

Unfortunately, I’m having trouble installing from github. This is what I keep getting, and I’ve tried different options for the “Enter one or more numbers…” stage with similar [lack of] progress:

Downloading GitHub repo stan-dev/rstanarm@master
stan-dev-rstanarm-5b9b826/tests/testthat/include: Can't create '\\\\?\\C:\\Users\\tomli96\\AppData\\Local\\Temp\\RtmpwtUaHX\\remotes6c5826649a7\\stan-dev-rstanarm-5b9b826\\tests\\testthat\\include'
stan-dev-rstanarm-5b9b826/tests/testthat/stan_files: Can't create '\\\\?\\C:\\Users\\tomli96\\AppData\\Local\\Temp\\RtmpwtUaHX\\remotes6c5826649a7\\stan-dev-rstanarm-5b9b826\\tests\\testthat\\stan_files'
tar.exe: Error exit delayed from previous errors.
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

 1: All                                  
 2: CRAN packages only                   
 3: None                                 
 4: tibble    (3.0.1   -> 3.0.3  ) [CRAN]
 5: vctrs     (0.3.0   -> 0.3.2  ) [CRAN]
 6: pillar    (1.4.4   -> 1.4.6  ) [CRAN]
 7: isoband   (0.2.1   -> 0.2.2  ) [CRAN]
 8: pkgload   (1.0.2   -> 1.1.0  ) [CRAN]
 9: pkgbuild  (1.0.8   -> 1.1.0  ) [CRAN]
10: backports (1.1.7   -> 1.1.8  ) [CRAN]
11: processx  (3.4.2   -> 3.4.3  ) [CRAN]
12: jsonlite  (1.6.1   -> 1.7.0  ) [CRAN]
13: httpuv    (1.5.2   -> 1.5.4  ) [CRAN]
14: htmltools (0.4.0   -> 0.5.0  ) [CRAN]
15: later     (1.0.0   -> 1.1.0.1) [CRAN]
16: promises  (1.1.0   -> 1.1.1  ) [CRAN]
17: dplyr     (0.8.5   -> 1.0.1  ) [CRAN]
18: shiny     (1.4.0.2 -> 1.5.0  ) [CRAN]
19: xfun      (0.13    -> 0.16   ) [CRAN]
20: openssl   (1.4.1   -> 1.4.2  ) [CRAN]
21: sys       (3.3     -> 3.4    ) [CRAN]
22: DT        (0.13    -> 0.14   ) [CRAN]

Enter one or more numbers, or an empty line to skip updates:

Installing 1 packages: rlang
Installing package into ‘C:/Users/tomli96/Documents/R/win-library/4.0’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rlang_0.4.7.zip'
Content type 'application/zip' length 1129581 bytes (1.1 MB)
downloaded 1.1 MB

package ‘rlang’ successfully unpacked and MD5 sums checked
Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘rlang’
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpwtUaHX\file6c58a13c27.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpwtUaHX/remotes6c5826649a7"’ returned error code 1

Any suggestion for what to try?

jonah · August 2, 2020, 8:44pm

Hmm, what does the error message look like when you tell it to install none of them (option 3)? Right now it looks like there’s a message about installing the ‘rlang’ package, but that shouldn’t be necessary.

tomli · August 2, 2020, 8:58pm

It says:

Error: Failed to install 'rstanarm' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘rlang’
In addition: Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpwtUaHX\file6c58d141724.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpwtUaHX/remotes6c582f7e608b"’ returned error code 1

jonah · August 2, 2020, 9:56pm

Weird. I don’t know what’s up with ‘rlang’, that doesn’t really have anything to do with rstanarm. I’m just guessing but maybe try remove.packages("rlang") and then install.packages("rlang") before trying to install rstanarm?

tomli · August 3, 2020, 10:17am

Thanks jonah, that seem to have worked. I got a warning message in the rstanarm installation:

Warning message:
In utils::untar(tarfile, ...) :
  ‘tar.exe -xf "C:\Users\tomli96\AppData\Local\Temp\RtmpmcxkkY\fileb9742b1c15e5.tar.gz" -C "C:/Users/tomli96/AppData/Local/Temp/RtmpmcxkkY/remotesb97427a4815"’ returned error code 1

…but it appears to have installed correctly. I can now run the example code I provided above in its entirety (without previously suggested hacks) without encountering any issues. Thank you very much!

jonah · August 3, 2020, 6:00pm

Glad that works. I’m not sure about that warning message, but if everything seems to be working I think you can safely ignore it.

tomli · August 3, 2020, 6:00pm

Sorry @jonah, but I think there is still some issue. Uncertain if it’s with loo, rstanarm, or something else, but this code…

df<-read.csv("reducedvilt.csv")

options(mc.cores = 1)
require(rstanarm)
require(rstan)
require(loo)
rstan_options(auto_write = TRUE)

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"


df$Inno_art<-ordered(df$Inno_art,levels=1:5,labels=LETTERS[1:5])
WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")


moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")
LOOWith<-loo(WL,k_threshold = 0.7)
LOOWithout<-loo(WoL,k_threshold = 0.7)

…gives the error message…

Error in stanmat[, colnames(x), drop = FALSE] : subscript out of bounds

Swapping the order of which model is fitted first, i.e…



moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"

WL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

LOOWithout<-loo(WoL,k_threshold = 0.7)
LOOWith<-loo(WL,k_threshold = 0.7)

…instead yields:

Error in eval(predvars, data, env) : object 'Lan' not found

In both cases, the first loo calculation works, but the second fail after refitting for pareto-flagged observations. I get the same results when running in parallel, but wanted to run serially to make sure there isn’t some issue related to previous problems with parallel chains on Windows. However, the following, which moves the loo calculation to immediately after stan_polr, works:

moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha + Lan"



WL<-stan_polr(moddef , data = df, method = "logistic",
                      prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
                      algorithm = "sampling")

LOOWith<-loo(WL,k_threshold = 0.7)
moddef="Inno_art ~ Vall_ha + Spann_ha + Socker_ha + Olje_ha + Balj_ha + Potatis_ha + Gron_ha + Majs_ha + Annat_ha"


WoL<-stan_polr(moddef , data = df, method = "logistic",
              prior = R2(0.25,what = "median"), init_r = 0.1, seed = 12346,
              algorithm = "sampling")

LOOWithout<-loo(WoL,k_threshold = 0.7)

Not sure what to make of it. It seems the loo calculation “remembers” the lastly fitted object and expects that object structure.

jonah · August 3, 2020, 6:17pm

Hmm that’s weird, thanks for letting me know. I’ll try to find some time later today to look into this.

Topic		Replies	Views
Loo error when specifying 'k_threshold = 0.7' Modeling loo	9	1164	March 14, 2019
LOO error when k_threshold = 0.7 fits model that drops sole observation of factor level rstanarm loo	8	1065	August 30, 2018
Max Farrell's tutorial says refit model without outliers k>0.7 Modeling loo	9	1374	October 22, 2021
Loo_moment_match not working with cmdstanr model General	6	736	January 30, 2021
Looking for examples of exact loo calculation and recombination with approximate loo for non-rstanarm stan models Modeling rstan , loo	1	490	March 14, 2021

Loo with k_threshold error for stan_polr()

Related topics