Dangerous design in rstan 2.21

Stephen_Martin · July 11, 2020, 9:32pm

? Th v8-r AUR package works fine on this arch machine at least. It doesn’t work properly with AUR frontends (e.g., yay) for some reason; so you need to clone and makepkg yourself.

Stephen_Martin · July 11, 2020, 9:35pm

Is there a reason the stan team can’t just ship the .js file in the rstan package itself? Did I miss that somewhere? It does seem frustrating that one would ‘need’ internet access for this, and although I agree with you about the practical lack of security threats here, I also imagine that it could cause some conflicts with IT policy.

bgoodri · July 11, 2020, 9:52pm

We are anticipating that the .js file will change rapidly in the coming weeks as people hit bugs in the parser and / or bugs in the Stan programs that the parser does not explain so well. And it currently runs (without the checks) without internet access. RStan 2.24 will come with a .js file so that it does not need to download one, but we won’t be able to release RStan 2.24 until a lot more parser progress has been made.

mespe · July 12, 2020, 12:43am

Feel like I am talking to a wall here, so I will try to be more clear.

The risk here, specifically is:

The intented behavior of rstan v2.21 is it fetches javascript code from a URL and executes it.
A malicious version of this would only need to point to a different URL controlled by someone else with the contents of stanc.js + additional javascript. No PR required, nor any interaction with the source code or the Stan dev. team.
There would be no indication to the end user that this differs from the intended behavior.
Switching the URLs in R code is trivial.

The key differences from the straw-man arguments above:

Just because X, Y, Z is also dangerous does not make this any less so.
In some ways this is more dangerous because Stan is “trusted” software, as opposed to random console commands, sourced R scripts, etc. You are quite right that Stan has a good reputation, which is why I feel this is important. Again, there are other threats out there, but this is a vulnerable piece of code in a trusted package.
The “man-in-the-middle” (not quite the correct term) is occurring in R, by-passing the security of https, github, the Stan Dev. PR system, etc. Again, swapping URLs is trivial, and will by-pass all of the “traditional” security Stan has used for 9+ years.
Building on previous point - this is not altering Stan C++/OCaml code at the source, but redirection to a malicious version hosted by some one else
This is entirely preventable by using some version of code signing on the downloaded javascript, a mechanism present in almost every single software distribution system. This is not new, novel, nor unheard of. This is basic cyber-security 101.
Just because you feel you are not being targeted does not mean other Stan users are not. There are active campaigns targeting scientists and researchers. In particular, I know this is a concern amongst colleagues from China and the Middle East.

Anyway, I’ve said my piece, and belaboring the point will just make me seem more like a loon to ya’all.

As an aside: It would have at least been polite to ask for an opt-in on this. Many people would be happy to test their models against your code. I would have set up a cron job to do this nightly for the 100+ models I have written in my 6 years using Stan if I had been asked. We, your users, do not need to be silently co-opted into helping you. Doing it this way is anti-social and disrespectful to your users.

bgoodri · July 12, 2020, 1:00am

I think we disagree as to how trivial it is to get this line of rstan code

ctx$source("https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js")

to actually load a Javascript file from somewhere else. But there are Man-In-The-Middle-Like scenarios where that is theoretically possible. However, a big part of the security model with V8 is that no code is trusted. I have not heard you or anyone else explain why I should be very worried about executing Javascript code that is not given access to the file system.

mespe · July 12, 2020, 1:39am

Here you go:

x = function()  print("Hello world!")

y = as.list(x)
y[[1]][[2]][[2]] = "Hello evil world"
x = as.function(y)

> x()
[1] "Hello evil world"

As I said, R lets you program on the language. Hard coded URLs can be changed. Quite easily in fact.

If I wanted to, I could obscure what is going on here, but I think you see the point.

mespe · July 12, 2020, 1:44am

And here is why you should be concerned about untrusted code in V8, straight from the horses mouth:

https://v8.dev/docs/untrusted-code-mitigations

“However, if your embedded V8 instance allows arbitrary or otherwise untrustworthy JavaScript or WebAssembly code to be downloaded and executed, or even generates and subsequently executes JavaScript or WebAssembly code that isn’t fully under your control (e.g. if it uses either as a compilation target), you may need to consider mitigations.”

And with that, I am done - this is no longer productive.

bgoodri · July 12, 2020, 2:28am

I do think I see your point (as did @rok_cesnovar and @Charles_Driver above) but for the benefit of others who may be reading this thread, these concerns have always been present with R or other languages that allow you to operate on the language at runtime. A similar example would be

library(rstan)
stanc <- function(...) print("Hello evil world")
assignInNamespace("stanc", stanc, ns = "rstan")
example(stanc, package = "rstan")

prints “Hello evil world” rather than parsing the example Stan program. But if someone wanted to do something evil like that, they would not need to subvert the line of R code in question (ctx$source("https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js")), because they could just replace stanc, or stan, or any of the many functions it calls with a malicious version.

All that said, presumably no one wants to inflict such harm on themselves (or else they are already capable of that), so the more pertinent question is whether a third party (who is neither me nor the RStan user) can replace stanc, stan, etc. with a malicious versions of one of these functions without the user realizing it while the user is using RStan. Those sorts of attacks would seem to require that the third party has access to the hard disk (in order to change .Rprofile or something like that to silently replace the functions) and if the third party has access to the hard disk, then the third party can do a lot of damage without involving R or RStan.

As I said before, Stan is potentially dangerous because it compiles and executes C++ code that is generated on-the-fly. But I don’t think there should be any people for whom versions of RStan <= 2.19.x are barely safe enough while RStan 2.21.x is barely not safe enough because it also downloads the parser that will be used in RStan 2.24 to check if the Stan program will continue to parse.

bgoodri · July 12, 2020, 3:12am

If you are done writing about this, fine. But there is no way of knowing how many people are still reading this thread in the future, so I am going to have to continue pointing out why I think they should not freak out.

I am more open to hearing concerns like the issues mentioned in

https://v8.dev/docs/untrusted-code-mitigations

than attacks that hinge on modifying R functions that can be done regardless of whether the RStan code reads a Javascript file from the internet. This link refers to exploits of speculative execution optimizations that have been going on for the past few years. More broadly, Stan users should be somewhat worried about these SPECTRE-style attacks because the C++ code in the Stan Math library and Stan algorithms have all kinds of branching.

On the Javascript parser specifically, I would say the file being downloaded

https://github.com/stan-dev/stanc3/releases/download/nightly/stanc.js

is semi-trusted. Other Stan developers wrote it and I trust them to not do anything malicious. But there are some scenarios where one of their GitHub accounts gets hacked, the URL gets redirected, etc. where the RStan code could end up reading unintended Javascript code into a V8 context.

The above link says that “you may need to consider mitigations” to speculative execution attacks, such as

“Update to the latest V8 to benefit from mitigations and enable mitigations”, specifically to version to v6.4.388.18 or later. Since that version was released in 2018, most people should have a V8 with these mitigations, although the V8 CRAN package does mention that it can use version 3.14 of the V8 library, which may be present on older servers and does not have those mitigations.
“If you execute untrusted JavaScript and WebAssembly code in a separate process from any sensitive data, the potential impact of SSCA is greatly reduced.” Although RStan does not execute the next version of the parser in a separate process, my understanding is that when it does ctx <- V8::v8() that R handle to the Javascript context cannot access other R objects, files on the hard disk, etc. If so, then the data in question is just the text of the Stan program, which I don’t think is sensitive in the way it is being defined in that quote.
“If your product offers high-precision timers that can be accessed by untrusted JavaScript or WebAssembly code, consider making these timers more coarse or adding jitter to them.” I don’t think the Javascript parser uses any high-precision timers.

Another mitigation would seem to be to garbage-collect ctx when done with it, which I am pretty sure happens when the stanc_beta() function finishes.

mespe · July 12, 2020, 3:29am

As I said before, Stan is potentially dangerous because it compiles and executes C++ code that is generated on-the-fly.

Yes, this. But rstan now also pulls code from the Web.

If I used lm() and it downloaded and compiled code, I would be very suspicious. In fact, other R code to bootstrap malicious javascript would probably need 1) code to download, compile, and execute that javascript code (also installing all the required dependencies) and 2) code to hide this from the user OR social engineering work to make the user trust it.

With v2.21, you have made this intended behavior in rstan. You did all the hard work. Basically, you built the framework for nasty stuff, so the barrier to mischief is significantly lower. Instead of all that work, someone can just hijack yours with a few lines to substitute the URL. Added bonus that all the above will not be suspicious to your users because you have made this intended behavior AND have your users’ trust.

Accordingly, I think you need some guardrails. Basic checks on the download. This does not seem like a lot to ask. Why this discussion has been protracted this long is beyond me…

Anyway, this horse is thoroughly dead (and I am a fool to keep engaging).

wds15 · July 12, 2020, 6:31pm

What would be a safe design in your eyes?

The constraint we have is that we have to use the latest stanc3 parser to make sure that only unreported bugs trigger a message.

Stephen_Martin · July 13, 2020, 8:45am

I’m not sure what else could be done here. Putting Ben’s excellent points about the safety of V8, due to running in its own context, to the side, you’re concerned about someone injecting a call into a script/package to run something nefarious. However, the manner in which one would need to do that is no different than what one would need for any R code. If someone ran any R code that replaced a user-exposed function with another that did nefarious things, then it’d have the same problem. So the key question is, what is the Stan team doing that increases the attack surface?

I don’t really see an increase on the attack surface here. I could, right now, write a function, called sampling, that calls rstan::sampling, but also spins up a bitcoin miner in a forked session. I could replace ].data.frame with a function that subsets, but also reads all files in ~/Documents and sends it off to my server. You can’t save users from themselves. Guardrails wouldn’t work either; e.g., having an MD5 hash check wouldn’t help; one could replace the expression graph in R with one that removes the md5 check and downloads another file altogether.
The key is, the user has to explicitly run these things, unless someone gets access to the stan git account, or some fairly sophisticated MITM attack occurs (which could happen regardless of Stan), or if someone manually injected the installed stan package with a modified version (which could happen regardless of Stan).

The ‘exploits’ you’re discussing are those that require manually changing what is run in the R session; I don’t honestly see how pulling a file from a github account would increase the odds of that happening. I think your concerns would go further if you could explain how pulling a .js file from their github would increase the attack surface beyond what is already exposed in any given R session (i.e., how it’s practically different from just running nefarious code, and replacing the expression tree with something nefarious). Otherwise, it’s nothing unique to stan, and really more a statement about how R’s runtime flexibility means that injecting bad code is simply one source() call away.

And after all that, it’s /still/ run in an independent context without much system access anyway.

rok_cesnovar · July 13, 2020, 9:00am

If someone is able to do a MITM on https, then downloading a file could be problematic.

But:

that is very unlikely
if someone can perform a MITM attack on https on you, then you are in big trouble anyways.

Just this discourse page I currently have open downloads and runs 25 javascript files and runs them in a V8 engine the same way rstan does ( I am on chrome). If you are using some other browser its a different javscript engine but same thing.

Not to mention if someone can do that, installing any package from CRAN or anywhere is also dangerous and not to be trusted. So if we consider HTTPS MITM is likely, then nothing you do in R would be considered safe.

Stephen_Martin · July 13, 2020, 9:09am

Right. The most probable ways I see of this happening is if 1) Someone got access to the git repo [in which case the URL doesn’t matter, and the threat would be much more if, say, the compiler itself were changed to do worse things in C++] 2) Someone forked the repo and socially engineered people to use it instead [In which case, see point 1 - The JS URL wouldn’t matter, b/c they could just nest the threat in R or C+±compiled code 3) Someone changed files on your system to replace the URL [in which case, they wouldn’t need to do so, because they already have access to your system and you have bigger problems to worry about].

In sum - The theoretical means by which someone could exploit this code would immediately suggest that they already have a fairly sophisticated exploit, and they wouldn’t need to exploit this code. In fact, if they already have the means to exploit the URL-approach, they would have the means to do something far worse elsewhere, given that the JS context is limited, and it’d be far and away more powerful to just alter the R code or generated C++. It would be a very confusing situation if an attacker were so sophisticated that they could alter the URL, but so unsophisticated that they would choose to do so, when so much of the rest could be more efficiently and powerfully attacked.

wds15 · July 13, 2020, 10:30am

If I recall we currently do download stanc3 no matter what. Is it an improvement to allow for an opt-out solution here?

We really need the feedback from (rstan) users on stanc3, so an opt-in would probably not get us what we need, but a configurable option to avoid the download of the next stanc3 is something we could consider.

josswright · July 13, 2020, 10:40am

From my perspective (and for context I’ll note that cybersecurity is my research area) I’m far more concerned about the effectively silent 1.5MB download every time stanc is run than I am with any theoretical security implications that have been discussed in this thread.

1.5MB isn’t that much these days but I’ve been in various situations, such as using mobile data whilst roaming abroad, where I could have been bitten by that.

A warning message that the download is happening, why it’s happening, and notifying the user of a parameter to opt-out, would seem to me a polite and reasonable way to do this.

rok_cesnovar · July 13, 2020, 11:03am

Yes, but definitely opt-out. If we go opt-in we will get limited feedback. We caught a few bugs now already and I like that. We should keep that up so the move to rstan 2.24 (which I understand will be the next version?) will be as painless as possible.

That is valid yeah. We will only need this for 2.21 as the js will afterwards be included in the package. Opt-out would fix this right now as well.

ssp3nc3r · July 13, 2020, 1:11pm

Or maybe tell users what your goal is in some message … when I first read the message currently used, I had no idea of its origins or reasons. If you tell users how they are helping in the message, maybe something like:

When you compile models, you are also contributing to development of the next Stan compiler. In this version, we compile your model as usual, but also test our next compiler against what is working now. In this case, the new compiler worked great. Thank you for helping.

When you compile models, you are also contributing to development of the next Stan compiler. In this version, we compile your model as usual, but also test our next compiler against what is working now. In this case, the anticipated, future compiler didn’t work like we hoped. By submitting an issue [URL], you’ll be contributing valuable information to the project.

stevebronder · July 13, 2020, 10:06pm

love this. Tho’ if a model succeeds I think we can be quiet and pat ourselves on the head that business is running as usual

Also agree with the opt-out param, not from a security standpoint but just from the mobile data standpoint. Could have it as an option.

jonah · July 13, 2020, 10:34pm

Yeah I agree this is a good change to the message. @ssp3nc3r Any interest in submitting a PR to change it to your version?

Topic		Replies	Views
A javascript stanc3 Developers	37	2771	December 10, 2019
Status on RStan, CRAN, and latest versions of Stan? RStan	43	3018	July 27, 2022
Rstan -- maintenance / future prospects? Interfaces	38	2196	June 22, 2021
Stan R package repository Developers r	21	2045	July 22, 2020
R Package Maintainers Using Stan - Please Read RStan	21	3472	December 10, 2021

Dangerous design in rstan 2.21

Related topics