Getting environment variables

Sometimes it’d be useful to pass integers or reals to a CmdStan program which are somewhat independent of the data set, e.g. grain size for reduce_sum or rel_tol for the ODE solver. I wonder what others think of the idea of utility functions like

int get_env_int(string key);
real get_env_real(string key);

but the parser doesn’t like string as a type so I resorted to getting just one variable,

#include <cstdlib>
get_grainsize(std::ostream* pstream__) {
  const char* str_grainsize = std::getenv("STAN_GRAINSIZE");
  return atoi(str_grainsize);

I wondered if strings will be added some day to the parser? Does it seem like a bad idea generally to create another way for variables to enter the model? I still think there’s a case for “dev” tunables like this.

If this were added in the parser, we could support a constant string, much like we support it for print(). I was just working on this for a profiling prototype, where we require profile section names. Its not as easy as adding a regular Stan Math function, but not a huge deal as well.

One thing that is problematic wrt to environment variables is that they can be modified during runtime. But I guess if someone does that, its their problem.

How about

./model sample data file=data.json runtime_settings grainsize=100 b=2 c=3 

There are probably better names then runtime_settings. grainsize, b and c would be added to the var_context and passed to the model constructor as if they were a part of the data file. If we limit this to scalars (which I guess is the actual use case), this should be doable.

When you are working on a cluster in the shell and just want to update a single number (like the grainsize) it can be very annoying to open a data file and change it (more so if the data file is big, which is usually the case when you resort to using a cluster or an AWS instance), so I would support any resolution to this problem.

I thought of passing as command line arguments as well but not sure how much work that is. Supporting both (with cli args overriding env vars) would cover all scenarios (HPC, CI pipelines, user scripts) that I can imagine, with functions like the above.

I hadn’t thought of the environment changing while a process is running, but I can’t imagine a situation where it’s useful, since for such tuning purposes, it’s the final summary stats (neff/s) per tunable value that should be used.

1 Like

On second thought, I would just drop the idea of env vars for now, since including some of the data { } vars on the command line would be sufficient (env vars can be passed as args without problem), and avoids problems like what to do if the env var isn’t present.

Lastly, it might be easier to implement as a single csv argument,

data file=foo.json values=grainsize=100,b=2,c=3

since it seems like the argument parser constructs the parse tree prior to doing the parsing, without a varargs mechanism.

edit I tried to start implementing and reading the argument parser code is easy but I have don’t really understand how the var_context objects figure out what’s real and what’s not before getting passed to the model constructor.

You would have to add functions add_int_scalar() and add_real_scalar() to var_context and then call those in command.hpp of cmdstan.

Like for example (this is hardcoded, just to show the point):

  //                Initialize Model              //

  std::string filename(
      dynamic_cast<string_argument *>(parser.arg("data")->arg("file"))
  std::shared_ptr<stan::io::var_context> var_context
      = get_var_context(filename);

  // new code
  var_context->add_int_scalar("N", 5);
  var_context->add_real_scalar("d", 1.2);
 // end of new code

  stan::model::model_base &model
      = new_model(*var_context, random_seed, &std::cout);

I threw something together quickly for var_context though it does yet work:
It does increase the size of the vars maps but then they dissapear. Probably missed something subtle, but I think this is close.

My reading of the var_context and the json and rdump implementations was that var_context provides the generic interface but not the storage. I think the missing piece in your code is to have the methods which return the list of data to first look in the maps you added before looking in the data storage read from the file.

Another way to structure it would be to have a second var context which has the cli settings and a final var context object which knows how to merge the two sources of variables. This seemed a little over complex though so I left it on the back burner, and it would be easier to have a working proof of concept based on your changes.

var_context assumes that variables are either real or int - hence all methods have suffix _r or _i -

this is part of the implementation - for Rdump format, we have our own ad-hoc parser - and for JSON we’re using the rapidjson parser.

my question/confusion was really about how should we be able to add variables to the var context, since this is just an interface (I.e. the documented methods are all virtual). I wouldn’t implement it for each file type since we are talking about variables not passed as a file but as command line arguments.