Specifying output file names

I am working on a distributed computing pool and wish to run many independent instances of an analysis. If these jobs are submitted at the same time, I will get file-name clashes and some results will be overwritten. Is there someway to specify the name of the output CSV file in CmdStanPy when it is copied back over from /tmp directory?

  • Operating System: Ubuntu 18.04.4 LTS
  • CmdStan Version: 2.23.0
  • Compiler/Toolkit: g++

I might be misunderstanding, but does CmdStanPy’s output_dir argument do what you need? (In case that’s not what you need I’m tagging CmdStanPy expert @mitzimorris)

what @jonah said - you should be able to create unique output directory names via the ‘output_dir’ arg to the sample method. alternatively, once you have the CmdStanMCMC object, you can use the “save_csvfiles” method.

Given the nature of distributed file system I am working with, specifying directories with output_dir or save_csvfiles doesn’t solve my problem. The workflow management system (HTCondor) copies over my stan executable and input data to the local scratch drive for faster IO, so naming a directory via CmdStan means the program looks for that dirname in local scratch not my home directory on the submit machine.

Would either of the CmdStanModel or CmdStanMCMC objects contain the name of the streamed output csv file or I guess the components to reconstruct it without having to grep a directory?

Wanted also to say, I love the CmdStan + Py/R wrapper design!

2 Likes

yes, the CmdStanMCMC object’s __repr__ function will print the names and locations of the output csv files.
so print(<obj>) should work, no?

you’ll have to parse the csv_file names from the output…

2 Likes

Yeah I see how parsing that is a bit of a pain. Doable but not pleasant.

(Pdb) fit.__repr__()
"CmdStanMCMC: model=schools chains=1['method=sample', 'num_samples=2000', 'num_warmup=1000', 'save_warmup=1', 'thin=1', 'algorithm=hmc', 'adapt', 'engaged=1']\n csv_files:\n\t/m/home/home2/21/westm1/unix/projects/eight_schools/schools-202007022027-1.csv\n output_files:\n\t/m/home/home2/21/westm1/unix/projects/eight_schools/schools-202007022027-1-stdout.txt"

runset.set_csvfiles() is the function that moves the files from /tmp to the given directory. Would it be acceptable, @mitzimorris, for that function to take an additional optional argument, ala new_filename, so that one could change StanExeName-DateTime-Chain?

short answer no. chain number is necessary.

parsing a little bit of text is what Python excels at - I know its a pain, but you should write the custom function that suits your needs.

OK, leaving chain number…
I can write up the PR.

I agree that chain is necessary. Other than chain number I think it should be fine to change the names, so I’m OK with new_filename if the chain number is still tacked on at the end. @mitzimorris ?

I don’t think this is generally a good idea, so no, I don’t want to add this feature.
as I said earlier, this isn’t hard for a user to implement for themselves given their use case.

from the point of view of both usability and maintainability, the fewer features, the better.

furthermore, putting timestamp and chain number in a file name is for the users own good. managing output files is a long-term reproducibility issue - we’re trying to help.
output is like a lab notebook. do you take off the bar code and tracking number of your samples? please don’t!

I understand the desire to help but forcing users to follow the developers required organizational framework is problematic.

I can solve this problem with parsing .__repr__ as suggested.

1 Like

The organizational framework addresses two common modes of working with Stan:

  • development - run the model over and over, outputs are summarized and visualized, but not saved

  • production - the user has a production set-up which is user specific.

A production pipeline should be scripted, so specifying additional arguments isn’t a burden.

We’ve gone with default options that favor learning/teaching/development because this is where most people spent their time working. therefore writing to /tmp is the default.

I could go either way on this.

On the one hand, I agree with @mitzimorris that we’re trying to provide the options that are safest and make sense in the vast majority of cases.

But I also agree with @mtwest that

But I guess we’re not technically forcing it here (it’s possible to get around it with a bit of work, as discussed).

So I’m OK with leaving it as is, but I’m also OK with adding the (non-default) option that @mtwest is requesting.

In fact, I’m sure that @mtwest isn’t the only person who will encounter this problem. We would like these CmdStan wrappers to play nicely with distributed file systems such as the one @mtwest is using, so if this will be a regular issue for people it seems reasonable for us to address that. Or is this something more unique to @mtwest’s particular setup than I realize?

@rok_cesnovar what do you think?

@jonah, as I marked in my previous reply, I am following @mitzimorris advice and just parsing some of the metadata. This isn’t worth taking up more thread inches. I’d say just drop it.

did you mean to say “issue”? all PRs must have an issue.
hashing this out in the forums is fine too.

perhaps I missed something w/r/t to the problem and proposed solution.

  • there’s a cluster run, each node puts files to its /tmp
  • the user wants to have a single output dir with all csv files from all nodes
  • save_csvfiles doesn’t let you change filename

we could add a “node id” prefix to the existing csv file. I object to wholesale renaming, but prefixing is easy.

@mtwest does this address your use case?

Yes I meant issue. My apologies for being unaware of the history of people producing PRs without consulting the group first about potential changes.

issue filed: https://github.com/stan-dev/cmdstanpy/issues/254

Proposal for argument prefix is to preserve information: model, timestamp, chain - I consider this useful for managing the analysis process, which can be very messy. Also, easy to implmenet. Discuss?

I like that idea. @mtwest Would that solve your problem? I guess we would default to not having a node-id but allow it to be specified when calling the save_csvfiles() method?

(@mitzimorris On another note, in CmdStanR we use the name save_output_files() to match argument output_dir and to distinguish from save_latent_dynamics_files(), since those are all csv files too. Should I open an issue in CmdStanPy to unify these names or do you want to stick with save_csvfiles()?)