I am working on a distributed computing pool and wish to run many independent instances of an analysis. If these jobs are submitted at the same time, I will get file-name clashes and some results will be overwritten. Is there someway to specify the name of the output CSV file in CmdStanPy when it is copied back over from /tmp directory?
I might be misunderstanding, but does CmdStanPyās output_dir argument do what you need? (In case thatās not what you need Iām tagging CmdStanPy expert @mitzimorris)
what @jonah said - you should be able to create unique output directory names via the āoutput_dirā arg to the sample method. alternatively, once you have the CmdStanMCMC object, you can use the āsave_csvfilesā method.
Given the nature of distributed file system I am working with, specifying directories with output_dir or save_csvfiles doesnāt solve my problem. The workflow management system (HTCondor) copies over my stan executable and input data to the local scratch drive for faster IO, so naming a directory via CmdStan means the program looks for that dirname in local scratch not my home directory on the submit machine.
Would either of the CmdStanModel or CmdStanMCMC objects contain the name of the streamed output csv file or I guess the components to reconstruct it without having to grep a directory?
runset.set_csvfiles() is the function that moves the files from /tmp to the given directory. Would it be acceptable, @mitzimorris, for that function to take an additional optional argument, ala new_filename, so that one could change StanExeName-DateTime-Chain?
I agree that chain is necessary. Other than chain number I think it should be fine to change the names, so Iām OK with new_filename if the chain number is still tacked on at the end. @mitzimorris ?
I donāt think this is generally a good idea, so no, I donāt want to add this feature.
as I said earlier, this isnāt hard for a user to implement for themselves given their use case.
from the point of view of both usability and maintainability, the fewer features, the better.
furthermore, putting timestamp and chain number in a file name is for the users own good. managing output files is a long-term reproducibility issue - weāre trying to help.
output is like a lab notebook. do you take off the bar code and tracking number of your samples? please donāt!
The organizational framework addresses two common modes of working with Stan:
development - run the model over and over, outputs are summarized and visualized, but not saved
production - the user has a production set-up which is user specific.
A production pipeline should be scripted, so specifying additional arguments isnāt a burden.
Weāve gone with default options that favor learning/teaching/development because this is where most people spent their time working. therefore writing to /tmp is the default.
But I guess weāre not technically forcing it here (itās possible to get around it with a bit of work, as discussed).
So Iām OK with leaving it as is, but Iām also OK with adding the (non-default) option that @mtwest is requesting.
In fact, Iām sure that @mtwest isnāt the only person who will encounter this problem. We would like these CmdStan wrappers to play nicely with distributed file systems such as the one @mtwest is using, so if this will be a regular issue for people it seems reasonable for us to address that. Or is this something more unique to @mtwestās particular setup than I realize?
@jonah, as I marked in my previous reply, I am following @mitzimorris advice and just parsing some of the metadata. This isnāt worth taking up more thread inches. Iād say just drop it.
Proposal for argument prefix is to preserve information: model, timestamp, chain - I consider this useful for managing the analysis process, which can be very messy. Also, easy to implmenet. Discuss?
I like that idea. @mtwest Would that solve your problem? I guess we would default to not having a node-id but allow it to be specified when calling the save_csvfiles() method?
(@mitzimorris On another note, in CmdStanR we use the name save_output_files() to match argument output_dir and to distinguish from save_latent_dynamics_files(), since those are all csv files too. Should I open an issue in CmdStanPy to unify these names or do you want to stick with save_csvfiles()?)