I am working on a distributed computing pool and wish to run many independent instances of an analysis. If these jobs are submitted at the same time, I will get file-name clashes and some results will be overwritten. Is there someway to specify the name of the output CSV file in CmdStanPy when it is copied back over from /tmp directory?
what @jonah said - you should be able to create unique output directory names via the ‘output_dir’ arg to the sample method. alternatively, once you have the CmdStanMCMC object, you can use the “save_csvfiles” method.
Given the nature of distributed file system I am working with, specifying directories with output_dir or save_csvfiles doesn’t solve my problem. The workflow management system (HTCondor) copies over my stan executable and input data to the local scratch drive for faster IO, so naming a directory via CmdStan means the program looks for that dirname in local scratch not my home directory on the submit machine.
Would either of the CmdStanModel or CmdStanMCMC objects contain the name of the streamed output csv file or I guess the components to reconstruct it without having to grep a directory?
runset.set_csvfiles() is the function that moves the files from /tmp to the given directory. Would it be acceptable, @mitzimorris, for that function to take an additional optional argument, ala new_filename, so that one could change StanExeName-DateTime-Chain?
I don’t think this is generally a good idea, so no, I don’t want to add this feature.
as I said earlier, this isn’t hard for a user to implement for themselves given their use case.
from the point of view of both usability and maintainability, the fewer features, the better.
furthermore, putting timestamp and chain number in a file name is for the users own good. managing output files is a long-term reproducibility issue - we’re trying to help.
output is like a lab notebook. do you take off the bar code and tracking number of your samples? please don’t!
But I guess we’re not technically forcing it here (it’s possible to get around it with a bit of work, as discussed).
So I’m OK with leaving it as is, but I’m also OK with adding the (non-default) option that @mtwest is requesting.
In fact, I’m sure that @mtwest isn’t the only person who will encounter this problem. We would like these CmdStan wrappers to play nicely with distributed file systems such as the one @mtwest is using, so if this will be a regular issue for people it seems reasonable for us to address that. Or is this something more unique to @mtwest’s particular setup than I realize?
I like that idea. @mtwest Would that solve your problem? I guess we would default to not having a node-id but allow it to be specified when calling the save_csvfiles() method?
(@mitzimorris On another note, in CmdStanR we use the name save_output_files() to match argument output_dir and to distinguish from save_latent_dynamics_files(), since those are all csv files too. Should I open an issue in CmdStanPy to unify these names or do you want to stick with save_csvfiles()?)