Cmdstanpy: Supplying multiple paths as inits writes to /tmp?

Garren_Hermanus · April 30, 2024, 3:25pm

Hi all

I just wanted to know how I can supply multiple inits, paths to json files, to the sampling method? I have tried supplying a list of paths as inits, however when inspecting the csv files there is a json file being created. I am running code on a cluster, and according to cluster guidlines the /tmp directory should be used sparingly hence I manually store my inits in my home directory and want to use these as inits. Hence the problem is not that I do not know how to provide inits, I just want to provide this such a way that it reads existing json files and does not write to /tmp.

I guess input on how cmdstanpy processes the list of paths as inits would suffice as well such that I can create this json file manually?

I am using cmdstanpy v1.2.0 and cmdstan v2.34.0

WardBrian · April 30, 2024, 3:29pm

Specifying multiple inits in some contexts requires very specific file name structures, which is why we create copies of the files you provide.

I think you could set cmdstanpy._TMPDIR = “some_other_path/you/want” but this isn’t really a use case we have considered

Garren_Hermanus · April 30, 2024, 3:43pm

Thanks @WardBrian this would solve my problem, I will give it a try. I am however interested in the details behind the file processing? What I see in the csv outputs is that the chains have the same json file as input?

WardBrian · April 30, 2024, 3:56pm

It is unfortunately complex, but if you are using STAN_THREADS with your model, the files will be prepared with names like

foo_1.json, foo_2.json, etc

The actual argument supplied is “foo.json” (leaving out the _ID), which will appear in the output file comment

Garren_Hermanus · May 1, 2024, 2:02am

Thanks Brian. I did compile my model with stan threads but only a single json file was referenced across all chains? I would have assumed that that the foo_ID would be applicable in this case? Upon investigating I found that if we compile the model STAN_THREADS=False the foo_ID is applicable, but when STAN_THREADS=True all chains have the same init file?

Anyway I have tried your suggestion with just overwriting the cmdstanpy._TMPDIR, but this did not yield anything. Files were still written to the /tmp directory. I delved into the source a bit and found that _TMPDIR is imported (upon importing cmdstanpy) into several different methods. These include cmdstan_args, utils.filesystem, stanfit.runset, stanfit.mcmc, and since the _TMPDIR is used in stanfit.runset.Runset method all other methods of stanfit also has it’s own local version of the _TMPDIR variable (as generated initially), due to all the packages being imported at once. This makes it nearly impossible to change _TMPDIR after it has been initialized.

I do have a solution for this however, but since cmdstanpy was installed using root privileges I can not change this on the cluster. This would involve changing the source for __ init__.py slightly. We define a new environmental variable, STAN_TMPDIR which is an existing path where you would like stan to write by default. I.e. we shall create a new random sub-directory in STAN_TMPDIR by changing the code as follows (and keeping the currently behaviour if no or incorrect/non-existent STAN_TMPDIR is supplied):

...
import tempfile
import os # used to check if STAN_TMPDIR is set and is an existing path

# Check if 'STAN_TMPDIR' exists as an environment variable
if 'STAN_TMPDIR' in os.environ:
    # Check if it's an absolute path that exists
    if os.path.isabs(os.environ['STAN_TMPDIR']) and os.path.exists(os.environ['STAN_TMPDIR']):
        dir = os.environ['STAN_TMPDIR'] # Use specified directory as /tmp
    else:
        dir = None # Default to /tmp 
else:
    dir = None # Default to /tmp

_TMPDIR = tempfile.mkdtemp(dir=dir)
...

This should be an efficient way to change the source, giving users the ability to change the tmp directory whilst keeping all other source the same. We could probably include some warning when STAN_TMPDIR is supplied but is non-existent, incorrect or not absolute path?

WardBrian · May 1, 2024, 1:31pm

Are you observing that they actually initialize at the same point, or just that the header comment says the same file for each? The header comment will be identical between different files when STAN_THREADS=true, even though the chains can still have different initializations, ids, etc.

This seems like a reasonable proposal, would you mind opening an issue or PR in the cmdstanpy repository?

Garren_Hermanus · May 1, 2024, 2:01pm

I understand now what you meant. Only the header comment in the csv files are the same, not the actual data.json file being sent to cmdstan?

Will do, thanks Brian

WardBrian · May 1, 2024, 3:22pm

Yes, the header comment is a faithful reconstruction of the command line given to the Stan executable, but in the multi-chain multi-threaded case, the command line uses shortcuts like foo.json being shorthand for foo_1.json, foo_2.json, etc. This is definitely confusing, and has a few open issues about it: id for each chain should be unique in multi chain · Issue #1257 · stan-dev/cmdstan · GitHub

WardBrian · May 1, 2024, 9:01pm

This was discussed in a github issue where we determined that using os.environ[‘TEMPDIR’] is the best way to control Python’s behavior that cmdstanpy relies on

Topic		Replies	Views
Pathfinder does not accept multiple inits? Interfaces cmdstan , cmdstanpy	2	252	March 25, 2024
Init for cmdstanr Interfaces cmdstanr	1	762	March 10, 2020
Initial values in CmdStanPy for sampling General	9	890	January 22, 2021
Provinding Pathfinder with Initial Values Modeling	2	151	March 29, 2024
A way to create JSON file for cmdstanpy in python? General cmdstanpy	4	764	August 12, 2021

Cmdstanpy: Supplying multiple paths as inits writes to /tmp?

Related topics