CmdStanR returns "grep: write error" and "All variables must have the same length"

I apologize in advance for not being able to provide a fully reproducible example, as the model usually takes >24 hours to run and has many thousands of parameters. I hope the information I can provide is sufficient in lieu of that. I’ve read CmdStanR reports error "All variables in all chains must have the same length" after apparently successful sampling - #2 by Robert_Dodier, but it doesn’t solve my issue since I’ve already been redirecting my CSVs to a directory that has 10TB of storage.

I run all of my models on my university’s HPC, which has the following operating system (not sure how little or how much you need, so I included all the info):

  • Operating System: Red Hat Enterprise Linux
  • CPE OS Name: cpe:/o:redhat:enterprise_linux:7.9:GA:server
  • Kernel: Linux 3.10.0-1160.83.1.el7.x86_64
  • Architecture: x86-64

I run CmdStan through the CmdStanR interface and R from a custom conda environment, with the below specifications. (Note: I didn’t install CmdStan into a new conda environment as the cmdstan-guide suggests; I have many other R packages in my current conda environment that I need to continue using, and I didn’t want to re-install them in a new environment.)

  • R version 4.2.2
  • cmdstanr version: 0.5.3
  • CmdStan version: 2.31.0
  • g++ (GCC) : 8.5.0 20210514 (Red Hat 8.5.0-10)
  • GNU Make: 4.3

I recently transitioned to CmdStanR from RStan. When running these same models via RStan, there was always a surge in memory usage once all chains were done running, and the process was creating the stanfit object from all three completed chains. This surge would result in needing 75-100GB of available memory (which I understand, there’s a lot happening in my model) and would result in a stanfit object between 8-14GB (depending on the model). So when running my CmdStan models, I’ve requested similar resources through slurm when I run these models via CmdStanR (so 25-30 cores with 3.15 GB each for two days). I’m sharing all this so you know the computing resources I use when encountering the problems below.

My overall problem is two-fold:

  1. Once my model completes all three chains successfully (per the console output), for which I’ve been saving CSVs in a specified directory throughout the simulation, I receive two errors:
    • “grep: write error: No space left on device” (x3 - it seems this is for each chain?)
    • “Error: All variables in all chains must have the same length.”
  2. Since I’m unable to perform the planned diagnostics on the model object due to the above, I created another R script to recreate the CmdStanMCMC object from the 3 CSVs (one for each chain of the model), then transform that object into an mcmc.list so I can use with the MCMCvis package. But I get the same errors as above (but this time, the grep error happens six times instead of 3).

After continuously receiving the “grep:write error” with all of the models I was running (re problem 1), I reached out to my university’s computing help desk. They determined the issue was likely due to tmp file overflow on the computing node and suggested doing the first two of three things:

  1. add two lines in your sbatch script: “export TMPDIR=/scratch/alpine/$USER/” and “export TMP=/scratch/alpine/$USER/”
  2. increase the number of cores in your sbatch script: “#SBATCH --ntasks=”
  3. command the program/code to auto-clean /tmp directory regularly.

I already included 1 and 2 in the script I created to read in the CSVs and create diagnostic plots (#2 from the original description of my problem). I increased the requested cores to 50, yet I still got the “grep” error and “all variables must have the same length”. I have no idea how to implement 3, and I’m not sure I’d even want to (unless y’all suggest that).

Currently, my biggest concern is being able to create a CmdStanMCMC object from the CSVs of the already fitted chains (for each of the ten models that have already been run). But I still have many other models I need to run, and I’d like to avoid these “grep” and “all variables must be the same length” errors when running models in the future.

Below, I’ve attached the R scripts and console output for both parts of my problem, so you can see more details (although there isn’t much else to say). I’ve included the stan file for one of these models in case that’s helpful. I’ve also included a link with the three CSVs corresponding to this specific model; please note that the zip is 1.65GB, and each file is 2.5GB. Not sure if any or all of this information will help diagnose my problem.

Model run and output:
g1_sigma-ri_xi-ri.stan (8.9 KB)
g1_NUTS_sampling.R (3.0 KB)
g1_og_sigma-ri_xi-ri_894542.txt (10.4 KB)
zip_of_csvs

Diagnostic plot script and output:
g1_dx_plots.R (2.1 KB)
g1_dxplots_og_sigma-ri_xi-ri_925896.txt (1.2 KB)

Thank you in advance for your help and for taking the time to read this! I really appreciate it.
Liz (she/her)

1 Like

Are you still receiving the grep: write error: No space left on device error? If so, then you’re likely still running out of space, and this still needs to be resolved.

The next step is to try and reduce the number of parameters that need to be stored, this can be an issue if you have multiple large parameters in the transformed parameters and generated quantities blocks that you don’t actually need for your final inference (just as intermediaries in the construction of other variables). If you can move these to be local variables then they won’t be saved in the csv output, which can help reduce the space needed.

2 Likes

What is the file size of the output CSVs, and what happens if you try to read them one-at-a-time? Can you successfully read one of them? If yes, does it contain the expected number of post-warmup iterations, or fewer?

It does look like you have huge numbers of transformed parameters saved, so @andrjohns excellent advice is probably applicable.

Just to set expectations here, I routinely run models with on the order of 150,000 parameters and 4 chains parallelized on 4 cores on a computer with 40 GB ram, yielding csvs in storage on the order of 6-7 GB each. So if you are not too far above such a job size it’s possible that something unexpected is happening in terms of the resources you’re actually getting.

1 Like

Thank you both for your responses!

I tried running as_cmdstan_fit(csv_files, format = "draws_list") again on the cluster, with ~160 GB of memory allotted. I again got the grep: write error: No space left on device (x3) about 2 minutes into running, along with Error: All variables in all chains must have the same length. And then the job crashed.

After looking at the individual CSVs and confirming that they all contained the same number of sampling iterations, I ran the same command on my laptop (MacBook Pro, M1 chip, 32 GB memory). And I successfully recreated the CmdStanMCMC object with no errors. Apologies for not doing this as a reference point prior to posting yesterday! I assumed it wouldn’t work on my laptop if it didn’t work on the cluster.

Do you have any suggestions on what to do next? I recognize that I have many transformed parameters, but I need all but one for inference and/or use in the model block. I also can’t use my laptop to run all of these models because it will render my laptop unusable for days at a time, given that one model can take over a day to run.

The ones that you just need for use in the model block can get declared inside the model block itself. Just paste the declaration and definition from the transformed parameters down into the model block. The only difference will be that they are no longer written out to csv.

So we know that the CSVs are ok, and that they can be successfully read in with 32 GB RAM. The question is why this isn’t working on the cluster. I’m no genius of troubleshooting this stuff; is there somebody who administers the cluster who might have an idea of what is going wrong? One last hail mary is that you could try reading the csvs using brms:::read_csv_as_stanfit, which (if it works) would produce a stanfit object from rstan rather than a cmdstanfit object from cmdstanr, but might be sufficient for your purposes. When you have a lot (tens of thousands or more) of variables (parameters, transformed parameters, generated quantities), do NOT use rstan::read_stan_csv as it will be prohibitively slow.

1 Like

Apologies for the delay in response! Installing brms in my environment on the cluster was more of a task than I would’ve anticipated; once it finished, I completely got distracted and didn’t have a chance to come back to this until today 🤦🏻‍♀️

I don’t think I realized this! Thanks for the tip - I’m definitely going to implement this in all of my models now.

I finally tried this today and again got the same error message:
grep: write error: No space left on device
Error: All variables in all chains must have the same length.

I’m not sure if this is a Stan problem, a problem with my conda environment, or a problem with the cluster. Do you have any further suggestions on how to diagnose this?

To clarify the problem, you have CSV files from the fitting that you can read into R using your laptop, but not when using the cluster, is that right?

Yes, that’s correct.

These are exactly the same CSV files being used in both the cluster and your laptop? How big is each file?

One thing to test is whether you can read a single CSV file on the cluster, rather than all of them at once.

Also, can you post the output of running traceback() after the error occurs? That will help narrow down where in the process the issue pops up

Oh and another thing to check, you’ll want to make sure that R is correctly picking up your change to the temporary directory recommended by the helpdesk. Try running: Sys.getenv(c("TMP","TMPDIR")) and checking that the output matches the directories set by your script

Yes - I downloaded the CSV files from the cluster onto my laptop. They’re each ~2.5 GB.

So doing this returned some interesting results. I still get the grep: write error: No space left on device, but it does create the object. However, it looks like only about 2/3 of each CSV file is being read in on the cluster. For example, for the first CSV file, only 631 posterior samples are read in (consistently, every time I read it in on the cluster) but there should be 1000. I’ve included a screenshot of what this looks like on the cluster (left) vs my laptop (right), with the exact same file being read in to create the CmdStanMCMC object, followed by the exact same commands.

The second and third files read in 633 and 639, respectively. So it seems like the Error: All variables in all chains must have the same length is what should be happening based on what I’m seeing when individually reading in the CSVs.

I did this prior to reading the files in separately, so this may now be irrelevant. Running traceback() after getting the errors completely stalled R. So I instead set options(error = traceback), ran as_cmdstan_fit(csv_files) on all 3 files together, and no traceback was returned. The output was literally No traceback available.

Thank you for suggesting this! It doesn’t appear to be setting the directory to where I’ve been specifying (it’s returning "" "/tmp"). I’ve been trying to troubleshoot this on my own, but I’m going to reach out to the help desk because I don’t really understand sbatch enough to figure out what’s going wrong.

To follow up on this piece - I did the above in an “interactive” session on the cluster, where I’m limited to a max of 4 cores (which I requested), yielding about 15GB of RAM.

I wanted to see if access to more RAM would allow all of the samples to be fully read in. So I just repeated the test within a regular job (whatever the opposite of an interactive job is). I did this twice, requesting 30 cores, yielding 112.5 GB RAM.

Only 14.69 GB of RAM were used the first time, and the number of samples read in (essentially creating three separate CmdStanMCMC objects sequentially, as part of the same R script in the same job) was 794, 797, and 804, respectively, for chains 1 through 3. Only 12.97 GB of RAM were used the second time, and 797, 799, and 807 samples were read in.

I have no idea what’s going on here. I clearly don’t need as much memory as I’m requesting, but the full set of posterior samples is still never being read in completely.

Can you try running:

Sys.setenv("TMPDIR" = "/scratch/alpine/YOUR_USERNAME_HERE/")
Sys.setenv("TMP" = "/scratch/alpine/YOUR_USERNAME_HERE/")

Before reading in the CSV files (in the same R session)?

This did set my tmp directories correctly (thank you for that!). But I’m still getting the same grep: write error: No space left on device, along with partial read-ins of each CSV - 797, 800, and 807 samples read in, respectively, out of 1000 expected.

Unfortunately this is the point where I’ll have to hand over to your University’s IT department/helpdesk sorry, I’m not sure what else could be the culprit

Just wrote a big long post then deleted it - this might help you more:

You said cmdstanr was still writing to /tmp - that’s definitely a template issue. Depending on how you’re running your jobs you could try adding export TMP= etc before your final RScript call in your template, or better (& I think the only option that solved a very similar problem for me):

Use the output_dir & optionally output_basename arguments for cmdstanr::sample

Also check you’re not building up lots of failed job submission files in your /home directory, as these get mounted on the workers, & I’m not sure whether NFS caches synchable file changes to Ramdisk or to storage itself. In my case I had hundreds of GB from cancelled jobs in a .future folder & because I ran many models simultaneously the temp files were maxxing out my home directory & stopping the job, even though my output_dir was going somewhere else. Most of the temp files then got deleted & my /home directory looked almost ok when I next logged in → hard to troubleshoot.

Thank you for your suggestions and for helping me troubleshoot this issue! Was hoping to come back here and post that the help desk/IT helped solved my issue, but they have ignored my emails this week. Regardless, I really the time you spent helping me!

I appreciate your thoughtful response and suggestions!

So I’m not sure if cmdstanr was writing to /tmp or not, but the computing help desk instructed me to export the temp directories to my scratch area, which has 10TB of storage (I’ve only used 118GB). That unfortunately didn’t solve the issue. It also didn’t seem to actually change anything when added to my sbatch script, so I added it directly to my Rscript (per @andrjohns’s suggestion); even though the temp directories were redirected correctly, it didn’t solve my issue.

Fortunately, I had been doing this. So I at least have my model results, but when the model tries to consolidate (not sure if this is the correct word) the three chains after the last one has completed, the computing cluster can’t fully read in all three CSVs; the cluster maxes out at around 15GB of RAM despite me having ~170 GB of RAM available (I request 30 cores).

I was able to transfer the CSVs to my advisor’s machine (Linux OS, 64 cores, 128 RAM) and successfully read in all three CSVs, but it required about ~48GB of RAM. So I think there’s some sort of issue with what I’m requesting vs what I’m actually getting on the cluster. But I can’t further diagnose this without the computing help desk responding to my emails. So I’m in this frustrating limbo until I hear from them.

We’re actually not allowed to run anything from our /home directories, only our /project or /scratch directories (with the latter being preferred). So I’ve always run jobs from my /scratch directory, and I don’t have any buildup of files there at all.

In order to proceed with running my models, I’m going to essentially run them all on the cluster and then scp the CSVs to my advisor’s machine to proceed with diagnostics and subsequent analyses. Not an ideal workflow, but not much else I can do. Per @jsocolar 's suggestion, I will also be moving some parameters from the transformed parameters block to the model block, which will hopefully make the CSVs smaller and then potentially the cluster will fully read them. And in a last-ditch effort to completely remove R from the equation, I created a new conda environment with cmdstan only, and I’m going to run my models from the command line in that environment. We’ll see what happens!