Read_stan_csv error

Hi,

I am running cmdstan 2.21 on Linux then scp to Windows and try to use read_stan_csv. Pretty often (but not always) I receive the message below. Any advice how to get around? Maybe scp needs some parameter?

Error in row.buffer[buffer.pointer, ] <- scan(con, nlines = 1, sep = “,”, :
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In scan(con, nlines = 1, sep = “,”, quiet = TRUE) :
embedded nul(s) found in input

Perhaps something to do with line endings?

What happens if you remove all null instances?
tr < file-with-nulls -d '\000' > file-without-nulls

First, I apologize. Widows has nothing to do. I scp the original csv file from the Linux server to the Linux PC. It is rather huge and with vi on the Linux server I don’t see elapsed time line. After I did

tr <1.csv -d '\000' > 1n.csv

the file went from 92557669 to 74799496. I still don’t see elapsed time. On the server, when I read.csv(“1.csv”), I get

embedded nul(s) found in input

However, read.csv(“1n.csv”) returns no errors. So it looks like there were nuls in cmdstan generated file. readLines gives

incomplete final line found

so it seems cmdstan didn’t finish saving the file. Any ideas how to get around this?

PS. another possibility that I am close to quota and then funny thinks start happening. I will free up space and will try again.

Sorry for the trouble. It was quota problem. Perhaps it would be easy for cmdstan to output a message that line is not written. Just a suggestion.

I want to follow up on this issue as I just ran into the same problem after running cmdstan on a cluster. Any reason why CmdStan wouldn’t finish writing the csv file?

The only reason I see is that the execution was killed externally.

Tagging @mitzimorris.

The model seems to run until the end, and there’s no indication in the slurm file that the execution was killed. I’m using cmdStan but calling it from R, using system. I realize it might be helpful to switch to cmdStanR. But I’m not sure that would solve the problem.

Here are the last lines outputted by the job running R:

 Elapsed Time: 2401.39 seconds (Warm-up)
               5852.06 seconds (Sampling)
               8253.45 seconds (Total)

Iteration: 3800 / 4000 [ 95%]  (Sampling)
Iteration: 4000 / 4000 [100%]  (Sampling)

 Elapsed Time: 2595.7 seconds (Warm-up)
               5767.56 seconds (Sampling)
               8363.26 seconds (Total)

Iteration: 3900 / 4000 [ 97%]  (Sampling)
Iteration: 4000 / 4000 [100%]  (Sampling)

 Elapsed Time: 2836.41 seconds (Warm-up)
               5659.64 seconds (Sampling)
               8496.05 seconds (Total)

[[1]]
[1] 0

[[2]]
[1] 0

[[3]]
[1] 0

[[4]]
[1] 0

>
> fit <- read_stan_csv(file.path(modelDir,
+                                modelName,
+                                paste0(modelName, chains, ".csv")))
Error in row.buffer[buffer.pointer, ] <- scan(con, nlines = 1, sep = ",",  :
  number of items to replace is not a multiple of replacement length
Calls: read_stan_csv
Execution halted

and a look at the csv file suggests it is incomplete.

this seems to be the most reasonable explanation. the timing message written to the console indicates that the run completed. the timing messages are written to stdout, stderr, and the csv file. but when you’ve hit the the disk quota, you can’t write to the csv file.
the question is whether or not this this problem can be detected and if so, what should happen?

binary I/O would be a good way to speed things up and avoid hitting this problem.

The cluster people have confirmed that this was a storage issue but not a memory on the job problem. This explains why the job ran to fruition without writing out the csv file. My personal disk space on the cluster is limited to 10GB, and I was using 9.99, something like that.

There were some junk files, labelled cores*, that took a lot of space. I’m not sure what created them, but I’m guessing this may be an artefact of my trying to run 10 chains and perhaps not setting this up properly. Hopefully, everything will run smoothly now.

1 Like

Those core.1234 files are often created on Linux when a program crashes for debugging purposes. It sounds like CmdStan crashed because of the storeage, in which case you should just delete the core.* files to free up more room. It still might not be enough though.

I had a similar issue, but the CSV file was on the disk. I posted a similar R function to hack reading large files into R in another question but googling the error will bump you to this page, so I’m posting it again here.

## file: csv file with stan output.
## vars: variable names as named in the stan output
## newfile: optionally save the file in a new file.
my_read_stan_vars <- function (file, vars, newfile = NULL) 
{
    file <- gsub(".csv", "", file)
    uncomented <- paste0(file, "_uncomented.csv")
    system(paste0("sed -e '/^#/d' ", paste0(file, ".csv"), " > ", 
        uncomented))
    post <- data.table::fread(uncomented, select = vars)
    system(paste0("rm ", uncomented))
    if (is.null(newfile)) {
        return(as.data.frame(post))
    }
    else {
        vroom::vroom_write(post, paste0(newfile, ".csv"), delim = ",")
    }
}

It did do the trick for me.
Also, this hack can be helpful to know the names of the variables in the Stan output, especially for individual parameters:

colnames_stan_output <- function (file) 
{
    ufile <- gsub(".csv", "", file)
    ufile <- paste0(ufile, "_uncomented.csv")
    colfile <- paste0(ufile, "_cols.csv")
    system(paste0("sed -e '/^#/d' ", file, " > ", ufile))
    system(paste0("awk 'NR==1{print $1}' ", ufile, " > ", colfile))
    cols <- system(paste0("cat ", colfile), intern = T)
    cols <- unlist(strsplit(cols, ","))
    system(paste0("rm ", colfile))
    system(paste0("rm ", ufile))
    return(cols)
}