Using Ariviz with saved cmdstanpy CSVs

Hello. If/How can I use arviz with previously saved CSVs? I have ran the model, sampled and generated new quantities saving the CSVs on my hard drive. I’m not familiar with cmdstanpy so I’m not sure how to use those CSVs for posterior analysis after closing out the workflow and coming back to it later.

Thank you.

1 Like

If you’re using one of the more recent versions of cmdstanpy (0.9.77, or the 1.0.0 release candidate) you can use from_csv API Reference — CmdStanPy 1.0.0rc1 documentation

This will turn the saved CSV files back into the correct object (e.g. a CmdStanMCMC object) which you can then pass to ArViz

4 Likes

Thank you. Unfortunately, my computer keeps throwing an error that states

MemoryError: Unable to allocate 322. GiB for an array with shape (1000, 4, 10799537) and data type float64

I’m not sure how I can manage memory better. This may not be the technique for a large data set. I just like the ease of explaining a Bayesian analysis over a frequentist.

We don’t have too many options in Python for when your dataset is too large to fit into memory (at least at the moment). You could try using only 100 samples (instead of 1000) if you wanted to do some smaller analysis, but even that would require a computer with >32 gigs of RAM

That is quite large data.

I think we should create workflows for streaming csv to xarray to netcdf file. And then this file could be used for post processing. (With the help of dask, it should not be impossible)

Thanks all! I’ve been trying for two years to find a Bayesian solution for the insurance industry that has large data sets. It may not be possible at this time. I have a computer with 64gigs of RAM. It also has additional memory on the GPU but I’m not sure if cmdstanpy can utilize that.

Maybe I’ll be able to use it when I get a project with smaller datasets.

Dask would certainly allow that. We should look into this

To be clear @Jordan_Howell - the issue isn’t actually sampling that data, it’s keeping all the samples in memory at the end. (You have ~ 300 mb per draw, which isn’t too crazy, but when you have 1000 draws it is a lot of memory). Analyzing the data afterwards is possible on your machine, but the way cmdstanpy does it natively is greedy and tries to put everything in memory at once. There are other ways of doing it, but at the moment you’d need to do it semi-manually

1 Like

you model outputs 10799537 variables - what are the actual parameters of interest?

This is my model. I’m only looking for one variable and the posterior samples.

stan reinstatement_model =

data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;
}
model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}
generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

you could use all the data to estimate mu and beta - CmdStan has not problem with that amount of data, and CmdStan 2.28 models compiled with STAN_THREADS=true and num_chains will do this multi-threaded for you. (working on getting this into CmdStanPy ASAP - working on PR now)

then you could have a separate model with just the data, parameters, and generated quantities block:

stan reinstatement_model_post_pred =

data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}
parameters {
    real<lower=0> mu;
    real beta;
}
generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

then you can divide your million row dataset into chunks and run the generate_quantities method.
cf: 7 Generating Quantities of Interest from a Fitted Model | CmdStan User’s Guide
and its CmdStanPy counterpart: Generating new quantities of interest. — CmdStanPy 1.0.0rc1 documentation

Please let me make sure I understand this. I take out the generated quantities block and run the with the “samples” command? Then I add the generated quantities back and rerun the generate_quantities command.

Jordan

yes - check out this notebook: Generating new quantities of interest. — CmdStanPy 1.0.0rc1 documentation