Using Ariviz with saved cmdstanpy CSVs

Jordan_Howell · November 1, 2021, 11:11am

Hello. If/How can I use arviz with previously saved CSVs? I have ran the model, sampled and generated new quantities saving the CSVs on my hard drive. I’m not familiar with cmdstanpy so I’m not sure how to use those CSVs for posterior analysis after closing out the workflow and coming back to it later.

Thank you.

WardBrian · November 1, 2021, 2:32pm

If you’re using one of the more recent versions of cmdstanpy (0.9.77, or the 1.0.0 release candidate) you can use from_csv API Reference — CmdStanPy 1.0.0rc1 documentation

This will turn the saved CSV files back into the correct object (e.g. a CmdStanMCMC object) which you can then pass to ArViz

Jordan_Howell · November 5, 2021, 2:04pm

Thank you. Unfortunately, my computer keeps throwing an error that states

MemoryError: Unable to allocate 322. GiB for an array with shape (1000, 4, 10799537) and data type float64

I’m not sure how I can manage memory better. This may not be the technique for a large data set. I just like the ease of explaining a Bayesian analysis over a frequentist.

WardBrian · November 5, 2021, 2:23pm

We don’t have too many options in Python for when your dataset is too large to fit into memory (at least at the moment). You could try using only 100 samples (instead of 1000) if you wanted to do some smaller analysis, but even that would require a computer with >32 gigs of RAM

ahartikainen · November 5, 2021, 2:24pm

That is quite large data.

I think we should create workflows for streaming csv to xarray to netcdf file. And then this file could be used for post processing. (With the help of dask, it should not be impossible)

Jordan_Howell · November 5, 2021, 2:44pm

Thanks all! I’ve been trying for two years to find a Bayesian solution for the insurance industry that has large data sets. It may not be possible at this time. I have a computer with 64gigs of RAM. It also has additional memory on the GPU but I’m not sure if cmdstanpy can utilize that.

Maybe I’ll be able to use it when I get a project with smaller datasets.

WardBrian · November 5, 2021, 3:14pm

Dask would certainly allow that. We should look into this

To be clear @Jordan_Howell - the issue isn’t actually sampling that data, it’s keeping all the samples in memory at the end. (You have ~ 300 mb per draw, which isn’t too crazy, but when you have 1000 draws it is a lot of memory). Analyzing the data afterwards is possible on your machine, but the way cmdstanpy does it natively is greedy and tries to put everything in memory at once. There are other ways of doing it, but at the moment you’d need to do it semi-manually

mitzimorris · November 6, 2021, 5:41pm

you model outputs 10799537 variables - what are the actual parameters of interest?

Jordan_Howell · November 6, 2021, 7:35pm

This is my model. I’m only looking for one variable and the posterior samples.

stan reinstatement_model =

data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;
}
model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}
generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

mitzimorris · November 6, 2021, 10:47pm

you could use all the data to estimate mu and beta - CmdStan has not problem with that amount of data, and CmdStan 2.28 models compiled with STAN_THREADS=true and num_chains will do this multi-threaded for you. (working on getting this into CmdStanPy ASAP - working on PR now)

then you could have a separate model with just the data, parameters, and generated quantities block:

stan reinstatement_model_post_pred =

data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}
parameters {
    real<lower=0> mu;
    real beta;
}
generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

then you can divide your million row dataset into chunks and run the generate_quantities method.
cf: 7 Generating Quantities of Interest from a Fitted Model | CmdStan User’s Guide
and its CmdStanPy counterpart: Generating new quantities of interest. — CmdStanPy 1.0.0rc1 documentation

Jordan_Howell · November 6, 2021, 11:46pm

Please let me make sure I understand this. I take out the generated quantities block and run the with the “samples” command? Then I add the generated quantities back and rerun the generate_quantities command.

Jordan

mitzimorris · November 7, 2021, 2:55am

yes - check out this notebook: Generating new quantities of interest. — CmdStanPy 1.0.0rc1 documentation

Topic		Replies	Views
Loading cmstanpy output in python General	4	467	September 3, 2020
Can I come back to analyzing posterior of a model after csv's have been dropped? General cmdstanpy	2	461	November 10, 2021
Is there a different, less memory-intensive way to plot a posterior predictive check? CmdStan	2	666	November 11, 2021
Reading cmdstanr csv files CmdStan	2	349	October 16, 2023
ArviZ - Exploratory analysis of Bayesian models in Python for PyStan and CmdStan General loo , arviz	1	2964	January 9, 2019

Using Ariviz with saved cmdstanpy CSVs

Related topics