Generated Quantities throwing parsing error in Pystan

Are you sure it is the same error?

1 Like

this should be poisson_log_rng(...)

1 Like

If N_new is large, you may run out of memory for reasons similar to those discussed in the topic First Pystan Poisson Model. Generated quantities take up memory in PyStan the same way that parameters do.

Ahhh. Ok, I put this in an actual .txt. file and loaded it that way instead of a string object. I changed y_new[n] = poisson_log(…) to poisson_log_rng(...) and it is parsed and running.

Do you have a suggestion for the memory issues? I’m working to get some change in my workplace to start using these techniques over our standard frequentist practice just because of the inference I can glean but our data sets are big.

have you tried running this using CmdStanPy? a big model with many params will take up memory, but there might be less memory overhead w/r/t data and outputs.
https://cmdstanpy.readthedocs.io/en/latest/generate_quantities.html

Instead of using the generated quantities block, just generate those quantities in a post-processing step. You can save the samples from your MCMC runs to a CSV file (or other file format that’s convenient for you), load those saved samples into another Python script, and then iterate over the samples.

For example, at the end of your PyStan run, you save your model fit as follows:

fit.to_dataframe().to_csv("my_samples.csv.gz",
                          compression = "gzip",
                          index = False)

In another script, you can do something like the following:

import gzip

import pandas as pd
import numpy as np

my_samples = pd.read_csv("my_samples.csv.gz")

num_samples = my_samples.shape[1]
BIlmt_coeff = my_samples["BIlmt_coeff"]
unit_value_model2_coeff = my_samples["unit_value_model2_coeff"]
# More parameter vars ...

# Load your data here ...
my_data = ...
BIlmt_model2_new = np.asarray(my_data["BIlmt_model2_new"])
# More data vars ...

with gzip.open("y_new.out.gz", "wt") as out_file:
    for i in range(num_samples):
        # y_new is an array of length N_new
        y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + ...))
        
        out_file.write(str(y_new)) # There's probably some smarter way to 
                                   # write this to a file.

The main thing of interest is that you are iterating over the parameter samples so that you only have one instance of y_new in memory at a time, and then you write it to a file.

(I also used gzip to write to files in order to save disk space.)

Thanks. So I got this to run but I’m not exactly sure what I’m looking at. I just put this out to a dataframe which is shape (26, 82868). I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from. Also, the posterior df is all whole numbers. See below:

image

Did I do something wrong? Below is my code:

num_samples = my_samples.shape[1]

#get coefficients
BIlmt_coeff = my_samples['BIlmt_coeff']
unit_value_model2_coeff = my_samples['unit_value_model2_coeff']
yrs_owned_model2_coeff = my_samples['yrs_owned_model2_coeff']
credit_c5_score_coeff = my_samples['credit_c5_score_coeff']
credit_c6_score_coeff = my_samples['credit_c6_score_coeff']
c6_2_coeff = my_samples['c6_2_coeff']
c6_3_coeff = my_samples['c6_3_coeff']
credit_52778_score_coeff = my_samples['credit_52778_score_coeff']
thinfile_52778_coeff = my_samples['thinfile_52778_coeff']
nohit_52778_coeff = my_samples['nohit_52778_coeff']
class_model2_Sport_coeff = my_samples['class_model2_Sport_coeff']
class_model2_Street_coeff = my_samples['class_model2_Street_coeff']
class_model2_other_coeff = my_samples['class_model2_other_coeff']
Single_Driver_Age_model_coeff = my_samples['Single_Driver_Age_model_coeff']
Driver_Age_model_coeff = my_samples['Driver_Age_model_coeff']
intercept = my_samples['intercept']

#get new data
BIlmt_model2_new = np.asarray(x_test['BIlmt_model2'])
multi_policy_count_model_new = np.asarray(x_test['multi_policy_count_model'])
unit_value_model2_new = np.asarray(x_test['unit_value_model2'])
yrs_owned_model2_new = np.asarray(x_test['yrs_owned_model2'])
credit_c5_score_new = np.asarray(x_test['credit_c5_score'])
credit_c6_score_new = np.asarray(x_test['credit_c6_score'])
c6_2_new = np.asarray(x_test['np.maximum(credit_c6_score - 2, 0)'])
c6_3_new = np.asarray(x_test['np.maximum(credit_c6_score - 3, 0)'])
credit_52778_score_new = np.asarray(x_test['credit_52778_score'])
thinfile_52778_new = np.asarray(x_test['np.maximum(credit_52778_score - 2, 0)'])
nohit_52778_coeff_new = np.asarray(x_test['np.maximum(credit_52778_score - 3, 0)'])
class_model2_Sport_new = np.asarray(x_test['class_model2_Sport'])
class_model2_Street_new = np.asarray(x_test['class_model2_Street'])
class_model2_other_new = np.asarray(x_test['class_model2_other'])
Single_Driver_Age_model_new = np.asarray(x_test['marital_status_model_S:Driver_Age_model'])
Driver_Age_model_new = np.asarray(x_test['Driver_Age_model'])

#run to get posterior
posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

Put what out to a dataframe? Are you referring to the my_samples dataframe or df1? And how did you create the dataframe my_samples in the first place?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?

I just put this out to a dataframe which is shape (26, 82868) . I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from.

My apologies. I created a data frame instead of writing out to a gzip file. Do you have a good script on posterior predictive checks?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?
Yes you are correct. I understand now.

You’d likely be creating a dataframe anyway, so that’s not a big deal. What I want to know is how you created that dataframe, because that will determine whether the reported shape of the dataframe, (26, 82868), makes any sense for what you want to do.

I created it like this:

posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

Show me the results of the following:

print(my_samples.shape)
print(x_test.shape)
print(df1.shape)
 my_samples: (18000, 26), x_test: (82868, 16), df1: (26, 82868)

Now that I’m looking, I’m thinking it’s pulling the columns [chain, draw, warmup, lp_, accept_stat_] etc. Maybe filter those out?

Ah, now things are making sense. Those extra columns are why my_samples is (18000,26).

Also, I made a mistake by writng num_samples = my_samples.shape[1]. It should be num_samples = my_samples.shape[0]. A pure brainfart on my part.

Of course, with num_samples now being 18000, the dataframe posterior will be huge, 18000\times 82868, possibly too large to fit in memory. That’s why I’d recommend appending each y_new to a file, rather than keeping it in memory.

Thanks. So how do I interpret this? Is every row 82,868 predicted values? I’m not sure the best way to show the predicted values as a historgram with the actual mean displayed.

(“advanced” stuff): you could transform posterior to InferenceData from ArviZ save it as netCDF and then use dask to do out-of-core computation (xarray.Dataset / xarray.Array)

https://arviz-devs.github.io/arviz/notebooks/InferenceDataCookbook.html

https://xarray.pydata.org/en/latest/dask.html

Each row is a predicted sample, which is a vector of length x_test.shape[0]. When you did MCMC, you drew 18000 samples of your parameters, so you now have 18000 samples from your posterior predictive distribution. Now you could make your posterior predictive sample this way:

num_test_points = x_test.shape[0]
posterior_moments = []
for i in range(num_test_points):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new[i]*BIlmt_coeff + 
                                    multi_policy_count_model_new[i]*.9 +
                                    unit_value_model2_new[i]*unit_value_model2_coeff +
                                    yrs_owned_model2_new[i]*yrs_owned_model2_coeff +
                                    credit_c5_score_new[i]*credit_c5_score_coeff + 
                                    credit_c6_score_new[i]*credit_c6_score_coeff + 
                                    c6_2_new[i]*c6_2_coeff + 
                                    c6_3_new[i]*c6_3_coeff + 
                                    credit_52778_score_new[i]*credit_52778_score_coeff + 
                                    thinfile_52778_new[i]*thinfile_52778_coeff + 
                                    nohit_52778_coeff_new[i]*nohit_52778_coeff +
                                    class_model2_Sport_new[i]*class_model2_Sport_coeff + 
                                    class_model2_Street_new[i]*class_model2_Street_coeff +
                                    class_model2_other_new[i]*class_model2_other_coeff +
                                    Single_Driver_Age_model_new[i]*Single_Driver_Age_model_coeff +
                                    Driver_Age_model_new[i]*Driver_Age_model_coeff))

    posterior_moments.append((np.mean(y_new), np.std(y_new, ddof=1))

That will get you the mean and standard deviation of y for each test point.

ETA: Notice that I’m now effectively iterating along the rows of x_test rather than along the rows of my_samples.

Thanks for sticking with me. So Row 1 of 18000 has 82,868 values. What do those values represent? I was expecting to get 18,000 predicted y variables. Is that not the case?

Edit:

Is the value for each column a predicted value from that sample taken? So each 18,00 is the samples of each variable and the value/s I’m seeing are the predicted ‘y’ for that observation in the test data?

Think of it this way. You have some sort of functional relationship, \mathbf{y} = f(\mathbf{X}, \boldsymbol{\theta}). \mathbf{X} is your matrix of input points (e.g. x_test in your Python script), with row i representing the i^{\rm th} vector of inputs (e.g. BIlmt_model2_new[i], multi_policy_count_model_new[i], etc.). \boldsymbol{\theta} is a vector of model parameters (e.g. BIlmt_coeff, unit_value_model2_coeff, etc.)

For any single value of \boldsymbol{\theta}, inputting the 82868\times16 matrix \mathbf{X} into f will get you a vector \mathbf{Y} of length 82868. But you’ve sampled 18000 random values of \boldsymbol{\theta}, so now you have 18000 predicted values of the vector \mathbf{Y}.

Got it. Thank you @jjramsey. I’m going to think through how I want to infer from this and start another thread if needed. Thanks so much!