Generated Quantities throwing parsing error in Pystan

Hello,

I’m trying to use the generated quantities block in Pystan and it seems to be throwing an error.

liability_code = """
data {
    int<lower=0> N;
    vector[N] BIlmt_model2;
    vector[N] multi_policy_count_model;
    vector[N] unit_value_model2;
    vector[N] yrs_owned_model2;
    vector[N] credit_c5_score;
    vector[N] credit_c6_score;
    vector[N] c6_2;
    vector[N] c6_3;
    vector[N] credit_52778_score;
    vector[N] nohit_52778;
    vector[N] thinfile_52778;
    vector[N] class_model2_Sport;
    vector[N] class_model2_Street;
    vector[N] class_model2_other;
    vector[N] Single_Driver_Age_model;
    vector[N] Driver_Age_model;
    int<lower=0> y1[N];
    
    int<lower=0> N_new;
    
    vector[N_new] BIlmt_model2;
    vector[N_new] multi_policy_count_model;
    vector[N_new] unit_value_model2;
    vector[N_new] yrs_owned_model2;
    vector[N_new] credit_c5_score;
    vector[N_new] credit_c6_score;
    vector[N_new] c6_2;
    vector[N_new] c6_3;
    vector[N_new] credit_52778_score;
    vector[N_new] nohit_52778;
    vector[N_new] thinfile_52778;
    vector[N_new] class_model2_Sport;
    vector[N_new] class_model2_Street;
    vector[N_new] class_model2_other;
    vector[N_new] Single_Driver_Age_model;
    vector[N_new] Driver_Age_model;
    
}
parameters {            
    real BIlmt_coeff;      
    real unit_value_model2_coeff;
    real yrs_owned_model2_coeff;
    real credit_c5_score_coeff;
    real credit_c6_score_coeff;
    real c6_2_coeff;
    real c6_3_coeff;
    real credit_52778_score_coeff;
    real thinfile_52778_coeff;
    real nohit_52778_coeff;
    real class_model2_Sport_coeff;
    real class_model2_Street_coeff;
    real class_model2_other_coeff;
    real Single_Driver_Age_model_coeff;
    real Driver_Age_model_coeff;
    real intercept;        // parameter for mean or intercept   
    

    

}
model {
    BIlmt_coeff ~ normal(0, 1);  
    unit_value_model2_coeff ~ normal(0, 1); 
    yrs_owned_model2_coeff ~ normal(0, 1); 
    credit_c5_score_coeff ~ normal(0, 1); 
    credit_c6_score_coeff ~ normal(0, 1); 
    c6_2_coeff ~ normal(0, 1); 
    c6_3_coeff ~ normal(0, 1); 
    credit_52778_score_coeff ~ normal(0,1);
    thinfile_52778_coeff ~ normal(0,1);
    nohit_52778_coeff ~ normal(0,1);
    class_model2_Sport_coeff ~ normal(0, 1); 
    class_model2_Street_coeff ~ normal(0, 1); 
    class_model2_other_coeff ~ normal(0, 1); 
    Single_Driver_Age_model_coeff ~ normal(0, 1); 
    Driver_Age_model_coeff ~ normal(0, 1); 
    intercept ~ normal(0,5);
    
    y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff + multi_policy_count_model*.9 + \
                  unit_value_model2*unit_value_model2_coeff + yrs_owned_model2*unit_value_model2_coeff +\
                  yrs_owned_model2*yrs_owned_model2_coeff + credit_c5_score*credit_c5_score_coeff +\
                  credit_c6_score*credit_c6_score_coeff + c6_2*c6_2_coeff + c6_3*c6_3_coeff +\
                  class_model2_Sport*class_model2_Sport_coeff + class_model2_Street*class_model2_Street_coeff +\
                  class_model2_other*class_model2_other_coeff + \
                  Single_Driver_Age_model*Single_Driver_Age_model_coeff + \
                  Driver_Age_model*Driver_Age_model_coeff + credit_52778_score*credit_52778_score_coeff + \
                  thinfile_52778*thinfile_52778_coeff + nohit_52778*nohit_52778_coeff + intercept);   
}
generated quantities {
    vector[N_new] y_new;
    for (n in 1:N_new)
        y_new[n] = poisson_log(BIlmt_model2[n]*BIlmt_coeff + multi_policy_count_model[n]*.9 + \
                      unit_value_model2_ne*unit_value_model2_coeff + yrs_owned_model2[n]*unit_value_model2_coeff +\
                      yrs_owned_model2_ne*yrs_owned_model2_coeff + credit_c5_score[n]*credit_c5_score_coeff +\
                      credit_c6_score[n]*credit_c6_score_coeff + c6_2[n]*c6_2_coeff + c6_3[n]*c6_3_coeff +\
                      class_model2_Sport[n]*class_model2_Sport_coeff + class_model2_Street[n]*class_model2_Street_coeff +\
                      class_model2_other[n]*class_model2_other_coeff + \
                      Single_Driver_Age_model[n]*Single_Driver_Age_model_coeff + \
                      Driver_Age_model[n]*Driver_Age_model_coeff + credit_52778_score[n]*credit_52778_score_coeff + \
                      thinfile_52778[n]*thinfile_52778_coeff + nohit_52778[n]*nohit_52778_coeff + intercept);
    }

This is throwing

ValueError: Failed to parse Stan model 'anon_model_e4b33c3d25ec074c3b0ac0b520ba39ea'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Duplicate declaration of variable, name=BIlmt_model2; attempt to redeclare as vector in data; previously declared as vector in data
 error in 'unknown file name' at line 24, column 30
  -------------------------------------------------
    22:     int<lower=0> N_new;
    23:     
    24:     vector[N_new] BIlmt_model2;
                                     ^
    25:     vector[N_new] multi_policy_count_model;
  -------------------------------------------------

I’m not sure what I’m doing wrong.

1 Like

This might be a good example of why it’s best practice to put the Stan code in a separate file, rather than in a string variable. It makes it easier to see what the line number in the error message is referring to, and you’d avoid “unknown file name” in the error message as well.

Perhaps that’s why you thought the error was in the generated quantities block, not the data block?

Anyway, it looks like there’s a copy-and-paste error in the data block. Most of the variables are declared twice, and the parser has complained about the first such variable: BIlmt_model2. Several of your vector variables are not only declared twice, but with two different lengths in each declaration, either N or N_new.

It’s not clear to me why you need N_new.

Thanks @jjramsey. I added _new to the new variables in the data block. This accounts for test data denoted with dictionary key BIlmt_model_new for example. Below is the new code:

liability_code = """
data {
    int<lower=0> N;
    vector[N] BIlmt_model2;
    vector[N] multi_policy_count_model;
    vector[N] unit_value_model2;
    vector[N] yrs_owned_model2;
    vector[N] credit_c5_score;
    vector[N] credit_c6_score;
    vector[N] c6_2;
    vector[N] c6_3;
    vector[N] credit_52778_score;
    vector[N] nohit_52778;
    vector[N] thinfile_52778;
    vector[N] class_model2_Sport;
    vector[N] class_model2_Street;
    vector[N] class_model2_other;
    vector[N] Single_Driver_Age_model;
    vector[N] Driver_Age_model;
    int<lower=0> y1[N];
    
    int<lower=0> N_new;
    
    vector[N_new] BIlmt_model2_new;
    vector[N_new] multi_policy_count_model_new;
    vector[N_new] unit_value_model2_new;
    vector[N_new] yrs_owned_model2_new;
    vector[N_new] credit_c5_score_new;
    vector[N_new] credit_c6_score_new;
    vector[N_new] c6_2_new;
    vector[N_new] c6_3_new;
    vector[N_new] credit_52778_score_new;
    vector[N_new] nohit_52778_new;
    vector[N_new] thinfile_52778_new;
    vector[N_new] class_model2_Sport_new;
    vector[N_new] class_model2_Street_new;
    vector[N_new] class_model2_other_new;
    vector[N_new] Single_Driver_Age_model_new;
    vector[N_new] Driver_Age_model_new;
    
}
parameters {            
    real BIlmt_coeff;      
    real unit_value_model2_coeff;
    real yrs_owned_model2_coeff;
    real credit_c5_score_coeff;
    real credit_c6_score_coeff;
    real c6_2_coeff;
    real c6_3_coeff;
    real credit_52778_score_coeff;
    real thinfile_52778_coeff;
    real nohit_52778_coeff;
    real class_model2_Sport_coeff;
    real class_model2_Street_coeff;
    real class_model2_other_coeff;
    real Single_Driver_Age_model_coeff;
    real Driver_Age_model_coeff;
    real intercept;        // parameter for mean or intercept   
    

    

}
model {
    BIlmt_coeff ~ normal(0, 1);  
    unit_value_model2_coeff ~ normal(0, 1); 
    yrs_owned_model2_coeff ~ normal(0, 1); 
    credit_c5_score_coeff ~ normal(0, 1); 
    credit_c6_score_coeff ~ normal(0, 1); 
    c6_2_coeff ~ normal(0, 1); 
    c6_3_coeff ~ normal(0, 1); 
    credit_52778_score_coeff ~ normal(0,1);
    thinfile_52778_coeff ~ normal(0,1);
    nohit_52778_coeff ~ normal(0,1);
    class_model2_Sport_coeff ~ normal(0, 1); 
    class_model2_Street_coeff ~ normal(0, 1); 
    class_model2_other_coeff ~ normal(0, 1); 
    Single_Driver_Age_model_coeff ~ normal(0, 1); 
    Driver_Age_model_coeff ~ normal(0, 1); 
    intercept ~ normal(0,5);
    
    y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff + multi_policy_count_model*.9 + \
                  unit_value_model2*unit_value_model2_coeff + yrs_owned_model2*unit_value_model2_coeff +\
                  yrs_owned_model2*yrs_owned_model2_coeff + credit_c5_score*credit_c5_score_coeff +\
                  credit_c6_score*credit_c6_score_coeff + c6_2*c6_2_coeff + c6_3*c6_3_coeff +\
                  class_model2_Sport*class_model2_Sport_coeff + class_model2_Street*class_model2_Street_coeff +\
                  class_model2_other*class_model2_other_coeff + \
                  Single_Driver_Age_model*Single_Driver_Age_model_coeff + \
                  Driver_Age_model*Driver_Age_model_coeff + credit_52778_score*credit_52778_score_coeff + \
                  thinfile_52778*thinfile_52778_coeff + nohit_52778*nohit_52778_coeff + intercept);   
}
generated quantities {
    vector[N_new] y_new;
    for (n in 1:N_new)
        y_new[n] = poisson_log(BIlmt_model2_new[n]*BIlmt_coeff + multi_policy_count_model_new[n]*.9 + \
                      unit_value_model2_new[n]*unit_value_model2_coeff + yrs_owned_model2_new[n]*unit_value_model2_coeff +\
                      yrs_owned_model2_new[n]*yrs_owned_model2_coeff + credit_c5_score_new[n]*credit_c5_score_coeff +\
                      credit_c6_score_new[n]*credit_c6_score_coeff + c6_2_new[n]*c6_2_coeff + c6_3_new[n]*c6_3_coeff +\
                      class_model2_Sport_new[n]*class_model2_Sport_coeff + class_model2_Street_new[n]*class_model2_Street_coeff +\
                      class_model2_other_new[n]*class_model2_other_coeff + \
                      Single_Driver_Age_model_new[n]*Single_Driver_Age_model_coeff + \
                      Driver_Age_model_new[n]*Driver_Age_model_coeff + credit_52778_score_new[n]*credit_52778_score_coeff + \
                      thinfile_52778_new[n]*thinfile_52778_coeff + nohit_52778_new[n]*nohit_52778_coeff + intercept);
    }

I’m still getting the same error though.

Are you sure it is the same error?

1 Like

this should be poisson_log_rng(...)

1 Like

If N_new is large, you may run out of memory for reasons similar to those discussed in the topic First Pystan Poisson Model. Generated quantities take up memory in PyStan the same way that parameters do.

Ahhh. Ok, I put this in an actual .txt. file and loaded it that way instead of a string object. I changed y_new[n] = poisson_log(…) to poisson_log_rng(...) and it is parsed and running.

Do you have a suggestion for the memory issues? I’m working to get some change in my workplace to start using these techniques over our standard frequentist practice just because of the inference I can glean but our data sets are big.

have you tried running this using CmdStanPy? a big model with many params will take up memory, but there might be less memory overhead w/r/t data and outputs.
https://cmdstanpy.readthedocs.io/en/latest/generate_quantities.html

Instead of using the generated quantities block, just generate those quantities in a post-processing step. You can save the samples from your MCMC runs to a CSV file (or other file format that’s convenient for you), load those saved samples into another Python script, and then iterate over the samples.

For example, at the end of your PyStan run, you save your model fit as follows:

fit.to_dataframe().to_csv("my_samples.csv.gz",
                          compression = "gzip",
                          index = False)

In another script, you can do something like the following:

import gzip

import pandas as pd
import numpy as np

my_samples = pd.read_csv("my_samples.csv.gz")

num_samples = my_samples.shape[1]
BIlmt_coeff = my_samples["BIlmt_coeff"]
unit_value_model2_coeff = my_samples["unit_value_model2_coeff"]
# More parameter vars ...

# Load your data here ...
my_data = ...
BIlmt_model2_new = np.asarray(my_data["BIlmt_model2_new"])
# More data vars ...

with gzip.open("y_new.out.gz", "wt") as out_file:
    for i in range(num_samples):
        # y_new is an array of length N_new
        y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + ...))
        
        out_file.write(str(y_new)) # There's probably some smarter way to 
                                   # write this to a file.

The main thing of interest is that you are iterating over the parameter samples so that you only have one instance of y_new in memory at a time, and then you write it to a file.

(I also used gzip to write to files in order to save disk space.)

Thanks. So I got this to run but I’m not exactly sure what I’m looking at. I just put this out to a dataframe which is shape (26, 82868). I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from. Also, the posterior df is all whole numbers. See below:

image

Did I do something wrong? Below is my code:

num_samples = my_samples.shape[1]

#get coefficients
BIlmt_coeff = my_samples['BIlmt_coeff']
unit_value_model2_coeff = my_samples['unit_value_model2_coeff']
yrs_owned_model2_coeff = my_samples['yrs_owned_model2_coeff']
credit_c5_score_coeff = my_samples['credit_c5_score_coeff']
credit_c6_score_coeff = my_samples['credit_c6_score_coeff']
c6_2_coeff = my_samples['c6_2_coeff']
c6_3_coeff = my_samples['c6_3_coeff']
credit_52778_score_coeff = my_samples['credit_52778_score_coeff']
thinfile_52778_coeff = my_samples['thinfile_52778_coeff']
nohit_52778_coeff = my_samples['nohit_52778_coeff']
class_model2_Sport_coeff = my_samples['class_model2_Sport_coeff']
class_model2_Street_coeff = my_samples['class_model2_Street_coeff']
class_model2_other_coeff = my_samples['class_model2_other_coeff']
Single_Driver_Age_model_coeff = my_samples['Single_Driver_Age_model_coeff']
Driver_Age_model_coeff = my_samples['Driver_Age_model_coeff']
intercept = my_samples['intercept']

#get new data
BIlmt_model2_new = np.asarray(x_test['BIlmt_model2'])
multi_policy_count_model_new = np.asarray(x_test['multi_policy_count_model'])
unit_value_model2_new = np.asarray(x_test['unit_value_model2'])
yrs_owned_model2_new = np.asarray(x_test['yrs_owned_model2'])
credit_c5_score_new = np.asarray(x_test['credit_c5_score'])
credit_c6_score_new = np.asarray(x_test['credit_c6_score'])
c6_2_new = np.asarray(x_test['np.maximum(credit_c6_score - 2, 0)'])
c6_3_new = np.asarray(x_test['np.maximum(credit_c6_score - 3, 0)'])
credit_52778_score_new = np.asarray(x_test['credit_52778_score'])
thinfile_52778_new = np.asarray(x_test['np.maximum(credit_52778_score - 2, 0)'])
nohit_52778_coeff_new = np.asarray(x_test['np.maximum(credit_52778_score - 3, 0)'])
class_model2_Sport_new = np.asarray(x_test['class_model2_Sport'])
class_model2_Street_new = np.asarray(x_test['class_model2_Street'])
class_model2_other_new = np.asarray(x_test['class_model2_other'])
Single_Driver_Age_model_new = np.asarray(x_test['marital_status_model_S:Driver_Age_model'])
Driver_Age_model_new = np.asarray(x_test['Driver_Age_model'])

#run to get posterior
posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

Put what out to a dataframe? Are you referring to the my_samples dataframe or df1? And how did you create the dataframe my_samples in the first place?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?

I just put this out to a dataframe which is shape (26, 82868) . I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from.

My apologies. I created a data frame instead of writing out to a gzip file. Do you have a good script on posterior predictive checks?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?
Yes you are correct. I understand now.

You’d likely be creating a dataframe anyway, so that’s not a big deal. What I want to know is how you created that dataframe, because that will determine whether the reported shape of the dataframe, (26, 82868), makes any sense for what you want to do.

I created it like this:

posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

Show me the results of the following:

print(my_samples.shape)
print(x_test.shape)
print(df1.shape)
 my_samples: (18000, 26), x_test: (82868, 16), df1: (26, 82868)

Now that I’m looking, I’m thinking it’s pulling the columns [chain, draw, warmup, lp_, accept_stat_] etc. Maybe filter those out?

Ah, now things are making sense. Those extra columns are why my_samples is (18000,26).

Also, I made a mistake by writng num_samples = my_samples.shape[1]. It should be num_samples = my_samples.shape[0]. A pure brainfart on my part.

Of course, with num_samples now being 18000, the dataframe posterior will be huge, 18000\times 82868, possibly too large to fit in memory. That’s why I’d recommend appending each y_new to a file, rather than keeping it in memory.

Thanks. So how do I interpret this? Is every row 82,868 predicted values? I’m not sure the best way to show the predicted values as a historgram with the actual mean displayed.

(“advanced” stuff): you could transform posterior to InferenceData from ArviZ save it as netCDF and then use dask to do out-of-core computation (xarray.Dataset / xarray.Array)

https://arviz-devs.github.io/arviz/notebooks/InferenceDataCookbook.html

https://xarray.pydata.org/en/latest/dask.html

Each row is a predicted sample, which is a vector of length x_test.shape[0]. When you did MCMC, you drew 18000 samples of your parameters, so you now have 18000 samples from your posterior predictive distribution. Now you could make your posterior predictive sample this way:

num_test_points = x_test.shape[0]
posterior_moments = []
for i in range(num_test_points):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new[i]*BIlmt_coeff + 
                                    multi_policy_count_model_new[i]*.9 +
                                    unit_value_model2_new[i]*unit_value_model2_coeff +
                                    yrs_owned_model2_new[i]*yrs_owned_model2_coeff +
                                    credit_c5_score_new[i]*credit_c5_score_coeff + 
                                    credit_c6_score_new[i]*credit_c6_score_coeff + 
                                    c6_2_new[i]*c6_2_coeff + 
                                    c6_3_new[i]*c6_3_coeff + 
                                    credit_52778_score_new[i]*credit_52778_score_coeff + 
                                    thinfile_52778_new[i]*thinfile_52778_coeff + 
                                    nohit_52778_coeff_new[i]*nohit_52778_coeff +
                                    class_model2_Sport_new[i]*class_model2_Sport_coeff + 
                                    class_model2_Street_new[i]*class_model2_Street_coeff +
                                    class_model2_other_new[i]*class_model2_other_coeff +
                                    Single_Driver_Age_model_new[i]*Single_Driver_Age_model_coeff +
                                    Driver_Age_model_new[i]*Driver_Age_model_coeff))

    posterior_moments.append((np.mean(y_new), np.std(y_new, ddof=1))

That will get you the mean and standard deviation of y for each test point.

ETA: Notice that I’m now effectively iterating along the rows of x_test rather than along the rows of my_samples.