Generated Quantities throwing parsing error in Pystan

Jordan_Howell · January 14, 2020, 5:57pm

Hello,

I’m trying to use the generated quantities block in Pystan and it seems to be throwing an error.

liability_code = """
data {
    int<lower=0> N;
    vector[N] BIlmt_model2;
    vector[N] multi_policy_count_model;
    vector[N] unit_value_model2;
    vector[N] yrs_owned_model2;
    vector[N] credit_c5_score;
    vector[N] credit_c6_score;
    vector[N] c6_2;
    vector[N] c6_3;
    vector[N] credit_52778_score;
    vector[N] nohit_52778;
    vector[N] thinfile_52778;
    vector[N] class_model2_Sport;
    vector[N] class_model2_Street;
    vector[N] class_model2_other;
    vector[N] Single_Driver_Age_model;
    vector[N] Driver_Age_model;
    int<lower=0> y1[N];
    
    int<lower=0> N_new;
    
    vector[N_new] BIlmt_model2;
    vector[N_new] multi_policy_count_model;
    vector[N_new] unit_value_model2;
    vector[N_new] yrs_owned_model2;
    vector[N_new] credit_c5_score;
    vector[N_new] credit_c6_score;
    vector[N_new] c6_2;
    vector[N_new] c6_3;
    vector[N_new] credit_52778_score;
    vector[N_new] nohit_52778;
    vector[N_new] thinfile_52778;
    vector[N_new] class_model2_Sport;
    vector[N_new] class_model2_Street;
    vector[N_new] class_model2_other;
    vector[N_new] Single_Driver_Age_model;
    vector[N_new] Driver_Age_model;
    
}
parameters {            
    real BIlmt_coeff;      
    real unit_value_model2_coeff;
    real yrs_owned_model2_coeff;
    real credit_c5_score_coeff;
    real credit_c6_score_coeff;
    real c6_2_coeff;
    real c6_3_coeff;
    real credit_52778_score_coeff;
    real thinfile_52778_coeff;
    real nohit_52778_coeff;
    real class_model2_Sport_coeff;
    real class_model2_Street_coeff;
    real class_model2_other_coeff;
    real Single_Driver_Age_model_coeff;
    real Driver_Age_model_coeff;
    real intercept;        // parameter for mean or intercept   
    

    

}
model {
    BIlmt_coeff ~ normal(0, 1);  
    unit_value_model2_coeff ~ normal(0, 1); 
    yrs_owned_model2_coeff ~ normal(0, 1); 
    credit_c5_score_coeff ~ normal(0, 1); 
    credit_c6_score_coeff ~ normal(0, 1); 
    c6_2_coeff ~ normal(0, 1); 
    c6_3_coeff ~ normal(0, 1); 
    credit_52778_score_coeff ~ normal(0,1);
    thinfile_52778_coeff ~ normal(0,1);
    nohit_52778_coeff ~ normal(0,1);
    class_model2_Sport_coeff ~ normal(0, 1); 
    class_model2_Street_coeff ~ normal(0, 1); 
    class_model2_other_coeff ~ normal(0, 1); 
    Single_Driver_Age_model_coeff ~ normal(0, 1); 
    Driver_Age_model_coeff ~ normal(0, 1); 
    intercept ~ normal(0,5);
    
    y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff + multi_policy_count_model*.9 + \
                  unit_value_model2*unit_value_model2_coeff + yrs_owned_model2*unit_value_model2_coeff +\
                  yrs_owned_model2*yrs_owned_model2_coeff + credit_c5_score*credit_c5_score_coeff +\
                  credit_c6_score*credit_c6_score_coeff + c6_2*c6_2_coeff + c6_3*c6_3_coeff +\
                  class_model2_Sport*class_model2_Sport_coeff + class_model2_Street*class_model2_Street_coeff +\
                  class_model2_other*class_model2_other_coeff + \
                  Single_Driver_Age_model*Single_Driver_Age_model_coeff + \
                  Driver_Age_model*Driver_Age_model_coeff + credit_52778_score*credit_52778_score_coeff + \
                  thinfile_52778*thinfile_52778_coeff + nohit_52778*nohit_52778_coeff + intercept);   
}
generated quantities {
    vector[N_new] y_new;
    for (n in 1:N_new)
        y_new[n] = poisson_log(BIlmt_model2[n]*BIlmt_coeff + multi_policy_count_model[n]*.9 + \
                      unit_value_model2_ne*unit_value_model2_coeff + yrs_owned_model2[n]*unit_value_model2_coeff +\
                      yrs_owned_model2_ne*yrs_owned_model2_coeff + credit_c5_score[n]*credit_c5_score_coeff +\
                      credit_c6_score[n]*credit_c6_score_coeff + c6_2[n]*c6_2_coeff + c6_3[n]*c6_3_coeff +\
                      class_model2_Sport[n]*class_model2_Sport_coeff + class_model2_Street[n]*class_model2_Street_coeff +\
                      class_model2_other[n]*class_model2_other_coeff + \
                      Single_Driver_Age_model[n]*Single_Driver_Age_model_coeff + \
                      Driver_Age_model[n]*Driver_Age_model_coeff + credit_52778_score[n]*credit_52778_score_coeff + \
                      thinfile_52778[n]*thinfile_52778_coeff + nohit_52778[n]*nohit_52778_coeff + intercept);
    }

This is throwing

ValueError: Failed to parse Stan model 'anon_model_e4b33c3d25ec074c3b0ac0b520ba39ea'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Duplicate declaration of variable, name=BIlmt_model2; attempt to redeclare as vector in data; previously declared as vector in data
 error in 'unknown file name' at line 24, column 30
  -------------------------------------------------
    22:     int<lower=0> N_new;
    23:     
    24:     vector[N_new] BIlmt_model2;
                                     ^
    25:     vector[N_new] multi_policy_count_model;
  -------------------------------------------------

I’m not sure what I’m doing wrong.

jjramsey · January 14, 2020, 6:23pm

This might be a good example of why it’s best practice to put the Stan code in a separate file, rather than in a string variable. It makes it easier to see what the line number in the error message is referring to, and you’d avoid “unknown file name” in the error message as well.

Perhaps that’s why you thought the error was in the generated quantities block, not the data block?

Anyway, it looks like there’s a copy-and-paste error in the data block. Most of the variables are declared twice, and the parser has complained about the first such variable: BIlmt_model2. Several of your vector variables are not only declared twice, but with two different lengths in each declaration, either N or N_new.

It’s not clear to me why you need N_new.

Jordan_Howell · January 14, 2020, 6:27pm

Thanks @jjramsey. I added _new to the new variables in the data block. This accounts for test data denoted with dictionary key BIlmt_model_new for example. Below is the new code:

liability_code = """
data {
    int<lower=0> N;
    vector[N] BIlmt_model2;
    vector[N] multi_policy_count_model;
    vector[N] unit_value_model2;
    vector[N] yrs_owned_model2;
    vector[N] credit_c5_score;
    vector[N] credit_c6_score;
    vector[N] c6_2;
    vector[N] c6_3;
    vector[N] credit_52778_score;
    vector[N] nohit_52778;
    vector[N] thinfile_52778;
    vector[N] class_model2_Sport;
    vector[N] class_model2_Street;
    vector[N] class_model2_other;
    vector[N] Single_Driver_Age_model;
    vector[N] Driver_Age_model;
    int<lower=0> y1[N];
    
    int<lower=0> N_new;
    
    vector[N_new] BIlmt_model2_new;
    vector[N_new] multi_policy_count_model_new;
    vector[N_new] unit_value_model2_new;
    vector[N_new] yrs_owned_model2_new;
    vector[N_new] credit_c5_score_new;
    vector[N_new] credit_c6_score_new;
    vector[N_new] c6_2_new;
    vector[N_new] c6_3_new;
    vector[N_new] credit_52778_score_new;
    vector[N_new] nohit_52778_new;
    vector[N_new] thinfile_52778_new;
    vector[N_new] class_model2_Sport_new;
    vector[N_new] class_model2_Street_new;
    vector[N_new] class_model2_other_new;
    vector[N_new] Single_Driver_Age_model_new;
    vector[N_new] Driver_Age_model_new;
    
}
parameters {            
    real BIlmt_coeff;      
    real unit_value_model2_coeff;
    real yrs_owned_model2_coeff;
    real credit_c5_score_coeff;
    real credit_c6_score_coeff;
    real c6_2_coeff;
    real c6_3_coeff;
    real credit_52778_score_coeff;
    real thinfile_52778_coeff;
    real nohit_52778_coeff;
    real class_model2_Sport_coeff;
    real class_model2_Street_coeff;
    real class_model2_other_coeff;
    real Single_Driver_Age_model_coeff;
    real Driver_Age_model_coeff;
    real intercept;        // parameter for mean or intercept   
    

    

}
model {
    BIlmt_coeff ~ normal(0, 1);  
    unit_value_model2_coeff ~ normal(0, 1); 
    yrs_owned_model2_coeff ~ normal(0, 1); 
    credit_c5_score_coeff ~ normal(0, 1); 
    credit_c6_score_coeff ~ normal(0, 1); 
    c6_2_coeff ~ normal(0, 1); 
    c6_3_coeff ~ normal(0, 1); 
    credit_52778_score_coeff ~ normal(0,1);
    thinfile_52778_coeff ~ normal(0,1);
    nohit_52778_coeff ~ normal(0,1);
    class_model2_Sport_coeff ~ normal(0, 1); 
    class_model2_Street_coeff ~ normal(0, 1); 
    class_model2_other_coeff ~ normal(0, 1); 
    Single_Driver_Age_model_coeff ~ normal(0, 1); 
    Driver_Age_model_coeff ~ normal(0, 1); 
    intercept ~ normal(0,5);
    
    y1 ~ poisson_log(BIlmt_model2*BIlmt_coeff + multi_policy_count_model*.9 + \
                  unit_value_model2*unit_value_model2_coeff + yrs_owned_model2*unit_value_model2_coeff +\
                  yrs_owned_model2*yrs_owned_model2_coeff + credit_c5_score*credit_c5_score_coeff +\
                  credit_c6_score*credit_c6_score_coeff + c6_2*c6_2_coeff + c6_3*c6_3_coeff +\
                  class_model2_Sport*class_model2_Sport_coeff + class_model2_Street*class_model2_Street_coeff +\
                  class_model2_other*class_model2_other_coeff + \
                  Single_Driver_Age_model*Single_Driver_Age_model_coeff + \
                  Driver_Age_model*Driver_Age_model_coeff + credit_52778_score*credit_52778_score_coeff + \
                  thinfile_52778*thinfile_52778_coeff + nohit_52778*nohit_52778_coeff + intercept);   
}
generated quantities {
    vector[N_new] y_new;
    for (n in 1:N_new)
        y_new[n] = poisson_log(BIlmt_model2_new[n]*BIlmt_coeff + multi_policy_count_model_new[n]*.9 + \
                      unit_value_model2_new[n]*unit_value_model2_coeff + yrs_owned_model2_new[n]*unit_value_model2_coeff +\
                      yrs_owned_model2_new[n]*yrs_owned_model2_coeff + credit_c5_score_new[n]*credit_c5_score_coeff +\
                      credit_c6_score_new[n]*credit_c6_score_coeff + c6_2_new[n]*c6_2_coeff + c6_3_new[n]*c6_3_coeff +\
                      class_model2_Sport_new[n]*class_model2_Sport_coeff + class_model2_Street_new[n]*class_model2_Street_coeff +\
                      class_model2_other_new[n]*class_model2_other_coeff + \
                      Single_Driver_Age_model_new[n]*Single_Driver_Age_model_coeff + \
                      Driver_Age_model_new[n]*Driver_Age_model_coeff + credit_52778_score_new[n]*credit_52778_score_coeff + \
                      thinfile_52778_new[n]*thinfile_52778_coeff + nohit_52778_new[n]*nohit_52778_coeff + intercept);
    }

I’m still getting the same error though.

ahartikainen · January 14, 2020, 6:35pm

Are you sure it is the same error?

Max_Mantei · January 14, 2020, 6:47pm

this should be poisson_log_rng(...)

jjramsey · January 14, 2020, 6:52pm

If N_new is large, you may run out of memory for reasons similar to those discussed in the topic First Pystan Poisson Model. Generated quantities take up memory in PyStan the same way that parameters do.

Jordan_Howell · January 14, 2020, 6:56pm

Ahhh. Ok, I put this in an actual .txt. file and loaded it that way instead of a string object. I changed y_new[n] = poisson_log(…) to poisson_log_rng(...) and it is parsed and running.

Do you have a suggestion for the memory issues? I’m working to get some change in my workplace to start using these techniques over our standard frequentist practice just because of the inference I can glean but our data sets are big.

mitzimorris · January 14, 2020, 7:18pm

have you tried running this using CmdStanPy? a big model with many params will take up memory, but there might be less memory overhead w/r/t data and outputs.
https://cmdstanpy.readthedocs.io/en/latest/generate_quantities.html

jjramsey · January 14, 2020, 8:10pm

Instead of using the generated quantities block, just generate those quantities in a post-processing step. You can save the samples from your MCMC runs to a CSV file (or other file format that’s convenient for you), load those saved samples into another Python script, and then iterate over the samples.

For example, at the end of your PyStan run, you save your model fit as follows:

fit.to_dataframe().to_csv("my_samples.csv.gz",
                          compression = "gzip",
                          index = False)

In another script, you can do something like the following:

import gzip

import pandas as pd
import numpy as np

my_samples = pd.read_csv("my_samples.csv.gz")

num_samples = my_samples.shape[1]
BIlmt_coeff = my_samples["BIlmt_coeff"]
unit_value_model2_coeff = my_samples["unit_value_model2_coeff"]
# More parameter vars ...

# Load your data here ...
my_data = ...
BIlmt_model2_new = np.asarray(my_data["BIlmt_model2_new"])
# More data vars ...

with gzip.open("y_new.out.gz", "wt") as out_file:
    for i in range(num_samples):
        # y_new is an array of length N_new
        y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + ...))
        
        out_file.write(str(y_new)) # There's probably some smarter way to 
                                   # write this to a file.

The main thing of interest is that you are iterating over the parameter samples so that you only have one instance of y_new in memory at a time, and then you write it to a file.

(I also used gzip to write to files in order to save disk space.)

Jordan_Howell · January 16, 2020, 4:01pm

Thanks. So I got this to run but I’m not exactly sure what I’m looking at. I just put this out to a dataframe which is shape (26, 82868). I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from. Also, the posterior df is all whole numbers. See below:

Did I do something wrong? Below is my code:

num_samples = my_samples.shape[1]

#get coefficients
BIlmt_coeff = my_samples['BIlmt_coeff']
unit_value_model2_coeff = my_samples['unit_value_model2_coeff']
yrs_owned_model2_coeff = my_samples['yrs_owned_model2_coeff']
credit_c5_score_coeff = my_samples['credit_c5_score_coeff']
credit_c6_score_coeff = my_samples['credit_c6_score_coeff']
c6_2_coeff = my_samples['c6_2_coeff']
c6_3_coeff = my_samples['c6_3_coeff']
credit_52778_score_coeff = my_samples['credit_52778_score_coeff']
thinfile_52778_coeff = my_samples['thinfile_52778_coeff']
nohit_52778_coeff = my_samples['nohit_52778_coeff']
class_model2_Sport_coeff = my_samples['class_model2_Sport_coeff']
class_model2_Street_coeff = my_samples['class_model2_Street_coeff']
class_model2_other_coeff = my_samples['class_model2_other_coeff']
Single_Driver_Age_model_coeff = my_samples['Single_Driver_Age_model_coeff']
Driver_Age_model_coeff = my_samples['Driver_Age_model_coeff']
intercept = my_samples['intercept']

#get new data
BIlmt_model2_new = np.asarray(x_test['BIlmt_model2'])
multi_policy_count_model_new = np.asarray(x_test['multi_policy_count_model'])
unit_value_model2_new = np.asarray(x_test['unit_value_model2'])
yrs_owned_model2_new = np.asarray(x_test['yrs_owned_model2'])
credit_c5_score_new = np.asarray(x_test['credit_c5_score'])
credit_c6_score_new = np.asarray(x_test['credit_c6_score'])
c6_2_new = np.asarray(x_test['np.maximum(credit_c6_score - 2, 0)'])
c6_3_new = np.asarray(x_test['np.maximum(credit_c6_score - 3, 0)'])
credit_52778_score_new = np.asarray(x_test['credit_52778_score'])
thinfile_52778_new = np.asarray(x_test['np.maximum(credit_52778_score - 2, 0)'])
nohit_52778_coeff_new = np.asarray(x_test['np.maximum(credit_52778_score - 3, 0)'])
class_model2_Sport_new = np.asarray(x_test['class_model2_Sport'])
class_model2_Street_new = np.asarray(x_test['class_model2_Street'])
class_model2_other_new = np.asarray(x_test['class_model2_other'])
Single_Driver_Age_model_new = np.asarray(x_test['marital_status_model_S:Driver_Age_model'])
Driver_Age_model_new = np.asarray(x_test['Driver_Age_model'])

#run to get posterior
posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

jjramsey · January 16, 2020, 4:16pm

Put what out to a dataframe? Are you referring to the my_samples dataframe or df1? And how did you create the dataframe my_samples in the first place?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?

Jordan_Howell · January 16, 2020, 4:19pm

I just put this out to a dataframe which is shape (26, 82868) . I’m guessing the 82,868 s the number of observations from my test data set.

I only have 16 variables so I’m not sure where the 26 comes from.

My apologies. I created a data frame instead of writing out to a gzip file. Do you have a good script on posterior predictive checks?

Samples from a Poisson distribution are non-negative integers. Should you not expect that?
Yes you are correct. I understand now.

jjramsey · January 16, 2020, 5:14pm

You’d likely be creating a dataframe anyway, so that’s not a big deal. What I want to know is how you created that dataframe, because that will determine whether the reported shape of the dataframe, (26, 82868), makes any sense for what you want to do.

Jordan_Howell · January 16, 2020, 5:17pm

I created it like this:

posterior = []
for i in range(num_samples):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + 
                                    multi_policy_count_model_new*.9 +
                                    unit_value_model2_new*unit_value_model2_coeff[i] +
                                    yrs_owned_model2_new*yrs_owned_model2_coeff[i] +
                                    credit_c5_score_new*credit_c5_score_coeff[i] + 
                                    credit_c6_score_new*credit_c6_score_coeff[i] + 
                                    c6_2_new*c6_2_coeff[i] + 
                                    c6_3_new*c6_3_coeff[i] + 
                                    credit_52778_score_new*credit_52778_score_coeff[i] + 
                                    thinfile_52778_new*thinfile_52778_coeff[i] + 
                                    nohit_52778_coeff_new*nohit_52778_coeff[i] +
                                    class_model2_Sport_new*class_model2_Sport_coeff[i] + 
                                    class_model2_Street_new*class_model2_Street_coeff[i] +
                                    class_model2_other_new*class_model2_other_coeff[i] +
                                    Single_Driver_Age_model_new*Single_Driver_Age_model_coeff[i] +
                                    Driver_Age_model_new*Driver_Age_model_coeff[i]))
    posterior.append(y_new)

df1 = pd.DataFrame(posterior)

jjramsey · January 16, 2020, 5:20pm

Show me the results of the following:

print(my_samples.shape)
print(x_test.shape)
print(df1.shape)

Jordan_Howell · January 16, 2020, 5:27pm

 my_samples: (18000, 26), x_test: (82868, 16), df1: (26, 82868)

Now that I’m looking, I’m thinking it’s pulling the columns [chain, draw, warmup, lp_, accept_stat_] etc. Maybe filter those out?

jjramsey · January 16, 2020, 5:56pm

Ah, now things are making sense. Those extra columns are why my_samples is (18000,26).

Also, I made a mistake by writng num_samples = my_samples.shape[1]. It should be num_samples = my_samples.shape[0]. A pure brainfart on my part.

Of course, with num_samples now being 18000, the dataframe posterior will be huge, 18000\times 82868, possibly too large to fit in memory. That’s why I’d recommend appending each y_new to a file, rather than keeping it in memory.

Jordan_Howell · January 16, 2020, 6:02pm

Thanks. So how do I interpret this? Is every row 82,868 predicted values? I’m not sure the best way to show the predicted values as a historgram with the actual mean displayed.

ahartikainen · January 16, 2020, 6:09pm

(“advanced” stuff): you could transform posterior to InferenceData from ArviZ save it as netCDF and then use dask to do out-of-core computation (xarray.Dataset / xarray.Array)

https://arviz-devs.github.io/arviz/notebooks/InferenceDataCookbook.html

https://xarray.pydata.org/en/latest/dask.html

jjramsey · January 16, 2020, 6:26pm

Each row is a predicted sample, which is a vector of length x_test.shape[0]. When you did MCMC, you drew 18000 samples of your parameters, so you now have 18000 samples from your posterior predictive distribution. Now you could make your posterior predictive sample this way:

num_test_points = x_test.shape[0]
posterior_moments = []
for i in range(num_test_points):
    y_new = np.random.poisson(np.exp(BIlmt_model2_new[i]*BIlmt_coeff + 
                                    multi_policy_count_model_new[i]*.9 +
                                    unit_value_model2_new[i]*unit_value_model2_coeff +
                                    yrs_owned_model2_new[i]*yrs_owned_model2_coeff +
                                    credit_c5_score_new[i]*credit_c5_score_coeff + 
                                    credit_c6_score_new[i]*credit_c6_score_coeff + 
                                    c6_2_new[i]*c6_2_coeff + 
                                    c6_3_new[i]*c6_3_coeff + 
                                    credit_52778_score_new[i]*credit_52778_score_coeff + 
                                    thinfile_52778_new[i]*thinfile_52778_coeff + 
                                    nohit_52778_coeff_new[i]*nohit_52778_coeff +
                                    class_model2_Sport_new[i]*class_model2_Sport_coeff + 
                                    class_model2_Street_new[i]*class_model2_Street_coeff +
                                    class_model2_other_new[i]*class_model2_other_coeff +
                                    Single_Driver_Age_model_new[i]*Single_Driver_Age_model_coeff +
                                    Driver_Age_model_new[i]*Driver_Age_model_coeff))

    posterior_moments.append((np.mean(y_new), np.std(y_new, ddof=1))

That will get you the mean and standard deviation of y for each test point.

ETA: Notice that I’m now effectively iterating along the rows of x_test rather than along the rows of my_samples.

Topic		Replies	Views
Generated Quantities Error General	3	628	October 22, 2021
Generating quantities (beginner) Modeling	6	3778	March 6, 2020
Learning Pystan and vectorizing models Modeling	2	884	July 3, 2017
Parser throw a "variable does not exist" error althought the variable is declared at the top of the block Modeling	16	1979	July 20, 2018
Difficulty passing data to PyStan: variable name=N; dims declared=(); dims found=(1) PyStan	6	1506	April 20, 2020

Generated Quantities throwing parsing error in Pystan

Related topics