ValueError: Failed to parse Stan model 'anon_model_e4b33c3d25ec074c3b0ac0b520ba39ea'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Duplicate declaration of variable, name=BIlmt_model2; attempt to redeclare as vector in data; previously declared as vector in data
error in 'unknown file name' at line 24, column 30
-------------------------------------------------
22: int<lower=0> N_new;
23:
24: vector[N_new] BIlmt_model2;
^
25: vector[N_new] multi_policy_count_model;
-------------------------------------------------
This might be a good example of why it’s best practice to put the Stan code in a separate file, rather than in a string variable. It makes it easier to see what the line number in the error message is referring to, and you’d avoid “unknown file name” in the error message as well.
Perhaps that’s why you thought the error was in the generated quantities block, not the data block?
Anyway, it looks like there’s a copy-and-paste error in the data block. Most of the variables are declared twice, and the parser has complained about the first such variable: BIlmt_model2. Several of your vector variables are not only declared twice, but with two different lengths in each declaration, either N or N_new.
Thanks @jjramsey. I added _new to the new variables in the data block. This accounts for test data denoted with dictionary key BIlmt_model_new for example. Below is the new code:
If N_new is large, you may run out of memory for reasons similar to those discussed in the topic First Pystan Poisson Model. Generated quantities take up memory in PyStan the same way that parameters do.
Ahhh. Ok, I put this in an actual .txt. file and loaded it that way instead of a string object. I changed y_new[n] = poisson_log(…) to poisson_log_rng(...) and it is parsed and running.
Do you have a suggestion for the memory issues? I’m working to get some change in my workplace to start using these techniques over our standard frequentist practice just because of the inference I can glean but our data sets are big.
Instead of using the generated quantities block, just generate those quantities in a post-processing step. You can save the samples from your MCMC runs to a CSV file (or other file format that’s convenient for you), load those saved samples into another Python script, and then iterate over the samples.
For example, at the end of your PyStan run, you save your model fit as follows:
fit.to_dataframe().to_csv("my_samples.csv.gz",
compression = "gzip",
index = False)
In another script, you can do something like the following:
import gzip
import pandas as pd
import numpy as np
my_samples = pd.read_csv("my_samples.csv.gz")
num_samples = my_samples.shape[1]
BIlmt_coeff = my_samples["BIlmt_coeff"]
unit_value_model2_coeff = my_samples["unit_value_model2_coeff"]
# More parameter vars ...
# Load your data here ...
my_data = ...
BIlmt_model2_new = np.asarray(my_data["BIlmt_model2_new"])
# More data vars ...
with gzip.open("y_new.out.gz", "wt") as out_file:
for i in range(num_samples):
# y_new is an array of length N_new
y_new = np.random.poisson(np.exp(BIlmt_model2_new*BIlmt_coeff[i] + ...))
out_file.write(str(y_new)) # There's probably some smarter way to
# write this to a file.
The main thing of interest is that you are iterating over the parameter samples so that you only have one instance of y_new in memory at a time, and then you write it to a file.
(I also used gzip to write to files in order to save disk space.)
Thanks. So I got this to run but I’m not exactly sure what I’m looking at. I just put this out to a dataframe which is shape (26, 82868). I’m guessing the 82,868 s the number of observations from my test data set.
I only have 16 variables so I’m not sure where the 26 comes from. Also, the posterior df is all whole numbers. See below:
You’d likely be creating a dataframe anyway, so that’s not a big deal. What I want to know is how you created that dataframe, because that will determine whether the reported shape of the dataframe, (26, 82868), makes any sense for what you want to do.
Ah, now things are making sense. Those extra columns are why my_samples is (18000,26).
Also, I made a mistake by writng num_samples = my_samples.shape[1]. It should be num_samples = my_samples.shape[0]. A pure brainfart on my part.
Of course, with num_samples now being 18000, the dataframe posterior will be huge, 18000\times 82868, possibly too large to fit in memory. That’s why I’d recommend appending each y_new to a file, rather than keeping it in memory.
Thanks. So how do I interpret this? Is every row 82,868 predicted values? I’m not sure the best way to show the predicted values as a historgram with the actual mean displayed.
(“advanced” stuff): you could transform posterior to InferenceData from ArviZ save it as netCDF and then use dask to do out-of-core computation (xarray.Dataset / xarray.Array)
Each row is a predicted sample, which is a vector of length x_test.shape[0]. When you did MCMC, you drew 18000 samples of your parameters, so you now have 18000 samples from your posterior predictive distribution. Now you could make your posterior predictive sample this way: