Analyzing the posterior prediction samples

Jordan_Howell · October 22, 2021, 11:29am

Hello.

I have a simple one variable GLM whose aim is to see the difference in a population with a flag and without. I’ve gotten the model to successfully compile and sample from the data.

How do I pull posterior predictive samples?
How do I show two distributions and their mean of those predictions for the population with a flag and without?

Below is my code:


reinstatement_model = """
data {
    int<lower=0> n; // number policy term years
    int<lower=0> NonCatcvrcnt[n]; // claims
    int<lower=0, upper = 1> alertflag[n]; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag[n]*beta);
}


"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 5000,
                       chains = 4, warmup = 500, thin = 1,
                       seed = 101)

andrewgelman · October 22, 2021, 11:40am

I don’t recommend warmup=500, iter=5000. If you only need 500 warmup, I doubt you need more than 500 iterations to save.

Jordan_Howell · October 22, 2021, 11:42am

Thank you. I will implement that when I rerun this model or develop a new one. Do you know how to pull posterior predictions?

andrewgelman · October 22, 2021, 11:52am

Sorry, I don’t know Python! I hope one of the cmdstanpy developers can point you to the relevant documentation.

jonah · October 22, 2021, 4:30pm

Hey @Jordan_Howell, to get draws from the posterior predictive distribution you’ll need to add a generated quantities block to your Stan program. Here’s the relevant chapter from the Stan user’s guide:

In particular check out section 26.4

and section 26.5:

Once you have a generated quantities block the variables you define there will be included with the posterior draws of all the other variables (parameters) in your model.

One other thing I noticed about your Stan program (unrelated to posterior predictions):

this will only use the nth element of alertflag and not the entire array. I think you want something like

NonCatcvrcnt ~ poisson_log(mu + to_vector(alertflag)*beta);

or alternatively you can declare alertflag as a vector type and then you wouldn’t need the to_vector conversion (you need it now because alertflag is an array and not a vector the way you have it).

Jordan_Howell · October 22, 2021, 5:11pm

thank you @jonah.

When I added the generated quantities and compiled the code, I get an error trying to sample. Below is the new model block.

reinstatement_model = """
data {
    int<lower=0> n; // number policy term years
    int<lower=0>  NonCatcvrcnt[n]; // claims
    vector[n] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}

generated quantities {
    int y_hat[n];
    
    for (i in 1:n){
            y_hat = poisson_log_rng(mu+alertflag*beta);
            }
    }

"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 1000,
                       chains = 4, warmup = 200, thin = 1,
                       seed = 101)

The error I get is the following?

RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566.pyx", line 373, in stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566._call_sampler_star
  File "stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566.pyx", line 406, in stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566._call_sampler
RuntimeError: std::bad_alloc
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-13-65ef346ba5cf>", line 3, in <module>
    seed = 101)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 813, in sampling
    ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 85, in _map_parallel
    map_result = pool.map(function, args)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 657, in get
    raise self._value

RuntimeError: std::bad_alloc

jonah · October 22, 2021, 8:10pm

That’s a weird error message. I think you should be getting a more informative message than that. @mitzimorris Is there an error message from Stan that’s getting suppressed here for some reason?

The one thing I notice in the code you added that might fix it is that since you’re using a loop in generated quantities you do now need to index alertflag (in the model block it’s vectorized so you don’t need to). So this

should be

y_hat[i] = poisson_log_rng(mu+alertflag[i]*beta);

That should work. Or I think you could vectorize (if you’re using a pretty recent version of Stan) with

generated quantities {
    int y_hat[n] = poisson_log_rng(mu+alertflag*beta);
}

mitzimorris · October 23, 2021, 2:42pm

poisson_log_rng in the generated quantities block should always be called after checking for overflow, cf: 15.6 Poisson distribution, log parameterization | Stan Functions Reference

R poisson_log_rng (reals alpha)
Generate a Poisson variate with log rate alpha; may only be used in transformed data and generated quantities blocks. alpha must be less than 30 log 2. For a description of argument and return types, see section vectorized function signatures.

e.g.

generated quantities {
  vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

mcol · October 23, 2021, 3:27pm

Is this something that could be suggested by the pedantic mode?

mitzimorris · October 23, 2021, 4:10pm

would pedantive mode flag the above code?
can the checker link up the test max(eta) > 20 (which isn’t quite 30 log 2, btw)
and the call poisson_log_rng(eta[n])?

related question - has poisson_log_rng been vectorized?

Jordan_Howell · October 25, 2021, 11:08am

Thank you. I tried the above and got this:

RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786.pyx", line 373, in stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786._call_sampler_star
  File "stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786.pyx", line 406, in stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786._call_sampler
RuntimeError: std::bad_alloc
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-17-65ef346ba5cf>", line 3, in <module>
    seed = 101)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 813, in sampling
    ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 85, in _map_parallel
    map_result = pool.map(function, args)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 657, in get
    raise self._value

RuntimeError: std::bad_alloc

Could it be the size of my data? 5,399,764 observations

mitzimorris · October 25, 2021, 2:41pm

that sure looks like an OOM error.

altogether you’re allocating 1 double vector plus 2 int arrays, total 16 bytes * 5.4M.
that’s 88 MB altogether, plus whatever else is needed by the model.
seems doable, if you’ve got enough memory - do you?

Jordan_Howell · October 25, 2021, 5:46pm

I have 64 gigs of memory. I drop the model to 1 chain and I’m still getting this error. I’m using python 3.7 on a spyder IDE on windows.

Here is the entirety of the code after I put the data into a python dictionary.

reinstatement_model = """
data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}

generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 1000,
                       chains = 1)

Working primarily with python, I’m just not familiar with memory management. Thank you for the help thus far.

mitzimorris · October 25, 2021, 6:01pm

are you using CmdStanPy or PyStan? PyStan uses a whole lot more memory - it keeps everything in memory both in C++ and in Python. CmdStanPy lets you keep all your input data on disk and sends everything to CmdStan that way.

Jordan_Howell · October 25, 2021, 6:05pm

pystan. i never heard of CmdStanPy. I’ll download now and try that. Thank you.

mitzimorris · October 25, 2021, 6:08pm

great.

the latest CmdStanPy is the 1.0.0 release candidate rc1 -
you can download via this link: https://anaconda.org/conda-forge/cmdstanpy/1.0.0rc1/download/noarch/cmdstanpy-1.0.0rc1-pyhd036fef_0.tar.bz2

documentation is here: cmdstanpy – Python interface to CmdStan — CmdStanPy 1.0.0rc1 documentation

Jordan_Howell · October 25, 2021, 6:45pm

After i ran the conda install, I ran the example Bernoulli model. I got the following error:

stan_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
Traceback (most recent call last):

  File "<ipython-input-7-aed4b5fa7d1e>", line 1, in <module>
    stan_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\cmdstanpy\utils.py", line 157, in cmdstan_path
    validate_cmdstan_path(cmdstan)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\cmdstanpy\utils.py", line 113, in validate_cmdstan_path
    'no CmdStan binaries found, '

ValueError: no CmdStan binaries found, run command line script "install_cmdstan"

So I ran the following:

import cmdstanpy
cmdstanpy.install_cmdstan()

and got the following:

Installing CmdStan version: 2.28.1
Install directory: C:\Users\JORDAN.HOWELL.GITDIR\.cmdstan
Downloading CmdStan version 2.28.1
Download successful, file: C:\Users\JORDAN~1.GIT\AppData\Local\Temp\tmpg10cmmcl
Unpacked download as cmdstan-2.28.1
Building version cmdstan-2.28.1
WARNING:cmdstanpy:CmdStan installation failed.
Command "make build" failed
'cut' is not recognized as an internal or external command,
operable program or batch file.
'cut' is not recognized as an internal or external command,
operable program or batch file.
process_begin: CreateProcess(NULL, expr >= 8, ...) failed.
mingw32-make: makefile:114: pipe: Bad file descriptor
process_begin: CreateProcess(NULL, cp bin/windows-stanc bin/stanc.exe, ...) failed.
make (e=2): The system cannot find the file specified.
mingw32-make: *** [bin/stanc.exe] Error 2
deleting tmpfiles dir: C:\Users\JORDAN~1.GIT\AppData\Local\Temp\tmpnn94nu8u
done

Not sure where to go from here.

Jordan_Howell · October 25, 2021, 6:54pm

Disregard. Putting cmdstanpy on it’s own environment worked.

mitzimorris · October 26, 2021, 2:58pm

does the model run without errors?

Jordan_Howell · October 26, 2021, 3:09pm

Well. It runs and says each chain is done, then throws this error.

Chain 1 -   done: 100%|██████████| 1200/1200 [4:20:00<00:00, 13.60s/it]
Chain 2 - sample:  92%|█████████▏| 1100/1200 [4:32:46<21:11, 12.71s/it]
Chain 2 - sample: 100%|██████████| 1200/1200 [4:49:26<00:00, 11.89s/it]
Chain 1 -   done: 100%|██████████| 1200/1200 [4:49:34<00:00, 14.48s/it]
Chain 2 -   done: 100%|██████████| 1200/1200 [4:49:36<00:00, 14.48s/it]
Chain 3 -   done: 100%|██████████| 1200/1200 [4:49:36<00:00, 14.48s/it]
Chain 4 -   done: 100%|██████████| 1200/1200 [4:49:35<00:00, 14.48s/it]
Traceback (most recent call last):

  File "C:\Users\JORDAN~1.GIT\AppData\Local\Temp/ipykernel_400/1455787158.py", line 1, in <module>
    fit = my_model.sample(data=df_group_dict, iter_warmup=200, iter_sampling =1000,

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\model.py", line 887, in sample
    mcmc = CmdStanMCMC(runset)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\stanfit.py", line 496, in __init__
    config = self._validate_csv_files()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\stanfit.py", line 727, in _validate_csv_files
    dzero = check_sampler_csv(

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 521, in check_sampler_csv
    meta = scan_sampler_csv(path, is_fixed_param)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 577, in scan_sampler_csv
    lineno = scan_sampling_iters(fd, dict, lineno)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 847, in scan_sampling_iters
    raise ValueError(

ValueError: line 51: bad draw, expecting 10799537 items, found 678606

I’ve done this running both a standard model and a parallel model. About four hours wasted each time. Have you seen this? The data is fed by a python dictionary.

Topic		Replies	Views
Posterior Predictive Checks After Sampling Modeling	3	815	October 23, 2022
How to get samples from the posterior predictive distribtion using stan Modeling	3	5355	February 1, 2019
Pystan sampling from the posterior predictive PyStan	8	3073	October 30, 2020
Stan samples from prior predictive distribution the same number of samples as posterior draws. How restrict it to only one? Modeling	3	540	November 21, 2022
CmdSTAN and posterior prediction Modeling	6	1418	March 21, 2019

Analyzing the posterior prediction samples

Related topics