Analyzing the posterior prediction samples

Hello.

I have a simple one variable GLM whose aim is to see the difference in a population with a flag and without. I’ve gotten the model to successfully compile and sample from the data.

  1. How do I pull posterior predictive samples?
  2. How do I show two distributions and their mean of those predictions for the population with a flag and without?

Below is my code:


reinstatement_model = """
data {
    int<lower=0> n; // number policy term years
    int<lower=0> NonCatcvrcnt[n]; // claims
    int<lower=0, upper = 1> alertflag[n]; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag[n]*beta);
}


"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 5000,
                       chains = 4, warmup = 500, thin = 1,
                       seed = 101)

I don’t recommend warmup=500, iter=5000. If you only need 500 warmup, I doubt you need more than 500 iterations to save.

1 Like

Thank you. I will implement that when I rerun this model or develop a new one. Do you know how to pull posterior predictions?

Sorry, I don’t know Python! I hope one of the cmdstanpy developers can point you to the relevant documentation.

Hey @Jordan_Howell, to get draws from the posterior predictive distribution you’ll need to add a generated quantities block to your Stan program. Here’s the relevant chapter from the Stan user’s guide:

In particular check out section 26.4

and section 26.5:

Once you have a generated quantities block the variables you define there will be included with the posterior draws of all the other variables (parameters) in your model.

One other thing I noticed about your Stan program (unrelated to posterior predictions):

this will only use the nth element of alertflag and not the entire array. I think you want something like

NonCatcvrcnt ~ poisson_log(mu + to_vector(alertflag)*beta);

or alternatively you can declare alertflag as a vector type and then you wouldn’t need the to_vector conversion (you need it now because alertflag is an array and not a vector the way you have it).

1 Like

thank you @jonah.

When I added the generated quantities and compiled the code, I get an error trying to sample. Below is the new model block.

reinstatement_model = """
data {
    int<lower=0> n; // number policy term years
    int<lower=0>  NonCatcvrcnt[n]; // claims
    vector[n] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}

generated quantities {
    int y_hat[n];
    
    for (i in 1:n){
            y_hat = poisson_log_rng(mu+alertflag*beta);
            }
    }

"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 1000,
                       chains = 4, warmup = 200, thin = 1,
                       seed = 101)

The error I get is the following?

RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566.pyx", line 373, in stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566._call_sampler_star
  File "stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566.pyx", line 406, in stanfit4anon_model_f329bc7ad7396a8e372d31ea21f5a0f6_4091579692383103566._call_sampler
RuntimeError: std::bad_alloc
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-13-65ef346ba5cf>", line 3, in <module>
    seed = 101)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 813, in sampling
    ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 85, in _map_parallel
    map_result = pool.map(function, args)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 657, in get
    raise self._value

RuntimeError: std::bad_alloc

That’s a weird error message. I think you should be getting a more informative message than that. @mitzimorris Is there an error message from Stan that’s getting suppressed here for some reason?

The one thing I notice in the code you added that might fix it is that since you’re using a loop in generated quantities you do now need to index alertflag (in the model block it’s vectorized so you don’t need to). So this

should be

y_hat[i] = poisson_log_rng(mu+alertflag[i]*beta);

That should work. Or I think you could vectorize (if you’re using a pretty recent version of Stan) with

generated quantities {
    int y_hat[n] = poisson_log_rng(mu+alertflag*beta);
}

poisson_log_rng in the generated quantities block should always be called after checking for overflow, cf: 15.6 Poisson distribution, log parameterization | Stan Functions Reference

R poisson_log_rng (reals alpha)
Generate a Poisson variate with log rate alpha; may only be used in transformed data and generated quantities blocks. alpha must be less than 30 log 2. For a description of argument and return types, see section vectorized function signatures.

e.g.

generated quantities {
  vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}
1 Like

Is this something that could be suggested by the pedantic mode?

2 Likes

would pedantive mode flag the above code?
can the checker link up the test max(eta) > 20 (which isn’t quite 30 log 2, btw)
and the call poisson_log_rng(eta[n])?

related question - has poisson_log_rng been vectorized?

Thank you. I tried the above and got this:

RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786.pyx", line 373, in stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786._call_sampler_star
  File "stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786.pyx", line 406, in stanfit4anon_model_6f5fdd20c1638b73d3630dd4db878822_5143144809530600786._call_sampler
RuntimeError: std::bad_alloc
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-17-65ef346ba5cf>", line 3, in <module>
    seed = 101)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 813, in sampling
    ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Roaming\Python\Python37\site-packages\pystan\model.py", line 85, in _map_parallel
    map_result = pool.map(function, args)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\multiprocessing\pool.py", line 657, in get
    raise self._value

RuntimeError: std::bad_alloc

Could it be the size of my data? 5,399,764 observations

that sure looks like an OOM error.

altogether you’re allocating 1 double vector plus 2 int arrays, total 16 bytes * 5.4M.
that’s 88 MB altogether, plus whatever else is needed by the model.
seems doable, if you’ve got enough memory - do you?

I have 64 gigs of memory. I drop the model to 1 chain and I’m still getting this error. I’m using python 3.7 on a spyder IDE on windows.

Here is the entirety of the code after I put the data into a python dictionary.

reinstatement_model = """
data {
    int<lower=0> N; // number policy term years
    int<lower=0>  NonCatcvrcnt[N]; // claims
    vector[N] alertflag; //alert flag
}

parameters {
    real<lower=0> mu;
    real beta;

}

model {
       mu ~ normal(0,3);
       beta ~ normal(0,1);
       NonCatcvrcnt ~ poisson_log(mu + alertflag*beta);
}

generated quantities {
      vector[N] eta = mu + alertflag * beta;
  int y_rep[N];
  if (max(eta) > 20) {
    // avoid overflow in poisson_log_rng
    print("max eta too big: ", max(eta));  
    for (n in 1:N)
      y_rep[n] = -1;
  } else {
      for (n in 1:N)
        y_rep[n] = poisson_log_rng(eta[n]);
  }
}

"""

fit = pystan.StanModel(model_code=reinstatement_model)

samples = fit.sampling(data=df_group_dict, iter = 1000,
                       chains = 1)

Working primarily with python, I’m just not familiar with memory management. Thank you for the help thus far.

are you using CmdStanPy or PyStan? PyStan uses a whole lot more memory - it keeps everything in memory both in C++ and in Python. CmdStanPy lets you keep all your input data on disk and sends everything to CmdStan that way.

pystan. i never heard of CmdStanPy. I’ll download now and try that. Thank you.

great.

the latest CmdStanPy is the 1.0.0 release candidate rc1 -
you can download via this link: https://anaconda.org/conda-forge/cmdstanpy/1.0.0rc1/download/noarch/cmdstanpy-1.0.0rc1-pyhd036fef_0.tar.bz2

documentation is here: cmdstanpy – Python interface to CmdStan — CmdStanPy 1.0.0rc1 documentation

1 Like

After i ran the conda install, I ran the example Bernoulli model. I got the following error:

stan_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
Traceback (most recent call last):

  File "<ipython-input-7-aed4b5fa7d1e>", line 1, in <module>
    stan_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\cmdstanpy\utils.py", line 157, in cmdstan_path
    validate_cmdstan_path(cmdstan)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\stan_env\lib\site-packages\cmdstanpy\utils.py", line 113, in validate_cmdstan_path
    'no CmdStan binaries found, '

ValueError: no CmdStan binaries found, run command line script "install_cmdstan"

So I ran the following:

import cmdstanpy
cmdstanpy.install_cmdstan()

and got the following:

Installing CmdStan version: 2.28.1
Install directory: C:\Users\JORDAN.HOWELL.GITDIR\.cmdstan
Downloading CmdStan version 2.28.1
Download successful, file: C:\Users\JORDAN~1.GIT\AppData\Local\Temp\tmpg10cmmcl
Unpacked download as cmdstan-2.28.1
Building version cmdstan-2.28.1
WARNING:cmdstanpy:CmdStan installation failed.
Command "make build" failed
'cut' is not recognized as an internal or external command,
operable program or batch file.
'cut' is not recognized as an internal or external command,
operable program or batch file.
process_begin: CreateProcess(NULL, expr >= 8, ...) failed.
mingw32-make: makefile:114: pipe: Bad file descriptor
process_begin: CreateProcess(NULL, cp bin/windows-stanc bin/stanc.exe, ...) failed.
make (e=2): The system cannot find the file specified.
mingw32-make: *** [bin/stanc.exe] Error 2
deleting tmpfiles dir: C:\Users\JORDAN~1.GIT\AppData\Local\Temp\tmpnn94nu8u
done

Not sure where to go from here.

Disregard. Putting cmdstanpy on it’s own environment worked.

3 Likes

does the model run without errors?

Well. It runs and says each chain is done, then throws this error.

Chain 1 -   done: 100%|██████████| 1200/1200 [4:20:00<00:00, 13.60s/it]
Chain 2 - sample:  92%|█████████▏| 1100/1200 [4:32:46<21:11, 12.71s/it]
Chain 2 - sample: 100%|██████████| 1200/1200 [4:49:26<00:00, 11.89s/it]
Chain 1 -   done: 100%|██████████| 1200/1200 [4:49:34<00:00, 14.48s/it]
Chain 2 -   done: 100%|██████████| 1200/1200 [4:49:36<00:00, 14.48s/it]
Chain 3 -   done: 100%|██████████| 1200/1200 [4:49:36<00:00, 14.48s/it]
Chain 4 -   done: 100%|██████████| 1200/1200 [4:49:35<00:00, 14.48s/it]
Traceback (most recent call last):

  File "C:\Users\JORDAN~1.GIT\AppData\Local\Temp/ipykernel_400/1455787158.py", line 1, in <module>
    fit = my_model.sample(data=df_group_dict, iter_warmup=200, iter_sampling =1000,

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\model.py", line 887, in sample
    mcmc = CmdStanMCMC(runset)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\stanfit.py", line 496, in __init__
    config = self._validate_csv_files()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\stanfit.py", line 727, in _validate_csv_files
    dzero = check_sampler_csv(

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 521, in check_sampler_csv
    meta = scan_sampler_csv(path, is_fixed_param)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 577, in scan_sampler_csv
    lineno = scan_sampling_iters(fd, dict, lineno)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\cmdstanpy\utils.py", line 847, in scan_sampling_iters
    raise ValueError(

ValueError: line 51: bad draw, expecting 10799537 items, found 678606

I’ve done this running both a standard model and a parallel model. About four hours wasted each time. Have you seen this? The data is fed by a python dictionary.