CmdStanPy with Arviz PPC plot differs from CmdStanR with bayesplot

nels · July 15, 2021, 8:38am

I’m taking part in the Learn Bayesian Data Analysis and Stan with Jonah Gabry course.

Wanted to duplicate the examples given in CmdStanR & bayesplot using CmdStanPy & ArviZ but running into an issue with the PPC plotting.

Using the same Stan file and the same data in both R & Python. (The provided .RDS file is read into Python using the rpy2 package)

In R

y_rep <- fit_simple_pois$draws("y_rep", format = "matrix")
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])

results in

while in Python

az_data = az.from_cmdstanpy(
    posterior=fit_simple_pois,
    posterior_predictive='y_rep',
    observed_data={'complaints': pest_data.complaints},
)

az.plot_ppc(az_data, data_pairs={'complaints': 'y_rep'}

results in
Screenshot_20210715_093242

Very confused by the shape of the ArviZ plot.
Any suggestions?

The Stan code is

data {
  int<lower=1> N;
  vector<lower=0>[N] traps;  // vector implies real numbers
  array[N] int<lower=0> complaints; 
}
parameters {
  real alpha;
  real beta;
}
model {
  complaints ~ poisson_log(alpha + beta * traps);
  alpha ~ normal(log(7), 1);
  beta ~ normal(-0.25, 0.5);
}
generated quantities {
  array[N] int y_rep = poisson_log_rng(alpha + beta * traps);
}

The data in R

  building_id       date traps floors sq_footage_p_floor live_in_super monthly_average_rent average_tenant_age age_of_building total_sq_foot month complaints log_sq_foot_1e4
1          37 2017-01-15     8      8           5149.008             0             3846.949           53.87742              47      41192.06     1          1        1.415661
2          37 2017-02-14     8      8           5149.008             0             3846.949           53.87742              47      41192.06     2          3        1.415661
3          37 2017-03-16     9      8           5149.008             0             3846.949           53.87742              47      41192.06     3          0        1.415661
4          37 2017-04-15    10      8           5149.008             0             3846.949           53.87742              47      41192.06     4          1        1.415661
5          37 2017-05-15    11      8           5149.008             0             3846.949           53.87742              47      41192.06     5          0        1.415661
6          37 2017-06-14    11      8           5149.008             0             3846.949           53.87742              47      41192.06     6          0        1.415661

and in a Pandas DF in Python

building_id	date	traps	floors	sq_footage_p_floor	live_in_super	monthly_average_rent	average_tenant_age	age_of_building	total_sq_foot	month	complaints	log_sq_foot_1e4
1	37	17181.00	8.00	8.00	5149.01	0.00	3846.95	53.88	47.00	41192.06	1.00	1.00	1.42
2	37	17211.00	8.00	8.00	5149.01	0.00	3846.95	53.88	47.00	41192.06	2.00	3.00	1.42
3	37	17241.00	9.00	8.00	5149.01	0.00	3846.95	53.88	47.00	41192.06	3.00	0.00	1.42
4	37	17271.00	10.00	8.00	5149.01	0.00	3846.95	53.88	47.00	41192.06	4.00	1.00	1.42
5	37	17301.00	11.00	8.00	5149.01	0.00	3846.95	53.88	47.00	41192.06	5.00	0.00	1.42

CmdStanR v0.4.0
bayesplot v1.8.1
CmdStanPy v0.9.76
ArViz v0.11.2

robertgrant · July 15, 2021, 8:56am

It looks like a difference in smoothing, like kernel bandwidth, rather than different numbers underneath. Remember that y and y_rep are natural numbers 0, 1, 2, 3…
Maybe try plotting the draws yourself with a basic kernel density or spline function, and tweak the smoothing.

ahartikainen · July 15, 2021, 11:27am

When creating the inferencedata object add dtype info

import arviz as az
idata = az.from_cmdstanpy(
    posterior=fit, 
    posterior_predictive=["y_rep"], 
    dtypes={"y_rep": int}
)

(Also make sure you use the latest github version, we should update pypi at some point)

pip install git+https://github.com/arviz-devs/arviz

nels · July 15, 2021, 12:50pm

Thanks @robertgrant and @ahartikainen

Using seaborn’s displot function I can plot the draws myself, increasing the smoothing to get something closer to the bayesplot output

With the latest ArviZ installed from GitHub which allows the dtypes argument I get
Screenshot_20210715_134506

New to ArviZ and was probably expecting a drop-in equivalent for the PPC density overlay plot from the bayesplot :-)

robertgrant · July 15, 2021, 1:15pm

It’s a nice example of the difficult choices and unavoidable compromises in showing multiple densities simultaneously. There is a spaghetti problem of too many lines all together – you can’t see which one is going where. You can smooth but that introduces artefacts like negative y/y_rep, and you have a subjective choice of smoothing parameter, I mean what’s to stop someone from smoothing out the feature they don’t like and keeping the one they do? You can show the proportions at discrete values, and I like the look of that arviz plot but it’s hard to see which step curve goes where. Or you can discard the lines and show some summary of them instead (e.g. the last chart in section 1.4.3 in Towards A Principled Bayesian Workflow). It might best serve the analyst to look at all of them!

nels · July 16, 2021, 8:52pm

Thank you @robertgrant!

Topic		Replies	Views
Getting an error I'm not Familiar with Trying to plot the ppc General	5	1277	January 27, 2022
Posterior predictive plot in when using PyStan General pystan , python , arviz	4	1901	June 4, 2021
ArviZ - Exploratory analysis of Bayesian models in Python for PyStan and CmdStan General loo , arviz	1	2977	January 9, 2019
PPCs in ArviZ? Interfaces cmdstanpy , arviz	1	806	April 28, 2021
Trying to plot posterior Predictive plot using az.from_cmdstanpy() PyStan	10	996	September 19, 2022

CmdStanPy with Arviz PPC plot differs from CmdStanR with bayesplot

Related topics