Hi Martin,
Thanks for replying!
You pointed out many mistakes (thanks!), I’ll be a bit more careful this time. Also thanks for the tip for the neg_binomial_2 distribution!
m_code = """
data {
int<lower=0> n_data;
int<lower=0> n_predictions;
real<lower=0> y[n_data];
}
parameters {
real<lower=0, upper=0.015> beta;
}
model {
//beta ~ uniform(0, 0.15);
y ~ exponential_lpdf(beta);
}
generated quantities{
vector[n_predictions] y_hat;
vector[n_data] log_lik;
for(i in 1:n_predictions)
y_hat[i] = exponential_rng(beta);
for(i in 1:n_data)
log_lik[i] = exponential_lpdf(y[i] | beta);
}
"""
for reference:
import pystan
import matplotlib.pyplot as plt
import numpy as np
import arviz as az
import corner
import time
%matplotlib inline
print(az.__version__)
print(pystan.__version__)
print(corner.__version__)
0.8.3
2.19.1.1
2.0.1
Create an array with my (small) data set, as I said earlier its taken from an MCA (multi-channel-analyser), this device bins data, so these numbers are bin-contents.
I then remove the first few bins in the set, which are full of noise, and then make a normalised data set:
data = np.array([
0,6,11,0,1,1076,2142,3324,5786,10408,15185,15592,11576,7722,4812,2549,1385,703,333,210,121,
78,73,59,56,47,37,39,26,38,47,39,30,27,46,38,26,31,36,31,29,43,33,35,33,34,33,32,28,29,29,
30,26,34,33,28,29,34,25,26,31,32,35,15,29,19,21,23,20,31,23,28,18,32,31,30,31,14,29,18,19,
19,23,17,24,24,25,31,28,22,25,17,17,21,16,28,21,28,19,16,22,25,24,11,18,20,19,11,13,13,25,
18,21,16,20,13,16,12,11,13,13,17,19,17,17,10,8,12,11,14,18,12,17,16,11,20,18,10,10,12,15,17,
11,17,9,11,13,10,15,17,16,15,17,12,12,10,12,14,16,5,10,10,13,9,12,10,12,10,7,13,11,8,10,16,
7,5,12,11,7,9,11,15,7,9,12,8,10,10,7,3,10,6,10,5,4,9,6,6,8,8,7,5,13,2,8,7,6,7,11,4,6,6,7,12,
7,7,4,4,4,4,5,9,8,11,3,6,5,5,5,7,14,4,4,6,6,6,8,1,7,4,7,8,8,1,7,3,6,5,7,6,5,1,5,4,5,3,4,6,7,
5,4,8,4,3,2,2,1,9,5,6,3,12,5,7,6,4,5,4,8,3,1,2,6,7,4,5,4,5,0,3,3,3,5,1,4,1,1,1,4,4,4,2,3,2,
5,2,1,1,1,3,1,2,3,4,3,1,3,2,3,3,4,1,2,2,2,1,0,3,4,2,1,1,1,4,0,3,2,1,2,1,6,2,1,1,2,2,5,4,2,3,
1,1,1,2,0,5,0,4,2,2,1,1,1,3,0,2,1,1,1,2,2,1,2,3,0,1,2,1,2,2,1,0,1,3,2,1,2,0,2,2,2,1,1,2,5,3,
2,1,2,1,2,4,0,3,0,2,1,1,0,1,2,3,2,1,0,1,1,2,3,1,1,2,1,3,5,0,0,2,0,1,0,0,1,0,0,0,0,1,0,0,1,1,
0,0,1,0,1,2,0,1,0,1,0,0,0,0,2,0,0,0,1,0,1,1,1,1,0,1,1,1,2,1,1,0,2,0,2,2,2,0,2,1,0,1,0,0,2,0,
1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0])
data_trunc = data[25::1] # truncate set to remove noise
data_trunc_n = data_trunc/np.sum(data_trunc) # normalize
bins = np.array(range(0,len(data_trunc_n)))
Later, I find (especially when making the arviz.ppc_plot) it’s easier If I deconstruct the binned data and work with sampled data, so I multiply through the bins and produce events:
data_i = []
for i, n_i in enumerate(data_trunc):
for ii in range(n_i):
data_i.append(i)
quick plot to visualise the data:
# quick plot to visualize data
plt.rcParams["figure.figsize"] = [10,4]
fig, ax = plt.subplots()
plt.bar(bins, data_trunc, width=1)
plt.xlabel(r"Bins")
plt.ylabel(r"Counts")
plt.xlim([0, len(bins)])
#plt.ylim([1e-4, max(data_trunc_n)])
ax.set_yscale('log')
plt.show()
compile the model:
starttime = time.time()
sm = pystan.StanModel(model_code=m_code)
print('Took {} seconds'.format(time.time() - starttime))
Took 40.821542263031006 seconds
now begin fitting:
expo_data = {'n_data': len(data_i),
'y': data_i,
'n_predictions': 2000}
n_chains = 4
starttime = time.time()
fit = sm.sampling(data=expo_data,
iter=10000, warmup=5000,
chains=n_chains,
init=[dict(beta=0.009) for i in range(n_chains)])
print('Took {} seconds'.format(time.time() - starttime))
#print(fit.stansummary()) # comment this out, it takes forever when n_predictions is large
Took 23.4742693901062 seconds
Inference for Stan model: anon_model_c505dc25772a95ae8e4471abf9ed01b1.
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta 9.7e-3 1.8e-6 1.5e-4 9.4e-3 9.6e-3 9.7e-3 9.8e-3 0.01 7357 1.0
y_hat[1] 103.81 0.73 102.67 2.56 30.55 72.48 145.11 376.04 19805 1.0
(attenuated output)
use arviz data structure for conveniently formatted data, especially when using multiple chains!
az_data = az.from_pystan(
posterior=fit,
observed_data=["y"],
posterior_predictive=["y_hat"],
log_likelihood={"y": "log_lik"}
)
get the mean beta value and also the value for beta using the optimise method:
op = sm.optimizing(data=expo_data)
beta_fit_opt = op['beta']
beta_fit_mean = np.mean(np.concatenate(az_data.posterior['beta'].values)) # combine all chains
print("Best fit from optimize: {:.6f}, mean of betas: {:.6f}".format(beta_fit_opt, beta_fit_mean))
Best fit from optimize: 0.009713, mean of betas: 0.009714
the arviz.plot_trace shows this nicely
az.plot_trace(fit, var_names=('beta'))
to view the output and the predicted posteriors, I use the arviz.ppc_plot. I wanted the bins shown as well, so plotted it together with a histogram of the data. Also required a y-axis. I couldn’t find any options for these (is this the best way?)
az.style.use("arviz-darkgrid")
plt.rcParams["figure.figsize"] = [10,4]
ax = az.plot_ppc(az_data, kind='kde', data_pairs={"y":"y_hat"}, alpha=0.9)
ax[0].hist(az_data.observed_data["y"].values,
bins=bins,
density=True,
color='k', lw=2, alpha=0.5,
width=1,
zorder=100000)
ax[0].set_xlabel('y')
ax[0].set_xlim([0,20])
ax[0].set_ylabel('Density')
ax[0].set_yticks(np.arange(0,0.25,0.002))
ax[0].set_xlim([0,400])
ax[0].set_ylim([0,0.014])
ax[0].legend(loc='upper left', frameon=False, fontsize='small')
plt.show()
The posterior predictive ‘mean’ line has an artifact at x=0. Is there any way to fix this? Also is there any way to edit the legend labels? (I looked into passing my own ‘labels’ as some form of kwarg but couldn’t see one to use)
Aside number 1 - Binned data
If I want to use the original binned data (data_trunc), in my Stan model, instead of the exponential model, I modify my model code:
for(i in 1:n_data)
target += y[i] * exponential_lpdf(x[i] | beta);
and then fit with:
expo_data = {'n_data': len(data_trunc),
'y': data_trunc,
'x': bins,
'n_predictions': 1000}
But I couldn’t see any way of getting the arviz.ppc_plot to work with binned data. Is there any way?
I could replace the ‘observed_data’ in the arviz inference set with the sampled data generated from the binned data (but then really may as well use this to begin with)
observed_histogram = az_data.observed_data['y']
az_data_old = az.convert_to_inference_data(observed_histogram,group='observed_histogram')
az_data = az_data + az_data_old
del az_data.observed_data
az_data_new = az.convert_to_inference_data({"y":np.array(data_i)},group='observed_data')
az_data = az_data + az_data_new
Post prediction checks
I wanted to test the fitting. I checked the log likelihood,
betas = np.concatenate(az_data.posterior['beta'].values)
log_lik = np.concatenate(az_data.log_likelihood['y'].values)
lh = [i.sum() for i in log_lik]
beta_fit_lh_lh = np.amax(lh)
beta_fit_lh = betas[np.argmax(lh)]
print ("beta_fit_beta: {:.5f}, LH: {:.5f} (id: {:.0f})".format(beta_fit_lh, beta_fit_lh_lh, np.argmax(lh)))
plt.rcParams["figure.figsize"] = [10,4]
fig, ax = plt.subplots()
plt.scatter(betas, lh, s=1)
plt.xlabel(r"Exponential factor, $\beta$")
plt.ylabel(r"Log Likelihood")
plt.show()
beta_fit_beta: 0.00971, LH: -23494.87768 (id: 3827)
I see the maximum is at the same value returned by the optimise method (and the mean beta). Is this the mechanism behind ‘optimise’?
I then compute the negative log likelihood of the data, as something else to check against. Here I’m comparing binned models. So I’m computing
nLH += log(model_i) * data_i
So I first pull all the chains together, create histograms (with the same binning as my data), normalise and then compute the nLH against my binned, normalised data:
betas = np.concatenate(az_data.posterior['beta'].values) # get beta parameter to plot against, all chains combined
y_hats = np.concatenate(az_data.posterior_predictive['y_hat'].values) # get y_hat data, all chains combined
lh_fit = np.zeros_like(betas) # create empty list
lh_pred = np.zeros_like(betas) # create empty list
for j in range(len(betas)):
# negative log likelihood of data
f_opt = lambda x :betas[j]*np.exp(-x*betas[j])
f_opt_n = [f_opt(x) for x in bins]
lh_fit[j] = np.array([x*np.log(f) for x, f in zip(data_trunc_n, f_opt_n) if f>0]).sum()
# negative log likelihood of predicted posteriors
bincontents, npbins = np.histogram(y_hats[j], bins=bins, density=True)
lh_pred[j] = np.array([x*np.log(f) for x, f in zip(bincontents, f_opt_n) if f>0]).sum()
# find maximum of nLH
beta_fit_lh_lh_fit = np.amax(lh_fit)
idx_fit = np.argmax(lh_fit)
beta_fit_lh_fit = betas[idx_fit]
# use mean value (i.e. comparable to the mean of the betas?)
idx_pred = (np.abs(lh_pred - lh_pred.mean())).argmin() #find mean then the closest value to it in array
beta_fit_lh_pred = betas[idx_pred]
beta_fit_lh_lh_pred = lh_pred[idx_pred]
print ("beta_fit_lh_fit: {:.5f}, LH: {:.5f} (id: {:.0f})".format(beta_fit_lh_fit, beta_fit_lh_lh_fit, idx_fit))
print ("beta_fit_lh_pred: {:.5f}, LH: {:.5f} (id: {:.0f})".format(beta_fit_lh_pred, beta_fit_lh_lh_pred, idx_pred))
beta_fit_lh_fit: 0.00971, LH: -5.63426 (id: 15887)
beta_fit_lh_pred: 0.00981, LH: -5.58654 (id: 14367)
this also returns a value for beta which is very close to the previous ones found (with optimise, mean, log_lik)
I look at the nLH of all the predicted posteriors. I couldn’t see how to use the arviz kde plot (or any other, such as pair plot) with specified contours. I want contours which show sigma levels. I found the corner package does this nicely:
az.style.use("default")
postsamples = np.vstack((betas, lh_pred)).T
levels = [0, 1-np.exp(-0.5),1-np.exp(-2)] # 1sigma and 2sigma contours
fig = corner.corner(postsamples,
quantiles=(0.05, 0.16, 0.84, 0.95),
levels=levels,
plot_datapoints=False,
plot_density=False,
fill_contours=True,
contour_kwargs={"colors":['red','blue']},
contourf_kwargs={"colors":['w','red','blue'], "alpha":0.4},
truths=[beta_fit_lh_pred, beta_fit_lh_lh_pred],
truth_color='grey',
show_titles=True, title_fmt=".5f",
)
The nLH’s of all the predicted data sets look consistent, this must be a good thing?!
I’ve never used an HDI before, but I found it in the arviz package. So I compute it too:
hdi_lower, hdi_upper = az.hdi(az_data, hdi_prob=0.95, var_names=['beta'])['beta'].values
beta_fit_hdi = hdi_lower + (hdi_upper-hdi_lower)/2
print("HDI lower: {:.5f}, HDI upper: {:.5f}, width: {:.5f}, mid-point: {:.5f}".format(hdi_lower, hdi_upper, hdi_upper-hdi_lower, beta_fit_hdi))
HDI lower: 0.00941, HDI upper: 0.01000, width: 0.00059, mid-point: 0.00971
it also returns a value for beta which is consistent with the others.
I attempt to use the arviz.hdi_plot. I first histogram all the predictive datasets and then plot the binned data. (Is there any other way to do this?)
It works, but the band looks quite large, I guess it does contain 95% of the values. Is this how it should be interpreted?
betas = np.concatenate(az_data.posterior['beta'].values) # get beta parameter to plot against
y_hats = np.concatenate(az_data.posterior_predictive['y_hat'].values) # get y_hat data
bincontents_list = []
for j in range(len(betas)):
bincontents, npbins = np.histogram(y_hats[j], bins=range(len(bins)+1), density=True)
bincontents_list.append(bincontents)
plt.rcParams["figure.figsize"] = [10,4]
fig, ax = plt.subplots()
ax = az.plot_hdi(bins, bincontents_list, hdi_prob=0.95, plot_kwargs={"ls": "--"})
plt.hist(data_i, width=1, bins=range(len(bins)+1), alpha=0.7, density=True)
plt.plot(bins, beta_fit_hdi*np.exp(-bins*beta_fit_hdi), color='red', label='hdi_midpoint ({:.5f})'.format(beta_fit_hdi))
plt.plot(bins, hdi_lower*np.exp(-bins*hdi_lower), color='yellow', label='hdi_lower ({:.5f})'.format(hdi_lower))
plt.plot(bins, hdi_upper*np.exp(-bins*hdi_upper), color='yellow', label='hdi_upper ({:.5f})'.format(hdi_upper))
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim([0, len(bins)])
plt.ylim([1e-4, max(data_trunc_n)*2])
ax.set_yscale('log')
plt.legend(loc="upper right", fontsize=9)
plt.show()
I could equally draw all the other lines of best fit for beta, but since they are so close, they would all fall ontop of each other on this plot. I like this plot, as long as I’m creating it correctly… I’ll have to check it over.
next
- I’ll compare the two models, exponential and neg_binomial_2_lpmf.
- fit the two parameters on neg_binomial_2_lpmf
- I’d like to see how loo works. I’ll attempt to try this.
Many thanks if anybody read this all the way through.
I quite like Stan and Arviz. There’s some really nice tools here. In this post I’ve learnt and tested them with the tools I know (e.g. nLH), which might not be a standard way of doing things (is this mixing in frequentism?). Please point out any mistakes!
edit: changed log-likelihood ratio to just log-likelihood (not sure ratio was the correct thing to do).