# What Steps after Posterior Predictive Checks

Hello,

I am seeking assistance regarding the necessary steps after implementing posterior predictive checks. I am replicating Jeffrey Arnold’s polling aggregation model using my own dataset.

The ‘ppc dens overlay’ analysis indicates that the simulated data points do not align well with the observed data.

Similarly, the ‘ppc stat_2d’ visualization illustrates that the simulated data points tend to cluster at values lower than the mean of the observed data, as well as higher than the standard deviation of the observed data.

My inquiry is: what measures can be undertaken to mitigate the disparities between the observed data and the simulated data points?

Here’s Stan code:
parameters {
vector[T] omega_raw;
real<lower = 0.> tau;
vector[H] eta_raw;
real<lower = 0.> zeta;
}
transformed parameters {
vector[N] mu;
vector[T] xi;
vector[H] eta;
eta = eta_raw * zeta;
xi[1] = xi_init_loc + omega_raw[1]*xi_init_scale;
for (t in 2:T) {
xi[t] = xi[t - 1] + omega_raw[t]tau;
}
for (i in 1:N) {
mu[i] = xi[time[i]] + eta[house[i]];
}
}
model {
eta_raw ~ normal(3.5, 5.0); // eta_raw ~ normal(0., 1.);
zeta ~ normal(4.5, zeta_scale); // zeta ~ normal(0., zeta_scale);
tau ~ cauchy(0., 2.95
tau_scale); // tau ~ cauchy(0., tau_scale);
omega_raw ~ normal(5.0, 7.0); // omega_raw ~ normal(0., 1.);
y ~ normal(mu, s);
}

I’ve made an attempt to adjust the priors by increasing their values, aiming to illustrate the extent of this adjustment in comparison to Arnold’s original model. However, despite these adjustments, the posterior predictive checks show minimal change."

I would appreciate your input on the following question: What steps could be taken to alleviate the disparities between the observed data and the simulated data points? Any comments you could provide would be highly valued.

Some things you might consider are: perhaps your model is missing key information or structure (for example, perhaps you don’t have predictors that are needed), perhaps your model isn’t flexible enough (for example, maybe you fit a linear function for the mean but it is nonlinear), or perhaps your response family doesn’t adequately describe the probability distribution that can be thought to generate the process. I think the first step that is always best is to go back to the drawing board and rethink the conceptual part of the analysis and use the discrepancies that you see between the posterior predictions and your data to motivate your thinking as to the modeling process.

When I see something like that small second peak in the density, I start thinking about predictors that I am missing. Perhaps the model is sort of averaging over those two peaks. You could also try a more flexible response family like student t.