Hi,
I’m trying to compute Posterior Predictive P-value (PPP) for a simple Factor Analysis (https://gist.github.com/mike-lawrence/dd2435f290a567bd1fd03370ee669688). While I agree that graphical posterior predictive checks would be a lot better option, PPP is widely used in many commercial SEM tools, like MPlus, and I feel a slight pressure to report it. I have actually used an approach proposed by MPlus authors, described here (p. 28): http://statmodel.com/download/Bayes3.pdf . Instead of their discrepancy function, I’m using sum of the log likelihoods generated by Stan, lpd function. As far as I understand, it should be equivalent, except that higher LL values indicate a better fit, while higher discrepancy values indicate a worse fit.
The problem is that I am getting substantially different results than those from MPlus. PPP is around 0.5 with SD around 0.5, while in MPlus I am getting 0.000 value. The former indicates perfect fit, the latter very bad fit.
Could you confirm that my approach is correct (with respect to the procedure described by MPlus authors)?
Below you will find code, and a short example.
Thank you.
data {
int n_subj;
int n_y;
matrix[n_subj,n_y] y;
int n_fac;
int<lower=1,upper=n_fac> y_fac[n_y];
}
transformed data {
matrix[n_subj,n_y] scaled_y;
for (i in 1:n_y) {
scaled_y[,i] = (y[,i]-mean(y[,i])) / sd(y[,i]);
}
}
parameters {
matrix[n_fac,n_subj] normal01;
cholesky_factor_corr[n_fac] fac_cor_helper;
vector[n_y] scaled_y_means;
vector<lower=0>[n_y] scaled_y_noise;
vector<lower=0>[n_y] betas;
}
transformed parameters{
matrix[n_subj,n_fac] subj_facs ;
subj_facs = transpose(fac_cor_helper * normal01) ;
}
model{
to_vector(normal01) ~ normal(0,1) ;
fac_cor_helper ~ lkj_corr_cholesky(1) ;
scaled_y_means ~ normal(0,1) ;
scaled_y_noise ~ weibull(2,1) ;
betas ~ normal(0,1) ;
for(i in 1:n_y){
scaled_y[,i] ~ normal(
scaled_y_means[i] + subj_facs[,y_fac[i]] * betas[i]
, scaled_y_noise[i]
) ;
}
}
generated quantities {
vector[n_y] Mu;
cov_matrix[n_y] Q = diag_matrix(scaled_y_noise .* scaled_y_noise);
vector[n_y] yrep[n_subj];
vector[n_subj] log_lik;
vector[n_subj] log_lik_rep;
real sum_ll;
real sum_ll_rep;
real ppp;
for (n in 1:n_subj) {
for (i in 1:n_y) {
Mu[i] = scaled_y_means[i] + subj_facs[n,y_fac[i]] * betas[i];
}
// compute likelihood of the current parameter values (given the original data)
log_lik[n] = multi_normal_lpdf(scaled_y[n,] | Mu, Q);
// generate new data based on current parameter values
yrep[n] = multi_normal_rng(Mu, Q);
// compute likelihood of the current parameter values (given the new data)
log_lik_rep[n] = multi_normal_lpdf(yrep[n] | Mu, Q);
}
// sum up the likelihoods for all observations
sum_ll = sum(log_lik);
sum_ll_rep = sum(log_lik_rep);
// check which is higher
ppp = sum_ll > sum_ll_rep ? 1 : 0;
}
library(tidyverse)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write=T)
data("HolzingerSwineford1939", package="lavaan")
HolzingerSwineford1939 %>%
select(x1:x9) -> dat
data_for_stan = list(
n_y = ncol(dat),
n_subj = nrow(dat),
y = as.matrix(dat),
n_fac = 3,
y_fac = c(1,1,1,2,2,2,3,3,3)
)
model = stan_model(
file = "cfa.stan"
)
post = sampling(model,
data = data_for_stan,
seed = 1,
chains = 4,
iter = 2e3,
pars = c('normal01','fac_cor_helper','Q','Mu','log_lik_rep','yrep'),
include = FALSE)
print(post,
pars = 'ppp',
digits=2,
use_cache=F)