Hi,
Recently, I fitted a multivariate autoregressive model to times series data. The effective sample size was big enough, R-hat values were below 1.01, no divergent transitions. However, all the k pareto values were above 1 when using loo
. I decided to check a simpler version of the model with simulated data. So here is an univariate simpler version of the model, which is a Poisson-Lognormal autoregressive model.
data {
int<lower=0> N;
int y[N];
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
vector[N] log_lambda;
}
model {
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma ~ exponential(1);
log_lambda[1] ~ normal(0,5);
for(n in 2:N){
log_lambda[n] ~ normal(alpha + beta * log_lambda[n-1], sigma);
}
for(n in 1:N){
target += poisson_log_lpmf(y[n] | log_lambda[n]);
}
}
generated quantities{
vector[N] log_lik;
vector[N] log_lambda_pred;
int y_pred[N];
log_lambda_pred[1] = log_lambda[1];
for(t in 2:N){
log_lambda_pred[t] = normal_rng(alpha + beta*log_lambda[t-1], sigma);
}
for(n in 1:N){
log_lik[n] = poisson_log_lpmf(y[n] | log_lambda[n]);
if(log_lambda_pred[n]>20){
y_pred[n] = poisson_log_rng(20);
}else{
y_pred[n] = poisson_log_rng(log_lambda_pred[n]);
}
}
}
And the R code to simulate data, fit the model and use loo.
rm(list = ls())
library(rstan)
library(loo)
# simulate time series
Gompertz.sim <- function(init.pop, alpha, beta, time, sd){
community <- log_lambda <- rep(NA, time)
log_lambda[1] <- init.pop
for (t in 2:time) {
log_lambda[t] <- alpha + beta*log_lambda[t-1] + rnorm(1, 0, sd)
}
for (i in 1:time) {
community[i] <- rpois(1, lambda = exp(log_lambda[i]))
}
return(results = list(counts = community, log_lambda = log_lambda))
}
set.seed(1234)
P0<- 2
sp <- 1
t <- 100
alpha <- rnorm(1, 0.25, 0.01)
beta <- rnorm(1, 0.92, 0.001)
sd_noise <- rnorm(1, 0.25, 0.01)
Sim.data <- Gompertz.sim(init.pop = P0, alpha = alpha, beta = beta, time = t, sd = sd_noise)
dat<-list(y = Sim.data$counts, N = length(Sim.data$counts))
PL_AR_model <- stan(file = "PLNAR1_1sp.stan",
data = dat,
chains = 4,
cores = 4,
iter = 4000,
control = list(adapt_delta = 0.8, max_treedepth = 10))
log_lik_PL <- extract_log_lik(PL_AR_model, merge_chains = F)
r_eff_PL <- relative_eff(exp(log_lik_PL), cores = 4)
loo_PL <- loo(log_lik_PL, r_eff = r_eff_PL, cores = 4, save_psis = T)
The model converges and captures the parameters used in the simulation. However, even though I simulated the data from the same generating process as the model in Stan, the k pareto values are high
Computed from 8000 by 100 log-likelihood matrix
Estimate SE
elpd_loo -270.9 8.0
p_loo 34.1 3.2
looic 541.9 16.0
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 52 52.0% 902
(0.5, 0.7] (ok) 36 36.0% 218
(0.7, 1] (bad) 11 11.0% 28
(1, Inf) (very bad) 1 1.0% 9
See help('pareto-k-diagnostic') for details.
Also, the posterior predictive check seems alright
From this thread, I think p_loo << p but not sure about p. I think p is the 3 parameters (alpha
, beta
and sigma
) plus the 100 log_lambda
, right? If I am correct, the model is likely to be misspecified, but the Stan model is the same to the model simulating the data (unless I made some conceptual or coding mistake along the way).
Any idea why this is happening?