I’m leaving this here for posterity in case someone finds it useful.
I ran a simulation study with S = 100 iterations. I tested 4 models, varying in correlation structures and link functions, against each other:
1 - independent - SoftMax;
2 - correlated - SoftMax;
3 - independent - Isometric Log-Ratio;
4 - correlated - Isometric Log-Ratio;
The Isometric Log-Ratio (ILR) is a link function used in compositional data. I wasn’t very familiar with it ( see Egozcue , Filzmozer et al. ) but I found it appealing for this problem because it has the property of generating representations of compositional variables on the Euclidean space which preserve the relationships (ratios) implied by the original scale. Unlike other log-ratio transformations, it is not equivalent to SoftMax, hence specifying correlations on the latent scale has the potential of learning something different from SoftMax.
In setting this up I simplified the data generating process to what amount to a simple imputation problem: a have a number n_series
of multinomial counts, representing the results from opinion polls from different candidates, each series is n_time_points
long, and for all series I have missing values at random days. For all but 2 of the series, the last \frac{1}{3} of the observations is missing. The challenge for the models is to reconstruct these values, along with the other randomly missing ones. The full code used for the DGP and the models can be found at the bottom. I will focus here on the results.
I find evidence that the correlated models outperform the independent ones, but in my view this is surprisingly thin… in principle I would have expected better performance on every simulation round, but this is not the case. ILR did not give major differences in performance, so I will focus on correlated v. independent SoftMax below (though ILR did produce substantially different predictions compared to SoftMax).
The results of the simulation are attached in this csv file:
simulation_results_stan.forum.csv (463.4 KB) . I aggregate by iteration to simply look at the average difference in performance between models in each simulation, rather than individual predictions within simulations, and focus on the average ability to predict the series with the `missing’ \frac{1}{3} of observations.
In ~ 70\% of the simulations, the RMSE is greater for the independent model than the correlated model. Similar results appear for absolute bias, correlation and coverage. The differences appear to be small on average, but become large in instances where the independent model `fails’. So one could be tempted to conclude the correlated model is more robust under a correlated DGP.
The average % gain in the predicted values is close to a percentage point (per candidate), which is small, though not totally negligible.
If I try and look at what characteristics of the simulation are associated with the difference in performance, I don’t get much information (100 simulations is too little to say anything definitive for subsets of observations, but the computational costs are elevated when the number of candidates considered is > 3, and / or the time points > 30, so I have to come up with a more efficient strategy before I simulate more). Below the results of a regression tree with default settings:
Clearly the magnitude of the correlation in the underlying data – as we would hope – is important when it comes to explaining the difference in performance between independent and correlated models. The sample size per day also plays a role, as do the total number of time points. I don’t put too much stock on any specific split considered by the tree, as the final nodes have very few observations, but the general idea that correlation, sample size and length of the series matter can be safely extrapolated, I think.
Finally, on the point that Bob made re - `the model does not perform well because it has not enough information properly learn the covariance’, I checked this for a couple of the simulations. The uncertainty around the correlation parameters is typically contained – the problem seems to be rather that the point estimates of these correlations are off, sometimes with an opposite sign, compared to the DGP.
I suppose this could be an issue still related to the lack of power – if I see samples for 30 time points and extrapolate a correlation from these, especially when some variable with large correlations but highly negative latent values will never be seen due to 0 sampling probability with small n, I’m likely to get my correlation structure wrong.
So what conclusions can I draw ?
I think it makes sense to fit the correlated model, though it does come at a heavy computational cost. I will expect better performance when:
(1) the underlying correlations are strong ;
(2) my sampling procedure is powerful enough to sample also candidates with extremely low prevalence ;
(3) I can observe a large number of time points.
Further, I can safely say the use of ILR does not guarantee better performance.
As always any thoughts are welcome !
Code for the simulations study
for(iter in 1:n_iter){
# Define parameters
n_time_points = sample(30:365,size = 1)
n_series <- sample(3:10,size = 1) # number of series
# Create initial correlation matrix
cor_matrix <-
matrix(
runif(n = (n_series)*(n_series),min = -1,max = 1),
nrow = n_series,
ncol = n_series
)
diag(cor_matrix) <- 1
# with even moderately large n_series, you're likely to observe a reasonable
# amount of correlation randomly.
# we expect the biggest gains to be when average correlation size is very high,
# and when the number of series is relatively small
# Adjust for positive definiteness
cor_matrix <- nearPD(cor_matrix, corr = TRUE)$mat
# Standard deviations (scale) for each variable
scales <- runif(n = n_series,min = 0,max = 0.25)
# Create a matrix of standard deviation products
sd_product <- outer(scales, scales)
# Convert correlation matrix to covariance matrix
cov_mat <- cor_matrix * sd_product
# first time point - sum to 0 to encourage reasonable behaviour
x0 = runif(n = n_series,min = -1,max = 1)
simulated_data <- t(x0)#-mean(x0))
# subsequent points (random walk)
for(t in 2:n_time_points) {
simulated_data <-
rbind(
simulated_data,
simulated_data[t-1,1:n_series] +
mvrnorm(n = 1, mu = rep(0, n_series), Sigma = cov_mat)
)
}
## ensure sum to 0
#simulated_data <- cbind(simulated_data,-rowSums(simulated_data))
# store average correlation
avg_cor_size <- mean(abs(cor(simulated_data)[lower.tri(cor(simulated_data))]))
# Transform to probability scale (0 to 1) using softmax
probability_data <- exp(simulated_data)/rowSums(exp(simulated_data))
prevalence = apply(probability_data,2,mean)
sd_pi = apply(probability_data,2,sd)
# random sample size per date
n = round(runif(n = n_time_points,min = 0,max = 2500))
n = ifelse(n==0,1,n)
Y =
t(
sapply(
1:n_time_points,
function(x){
rmultinom(
n = 1,
size = n[x],
prob = t(probability_data[x,])
) } ) )
# make last 15 days disappear for one of the series
empty = 3:n_series # sort(sample(3:n_series,size = sample(1:(n_series-2),size = 1)))
Y[1:round((0.33*n_time_points)),empty] <- NA
# # # to evaluate reconstruction power, and overall estimation accuracy
# for each time-point drop one of the series at random
for(t in 1:n_time_points) {
if(any(is.na(Y[t,]))){next}
size <- sample( 0:(n_series-2),size = 1)
if(size==0){next}
Y[t,sample(1:n_series,size = size)] <- NA
# ad minimum 2-way comparison
}
print(Y)
# Check the result
par(mfrow = c(2,2))
plot(
simulated_data[,1],
type = 'l',
ylim = c(min(simulated_data),max(simulated_data)),
col = 1,
xlim = c(n_time_points,1) ,
ylab = "Latent Support",
xlab = "Time")
for(i in 2:n_series) {
lines(simulated_data[,i], type = 'l', col = i)
}
plot(
probability_data[,1],
type = 'l',
col = 1,
ylim = c(0,1),
xlim = c(n_time_points,1) ,
ylab = "Probability",
xlab = "Time")
for(i in 2:n_series) {
lines(probability_data[,i], type = 'l', col = i)
}
plot(
Y[,1],
type = 'l',
col = 1,
ylim = c(0,max(Y,na.rm=TRUE)),
xlim = c(n_time_points,1) ,
ylab = "Observed Counts",
xlab = "Time")
# highlight daily sampling effort with size of circles
points(Y[,1], col = 1,cex = 2*n/max(n))
for(i in 2:n_series) {
points(Y[,i], col = i,cex = 2*n/max(n))
lines(Y[,i], type = 'l', col = i)
}
sims_dt <-
data.table(
Y = as.numeric(Y),
choice_id = rep(1:n_series,each = n_time_points),
dte_id = rep(1:n_time_points,n_series),
n = rep(n,n_series)
)
sims_dt <- sims_dt[complete.cases(sims_dt )]
ljk_prior = 1
data_list =
list(
Y = sims_dt$Y,
n = sims_dt $n,
choice_id = sims_dt $choice_id,
choice_N = n_series,
dte_id = sims_dt $dte_id,
dte_N = n_time_points,
N = dim(sims_dt)[1],
lkj_prior = ljk_prior,
e = ilrBase(D = n_series,method = 'basic' )
)
for(link in c('SoftMax','ILR')){
for(type in c('correlated','independent')){
if(link == 'SoftMax'){
link_it <- 'pi[t] = exp(mu[t])/sum(exp(mu[t]));'
}
if(link == 'ILR'){
link_it <- "pi[t] = exp(e*mu[t][1:(choice_N-1)])/sum(exp(e*mu[t][1:(choice_N-1)]));"
}
if(type == 'correlated'){
model = paste0(
"data{
int<lower = 1> N; // total n. candidate measurements
int Y[N]; // poll-level vote counts
int n[N]; // poll-level sample-size
int choice_id[N]; // candidate id
int<lower=1> choice_N; // total number of candidates under consideration
int dte_id[N]; // days-to-election id
int<lower=1> dte_N; // max number of days-to-election
real<lower = 0> lkj_prior; // prior for the strength of candidate correlatioin
matrix[choice_N,choice_N-1] e; // ILR Basis
}
parameters {
vector[choice_N-1] gamma; // candidate choice effect
vector<lower=0>[choice_N-1] gamma_scale; // candidate choice effect
array[dte_N] vector[choice_N-1] z; // auxiliary variable for multivariate random walk
vector<lower = 0>[choice_N-1] delta_scale; // day-to-day change
cholesky_factor_corr[choice_N-1] L_Omega; // correlation between candidates
}
transformed parameters {
vector[N] lambda; // latent support
array[dte_N] vector[choice_N] mu;
array[dte_N] vector[choice_N] pi;
vector[choice_N-1] gamma_star = gamma.*gamma_scale; // choice-level log-support on election day
array[dte_N] vector[choice_N-1] delta_star; // daily log-support per candidate
matrix[choice_N-1,choice_N-1] L; // covariance matrix
// define covariance matrix on daily changes in support
L = diag_pre_multiply(delta_scale[1:(choice_N-1)], L_Omega);
// random walk on daily support
for(c in 1:(choice_N-1)) {
delta_star[1][c] = 0; // identify via corner constraint
}
for(t in 2:dte_N) {
delta_star[t][1:(choice_N-1)] =
delta_star[t-1][1:(choice_N-1)] +
L * z[t][1:(choice_N-1)];
}
for(t in 1:dte_N){
mu[t][1:(choice_N-1)] =
gamma_star[1:(choice_N-1)] +
delta_star[t][1:(choice_N-1)];
mu[t][choice_N] = 0;
}
for(t in 1:dte_N){
",link_it,"
}
// linear predictor
for (i in 1:N) lambda[i] = n[i]*pi[dte_id[i]][choice_id[i]];
}
model {
to_vector(gamma) ~ std_normal();
to_vector(gamma_scale) ~ std_normal();
for(t in 1:dte_N) to_vector(z[t]) ~ std_normal();
to_vector(delta_scale) ~ std_normal();
L_Omega ~ lkj_corr_cholesky(lkj_prior);
Y ~ poisson( lambda ); // likelihood
}
")
# parameters to monitor
pars <-
c('gamma_star', # choice baseline
'delta_star', # campaign dynamic effect
'delta_scale',
'pi',
'L',
'L_Omega'
)
}
if(type == 'independent'){
model =
paste0(
"data{
int<lower = 1> N; // total n. candidate measurements
int Y[N]; // poll-level vote counts
int n[N]; // poll-level sample-size
int choice_id[N]; // candidate id
int<lower=1> choice_N; // total number of candidates under consideration
int dte_id[N]; // days-to-election id
int<lower=1> dte_N; // max number of days-to-election
real<lower = 0> lkj_prior; // prior for the strength of candidate correlatioin
matrix[choice_N,choice_N-1] e; // ILR Basis
}
parameters {
vector[choice_N-1] gamma; // candidate choice effect
vector<lower=0>[choice_N-1] gamma_scale; // candidate choice effect
array[dte_N] vector[choice_N-1] z; // auxiliary variable for multivariate random walk
vector<lower = 0>[choice_N-1] delta_scale; // day-to-day change
}
transformed parameters {
vector[N] lambda; // latent support
array[dte_N] vector[choice_N] mu;
array[dte_N] vector[choice_N] pi;
vector[choice_N-1] gamma_star = gamma.*gamma_scale; // choice-level log-support on election day
array[dte_N] vector[choice_N-1] delta_star; // daily log-support per candidate
// random walk on daily support
for(c in 1:(choice_N-1)) {
delta_star[1][c] = 0; // identify via corner constraint
}
for(t in 2:dte_N) {
delta_star[t][1:(choice_N-1)] =
delta_star[t-1][1:(choice_N-1)] +
delta_scale[1:(choice_N-1)] .* z[t][1:(choice_N-1)];
}
for(t in 1:dte_N){
mu[t][1:(choice_N-1)] =
gamma_star[1:(choice_N-1)] +
delta_star[t][1:(choice_N-1)];
mu[t][choice_N] = 0;
}
for(t in 1:dte_N){
",link_it,"
}
// linear predictor
for (i in 1:N) lambda[i] = n[i]*pi[dte_id[i]][choice_id[i]];
}
model {
to_vector(gamma) ~ std_normal();
to_vector(gamma_scale) ~ std_normal();
for(t in 1:dte_N) to_vector(z[t]) ~ std_normal();
to_vector(delta_scale) ~ std_normal();
Y ~ poisson( lambda ); // likelihood
}
")
# parameters to monitor
pars <-
c('gamma_star', # choice baseline
'delta_star', # campaign dynamic effect
'delta_scale',
'pi'
)
}
# fit model
fit_object <-
stan(model_code = model,
pars = pars,
data = data_list,
iter = 1000,
warmup = 750,
refresh = 1,
thin = 4,
cores = 8,
chains = 8,
control = list(max_treedepth = 20,
adapt_delta = 0.8),
verbose = TRUE
)
# check convergence
summary.fit <- summary(fit_object)$summary
pars_hat <- rstan::extract(fit_object, pars = pars, inc_warmup = FALSE)
S = dim(pars_hat$delta_star)[1]
# calculate candidate posterior predictive support
pi <-
lapply(1:S,
function(s){
sapply(1:n_series,
function(c){
pars_hat$pi[s,,c]
} )
} )
# calculate posterior predicitve probabilities
preds =
data.table(
choice_id = rep(1:n_series,each = n_time_points),
dte_id = rep(1:n_time_points, n_series),
obs = as.numeric(probability_data),
missing = as.vector(is.na(Y)),
pred_lo = as.numeric(sapply(1:n_series,function(c){
apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.05)
})),
pred_mid = as.numeric(sapply(1:n_series,function(c){
apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.5)
})),
pred_hi = as.numeric(sapply(1:n_series,function(c){
apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.95)
}))
)
par(mfrow = c(3,2))
for(c in 1:n_series){
plot(
preds[choice_id == c]$pred_mid,
col = NA,
ylim = c(0,1),
xlim = c(n_time_points,1)
)
lines(y = preds[choice_id == c]$pred_mid,x = 1:n_time_points,col = 'dodgerblue',lwd = 2)
polygon(c(1:n_time_points, rev(1:n_time_points)), c(preds[choice_id == c]$pred_lo, rev(preds[choice_id == c]$pred_hi)), col = adjustcolor('skyblue',0.5), border = NA)
lines(y = probability_data[,c],x = 1:n_time_points,col = 'black')
}
results <-
data.table(
iter = iter,
model = type,
link = link,
avg_cor_size = avg_cor_size,
n_time_points = n_time_points,
n_series = n_series,
prevalence = prevalence,
sd_pi = sd_pi,
n_per_day = sum(n)/n_time_points,
bias = sapply(1:n_series,function(x){bias(y = preds[choice_id == x & missing]$obs,yhat = preds [choice_id == x & missing]$pred_mid)}),
rmse = sapply(1:n_series,function(x){rmse(y = preds[which(choice_id == x & missing)]$obs,yhat = preds[which(choice_id == x & missing)] $pred_mid)}),
cor = sapply(1:n_series,function(x){cor(x = preds [which(choice_id == x &missing)]$pred_mid,y = preds[which(choice_id == x &missing)]$obs)}),
cover = sapply(1:n_series,function(x){cover(y = preds[which(choice_id == x &missing)]$obs,yhat_lo = preds[which(choice_id == x &missing)]$pred_lo,yhat_hi = preds[which(choice_id == x &missing)]$pred_hi)})
)
results$empty = 'no'
results$empty[empty] = 'yes'
results_list <- rbindlist(list(results_list,results) )
print(results_list)
# save(results_list,file = 'generated_data/offset.model_simulations.results.RData',compress = TRUE)
gc()
} }
}