Model
I’m new to building models directly in Stan, and I’d be grateful for help implementing a state space model that has similarities to a simple two-party election forecasting model:
- there are multiple groups observed over time with binary counts, but not every group is observed at each timestep (i.e., there are missing data)
- the goal is to infer the latent state, Pr(success), for each group at all timesteps of the observation period and also forecast this latent state a modest number of timesteps into the future
- changes in the latent state from one timestep to the next occur via a correlated random walk process, set by a covariance matrix that is inferred from the data
At the core of the model are draws from a multivariate normal distribution, which requires a covariance matrix (\Sigma) that governs the state changes of each group. As I understand it, I can construct \Sigma by first creating a diagonal matrix (\mathbf{S}) from the vector of group standard deviations (\sigma_{n}) that determine the distribution of step sizes for each group. Then, pre- and post-multiplying this to the group correlation matrix (\mathbf{R}). Both distributions of \sigma_{n} and \mathbf{R} can be specified using priors. In other words,
Problem
The model in which I’ve attempted to implement this covariance structure (code+data below) fails to sample due to the same error & exception on all chains. As an example:
Chain 1 Rejecting initial value:
Chain 1 Error evaluating the log probability at the initial value.
Chain 1 Exception: lkj_corr_lpdf: Correlation matrix is not a valid correlation matrix.
Correlation matrix(1,1) is 6.01864, but should be near 1.0
(in '/var/folders/04/jbnbgfsj0y537r7cr7180dx40000gq/T/Rtmp256Pzl/model-1077c528e1a8f.stan',
line 40, column 4 to column 34)
This refers to
rw_corr ~ lkj_corr(corr0_eta);
in the model
block. corr0_eta
is a user-supplied shape parameter for the LKJ correlation distribution.
Questions
- What am I doing wrong to cause the error, and how do I fix it? Is it a simple parameterization error, bad priors, bad initial values, or is the covariance structure improperly implemented? I’ve seen similar models use a Cholesky decomposition for more complex correlation structures (e.g., reconciling local vs. national polls). I don’t understand whether I need to do this and (if so) how to set it up.
- Are there obvious structure/efficiency improvements that should be made? (code+data below) Since I’m encountering errors, I have no idea whether this code will do what I intend and sample efficiently. The key areas I’m uncertain about are:
- handling of missing observations (using an indicator variable in the user-supplied
data
to control which observations update the likelihood inmodel
) - handling of forecast timesteps (would it be better to separate this into a
generated quantities
block that is separated from themodel
?) - is there a more efficient way than
for
loops in themodel
block? I tried to extend the sliced missing data example from the Stan manual to this matrix, but couldn’t figure out how to pass the matrix columns tomulti_normal
in a vectorized way. I’d appreciate help doing this if it would improve efficiency.
reprex
I’m using the cmdstanr
package v.0.4.0.9001 to fit the model with R 4.1.2 on an Apple M1 processor (aarch64).
Stan code: ssm_correlated-random-walk.stan (2.5 KB)
Example data: dat.Rdata (2.3 KB)
–or–
generate example data in R
library(MASS) # for MVNormal distribution functions
library(rethinking) # for LKJ correlation matrix distribution functions
set.seed(23981)
# data indices
n_time <- 50 # number of observation time steps
n_fc <- 5 # number of time steps ahead to forecast
n_group <- 10 # number of groups where state is observed
n <- (n_time + n_fc) * n_group # total number of observations (incl. forecasts)
dat_obs <- as.integer(replicate(n_group, # indicate data presence (~80% obs, 0% forecast)
rbinom(n_time + n_fc, size = 1,
prob = c(rep(0.8, n_time), rep(0, n_fc)))))
# true parameter values (latent)
corr <- rethinking::rlkjcorr(n = 1, K = n_group, eta = 0.5) # correlations
var <- rep(0.3, n_group) # standard deviations
vcov <- diag(var) %*% corr %*% diag(var)
logit_true_p <- as.vector(apply(MASS::mvrnorm(n = n_time + n_fc,
mu = rep(0, n_group),
Sigma = vcov),
MARGIN = 2, cumsum))
true_p <- plogis(logit_true_p)
# observed data
time_group <- rep(1:(n_time + n_fc), times = n_group)
group <- factor(rep(letters[1:n_group], each = n_time + n_fc))
trials <- as.integer(replicate(n_group, c(sample(100:500, size = n_time, replace = TRUE),
rep(9999, n_fc))))
successes <- rbinom(n, size = trials, prob = true_p)
trials[which(dat_obs==0)] <- 9999L # delete missing cases
successes[which(trials==9999L)] <- 9999L
# prior parameters
corr0_eta <- 1 # LKJ(_)
sigma0_shape <- rep(1, n_group) # gamma(_,rate)
sigma0_rate <- rep(1, n_group) # gamma(shape,_)
pr0_alpha <- rep(1, n_group) # beta(_,beta)
pr0_beta <- rep(1, n_group) # beta(alpha,_)
# list data to be passed to Stan
dat <- list(n = n, n_time = n_time, n_group = n_group, n_fc = n_fc, dat_obs = dat_obs,
time_group = time_group, group = group, trials = trials, successes = successes,
sigma0_shape = sigma0_shape, sigma0_rate = sigma0_rate,
corr0_eta = corr0_eta, pr0_alpha = pr0_alpha, pr0_beta = pr0_beta)
R code to compile and sample the model
library(cmdstanr)
# compile the model
mod <- cmdstan_model("ssm_correlated-random-walk.stan")
# sample the model
fit <- mod$sample(data = dat)