# Dynamic panel data models with Stan?

#1

It’s well known that in panel data models (that is, models with observations across individuals and time), if you have individual intercepts/slopes, then including a lagged dependent variable on the right hand side is a big no-no (potentially enormous biases).

To see this, imagine you have

y_{it} = \alpha + u_{i} + \delta y_{it-1} + X_{it}\beta + \epsilon_{it}

with u_{i} \sim \text{Normal}(0, \sigma_{u}) being the random effects. By Gauss-Markov, these must be uncorrelated to X_{it} and y_{it-1} in our statistical model. But u_{i} applies to all observations of y_{i}, and so by construction E[u_{i} y_{it}]\neq 0. This typically inflates our estimate of \delta and shrinks our estimate of \beta.

Non-Bayesians use the Arellano-Bond estimator, which is a GMM approach to solving this problem. The idea is to express it in first-differences (getting rid of the random effect) in which case you’re only dealing with the correlation of the change in the error and change in lagged dependent variable. Then use GMM, instrumenting the differenced lagged dependent variable with its own lags.

Does anyone here have experience fitting these models in Stan? Happy to just throw Bayesian IV at it, but the Frequentist counterpart to this is less efficient than Arellano Bond. Ideas?

#2

Hi Jim,

Can you expand a bit on exactly what the model you want to fit it? Is it the one in the display?

#3

Haven’t done these models personally but I tried a little experiment since it was quick to do, and you were right I can’t recover the parameters.

generate_panel_data <- function(delta, beta, x) {

N <- length(x)
u <- rnorm(1)
y <- rep(NA, N)
y[1] <- u + beta*x[1] + rnorm(1)
for(n in 2:N) {
y[n] <- u + delta*y[n-1] + beta*x[n] + rnorm(1)
}

tibble(y = y, u = u, lag_y = lag(y), x = x)
}

data <- generate_panel_data(1.2, 3.0, rnorm(20))

fit <- stan(file = "panel.stan",
data = list(N = nrow(data),
x_lead = data$x %>% head(nrow(data)-1), y_lead = data$y %>% head(nrow(data)-1),
y_lag = data$y %>% tail(nrow(data)-1)), chains = 1) > fit Inference for Stan model: panel. 1 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=1000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat u -0.73 0.02 0.34 -1.38 -0.96 -0.74 -0.51 -0.08 431 1.00 delta 0.83 0.00 0.00 0.82 0.83 0.83 0.83 0.84 470 1.00 beta 0.45 0.01 0.26 -0.03 0.28 0.45 0.62 0.97 577 1.00 lp__ -48.11 0.07 1.30 -51.53 -48.70 -47.72 -47.18 -46.75 340 1.01  data { int N; vector[N-1] x_lead; vector[N-1] y_lead; vector[N-1] y_lag; } parameters { real u; real delta; real beta; } model { y_lead ~ normal(u + delta*y_lag + beta*x_lead, 1.0); }  #4 Yep @Daniel_Simpson here’s some simulated data etc.  library(rstanarm); library(tidyverse) # Simulate some fake data delta <- 0.5 sigma_u <- 2 sigma_e <- 1 I <- 100 T <- 10 initial_values <- data_frame(individual = 1:I, initial_y = rnorm(I), u = rnorm(I, 0, sigma_u)) simulated_panel <- initial_values %>% group_by(individual) %>% do({ simulate <- function(x) { out <- data.frame(y = rep(NA, T), time = rep(NA, T), u = x$u)
for(t in 1:T) {
out$time[t] <- t if(t == 1) { out$y[t] <- rnorm(1, delta * x$initial_y + x$u, sigma_e)
} else {
out$y[t] <- rnorm(1, delta * out$y[t-1] + x$u, sigma_e) } } return(out) } as.data.frame(simulate(.)) }) %>% group_by(individual) %>% mutate(lagged_y = lag(y)) %>% filter(!is.na(lagged_y)) cor(simulated_panel$lagged_y, simulated_panel$u) test_fit <- stan_lmer(y ~ lagged_y + (1 | individual), data = simulated_panel, iter = 1000, cores = 4) bayesplot::mcmc_dens(as.data.frame(test_fit), pars = "lagged_y") + geom_vline(aes(xintercept = delta)) + annotate("text", y = 15, x = delta + 0.05, label = "true coef") + xlim(0, max(as.data.frame(test_fit)$lagged_y))



#5

Man I wish we had R syntax highlighting on the forums!

#6

Arya – this isn’t repeated observations across multiple individuals it it?

#7

The simulation is recursive but the Stan program isn’t. Is that a problem?

#8

No just a single individual, but without the \alpha term. I figured I’d start simple. The posterior pairs plots don’t actually look bad for that model or show any non-identifiabilities.

#9

Just bumping this in case the right person missed it. Seems too common a problem to not have been discussed on the forum before.

#10

You should look at the R package ctsem, got continuous time SEM. This allows you to do dynamic systems with either frequentist or Bayesian, Bayesian is done with Stan.

The simplest version of the dinámica system is the panel model.

#11

I never took the time to quite wrap my head around the Arellano Bond thing, but the poor performance of the rstanarm spec was eye opening. As Mauricio said, ctsem provides a front end for specifying hierarchical time series / state space models in Stan. I took a quick look and it seems to behave as you want, though performance is much slower than it could be for a perfectly observed first order model like you have (since there is no need to integrate out the latent states).

install.packages("devtools")
library(devtools)
install_github("cdriveraus/ctsem")

library(rstanarm); library(tidyverse); library(ctsem)

# Simulate some fake data
delta <- 0.5
sigma_u <- 2
sigma_e <- 1
I <- 100
T <- 10

initial_values <- data_frame(individual = 1:I,
initial_y = rnorm(I),
u = rnorm(I, 0, sigma_u))

simulated_panel <- initial_values %>%
group_by(individual) %>%
do({
simulate <- function(x) {
out <- data.frame(y = rep(NA, T),
time = rep(NA, T),
u = x$u) for(t in 1:T) { out$time[t] <- t
if(t == 1) {
out$y[t] <- rnorm(1, delta * x$initial_y + x$u, sigma_e) } else { out$y[t] <- rnorm(1, delta * out$y[t-1] + x$u, sigma_e)
}
}
return(out)
}
as.data.frame(simulate(.))
}) %>%
group_by(individual) %>%
mutate(lagged_y = lag(y)) %>%
filter(!is.na(lagged_y))

cor(simulated_panel$lagged_y, simulated_panel$u)

#ctsem model
ctm <- ctModel(n.manifest = 1,n.latent = 1,LAMBDA = diag(1), type = 'standt',
MANIFESTMEANS=matrix(0,1,1),CINT=matrix('u',1,1),
MANIFESTVAR=diag(1e-6,1),
T0VAR=matrix('stationary',1,1),
manifestNames = 'y')

#set all parameters except the initial states and intercept fixed across subjects
ctm$pars$indvarying[!ctm$pars$matrix %in% c('T0MEANS','CINT')] <- FALSE

ctm$subjectIDname <- 'individual' #fit using optimization ctf <- ctStanFit(as.data.frame(simulated_panel),ctstanmodel = ctm, chains=6,optimize=TRUE,verbose=0,iter=300,deoptim = F,isloops = 0,issamples = 100) summary(ctf) #fit with sampling ctf <- ctStanFit(as.data.frame(simulated_panel),ctstanmodel = ctm,chains=6,iter=300) cat(ctf$stanmodeltext) #to view the stan model -- lots of unnecessary stuff for this problem though!