Addition of AR(1) process in brms model results in high pareto-k values from loo

Data and Models

I’m trying to run a series of multilevel time series cross-sectional models. I observe outcomes for political parties nested within countries over a decade. Depending on the variables I use, I have approximately ~180 parties in 30 countries, measured over 10 years, yielding ~1360 party-country-year observations (not all parties are observed for the entire period). I have variables measured at the party and country level. Party variables are a mix of time-varying and time-invariant, while country variables are all time-varying. The main independent variable of interest is time-invariant.

The outcomes are count variables with high levels of overdispersion. For example, one of the outcomes has a [0:13,404,375] range (but very few zeros) with a mean of 97553.32 and variance of 250760871628. The outcomes have a long-tail, with a handful of observations with extreme values. I use negative binomial regression (Poisson models extremely poorly with low Bulk_ESS and Tail_ESS even with thousands of iterations).

Issues

I’ve managed to get the models to fit without any warnings but I’m running into issues when I try to include an AR(1) process to account for temporal autocorrelation in the residuals. Specifically, including the ar(time = year, gr = party, p=1) term results in most observations having extremely high Pareto K values when running loo. Just including the random effects results in problematic Pareto K values and almost all are bad or very bad when including the AR(1) process. I’ve looked at the observations with high values but cannot see any clear patterns (e.g. extreme or unusual values).

Moreover, it seems that the ar and shape parameters can have a strange interaction, such that both have estimation issues (high R hats) and the estimated shape parameter is much larger when the AR(1) process is included.

I’ve tested out different priors including the defaults for the AR(1) process but it doesn’t appear to make much difference. The issue also persists with more complex models with random coefficients at the country and year level for the main independent variable of interest.

I have been unable to find any answers to this question on the forums. The literature on TSCS analysis strongly recommends that models include AR(1) terms, lagged-dependent variables, or other terms to account for dynamics (e.g. Beck and Katz 2011), but unfortunately, the key works using Monte Carlo simulations to evaluate TSCS seem to simply their analyses by leaving these factors out (e.g. Shor et al. 2007, Beck and Katz 2007), so offer little guidance on the issue (References below). Much of the literature in this area focuses on panel data using repeated cross-sections where it is unnecessary to account for temporal dependencies.

I’m a little unsure how to proceed here. I’ve read that the warnings can indicate misspecification or model flexibility. Is loo even reliable when considering autocorrelation and/or multilevel models? The model specification is theoretically motivated and is consistent with prior work in the area, the models to converge without issues (sometimes there are a handful of divergent transitions but these go away when I increase adapt_delta), and the posterior predictive checks look good, so I’m not sure how seriously to take the warnings. I’m not too concerned about model comparison either - while it would be nice to make comparisons using loo, I plan to present a range of estimates across different model specifications.


Code

The code below shows condensed versions of four models. The first two include only population effects, the second two include all random intercepts. For each model I show a specification with and without the ar term. I have simplified the formulas a little, but each model includes a factor variable for time and 11 other independent variables, either binary indicators or scaled continuous variables.

I then show the output for loo and loo_compare for all models.

# Simple model with population effects only
m1 <- brm(
        y ~ log(x) +
        as.factor(year) +
        scale(x2) +
        ...,
    prior = c(prior(normal(0,1), class = "b"),
              prior(normal(0,1), class = "Intercept"),
              prior(exponential(1), class = "shape")),
    data = data, family = negbinomial,
    iter = 2000, warmup = 1000, cores = 16, chains = 2)

# + AR(1)
m2 <- brm(
        y ~ log(x) +
        as.factor(year) +
        scale(x2) +
        ... +
        # autoregressive term
        ar(time = year, gr = party, p=1),
    prior = c(prior(normal(0,1), class = "b"),
              prior(normal(0,1), class = "Intercept"),
              prior(exponential(1), class = "shape"),
              prior(normal(0.5, 0.5), class = "ar"),
              prior(exponential(1), class = "sderr")),
    data = data, family = negbinomial,
    iter = 2000, warmup = 1000, cores = 16, chains = 2)

# + random intercepts - AR(1)
m3 <- brm(
        y ~ log(x) +
        as.factor(year) +
        scale(x2) +
        ... +
        # random part
        (1 |party) + (1 |country) + (1 |year) + (1|country:year),
    prior = c(prior(normal(0,1), class = "b"),
              prior(normal(0,1), class = "Intercept"),
              prior(exponential(1), class = "shape"),
              prior(exponential(1), class = "sd")),
    data = data, family = negbinomial,
    iter = 2000, warmup = 1000, cores = 16, chains = 2)

# Random intercepts + AR(1)
m4 <- brm(
        y ~ log(x) +
        as.factor(year) +
        scale(x2) +
        ... +
        # autoregressive term
        ar(time = year, gr = party, p=1) +
        # random part
        (1 |party) + (1 |country) + (1 |year) + (1|country:year),
    prior = c(prior(normal(0,1), class = "b"),
              prior(normal(0,1), class = "Intercept"),
              prior(exponential(1), class = "shape"),
              prior(normal(0.5, 0.5), class = "ar"),
              prior(exponential(1), class = "sderr"),
              prior(exponential(1), class = "sd")),
    data = data, family = negbinomial,
    iter = 2000, warmup = 1000, cores = 16, chains = 2

l.1 <- loo(m1)
l.2 <- loo(m2)
l.3 <- loo(m3)
l.4 <- loo(m4)

l.1
l.2
l.3
l.4

loo_compare(l.1, l.2, l.3, l.4)

Output

m1

Computed from 2000 by 1360 log-likelihood matrix

         Estimate    SE
elpd_loo -15231.0  84.4
p_loo        31.4   5.6
looic     30462.1 168.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1359  99.9%   164       
 (0.5, 0.7]   (ok)          0   0.0%   <NA>      
   (0.7, 1]   (bad)         1   0.1%   13        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

m2

Computed from 2000 by 1360 log-likelihood matrix

         Estimate     SE
elpd_loo -33073.8  931.0
p_loo     18783.9  855.4
looic     66147.6 1862.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)        1   0.1%   267       
 (0.5, 0.7]   (ok)         33   2.4%   49        
   (0.7, 1]   (bad)       166  12.2%   10        
   (1, Inf)   (very bad) 1160  85.3%   0         
See help('pareto-k-diagnostic') for details.

m3

Computed from 2000 by 1360 log-likelihood matrix

         Estimate    SE
elpd_loo -14239.7  86.0
p_loo       268.2  12.2
looic     28479.4 171.9
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1219  89.6%   295       
 (0.5, 0.7]   (ok)        104   7.6%   67        
   (0.7, 1]   (bad)        34   2.5%   22        
   (1, Inf)   (very bad)    3   0.2%   10        
See help('pareto-k-diagnostic') for details.

m4
Computed from 2000 by 1360 log-likelihood matrix

         Estimate    SE
elpd_loo -17632.2 198.3
p_loo      4403.3 154.1
looic     35264.4 396.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)       7    0.5%   340       
 (0.5, 0.7]   (ok)        89    6.5%   48        
   (0.7, 1]   (bad)      426   31.3%   10        
   (1, Inf)   (very bad) 838   61.6%   0         
See help('pareto-k-diagnostic') for details.

> loo_compare(l.1, l.2, l.3, l.4)
   elpd_diff se_diff
m3      0.0       0.0
m1   -991.3      37.8
m4  -3392.5     154.5
m2 -18834.1     905.0

Here are the ar and shape parameters for the four models (in order, other output omitted):

Correlation Structures:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS    Model
ar[1]     0.76      0.03     0.71     0.81 1.01      298      641     m2
sderr     0.92      0.02     0.88     0.96 1.02      178      718     m2
ar[1]     0.70      0.07     0.58     0.83 1.00      191      396     m4
sderr     0.44      0.02     0.40     0.49 1.01      144      335     m4


Family Specific Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS   Model   AR(1)
shape     0.82      0.03     0.77     0.88 1.00     3010     1428     m1    FALSE
shape    13.36      2.16     9.57    18.06 1.06       74      168     m2    TRUE
shape     3.34      0.14     3.07     3.62 1.00     1366     1440     m3    FALSE
shape    12.98      1.90     9.88    17.22 1.01      101      301     m4    TRUE

Thanks for taking the time to read this. Any advice would be much appreciated!


References

Beck, Nathaniel, and Jonathan N. Katz. 2007. “Random Coefficient Models for Time-Series—Cross-Section Data: Monte Carlo Experiments.” Political Analysis 15 (2): 182–95. Random Coefficient Models for Time-Series—Cross-Section Data: Monte Carlo Experiments | Political Analysis | Cambridge Core.
———. 2011. “Modeling Dynamics in Time-Series–Cross-Section Political Economy Data.” Annual Review of Political Science 14 (1): 331–52. https://doi.org/10.1146/annurev-polisci-071510-103222.
Shor, Boris, Joseph Bafumi, Luke Keele, and David Park. 2007. “A Bayesian Multilevel Modeling Approach to Time-Series Cross-Sectional Data.” Political Analysis 15 (2): 165–81. A Bayesian Multilevel Modeling Approach to Time-Series Cross-Sectional Data | Political Analysis | Cambridge Core.

1 Like

Linking to some other relevant posts:

This post discusses potential issues with AR(1) and negative binomial: Brms: fitting negative binomial model with an AR-correlation structure - #2 by martinmodrak

This post discusses issues of high pareto-k values when estimating hierarchical negative binomial models: Hierarchical negative binomial - pareto k > 0.5

Can’t really delve deep into this due to time constraints, but if you inspect the generated code (via make_stancode), I would expect the AR term to be represented by a set of nuisance parameters, one for each observation representing the autoregressive term, because there is no closed form. If this is the case, those nuisance parameters basically by their definition depend strongly on the individual observation. The pareto k in loo checks roughly (in my limited understanding) whether some parameters depend strongly on a single observation. And those nuisance parameters do (by definition) depend on single observations, so you’ll get high pareto k.

The short answer therfore is that for this type of models (with nuisance parameters) you cannot use loo easily. The longer answer is that you can IMHO try to compute loo anyway by integrating the nuisance parameters out, but that’s not straightforward (see Is LOO valid for models with missing outcome when using a complete case dataset that is a subset of the original data? - #7 by martinmodrak for a similar, but simpler case - I think 1D integration might not be enough in your case).

Best of luck with your model!

2 Likes