Hierarchical Multivariate Ordinal Model Diagnostics (and tuning)

I am modelling data for 28 participants who rated “arousal” and “valence” using a combination pictographic (5 pictures) and Likert-type numeric (9-point) scale for each construct under 7 conditions: auditory only, audio-visual, low-pass filtered auditory only (half of the subjects (n = 14)), low-pass filtered auditory-visual (n = 14) high-pass filtered auditory only (half of the subjects (n = 14)), high-pass filtered auditory-visual (n = 14), and visual only (filtering not applicable). The stimuli consist of 65 “tokens” that have been categorized into “unpleasant”, “neutral”, “pleasant” - the number of tokens in each category is unbalanced.

So, it is a nested design, with half of the participants split on one sub-condition, and a bivariate response

I am trying to model the data as multivariate ordinal (cumulative(logit)).

I have uploaded the data here:
emo_modality_with_all_tokens.csv (609.3 KB)

My code for creating and running the model:

## List required packages ====
Pkgs <- c("tidyverse", "magrittr", "brms", 
           "rstan", "bayesplot",  
          "parallel",  "tidybayes")

### Load packages ====
lapply(Pkgs, require, c = T)

## Set computing options ====
ncores = detectCores()
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Read in data ----
Modality_Ord <- read.csv("emo_modality_with_all_tokens.csv", stringsAsFactors = FALSE) %>%
        select("Subject" = ss_id, "Group" = filtering_group, "Filter" = filter_type,
               "Condition" = condition, "Category" = category_label,
               "Token" = token, "Arousal" = arousal, "Valence" = valence ) %>%
        mutate( Category = recode_factor(Category,
                                         unpleasant = "Unpleasant",
                                         neutral = "Neutral",
                                         pleasant = "Pleasant")) %>%
        droplevels() %>%
        mutate(Condition = ifelse(Condition == "fAO" | Condition == "fAV", 
                                  paste(Condition, Filter, sep = "_"), Condition),
               Subject = factor(Subject),
               Condition = factor(Condition, 
                                  levels = c("AO", "fAO_LP", "fAO_HP",
                                             "AV", "fAV_LP", "fAV_HP",
                                             "VO")),
               Token = factor(Token),
               Arousal = factor(Arousal, ordered = TRUE),
               Valence = factor(Valence, ordered = TRUE)) %>%
        select(-Group, -Filter) %>%
        ungroup() %>%
        na.omit()

## Multivariate model formula
Modality_ord_mv_fmla <- bf(mvbind(Arousal, Valence) ~ Category +
                                    Condition +
                                    Category:Condition +
                                    (1|o|Token) +
                                    (1|p|Condition) +
                                    (1|q|Condition:Subject))

## Specify priors
modality_ord_priors_mv <- c(
        set_prior("normal(0,2)",
                  class = "b",
                  resp = "Arousal"),
        
        set_prior("normal(0,6)",  
                  class = "Intercept",
                  coef = "",
                  resp = "Arousal"),
        
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition",
                  resp = "Arousal"),
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition:Subject",
                  resp = "Arousal"),
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Token",
                  resp = "Arousal"),
        
        set_prior("normal(0,2)",
                  class = "b",
                  resp = "Valence"),
        
        set_prior("normal(0,6)",  
                  class = "Intercept",
                  coef = "",
                  resp = "Valence"),
        
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition",
                  resp = "Valence"),
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Condition:Subject",
                  resp = "Valence"),
        set_prior("normal(0,0.125)",
                  class = "sd",
                  coef = "Intercept",
                  group = "Token",
                  resp = "Valence")
)

# Compile and fit model 
Modality_Ord_MV_Mod <-
        brm(Modality_ord_mv_fmla,
            data = Modality_Ord,
            family = cumulative("logit"),
            prior = modality_ord_priors_mv,
            inits = 0,
            iter = 2000,
            warmup = 1000,
            chains = 2,
            cores = ncores,
            sample_prior = TRUE,
            save_all_pars = TRUE,
            control = list(adapt_delta = 0.99,
                           max_treedepth = 12)
        )

The model seems to converge fairly efficiently and the fits to the data for both responses seem good:

Arousal:
pp_check(Modality_Ord_MV_Mod, resp = "Arousal")

Valence:
pp_check(Modality_Ord_MV_Mod, resp = "Valence")

I am seeing some poor PSIS-LOO checks, however:

Looking at Arousal:

loo_mv_arl <- loo(Modality_Ord_MV_Mod, resp = "Arousal", 
                            save_psis = TRUE)

yrep_arl <- posterior_predict(Modality_Ord_MV_Mod, resp = "Arousal")

ppc_loo_pit_overlay(
        y = as.numeric(Modality_Ord$Arousal),
        yrep = as.matrix(yrep_arl),
        lw = weights(loo_mv_arl$psis_object)
)

Looking at Valence

loo_mv_val <- loo(Modality_Ord_MV_Mod, resp = "Valence", 
                  save_psis = TRUE)

yrep_val <- posterior_predict(Modality_Ord_MV_Mod, resp = "Valence")

ppc_loo_pit_overlay(
        y = as.numeric(Modality_Ord$Valence),
        yrep = as.matrix(yrep_val),
        lw = weights(loo_mv_val$psis_object)
)

My understanding is that the LOO PITs should be uniform for continuous data; does that apply if the the model is based on a continuous latent variable rather than a true continuous response?

Does this suggest the model is over-dispersed for both responses?

How might I go about further generalizing the model to account for the abundance of large errors?

Would adding a distribution over the disc parameter be of any help?

Edit: I tried adding disc to the model and saw no change in LOO-PIT.

1 Like

Sorry, your question fell through despite being relevant and well written.

A small suggestion I have is that for ordinal data, ppc_bars will produce clearer output than ppc_dens.

I have very little experience with PSIS-LOO plots, but I’ll take a guess that the plot tells you that your data are more likely to be found above the median of the leave-one-out predictive distribution than below, i.e. there is a skew towards higher values that is not captured by the model.

However, since your overall pp_checks look good, it might be possible this is because the model actually overfits the data. But first, I would suggest you do some grouped pp checks (e.g. type = "bars_grouped") slicing the data by various categories using the group parameter as this may also show some discrepancies.

I know @avehtari is busy, but hope he can check my reasoning here.

Thank you, Martin!

I’ll look a little further with some bar plots and will share my results here.

PPC looks marginal distribution, while LOO-PIT looks at the conditional distributions and thus there can be a big difference in the results.

For discrete data see Predictive Model Assessment for Count Data. If there are many unique values in discrete space the issue is smaller. This is something we should fix for LOO-PIT plots, but the progress has been delayed by thinking more about how to present the deviation from uniformity better than the current approach.

So if I understand you correctly, the current implementation of the PIT in the bayesplot package doesn’t expect ties in data and if you want to use the LOO-PIT checks for discrete data, you need to compute the PIT yourself, using the approach in the linked paper. Is that correct?

It’s not about ties, but about 1) whether we compute ecdf with < or <= and 2) that values in discrete case have a finite number of discrete values that are not uniformly distributed. Case 2) is illustrated in the example below

n<-1000000;N<-100;p<-.3;r<-rbinom(n, N, p);
hist(pbinom(r, N, p), breaks=1000)

Note that I intentionally had 1000 bins. The issue is that this can be sensitive to the number and placement of bins, but with 10 bins we get

hist(pbinom(r, N, p), breaks=10)

but this is skewed to right

A quick approximation to correct is to use midpoint between < and <=

hist(0.5*pbinom(r-1, N, p)+0.5*pbinom(r, N, p), breaks=10)

The randomized PIT is

hist(pbinom(r-1, N, p)+runif(n)*(pbinom(r, N, p)-pbinom(r-1, N, p)), breaks=1000)


which looks uniform even with 1000 breaks.

So the current version is doing something wrong for discrete, but how much wrong depends on the case and number of bins in histogram or kernel width in the kernel density estimate. The above mentioned paper has correct non-randomized version which would be nice to have, but it’s been a resource issue to get it done.

1 Like

My apologies for the delay in getting back to this thread. Thank you all for the valuable lessons and input!

I created some grouped bar pp_checks for each response, grouped first by condition, and then by stimulus category. To me, it looks like the model is doing a fairly good job, but I would very much appreciate any input on how I can further diagnose the model before we submit for publication.

Arousal by Condition:

Arousal by Category:

Valence by Condition:

Valence by Category:

1 Like

Hi, I kind of missed this, so getting back after a while.

The plots don’t look terrible, but they also don’t look great. It appears that in many cases, there is some issue with the item 5 not being predicted very well.

Also, both the “by Category” plots have quite a lot of the posterior intervals missing the data.

I think both should warrant at least some additional investigation. Is there a reason why 5 might be special?

In particular, the most useful pp checks tend to be by splitting on category that is NOT included in your model. Say, the order the question was displayed. Or any other covariate you did record bu have not used. Also using the “stat_grouped” might be useful for some statistic that the model doesn’t directly model - say sd of the answers. Or computing the maximum and minimum of answers per subject (having this PP check will likely require some additional processing code on your side)

But since we see problems already with grouping you have included in the model, one possibility is that you priors for the sd parameters are too narrow… Does the posterior for the sd parameters “fight” the prior? (i.e. the sd priors would put most of the mass below roughly 0.125 * 2 = 0.25, if the posterior 95% interval includes 0.25 it is a hint that the prior might be substantially influencing the posterior).

Does that make sense? Best of luck with the publication!

Thank you @martinmodrak!

I’ll do some further interrogation of the model.

For the priors on the sd parameters, my thinking was that since I am using a logit link, that I’d be putting most of the mass below exp(0.125*2) = 1.65. All of the posterior intervals have an upper limit below 1, so I thought I might be safe there.

Do only beta priors also use logit link, or does the link function apply to sd parameters as well?

The link function applies to the complete linear predictor (sum of intercept and all effect terms). The sd_XXX parameters (and all others) are reported in the model summary on the log odds scale.

Note that you are putting most of the mass of the sd parameter below 0.125 * 2, which in turn means that most of the varying intercepts will be within +/- ((0.125) * 2) * 2 = +/- 0.5 (this is an overestimate - it is unlikely that both sd AND the varying intercept are at the extreme tail at once, but it works for rough checks), so most of the varying intercepts will end up multiplying/dividing the (non-log) odds by at most exp(0.5).

Does that make sense?