Question about posterior distribution over intercept parameter

Hi there. I have finished fitting a multilevel regression model (binomial response, categorical predictors). The issue I am having is understanding whether to be concerned with the apparent spread of the posterior distribution of the intercept parameter when I plot the posterior model (using mcmc_areas). I have attached the output:

It seems like the posterior distribution over the intercept is spread far too widely across parameter values (ranging from ~-10 to ~6). But, I am not sure; is this something to be concerned about? As far as I can tell, the posterior model parameter values do not look abnormal (below), and the tails of the other parameter estimates do not appear abnormal. On the other hand, if this is something I should be concerned about, how might I go about fixing this?

Here is the code I used to specify the model:

comp_glmm2 ā† brm(values_anna_bernoulli ~ age_group * speaker + order_comp_prod + (speaker|participant_id) + (1|comp_trial),
data = wepc_comp,
family = bernoulli,
chains = 4,
iter = 10000,
warmup = 2000,
control = list(adapt_delta = 0.9999),
save_pars = save_pars(all = TRUE),
cores = 4,
seed = 31)

Here is the model output:

Family: bernoulli
Links: mu = logit
Formula: values_anna_bernoulli ~ age_group * speaker + order_comp_prod + (speaker | participant_id) + (1 | comp_trial)
Data: wepc_comp (Number of observations: 236)
Samples: 4 chains, each with iter = 10000; warmup = 2000; thin = 1;
total post-warmup samples = 32000

Group-Level Effects:
~comp_trial (Number of levels: 2)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.89 1.08 0.02 3.80 1.00 9377 14010
~participant_id (Number of levels: 61)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.59 0.45 0.80 2.59 1.00 9772 14781
sd(speakerp1p2) 0.94 0.65 0.05 2.44 1.00 4707 10475
cor(Intercept,speakerp1p2) 0.14 0.52 -0.89 0.95 1.00 13363 14528

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.73 0.93 -2.59 1.18 1.00 11518 11729
age_group4 1.19 0.65 -0.01 2.53 1.00 13916 16754
speakerp1p2 0.06 0.56 -1.07 1.15 1.00 19916 20618
order_comp_prodComprehension2nd 0.07 0.57 -1.03 1.20 1.00 13526 18799
age_group4:speakerp1p2 -1.11 0.75 -2.61 0.38 1.00 19241 20712

OS: Mac (Catalina, 10.15.7)
brms: 2.15.0

I apologize if I have excluded pertinent information or have formatted the code or output unconventionally for this post, this is my first time posting on this forum. Thank you.

A few things. That number of iterations seems really high. Can you re-run with 2000 and adjust your warmup according?

Can you also show what priors are being used? You may need to specify a different prior on the intercept.

Hello, thank you for helping with this. I am using default priors, with the default studentā€™s t(3, 0, 2.5) on the intercept.

I have rerun the model with 2000 iterations and 500 warmups.

Family: bernoulli
Links: mu = logit
Formula: values_anna_bernoulli ~ age_group * speaker + order_comp_prod + (speaker | participant_id) + (1 | comp_trial)
Data: wepc_comp (Number of observations: 236)
Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
total post-warmup samples = 6000

Group-Level Effects:
~comp_trial (Number of levels: 2)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.89 1.03 0.02 3.75 1.00 2170 3245
~participant_id (Number of levels: 61)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.59 0.44 0.78 2.55 1.00 2446 3333
sd(speakerp1p2) 0.94 0.67 0.04 2.47 1.00 1305 2883
cor(Intercept,speakerp1p2) 0.15 0.52 -0.87 0.95 1.00 4680 3938

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.70 0.91 -2.56 1.09 1.00 2844 2800
age_group4 1.18 0.64 -0.03 2.55 1.00 4929 4302
speakerp1p2 0.06 0.57 -1.12 1.14 1.00 5396 4075
order_comp_prodComprehension2nd 0.07 0.56 -1.01 1.18 1.00 4572 4457
age_group4:speakerp1p2 -1.10 0.75 -2.58 0.39 1.00 5392 4520

My reasoning for using as many iterations and warmups as I did was because I thought that increasing the number of iterations increases the accuracy of the posterior estimates, and that there was not much downside to doing this besides computation time and storing the samples. Is this not correct?

Also, after decreasing the number of iterations I am now seeing that there are three pareto k values that are high. I know if I am computing the LOOIC that this issue can be resolved by using loo_moment_match. However, am I still able to trust the parameter estimates of the model even if there are high pareto k values? Or are the pareto k values only something to be concerned about when computing the LOOIC?

So in general if I have to use that many iters a few things come to mind:

  1. The data has something weird going on in it. So I need to check the raw data plots. Like plot values_anna_bernoulli against all the covariates.

  2. The model is misspecified in some way. Maybe center the data? Are the default priors sane?

The default priors might be too broad for your sample size. How many observations are in the data set?

Ah sure, that seems to be a possibility. There are 256 observations in the dataset. I was thinking that the issue may have something to do with the default prior, but I am uncertain of what alternative priors to try out and do not want to shoot around in the dark too much. Any suggestions would be very greatly appreciated.

Can you dump out all the priors with the get_prior command and show them here?

Sure thing, here it is:

prior     class                            coef          group resp dpar nlpar bound       source
(flat)         b                                                                          default
(flat)         b                      age_group4                                       (vectorized)
(flat)         b          age_group4:speakerp1p2                                                    (vectorized)
(flat)         b order_comp_prodComprehension2nd                                      (vectorized)
(flat)         b                     speakerp1p2                                                                     (vectorized)
student_t(3, 0, 2.5) Intercept                                                                           default
lkj_corr_cholesky(1)         L                                                                           default
lkj_corr_cholesky(1)         L                        participant_id                       (vectorized)
student_t(3, 0, 2.5)        sd                                                                           default
student_t(3, 0, 2.5)        sd                        comp_trial                       (vectorized)
student_t(3, 0, 2.5)        sd          Intercept     comp_trial                       (vectorized)
student_t(3, 0, 2.5)        sd                        participant_id                                         (vectorized)
student_t(3, 0, 2.5)        sd          Intercept     participant_id                       (vectorized)
student_t(3, 0, 2.5)        sd          speakerp1p2   participant_id                       (vectorized)

Iā€™ve already tried using a Cauchy(0,1) prior on the intercept in place of the studentā€™s t, but the same issue arose as with the studentā€™s t. My current best guess is to use the dv ~ 0 + Intercept trick to specify the intercept as a fixed effect (b) parameter, so that I can place lower and upper bounds on the prior distribution. However, I do not necessarily think this is the best way to treat this issueā€¦

I am not sure if this will help but I use student_t(3, 0, 1) and I donā€™t recall the reason for that. I might have some notes buried in some old R code on why that is.

Okay, thanks, let me know if you find anything useful from that. I play around with a few different studentā€™s_t priors, and decreasing the scale parameter (from 2.5 to, e.g., 1 then 0.5). Here is the output for the 0.5 scale studentā€™s t:

I think I may end up going the route of using 0 + Intercept and setting a uniform prior on the parameters with lower and upper bounds. None of these solutions feel especially principled, though; this may just be the feeling associated with the novelty of setting priors, haha.

Ok for the student_t(3,0,1) itā€™s basically this: We expect that most values will centered near zero, however with the heavier tails ( than normal() ) we allow for the possibility of some extreme values.

Generally for priors I look to whatā€™s been published before, field notes, plots of the data set, and things like student_t(3,0,1).

Can you post the plots of your data? That might help sort out the priors.

1 Like

Hello, apologies for the delayed response to your message, and thank you for all the help regarding the studentsā€™ t distribution on the intercept. I have in fact opted for normal priors, one on the intercept term ~ N(0,1) and then on the beta weights, b ~ N(0,2). These priors seem to work well given the typical sorts of response patterns in my field of study, and also helped to regularize the posterior inferences. Attached are the mcmc_areas plots for the posterior model using the above-stated priors.

As for the plots of my data, Iā€™m sorry but, what exactly do you have in mind?

2 Likes

If that looks better and the priors make sense from your domain knowledge then you are good to go.

As for data plots Iā€™d plot things like values_anna_bernoulli against all your predictors just to check to make sure the priors on those make sense.

1 Like

The priors do make sense within my domain of research, but I will be sure to plot the dependent measures against the predictors just to make sure as I have not yet. Thank you for helping me to work through this issue and get me thinking about my intercept prior more critically, it is greatly appreciated.

2 Likes

Of course! Anytime. Lots of folks around here helped me out for thinking along the same lines.

2 Likes