Why would modeling varying intercepts change significance of pop. level effects?

dylanjm · June 18, 2019, 12:53am

Operating System: macOS Sierra 10.12.6
brms Version: 2.9.0

I am trying to determine if adding a new solver to our codebase has significantly changed simulation_alive_times of our regression tests. We have 202 unique regression tests, that vary widely in simulation time. I have run our baseline_case three times and our baseline_noexodus test three times. This results in 1,212 observations equally balanced.

I have decided that the response family should be Gamma(link = 'log') given the positive, skewed nature of the data.

My question is that when I use a multilevel model, accounting for repeated measures for the regression tests, my posterior distribution shows a seemingly promising decrease in simulation_alive_time for the baseline_noexodus case.

I have also noticed that my eff.sample is a lot lower on the MLM model, the diagnostics don’t look as promising but the pp_check fits closer to my data. What would cause this? I have included my model output below.

test <- brm(bf(simulation_alive_time ~ intervention),
            data = compare_dat,
            family = Gamma(link = 'log'),
            prior = prior(normal(0, 1), class = "b"),
            iter = 11000, warmup = 5500, chains = 5, thin = 5,
            cores = 24)

summary(test)

 Family: gamma 
  Links: mu = log; shape = identity 
Formula: simulation_alive_time ~ intervention 
   Data: compare_dat (Number of observations: 1126) 
Samples: 5 chains, each with iter = 11000; warmup = 5500; thin = 5;
         total post-warmup samples = 5500

Population-Level Effects: 
                              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept                         8.56      0.05     8.46     8.66       5528 1.00
interventionbaseline_noexodus    -0.02      0.07    -0.16     0.12       5407 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
shape     0.69      0.02     0.65     0.75       5855 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

test1 <- brm(bf(simulation_alive_time ~ intervention + (1|basename)),
            data = compare_dat,
            family = Gamma(link = 'log'),
            prior = prior(normal(0, 1), class = "b"),
            iter = 11000, warmup = 5500, chains = 5, thin = 5,
            cores = 24)

 Family: gamma 
  Links: mu = log; shape = identity 
Formula: simulation_alive_time ~ intervention + (1 | basename) 
   Data: compare_dat (Number of observations: 1126) 
Samples: 5 chains, each with iter = 11000; warmup = 5500; thin = 5;
         total post-warmup samples = 5500

Group-Level Effects: 
~basename (Number of levels: 190) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     1.33      0.07     1.21     1.47        291 1.01

Population-Level Effects: 
                              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept                         7.78      0.10     7.61     7.97        129 1.03
interventionbaseline_noexodus    -0.14      0.01    -0.15    -0.12       5204 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
shape    54.21      2.46    49.71    59.19       4857 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

test <- add_criterion(test, "loo")
test1 <- add_criterion(test1, "loo")
Warning message:
Found 5 observations with a pareto_k > 0.7 in model 'test1'. It is recommended to set 'reloo = TRUE' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 5 times to compute the ELPDs for the problematic observations directly. 
loo_compare(test, test1)

elpd_diff se_diff
test1     0.0       0.0
test  -2590.3      39.6

time.hyp <- hypothesis(test,
                       hypothesis = c("exp(Intercept + interventionbaseline_noexodus) - exp(Intercept) = 0"))

mcmc_areas(time.hyp$samples, prob = .95)

time1.hyp <- hypothesis(test1,
                       hypothesis = c("exp(Intercept + interventionbaseline_noexodus) - exp(Intercept) = 0"))

mcmc_areas(time1.hyp$samples, prob = .95)

I’m very new to Bayesian and library(brms) so if I have made a lapse in judgement or it’s obvious I’m not quite getting something, please help me understand.

dylanjm · June 18, 2019, 2:31pm

I never seem to get any responses when I post here. I must not be part of the Bayesian cool kids club. Is there another Bayesian forum online that I might find more success in?

paul.buerkner · June 18, 2019, 2:36pm

Hi, it is mostly me who is replying to brms related questions and I get a lot of them, but eventually I am replying to all questions with the brms tag (I may overlook brms related questions with other tags though).

With regard to your question, I would strongly favor the multilevel model, which you can likely also see when running model comparison for instance via loo.

You may need to set some more informative prior on the sd parameter of the random effect to imporve convergence, but overall the convergence looks ok.

You have asked another question in the title of your post but I don’t see this repeated in your post. Or did I overlook it? Can you perhaps clarify on this.

dylanjm · June 18, 2019, 2:50pm

Hi Paul, thanks for responding. I didn’t realize it was mostly only you responding to brms related questions. Apologies for my impatience.

With regard to your question, I would strongly favor the multilevel model, which you can likely also see when running model comparison for instance via loo .

This was my thoughts as well, to be honest I’m having trouble interpreting the sd parameter of the random effects are in relation to my model. How would I put it into an English statement? I think misunderstanding relates to the question in my post title.

Why would modeling varying intercepts change significance of pop. level effects?

At the bottom of my post you can see I run a hypothesis on the difference in means between my basecase and noexodus case for both models.

The posterior distribution of the model with no random effects, determines there is no likely difference between both cases.

The posterior distribution of the model including random effects, determines there is a likely difference (albeit small) between both cases.

I’m wondering why that is and how I should interpret the random effects model?

Would I say something like:

“We can see that when controlling for repeated measures there is a likely speed increase when using the noexodus solver.”

All in all, I’m having trouble visualizing in my mind what exactly the random intercept looks like in an ANOVA like model which makes it hard to think of a prior for that parameter.

nerutenbeck · June 18, 2019, 3:29pm

This sounds right to me - the random intercept models the random variation in speed between basenames. Within each basename, the population effects then show adjustments from the random intercept.

This was my thoughts as well, to be honest I’m having trouble interpreting the sd parameter of the random effects are in relation to my model.

Then the standard deviation of the random intercepts is the amount of baseline between-basename variation that you see before you account for any treatment effects (basecase vs noexodus).

mike-lawrence · June 18, 2019, 3:42pm

I don’t think your experience has anything to do with social dynamics. Posts and response here are largely voluntary donations of time, and now that Stan is getting more popular and the forum is getting more active, the time of the more expert folks is spread more thin, so it’s inevitable that there will be longer delays in responses or even no responses (as I’ve experienced too). Ideally with forums like this there is a constant flow of folks levelling-up their expertise so that increased traffic is matched by increased numbers of folks with the know-how to help, but it seems that Stan’s rapid popularity growth has thwarted this ideal.

I do wonder if you might try stats.stackexchange.com for more conceptual questions like this?

dylanjm · June 18, 2019, 3:43pm

Thank you for the response. I understand what you’re saying, so if I wanted to see if noexodus decreased run time as well as spread compared to the baseline, I could model that as well?

dylanjm · June 18, 2019, 3:46pm

Hi Mike, you’re right and that is something I can appreciate. I do want to be respectful of everyone’s time. I’m particularly fascinated in Bayesian Stats, but I am mostly self-taught. So often times my questions are more conceptual than programming based. I figured I would post on CrossValidated after a few days of no responses here, but I didn’t want to be too quick to cross-post to a bunch of websites.

nerutenbeck · June 18, 2019, 3:48pm

You could, though that would be a different model specification. I am still fuzzy on the details of how to do that in brms for a gamma model.

paul.buerkner · June 25, 2019, 7:07am

You need to control for repeated measures anyway as this structure is part of your data and should be modelled anyway. From this perspective, I argue that you shouldn’t report this fixed effects model at all as it (given that your data has multilevel structure) doesn’t make much sense to begin with. The SD is indeed hard to interprete as it happens on the log-scale (link = log), but this is just a property of any GLM using a non-identity link function.

dylanjm · June 25, 2019, 8:28pm

Hi Paul, thanks for a follow up. I’ve been able to make swell progress on understanding this a bit more since my post. It has also come to my attention that I might have a more nested structure than just repeated measures.

I have created a fake dataset to illustrate what I mean, but my data is organized the following way:

10 cases where simulation time was observed
3 categories with unbalanced cases nested inside (different # of cases per category)
3 times the cases were observed per intervention
2 interventions

The question is, does intervention have an effect on simulation time?

Here is my fake data to show what I mean:

library(tidyverse)

set.seed(123)

fake_dat <- tibble(intervention = as.factor(rep(c("baseline", "noexo"), each = 15)), 
                   run_num = as.factor(rep(paste0("run", 0:2), each = 5, times = 2)), 
                   category = as.factor(rep(c("cat1", "cat1", "cat2", "cat3", "cat3"), 6)), 
                   case = as.factor(rep(paste0("case", 1:5), 6)), 
                   sim_time = rgamma(30, c(5, 8), c(0.01, 0.02)))

fake_dat
#> # A tibble: 30 x 5
#>    intervention run_num category case  sim_time
#>    <fct>        <fct>   <fct>    <fct>    <dbl>
#>  1 baseline     run0    cat1     case1     339.
#>  2 baseline     run0    cat1     case2     556.
#>  3 baseline     run0    cat2     case3     163.
#>  4 baseline     run0    cat3     case4     393.
#>  5 baseline     run0    cat3     case5     887.
#>  6 baseline     run1    cat1     case1     441.
#>  7 baseline     run1    cat1     case2     222.
#>  8 baseline     run1    cat2     case3     176.
#>  9 baseline     run1    cat3     case4     747.
#> 10 baseline     run1    cat3     case5     426.
#> # … with 20 more rows

Previously, I was modeling repeated measures using (1 | case) and getting decent results as you mentioned, I’m curious if I’m obligated to account for the repeated runs within each intervention?

I suppose it would come down to model diagnostics and fit, but if I was do to so, how would the syntax look?

# Orginal Model
fake.brm <- brm(sim_time ~ intervention + (1 | case), 
                data = fake_dat, family = Gamma(link = 'log'))

# Account for variance between runs
fake.brm2 <- brm(sim_time ~ intervention + (run_num | case), 
                 data = fake_dat, family = Gamma(link = 'log'))

# or would it be this?

fake.brm3 <- brm(sim_time ~ intervention + (1 | case) + (1 | run_num), 
                 data = fake_dat, family = Gamma(link = 'log'))

I’m not really sure variance between category matters that much.

I appreciate your insight and recognize you are very busy. I have read through your papers and documentation explaining MLM and have read through library(lme4) documentation, but am struggling to recognize nested structure and the syntax to go along with it.

There is also the problem of some of these cases failing mid-run resulting an artificially low sim times, which leads me to believe a censored model might be of use, but that’s a whole other question, and not really in my purview at the moment. Baby steps :)

Thanks for your time.

MegBallard · June 27, 2019, 4:50am

I find thinking about WHAT a model is doing helps to figure out what approach you should start with.

So part of the goal/benefit of using a mixed/multilevel model is to take advantage of the ability to partition out the variability that is caused by factors relating to experimental structure. (different sampling sites, different family groups, different years, different individuals).

The way I prefer to think about fixed/population effects (which is certainly influenced by the kinds of questions I tend to ask) is to ask myself if I want an estimate of the VALUE of the response for each level of a group? So if any of your group level effects are actually interesting to you with respect to estimating their value, they should probably be treated as fixed/population effects.

To the extent that you can properly account for the structure of an experiment in the model formula, you should.

Outside of a bayesian framework there are rules of thumb about what factors you can or should model as group level effects. These I think are relaxed in a bayesian framework (or in particular Stan) because of math stuff I don’t understand related to estimating equations performing poorly with fewer than 4-5 levels with in a group creating convergence problems.

So I can’t even begin to know if computer simulations create variation per individual run.

First , My recommendation would be to code/label your factors such that each level has a unique identification. For example in your data above cat3 appears several times in run1 and in run0.
If run0 was coded like this base_run0 and noexo_run0 then they would be uniquely coded within their treatment.

if you coded each factor level in line 5 explicitly it would look like this.

5 baseline     base_run0   base_run0_case5_ cat3     base_run0_case5     887.

In this way your nested structure is explicit and writing a formula with it will be less difficult and also will make it more clear to you the structure within your data. The little tidbit I see above makes it look like your data is structure something like this:

Intervention (2 levels)
- run_number (3 levels?)
  - case (5 levels)
    - category (3 levels)

Now depending on how many observations you have at the lowest level of your structure certain formulas may be more or less appropriate.

Your model could be as complicated as you want it to be. And there are different opinions on how complex a model you should start with. (Some people say as complex as your data can support, and in a case where you are simulating data that could be pretty complex.) I always suggest looking through the GLMM FAQ by Ben Bolker (one of the authors of lme4) which is applicable to many platforms under which you could conduct analysis of mixed models.

You can also specify the model in multiple ways and then use model selection techniques to help decide what way to go (or combine them). That is Well outside my area of knowledge.

You could try including all the grouping factors and if the levels are named explicity you don’t have to worry about how to write the nested model. You can just write the groups as if they are not nested.

fake.brm2 <- brm(sim_time ~ intervention + (run_num | case), 
                 data = fake_dat, family = Gamma(link = 'log'))

is not the correct way to add run_num as a group level effect.

Model 1
sim_time ~ intervention + (1 | run_num) + ( 1 | case)

in brms you can write this more easily

sim_time ~ intervention + ( 1 | run_num + case)

Model 1 has intercepts that can vary for each level and slopes that are the same for each level. If you wanted to model varying slopes you would have to put in something like

Model 2
sim_time ~ intervention + (1 + intervention | run_num) + ( 1 | case)

Model 2 has slopes and intercepts of run_num varying, but slopes of case do not.

I won’t continue but I would start with model 1 or possibly add another group level term (1 | category) especially if there is more than one observation of sim time for each level of category. Or alternatively build your model from simplest to most complicated
sim_time ~ intervention
sim_time ~ intervention + (1 | run_num)

and so on.

dylanjm · June 27, 2019, 8:23pm

Thank you so much for this. You explained it in way that I really was able to understand. I appreciate it :D

torkar · December 22, 2019, 8:07am

And only to add to @MegBallard’s post, once you have a number of candidate models that you feel are sane, then do model comparison using LOO to get indications which model is preferable concerning out-of-sample prediction: https://mc-stan.org/loo/articles/loo2-example.html

Even if LOO would prefer a more complex model, that does not mean that you automatically should pick that one (after all, having a conceptually simpler model has its advantages too!)

You could use WAIC also but I tend to like LOO since it gives you more diagnostics, which has saved me many times :)

Topic		Replies	Views
Best Prior for Varying Intercept in Logistic Regression modeling Penalty Shot Success Rate Modeling	6	777	February 15, 2019
Hierarchical Multinomial Model taking days to run brms performance , multinomial-response , hierarchical-model	14	1674	April 9, 2020
Checking interpretation of sd parameter brms interpret-results	15	2688	July 11, 2019
ANOVA - random effects on intercept and/or slope? Modeling	18	2692	January 2, 2020
Results from monotonic effect versus dummy variables in brms Modeling	27	1621	March 14, 2020

Why would modeling varying intercepts change significance of pop. level effects?

Related topics