Help with bayesian regression modeling (beginner)

Hi all,
I am a PhD student of cognitive science with the main focus on linguistics. In the past years, I learned how to fit some inferencial statistics models such as LM, Anova and lmer in RStudio. In the last days, I decided to start learning some bayesian approaches to model a bayesian regression model to my data. However, even if I read some documentation, the procedure isn’t clear to me.

Firstly, I will share with you an lmer model to explain what I am trying to fit:

model <- lmer(Total_duration_of_whole_fixations~TOV+Group+(1|Media)+(1|Participant)+(1|CHR),na.action="na.omit", data = data1) 

TOV is a non-ordered categorical predictor with three levels: IDIOM, COMP, LEX.
Group is a non-ordered categorical predictor with two levels: GSPE and GCONT,
Media, Participant, and CHR are categorical random effects.

What I tried to do is to fit a bayesian regression with mixed effetct in the following way:

pr = prior(normal(0,1), class="b")
bayesian_mixed = brm(Total_duration_of_whole_fixations~TOV*Group + (1|Participant)+(1|Media)+(1|CHR), data=data1,prior = pr, cores=4)
priors <- get_prior(Total_duration_of_whole_fixations~TOV*Group + (1|Participant)+(1|Media)+(1|CHR), data=data1, family = gaussian())

priors
                    prior     class               coef       group resp dpar nlpar lb ub       source
                   (flat)         b                                                           default
                   (flat)         b          GroupGSPE                                   (vectorized)
                   (flat)         b           TOVIDIOM                                   (vectorized)
                   (flat)         b TOVIDIOM:GroupGSPE                                   (vectorized)
                   (flat)         b             TOVLEX                                   (vectorized)
                   (flat)         b   TOVLEX:GroupGSPE                                   (vectorized)
 student_t(3, 513, 317.3) Intercept                                                           default
   student_t(3, 0, 317.3)        sd                                                 0         default
   student_t(3, 0, 317.3)        sd                            CHR                  0    (vectorized)
   student_t(3, 0, 317.3)        sd          Intercept         CHR                  0    (vectorized)
   student_t(3, 0, 317.3)        sd                          Media                  0    (vectorized)
   student_t(3, 0, 317.3)        sd          Intercept       Media                  0    (vectorized)
   student_t(3, 0, 317.3)        sd                    Participant                  0    (vectorized)
   student_t(3, 0, 317.3)        sd          Intercept Participant                  0    (vectorized)
   student_t(3, 0, 317.3)     sigma                                                 0         default

However, I still don’t get a lot of points. First of all, I know that the function “get_prior” is important to set the priors in the “brm” function. However, as my predictor are categorical, “get_prior” doesn’t give any value in the output:

So the first question is: in the case of categorical variables, how do I have to chose and set the priors for the model? Do I have to keep them flat? (I set prior(normal(0,1), class=“b” because I read to do it in a paper, but I am not sure about the meaning of this choice). Also, in the case of continuous values, how I have to insert the values in the model? Do I have to transform them with some calculus? And what about inserting student t values and sigma values?

Second question: how do I understand if the predictors of my model are “significant” or not? To which values of the summary I have to look and how? I attache the summary(model) output here:

summary(bayesian_mixed)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Total_duration_of_whole_fixations ~ TOV * Group + (1 | Participant) + (1 | Media) + (1 | CHR) 
   Data: data1 (Number of observations: 5236) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~CHR (Number of levels: 8) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    27.73     25.42     1.04    90.64 1.01      332      570

~Media (Number of levels: 123) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)   129.15      9.85   111.45   150.08 1.01      786     1371

~Participant (Number of levels: 44) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)   226.87     24.74   185.52   282.56 1.02      322      694

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            607.64     39.34   528.62   679.88 1.02      185      426
TOVIDIOM               0.05      0.99    -1.84     1.99 1.00     3225     1629
TOVLEX                 0.02      1.00    -1.89     1.94 1.00     3796     2925
GroupGSPE             -0.03      0.99    -1.96     1.89 1.00     3975     2897
TOVIDIOM:GroupGSPE     0.03      0.99    -1.92     1.90 1.00     3745     2831
TOVLEX:GroupGSPE      -0.03      1.00    -1.99     1.93 1.00     3601     2674

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   322.55      3.15   316.43   328.97 1.00     3111     2173

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Thank you for the help, I hope my explaination are clear.

Hi and welcome. Thanks for the detailed post. A couple of things that will be helpful here. What OS and version of Stan are you running? I am guessing brms.

So a general recommendation for starting out:
Plot your data that’s going into the model first. Just as a quick visual QAQC.

Then start with the simplest model to see if it makes some kinda of sense, like
bayesian_mixed ← brm(Total_duration_of_whole_fixations~TOV, cores=4, chains=4)

2 Likes

For additional trouble shooting I really recommend a simulated data set. There are a number of R packages that will do this for you or you can code up a function in R or python to do this.

Using a simulated data set helps to ensure that the model is returning known true values. This also makes it easier to share the data (since it simulated, not actual) with the broader community here to get additional help.

2 Likes

Hi, thank you for the reply. Yes, I am using brms (version 2.20.4) and I am running on Windows.
Can you suggest me the best way or code to plot the data in order to be informative? I didn’t mentioned the fact that I am still learning RStudio, on the other side.
Also, do you need to share the output of bayesian_mixed ← brm(Total_duration_of_whole_fixations~TOV, cores=4, chains=4) here?

Thank you a lot

Can you suggest me some packages for simulating data please?

Thank you again

Well I am not a cognitive scientist but an arid land ecologist (the mathy/staty/field kind) but I can look to see what is out there. So for broadly setting up a workflow for brms (or Stan) it’s something like this:

library(tidyverse)
library(ggplot2)
library(brms)

# Working directory should be set using Session -> Set Working Directory. Not hard coded. 
# Better practices suggest your file structure (folders) look something like this:
# .
# └── Project name/
#   ├── data/
#   │   ├── external
#   │   ├── interim
#   │   ├── processed
#   │   └── raw
#   ├── docs
#   ├── models
#   └── reports/
#       ├── images
#       └── graphs
nema <- read_csv("data/raw/WENNDEx_nematodes.csv",
                 guess_max = 300000)
nema

# Do the data types make sense? date, dbl, chr, int
print(nema, n=1, width = Inf)

I should be clear this is not some official Stan community workflow :) just what I do as a data/information/model type person.

Feel free to replace nema with something that makes more sense for you, that’s just the tibble (fancy dataframe) that I am working on.

As for plots before brms is TOV numeric, integer, categorical, something else?

Thank you for the explaination. Yes, the data types make sense to me (before bayesian I used lm and lmer models). The problem is that I cannot really understand how to interpret the values of the summay(bayesian_mixed) (i.e. cannot understand if what I am looking to is signifiant or not)

Ok I’d plot the data first using something like

myfancydata %>% ggplot(.,aes(x=TOV, y=Total_duration_of_whole_fixations) + geom_point()

For your categories a boxplot of some sort using geom_boxlot.

If all those look good I’d run a very simple model first. Then we chat about how to read the output. The plots and the model output are important for understanding the whole process.

In the Bayesian framework I don’t worry about signifiant but focus on quantifying the uncertainty in my model parameters.

Okay, thank you a lot, it is really appreciated :)

1 Like

Of course. Once you’ve gotten this one run
bayesian_mixed_1 ← brm(Total_duration_of_whole_fixations~TOV, cores=4, chains=4)

maybe also run
bayesian_mixed_2 ← brm(Total_duration_of_whole_fixations~TOV + (1|Gorup), cores=4, chains=4)
if group what I think it might be, hopefully :)

then post those outputs.

There are some model diagnostics you will also want to run as well but let’s tackle these two thing first.

Hi, sorry for the delay. Here I put what I coded and the output:

library(ggplot2)
ggplot(data1, aes(x=TOV, y=Total_duration_of_whole_fixations, fill=Group, color=Group)) + geom_point(size=1)+geom_jitter(alpha = 0.5, width = 0.3, height = 0.3)

Prova

###FIRST MODEL###

bayesian_mixed_1 <- brm(data=data1, Total_duration_of_whole_fixations~TOV, cores=4, chains=4)
bayesian_mixed_1
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Total_duration_of_whole_fixations ~ TOV 
   Data: data1 (Number of observations: 5236) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   587.53      9.64   568.42   606.60 1.00     3200     2810
TOVIDIOM     45.42     13.51    18.81    72.26 1.00     3355     3007
TOVLEX       17.84     13.98    -8.81    46.45 1.00     3355     2791

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   410.65      4.10   402.47   418.79 1.00     3634     2872

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

###SECOND MODEL###
Actually, the Group variable is an independent variable in the mixed model that I use. Here I put the Group as a random as you requested:

bayesian_mixed_2 <- brm(data=data1, Total_duration_of_whole_fixations~TOV + (1|Group), cores=4, chains=4)
bayesian_mixed_2
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Total_duration_of_whole_fixations ~ TOV + (1 | Group) 
   Data: data1 (Number of observations: 5236) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000
Group-Level Effects: 
~Group (Number of levels: 2) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)   113.99    125.79     1.61   432.68 1.29       11       21
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   563.02     69.10   428.66   688.88 1.26       12       21
TOVIDIOM     42.12     14.64    17.68    71.03 1.13       21       83
TOVLEX       17.22     13.02    -9.62    44.66 1.12      578     1225
Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   409.98      3.91   403.06   418.57 1.08       35      785

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Messaggi di avvertimento:
1: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors.
2: There were 476 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See Runtime warnings and convergence problems

###THIRD MODEL###
In this case, I treat Group as a fixed effect:

bayesian_mixed_3 <- brm(data=data1, Total_duration_of_whole_fixations~TOV + Group, cores=4, chains=4)                                                                                                                                     
bayesian_mixed_3
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Total_duration_of_whole_fixations ~ TOV + Group 
   Data: data1 (Number of observations: 5236) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   594.50     11.28   572.43   616.25 1.00     4106     2642
TOVIDIOM     44.67     13.70    17.53    71.91 1.00     4283     3061
TOVLEX       17.38     13.88    -9.03    44.62 1.00     3859     3131
GroupGSPE   -13.19     11.22   -34.97     9.42 1.00     4814     2979
Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   410.72      3.88   403.20   418.21 1.00     5133     3164

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

I hope these information are uselful, thank you!

1 Like

Ohh is TOV categorial, numeric, or integer? Sorry I am not familiar with your data types.

Don’t worry, I understand how can be complex to understand data :)
Yes, TOV is categorical non-ordinate and presents three levels. Also, Group is categorical non-ordinate with two levels.

This is why I am struggling to understand how to visualize data and how to set priors in the proper way: the get_prior function doesn’t give me any value regarding categorical variables

Ok, that helps. So you will need some domain knowledge that I don’t have here. Do you want things like TOV to be treated as having variable intercept or both a variable intercept and slope? That will determine if you do something like +Group or (1|Group). Hopefully I am getting that correct. I am sorry if you already worked this out and I missed that.

But let’s look at model one. I added some code to make your output a little easier to read.

In your first model in the output you will see the
l-95% CI u-95% CI
If that range includes 0 then you can say something like TOVLEX at the 95% uncertainty interval (-9.62 , 44.66) could have a negative, neutral (or none), or positive effect on Total_duration_of_whole_fixations. While TOVIDIOM at the 95% uncertainty interval (17.68 , 71.03 ) has a positive (but a wide spread? might not be true, that’s a domain question) effect on Total_duration_of_whole_fixations.

I tend towards very casual language for model descriptions which may not be great depending on your audience or needs. Finding some adjacent or interesting brms papers can give you a better idea of how to write that part up.

Does that make sense?

Regarding domain knowledge. In my specific frame, I would like to understand if there is a an interaction between the TOV and the Group variables, and if there isn’t any interaction, looking at the main effects after. TOV and Group are treated as they have only an intercept, without the slope. Regarding the random effect, I consider as slopes Media, Participant and CHR variables. The code should be the following:

bayesian_mixed_2 ← brm(data=data1, Total_duration_of_whole_fixations~TOV *Group (1|Media), (1|Participant), (1|CHR), cores=4, chains=4)

Looking at the first model, let me check if I understood:

  1. If l-95% CI u-95% CI are < 0, the effect could be negative
  2. If l-95% CI u-95% CI are between -1 and 1, the effect could be negative, neutral, none or positive
  3. If l-95% CI u-95% CI are >0, the effect could be positive.

So, in the case of this model, the level TOVCOMP is considered as the Intercept right? How do I have to evaluate it? (The levels of TOV are COMP, IDIOM, LEX).

And the next question is: how do I test that TOVLEX present negative, positive or none effect? Which functions do I should use?

However, this starts to make a little bit more sense, thank you :)

Yes you are on the right track there given your data and your model(s). Let me give you two resources the first one is slightly quicker
https://www.rensvandeschoot.com/tutorials/brms-started/

and the second one will take some more time but I know folks around here have found it useful
https://xcelab.net/rm/statistical-rethinking/

Also if you prefer youtube there is

1 Like

Thank you :) at least now I know how to read the summary!
I will take a look to the documentation and I’ll let you know if there are any problems!

Hi again, I had a look on the documentation that you sent. Now the procedure is more clear, but I have still a couple of problems.

  1. I still don’t get how to understand if a a variabile has a null, negative or positive effect, as in the case of TOVLEX.

  2. I understood that priors can be uninformative and informative. In this second case, we can set “normal” priors or custom priors. In both the cases, it is better to includent student and sigma values. Custom priors come from previous studies or subjective beliefs about data. In my case, previous studies show that TOVLEX are read slowly than TOVCOMP, but TOVIDIOM present longer RT in comparison to both TOVCOMP and TOVIDIOM. Regarding the Group variabile, I haven’t data which could help to set my priors. Also, as I understood, for categorical variables it is usually used Bernoulli or Cauchy priors. Why?

Summing up:

  1. How can I understand if TOVLEX is negative, positive or null?
  2. How can I set through coding normal and custom priors to compare two different model? Which values I have to consider regarding previous information (if you want to do examples, you can use moca values of course)? In the case of categorical variables, do I have to specify these values for every level (i.e. TOVLEX, TOVCOMP (Intercept), TOVIDIOM)? If yes, how?

Thank you again and sorry for all these questions

As for question 1 there is a lot of great discussion here which has folks tackling this question in different ways

particularly jscolar comments.

Priors can get difficult quickly. However I tend to view them as just additional data you are adding into your model. So if you know some about how all your parameters interact you can code that up.

Like in my area fire is bad for wetland plants in arid system. If my model increase something for fire I might have a prior normal(-0.5, 0.5).

As for default best pracites priors we would need to get help from some other folks. There is this
Prior Choice Recommendations · stan-dev/stan Wiki · GitHub but it was last updated 2023.