@Bob_Carpenter thank you for engaging in such depth. I think this is really important., so I’m really grateful we can discuss it. I’ve started to realize that depending on background, different people have a different conceptual framework of modeling, and sometimes we make incorrect assumptions about what we communicate.
I’ll try my best to answer. First some general thoughts and then specific replies. (PS: This became much longer than I planned. I hope it’s not too basic. It helped me clear up how I think about this and maybe would be helpful for other who end up reading this discussion. I might write a blog post with more details and illustrations - consider this a rough starting point. Will add specific replies to your comments in a separate comment).
Contents:
- Model components in distributional regression
- A core model
- A regression model layer
- What ‘model’, ‘model family’, and ‘model-comparison’ refer to
- Understanding the bottleneck
- Stan components
- Core model likelihood
- Data structure
- Parameter structure
- Priors
- Regression layer
- Optimizations
- What a user has and what they want to achieve
- Bottleneck: How do the goals map to changes in stan code?
- Stan components
- How do
brms
andbmm
solve the bottleneck?
Model components in distributional regression
We (and brms more generally) implement a distributional regression approach to modeling. Within this framework, we can consider a model at two levels of abstraction:
- The core model, which specifies a likelihood function of the data given a closed set of parameters and their priors.
In a simple statistical model this is a distributional family - a normal, poisson or whatever, with its parameters. A more complicated example could be a drift diffusion model which gives the joint likelihood of response times and binary choices as a function of parameters drift-rate, bias, threshold, non-decision time. Another example could be a mixture of 3 von-mises distributions, which then has parameters for each distribution and the mixture weights. The parameters of this core model are a closed set and each plays a distinct part in the likelihood function (e.g. always 5 scalar paramers). In essence this is the part of the model that looks like this:
where y is the data (one or more variables), and \theta_a, \theta_b, etc are the core parameters
- A regression model layer, which specifies for a particular dataset how the core parameters change over observations.
This is a standard statistical formula of the type:
where g() is some link function that unconstrains a parameter, \beta are the regression parameters and X
is the model matrix. Such a formula is specified for every core parameter of the model. This can include fixed and random effects at different levels. In contrast to the core model, which has a closed set of parameters, the regression model now has a vector for each core parameter. The core model priors correspond to the intercepts, and the regression layer needs to additionally specify priors for the effects.
What do the terms “model”, “model comparison”, etc, refer to?
This I think is where often misunderstandings occur because different fields use this differently, and also the same field uses it differently depending on the context.
In the cognitive modeling literature, by “model” we often refer to (1) - a drift diffusion model, a linear balistic accumultor, unequal-variance signal detection, a 3-parameter mixture model of memory, etc. In the bmm package we use the term “measurement” model in this sense - a core model without the regression layer. In the field, when we talk about “different models”, we usually mean different core models.
This corresponds to the idea of “model family” in the statistical literature. Where instead, usually when you discuss alternative models, you mean different regression-level specifications of the same model family (e.g. comparing the same core model in which a parameter \theta_a is allowed to vary with some predictor vs that model but with a fixed \theta_a parameter.
Understanding the bottleneck
To understand where the bottleneck and friction points in modeling for many of us come from, and how brms
and bmm
help in generating the Stan code for distributional models, we need to understand:
- how the distinct components of such models map to Stan code
- what goals modelers usually want to achieve
- in a typical workflow, what do we have starting out and what do we want to get
- how the modeling goals map to changes in these components
- how do we move from what we have to what we want to get in Stan
Stan components
Here I don’t mean the syntactic part of data {}
, parameters {}
, model {}
- although there is some overlap, the components sometimes are written over multiple parts.
1. Core model likelihood. One of:
- a built-in function such as
normal_lpdf(y | mu, sigma)
; - a complete custom function defined in the
functions {}
block of the typemy_fun_lpdf(y | mu, sigma);
This can either specify a series of equations from basic math functions, or combine them with built-in likelihood functions - a sequence of code in
model {}
that calls several functions storing values in intermediate variables. Equivalent to the second option, but messier
2. Data structure. Consists of:
- Variables that contain:
- observations
- auxiliary information (size of other data vectors, number of observations, number of predictors, etc)
- predictors for the regression layer
3. Parameter structure. Consists of:
- Variables that define the which model parameters are sampled (rather than fixed)
- Typically a variable corresponds to a core model parameter, and its structure (a vector, list, array, whatever), contains the values for the regression components
4. Priors
5. Regression layer
- code in
model {}
which defines how the model matrix is combined with the regression parameter vectors to produce a final parameter vector to be used in the model likelihood - this is needed, because the likelihood function needs to know for each (set of) observations, which parameters to use. Usually trickiest with random effects, because the most efficient ways of coding them are not the cleanest (e.g. just using linear algebra is slower than looping over observations an subseting the parameter vector based on random effect indicator…)
6. Optimizations
- code that plays no conceptual role in the model, but improves sampling or inference (parametrization, centering predictors, applying link functions)
What a user has and what they want to achieve
Depends on conventions in each field, of course, but in the behavioral sciences, this is very common.
We have:
- Data(sets): typically in a long format data.frame, where one column stores observations (or more, but one for each distinct observation type, e.g. response times and accuracy), and other columns store potential predictor variables for fixed and random effects. This means, that each row represents a single observation (or a sufficient statistic), and all of its predictors.
- Potential core model candidates
- Potential predictor candidates
All three of these might vary, and typically the goal is to do some form of inference:
- Inference about the regression layer
Definition: For a given particular dataset and core model, which set of predictors on which parameters lead to the best model fit.
Use case: Very typical with pure statistical models, such as testing the hypothesis that reaction times in condition A are faster than in condition B, and using the default gaussian or log-normal core model. But also applicable with complicated domain specific models such as drift diffusion models in which we test the hypothesis that the drift rate is slower in conditation A vs B.
Assumptions: This type of data is described by a particular core model, and this particular dataset includes conditions that affect the parameters of the model.
What varies:
- data: fixed in a particular instance, but we might want to apply the same regression approach for different experiments/data sources.
- regression formula
- Inference about the core model
Definition: For data coming from a particular paradigm (e.g., memory decision in delayed estimation), what core model is “the best”
Use case: usually the aim is to compare different data generating assumptions (process models) or to compare what model describes the type of data best (measurement models).
Assumptions: we include regression components to account for variability over datasets (random slopes for participants, fixed effects for conditions), so that we can do a fair comparison of different core models
Bottleneck: How do the goals map to changes in stan code?
The big bottleneck is that much of the stancode we need to write depends on details we don’t care about, introducing a lot of overhead. The names and sizes of variables, parameters and regression layer syntax depends on the specific dataset and regression formula.
When we want to compare several regression formulations of the same core model to the same dataset, we would need either 1) as many copies of the stan code as variations, each with a lot of code duplications and idiosyncratic changes, 2) a very generic script with an abstract structure, which allows you to somehow map those choices to the same script by varying the inputs. This is very difficult to write for the average user. Then you also need to create a lot of the optimization code and regression flow anew for each project. These scripts can get hundreds of lines long for relatively simple cases.
A typical workflow without brms
would be and not much training in software engineering (which is true for most statistical users).
- if you’ve done this before, copy a script you have for a similar situation (e.g. another time I applied a logistic multi-level regression with random effects to a memory experiment; regression layer inferece goal; or my 3 different custom likelihood measurement models).
- adapt variable names, sizes for the current data with the basic model
- make multiple copies of the code, one for each regression version, and change relevant code
- debug and test interactively
- write complex R code to translate your tidy long-format data.frame into a stan-type data structure. Repeat as many times as you have alternative models, or write a custom function to do that for you
- and a bunch of other things.
How do brms
and bmm
solve the bottleneck?
You could say at this point that a solution to all these problems is to write abstracted code that can be reused, and whose outputs will depend on the inputs, rather than having a different script for each input. Well, this is exactly what brms
does.
Given:
- a model family (core model; builtin or user-defined stan-code)
- a long-format data frame
- a standardized regression formula
- priors specified via user-friendly syntax
It:
- transforms the data to an appropriate format for stan
- generates boiler plate code that does not depend on the core or regression model
- creates mappings between user specified variables and standardized stan code for data and parameter variables
- defines the regression layer syntax to feed parameters to the core model
🔥 Several things make it very flexible:
- the brmsformula syntax:
- via linear and non-linear formulas it can specify nearly arbitrary regression layer code, custom link functions, etc
- turns out that it can do many things it was not designed for, which is what we take advantage of in
bmm
- custom families:
- users can write a function in stan code and then pass that to
brms
via thecustom_family()
option. Through this you can define any custom core model you can code in Stan.- stan code injection
- via the
stanvars
argument, you can specify arbitrary stan code to be inserted in any code section.
Summary - role of brms and bmm
Based on the last section, it means that as long as you have model that can be described in the above framework, you can generate the stan code for it with brms. You still have to provide some Stan code if 1) you are using a core model that is not a typical statistical family or 2) you want to affect some of the code logic via stanvars
. Thus, brms
provides you with three methods for injecting stancode: custom families (tells brms to use that function to increment the traget); stanvars: inject code in arbitrary blocks but after/before existing code; specify non-linear regression formulas injects code in how parameter vectors are combined to produce a final parameter passed to the core model.
All of this in brms
gives you a lot of control over the stan code that you care about changing manually, while automatically generating the parts of the code that can be boilerplate.
This works, but still requires more code that just fitting a typical mixed-effects gaussian regression. What bmm
then does is provide a framework to smooth this process for custom domain models, so that users won’t have to write even those things themselves. And for me as a model developer, I can use bmm
to specify the common parts of a model I’m working on, and then more easily itterate on the code, by needing to only supply changes to the relevant code parts, and let the rest of the code be generated automatically.