Help Correctly Specifying Multinomial Regression with predictor error in BRMS

I’m working with multivariate response data where each observation is essentially the frequencies of different synonymous codons within a gene.

Here’s some example data for amino acids “C” and “I” for 20 different yeast genes

cc_data_mwe <- structure(list(gene_id = c("YAL003W", "YAL042W", "YAR018C", "YAR035W", 
"YBL015W", "YBL023C", "YBL057C", "YBL089W", "YBL104C", "YBR004C", 
"YBR006W", "YBR012W-A", "YBR024W", "YBR025C", "YBR055C", "YBR062C", 
"YBR074W", "YBR083W", "YBR096W", "YBR115C"), phi = c(8.46893787384033, 
0.380664080381393, 0.313518494367599, 0.163658604025841, 1.69344747066498, 
0.990803837776184, 0.402997583150864, 0.172610253095627, 0.37598243355751, 
0.734753429889679, 1.3231383562088, 0.485677689313889, 0.622240304946899, 
5.22348690032959, 0.190805360674858, 0.265862882137299, 0.297768205404282, 
0.19674488902092, 1.45047247409821, 1.57859694957733), trials_C = c(1L, 
10L, 6L, 11L, 4L, 9L, 4L, 10L, 23L, 8L, 11L, 3L, 7L, 6L, 11L, 
7L, 9L, 3L, 9L, 14L), C = structure(c(0L, 4L, 0L, 8L, 1L, 3L, 
1L, 4L, 10L, 3L, 3L, 1L, 2L, 1L, 6L, 3L, 3L, 1L, 3L, 6L, 1L, 
6L, 6L, 3L, 3L, 6L, 3L, 6L, 13L, 5L, 8L, 2L, 5L, 5L, 5L, 4L, 
6L, 2L, 6L, 8L), dim = c(20L, 2L), dimnames = list(NULL, c("c_1", 
"c_2"))), trials_I = c(12L, 20L, 30L, 32L, 33L, 55L, 13L, 51L, 
72L, 40L, 30L, 23L, 18L, 35L, 50L, 6L, 67L, 34L, 14L, 81L), I = structure(c(0L, 
0L, 9L, 0L, 4L, 10L, 6L, 14L, 25L, 12L, 5L, 3L, 4L, 1L, 18L, 
1L, 19L, 14L, 0L, 16L, 4L, 12L, 8L, 29L, 12L, 11L, 1L, 13L, 17L, 
10L, 8L, 8L, 6L, 15L, 7L, 2L, 17L, 10L, 4L, 22L, 8L, 8L, 13L, 
3L, 17L, 34L, 6L, 24L, 30L, 18L, 17L, 12L, 8L, 19L, 25L, 3L, 
31L, 10L, 10L, 43L), dim = c(20L, 3L), dimnames = list(NULL, 
    c("c_1", "c_2", "c_3")))), class = c("rowwise_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
    .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
        10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame")))

Which looks like this:

View(cc_data_mwe)

The phi column is the predictor, but may be measured with substantial error. Nevertheless, if I ignore the error I can estimate the parameters of interest: the intercept of a codon and how it’s frequency changes with phi under a multinomial regression model using,

multinomial_mwe <- brm(formula =
                            bf(C|trials(trials_C) ~ 1 + phi) +
                            bf(I|trials(trials_I) ~ 1 + phi),
     family = multinomial(link = "logit"),
     data = cc_data_mwe,
     threads = 1, #threads within a chain
     chains = 4,
     cores = 4,
     iter = 2000)

This compiles and runs and gives reasonable results.

> multinomial_mwe
 Family: MV(multinomial, multinomial) 
  Links: muc2 = logit
         muc2 = logit; muc3 = logit 
Formula: C | trials(trials_C) ~ 1 + phi 
         I | trials(trials_I) ~ 1 + phi 
   Data: cc_data_mwe (Number of observations: 20) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat
muc2_C_Intercept     0.22      0.21    -0.20     0.64 1.00
muc2_I_Intercept    -0.12      0.14    -0.39     0.15 1.00
muc3_I_Intercept     0.30      0.13     0.05     0.57 1.00
muc2_C_phi           0.35      0.20    -0.01     0.77 1.00
muc2_I_phi           0.54      0.14     0.28     0.83 1.00
muc3_I_phi           0.57      0.14     0.31     0.86 1.00
                 Bulk_ESS Tail_ESS
muc2_C_Intercept     4704     2886
muc2_I_Intercept     3383     3214
muc3_I_Intercept     3480     2794
muc2_C_phi           2238     2229
muc2_I_phi           1538     1546
muc3_I_phi           1612     1500

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

However, I’m still wondering how I can treat phi as a RV where the prior for phi ~ lognormal(0, 1). I have tried numerous formulations, such as

multinomial_mwe_w_error <-
  brm(
    formula = bf(C|trials(trials_C) ~ 1 + phi) +
      bf(I|trials(trials_I) ~ 1 + phi) + 
      bf(phi ~ 0 + gene_ID), 
    prior = prior(lognormal(0, 1), nlpar = "phi", lb = 0),      
     family = multinomial(link = "logit"),
     data = cc_data_mwe,
     threads = 1, 
     chains = 4,
     cores = 4,
    iter = 2000)

but in the above example, it seems like the code is (logically) treating phi as also following a multinomial, because I get the error: Error: At least 2 response categories are required.

I’ve tried using

bf( bf(C|trials(trials_C) ~ 1 + phi,
             I|trials(trials_I) ~ 1 + phi, 
             phi ~ 0 + gene_id,
             nl = TRUE),
     family = multinomial(link = "logit")

But that gives me error saying I can’t use underscores in my variable names.
Can someone help me get the correct syntax

  • Operating System: MATE 22.04
  • brms Version: 2.21.0
  • R: 4.3.3

Hello @mikegee, seems like you might be proposing some type of errors-in-variables model, in which the predictor phi cannot be considered perfectly known?

I’m not sure that the nonlinear structure that you’re attempting will be a helpful model because the variable gene_id doesn’t contain any information from which the value phi could be estimated. Perhaps what you want instead is something like

...bf(C|trials(trials_C) ~ 1 + me(phi, phi_se)...

in which case you’d need to supply information about the measurement error in phi as another variable, and also priors about the distribution of true phi.

Note that me() is soft deprecated, but I haven’t personally used the 'mi()` syntax yet for this type of specification.

Hi AWoodward,

Thanks for the reply. The syntax I’m aiming for is \Phi_g \sim LogN(\phi_g, \sigma_\phi), where \phi_g is gene_id = g’s true specific \phi value and \Phi_g is a noisy measurement.

I do realize this can be viewed as an error in variables model, but I’m not sure if that’s the best approach. This is because sometimes we actually don’t have any direct information on \phi!

As we’ve shown (e.g. Gilchrist et al 2015, GBE) you can estimate \phi quite accurately w/o any external estimates (but, I’m assuming we have them for the moment). Estimating a gene specific \phi in the absence of direct observations is possible, because each gene consists of has multiple variates (19 to be exact) where each variate is a set of synonymous codon counts for a given amino acid. In total, most genes have ~400 trials across these sets of synonyms and we’re only trying to estimate one parameter for that gene.

We’ve got in-house code that does this as the R package AnaCoDa, but it’s cumbersome to extend the models in ways we can write down quickly on the board. That’s why I’m hoping to be able to use brms. The documentation indicates we should be able to fit arbitrarily complex models, if I can only figure out how to code them.