Help Correctly Specifying Multinomial Regression with predictor error in BRMS

mikegee · May 7, 2024, 2:12am

I’m working with multivariate response data where each observation is essentially the frequencies of different synonymous codons within a gene.

Here’s some example data for amino acids “C” and “I” for 20 different yeast genes

cc_data_mwe <- structure(list(gene_id = c("YAL003W", "YAL042W", "YAR018C", "YAR035W", 
"YBL015W", "YBL023C", "YBL057C", "YBL089W", "YBL104C", "YBR004C", 
"YBR006W", "YBR012W-A", "YBR024W", "YBR025C", "YBR055C", "YBR062C", 
"YBR074W", "YBR083W", "YBR096W", "YBR115C"), phi = c(8.46893787384033, 
0.380664080381393, 0.313518494367599, 0.163658604025841, 1.69344747066498, 
0.990803837776184, 0.402997583150864, 0.172610253095627, 0.37598243355751, 
0.734753429889679, 1.3231383562088, 0.485677689313889, 0.622240304946899, 
5.22348690032959, 0.190805360674858, 0.265862882137299, 0.297768205404282, 
0.19674488902092, 1.45047247409821, 1.57859694957733), trials_C = c(1L, 
10L, 6L, 11L, 4L, 9L, 4L, 10L, 23L, 8L, 11L, 3L, 7L, 6L, 11L, 
7L, 9L, 3L, 9L, 14L), C = structure(c(0L, 4L, 0L, 8L, 1L, 3L, 
1L, 4L, 10L, 3L, 3L, 1L, 2L, 1L, 6L, 3L, 3L, 1L, 3L, 6L, 1L, 
6L, 6L, 3L, 3L, 6L, 3L, 6L, 13L, 5L, 8L, 2L, 5L, 5L, 5L, 4L, 
6L, 2L, 6L, 8L), dim = c(20L, 2L), dimnames = list(NULL, c("c_1", 
"c_2"))), trials_I = c(12L, 20L, 30L, 32L, 33L, 55L, 13L, 51L, 
72L, 40L, 30L, 23L, 18L, 35L, 50L, 6L, 67L, 34L, 14L, 81L), I = structure(c(0L, 
0L, 9L, 0L, 4L, 10L, 6L, 14L, 25L, 12L, 5L, 3L, 4L, 1L, 18L, 
1L, 19L, 14L, 0L, 16L, 4L, 12L, 8L, 29L, 12L, 11L, 1L, 13L, 17L, 
10L, 8L, 8L, 6L, 15L, 7L, 2L, 17L, 10L, 4L, 22L, 8L, 8L, 13L, 
3L, 17L, 34L, 6L, 24L, 30L, 18L, 17L, 12L, 8L, 19L, 25L, 3L, 
31L, 10L, 10L, 43L), dim = c(20L, 3L), dimnames = list(NULL, 
    c("c_1", "c_2", "c_3")))), class = c("rowwise_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
    .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
        10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame")))

Which looks like this:

View(cc_data_mwe)

The phi column is the predictor, but may be measured with substantial error. Nevertheless, if I ignore the error I can estimate the parameters of interest: the intercept of a codon and how it’s frequency changes with phi under a multinomial regression model using,

multinomial_mwe <- brm(formula =
                            bf(C|trials(trials_C) ~ 1 + phi) +
                            bf(I|trials(trials_I) ~ 1 + phi),
     family = multinomial(link = "logit"),
     data = cc_data_mwe,
     threads = 1, #threads within a chain
     chains = 4,
     cores = 4,
     iter = 2000)

This compiles and runs and gives reasonable results.

> multinomial_mwe
 Family: MV(multinomial, multinomial) 
  Links: muc2 = logit
         muc2 = logit; muc3 = logit 
Formula: C | trials(trials_C) ~ 1 + phi 
         I | trials(trials_I) ~ 1 + phi 
   Data: cc_data_mwe (Number of observations: 20) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat
muc2_C_Intercept     0.22      0.21    -0.20     0.64 1.00
muc2_I_Intercept    -0.12      0.14    -0.39     0.15 1.00
muc3_I_Intercept     0.30      0.13     0.05     0.57 1.00
muc2_C_phi           0.35      0.20    -0.01     0.77 1.00
muc2_I_phi           0.54      0.14     0.28     0.83 1.00
muc3_I_phi           0.57      0.14     0.31     0.86 1.00
                 Bulk_ESS Tail_ESS
muc2_C_Intercept     4704     2886
muc2_I_Intercept     3383     3214
muc3_I_Intercept     3480     2794
muc2_C_phi           2238     2229
muc2_I_phi           1538     1546
muc3_I_phi           1612     1500

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

However, I’m still wondering how I can treat phi as a RV where the prior for phi ~ lognormal(0, 1). I have tried numerous formulations, such as

multinomial_mwe_w_error <-
  brm(
    formula = bf(C|trials(trials_C) ~ 1 + phi) +
      bf(I|trials(trials_I) ~ 1 + phi) + 
      bf(phi ~ 0 + gene_ID), 
    prior = prior(lognormal(0, 1), nlpar = "phi", lb = 0),      
     family = multinomial(link = "logit"),
     data = cc_data_mwe,
     threads = 1, 
     chains = 4,
     cores = 4,
    iter = 2000)

but in the above example, it seems like the code is (logically) treating phi as also following a multinomial, because I get the error: Error: At least 2 response categories are required.

I’ve tried using

bf( bf(C|trials(trials_C) ~ 1 + phi,
             I|trials(trials_I) ~ 1 + phi, 
             phi ~ 0 + gene_id,
             nl = TRUE),
     family = multinomial(link = "logit")

But that gives me error saying I can’t use underscores in my variable names.
Can someone help me get the correct syntax

Operating System: MATE 22.04
brms Version: 2.21.0
R: 4.3.3

AWoodward · May 7, 2024, 3:22am

Hello @mikegee, seems like you might be proposing some type of errors-in-variables model, in which the predictor phi cannot be considered perfectly known?

I’m not sure that the nonlinear structure that you’re attempting will be a helpful model because the variable gene_id doesn’t contain any information from which the value phi could be estimated. Perhaps what you want instead is something like

...bf(C|trials(trials_C) ~ 1 + me(phi, phi_se)...

in which case you’d need to supply information about the measurement error in phi as another variable, and also priors about the distribution of true phi.

Note that me() is soft deprecated, but I haven’t personally used the 'mi()` syntax yet for this type of specification.

mikegee · May 7, 2024, 4:04am

Hi AWoodward,

Thanks for the reply. The syntax I’m aiming for is \Phi_g \sim LogN(\phi_g, \sigma_\phi), where \phi_g is gene_id = g’s true specific \phi value and \Phi_g is a noisy measurement.

I do realize this can be viewed as an error in variables model, but I’m not sure if that’s the best approach. This is because sometimes we actually don’t have any direct information on \phi!

As we’ve shown (e.g. Gilchrist et al 2015, GBE) you can estimate \phi quite accurately w/o any external estimates (but, I’m assuming we have them for the moment). Estimating a gene specific \phi in the absence of direct observations is possible, because each gene consists of has multiple variates (19 to be exact) where each variate is a set of synonymous codon counts for a given amino acid. In total, most genes have ~400 trials across these sets of synonyms and we’re only trying to estimate one parameter for that gene.

We’ve got in-house code that does this as the R package AnaCoDa, but it’s cumbersome to extend the models in ways we can write down quickly on the board. That’s why I’m hoping to be able to use brms. The documentation indicates we should be able to fit arbitrarily complex models, if I can only figure out how to code them.

Topic		Replies	Views
Measurement error models in brms with error on both independent and dependent variables brms specification	2	3288	February 3, 2021
Measurement error in nonlinear mixed effect models brms techniques , specification	13	1507	August 30, 2020
Syntax error brms multivariate model brms	14	1576	July 6, 2019
Help with predictions with brms and including errors General brms	1	378	March 14, 2024
Regression with measurement error in brms brms specification	4	1895	October 6, 2021

Help Correctly Specifying Multinomial Regression with predictor error in BRMS

Related topics