I’m working with multivariate response data where each observation is essentially the frequencies of different synonymous codons within a gene.
Here’s some example data for amino acids “C” and “I” for 20 different yeast genes
cc_data_mwe <- structure(list(gene_id = c("YAL003W", "YAL042W", "YAR018C", "YAR035W",
"YBL015W", "YBL023C", "YBL057C", "YBL089W", "YBL104C", "YBR004C",
"YBR006W", "YBR012W-A", "YBR024W", "YBR025C", "YBR055C", "YBR062C",
"YBR074W", "YBR083W", "YBR096W", "YBR115C"), phi = c(8.46893787384033,
0.380664080381393, 0.313518494367599, 0.163658604025841, 1.69344747066498,
0.990803837776184, 0.402997583150864, 0.172610253095627, 0.37598243355751,
0.734753429889679, 1.3231383562088, 0.485677689313889, 0.622240304946899,
5.22348690032959, 0.190805360674858, 0.265862882137299, 0.297768205404282,
0.19674488902092, 1.45047247409821, 1.57859694957733), trials_C = c(1L,
10L, 6L, 11L, 4L, 9L, 4L, 10L, 23L, 8L, 11L, 3L, 7L, 6L, 11L,
7L, 9L, 3L, 9L, 14L), C = structure(c(0L, 4L, 0L, 8L, 1L, 3L,
1L, 4L, 10L, 3L, 3L, 1L, 2L, 1L, 6L, 3L, 3L, 1L, 3L, 6L, 1L,
6L, 6L, 3L, 3L, 6L, 3L, 6L, 13L, 5L, 8L, 2L, 5L, 5L, 5L, 4L,
6L, 2L, 6L, 8L), dim = c(20L, 2L), dimnames = list(NULL, c("c_1",
"c_2"))), trials_I = c(12L, 20L, 30L, 32L, 33L, 55L, 13L, 51L,
72L, 40L, 30L, 23L, 18L, 35L, 50L, 6L, 67L, 34L, 14L, 81L), I = structure(c(0L,
0L, 9L, 0L, 4L, 10L, 6L, 14L, 25L, 12L, 5L, 3L, 4L, 1L, 18L,
1L, 19L, 14L, 0L, 16L, 4L, 12L, 8L, 29L, 12L, 11L, 1L, 13L, 17L,
10L, 8L, 8L, 6L, 15L, 7L, 2L, 17L, 10L, 4L, 22L, 8L, 8L, 13L,
3L, 17L, 34L, 6L, 24L, 30L, 18L, 17L, 12L, 8L, 19L, 25L, 3L,
31L, 10L, 10L, 43L), dim = c(20L, 3L), dimnames = list(NULL,
c("c_1", "c_2", "c_3")))), class = c("rowwise_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame")))
Which looks like this:
View(cc_data_mwe)
The phi
column is the predictor, but may be measured with substantial error. Nevertheless, if I ignore the error I can estimate the parameters of interest: the intercept of a codon and how it’s frequency changes with phi
under a multinomial regression model using,
multinomial_mwe <- brm(formula =
bf(C|trials(trials_C) ~ 1 + phi) +
bf(I|trials(trials_I) ~ 1 + phi),
family = multinomial(link = "logit"),
data = cc_data_mwe,
threads = 1, #threads within a chain
chains = 4,
cores = 4,
iter = 2000)
This compiles and runs and gives reasonable results.
> multinomial_mwe
Family: MV(multinomial, multinomial)
Links: muc2 = logit
muc2 = logit; muc3 = logit
Formula: C | trials(trials_C) ~ 1 + phi
I | trials(trials_I) ~ 1 + phi
Data: cc_data_mwe (Number of observations: 20)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
muc2_C_Intercept 0.22 0.21 -0.20 0.64 1.00
muc2_I_Intercept -0.12 0.14 -0.39 0.15 1.00
muc3_I_Intercept 0.30 0.13 0.05 0.57 1.00
muc2_C_phi 0.35 0.20 -0.01 0.77 1.00
muc2_I_phi 0.54 0.14 0.28 0.83 1.00
muc3_I_phi 0.57 0.14 0.31 0.86 1.00
Bulk_ESS Tail_ESS
muc2_C_Intercept 4704 2886
muc2_I_Intercept 3383 3214
muc3_I_Intercept 3480 2794
muc2_C_phi 2238 2229
muc2_I_phi 1538 1546
muc3_I_phi 1612 1500
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
However, I’m still wondering how I can treat phi
as a RV where the prior for phi ~ lognormal(0, 1)
. I have tried numerous formulations, such as
multinomial_mwe_w_error <-
brm(
formula = bf(C|trials(trials_C) ~ 1 + phi) +
bf(I|trials(trials_I) ~ 1 + phi) +
bf(phi ~ 0 + gene_ID),
prior = prior(lognormal(0, 1), nlpar = "phi", lb = 0),
family = multinomial(link = "logit"),
data = cc_data_mwe,
threads = 1,
chains = 4,
cores = 4,
iter = 2000)
but in the above example, it seems like the code is (logically) treating phi as also following a multinomial, because I get the error: Error: At least 2 response categories are required.
I’ve tried using
bf( bf(C|trials(trials_C) ~ 1 + phi,
I|trials(trials_I) ~ 1 + phi,
phi ~ 0 + gene_id,
nl = TRUE),
family = multinomial(link = "logit")
But that gives me error saying I can’t use underscores in my variable names.
Can someone help me get the correct syntax
- Operating System: MATE 22.04
- brms Version: 2.21.0
- R: 4.3.3