Problem setup -
We have 4 vehicles, and 2 oils are applied to 2 vehicles each. Every 15000km, the oil is changed, and the filtration efficiency of a filter is measured, which is the proportion of particulate matter left behind. I have 3 such time points (20,000km, 35,000km and 50,000km). In addition, at each timepoint, each vehicle is measured atleast 3 times (subsampled).
We would like to assess, whether the choice of oil has any effect.
Data setup
library(dplyr)
structure(list(vehicle = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), phase_study = c("20000k",
"20000k", "20000k", "20000k", "20000k", "20000k", "20000k", "20000k",
"20000k", "20000k", "20000k", "20000k", "35k", "35k", "35k",
"35k", "35k", "35k", "35k", "35k", "35k", "35k", "35k", "35k",
"50000k", "50000k", "50000k", "50000k", "50000k", "50000k", "50000k",
"50000k", "50000k", "50000k", "50000k", "50000k", "50000k", "50000k",
"50000k", "50000k"), proportion = c(0.133276922067952, 0.205192487353148,
0.180891132637869, 0.131808859615293, 0.233605003889866, 0.156346934097495,
0.124846693417874, 0.101009106805783, 0.0960592734051121, 0.0747444592529318,
0.0806232848484083, 0.0876103330243045, 0.128560106261169, 0.187112431826725,
0.135358954840921, 0.116530374466731, 0.133765765869908, 0.151286541197553,
0.149539874266327, 0.147483421748724, 0.138464775367404, 0.136179625528026,
0.166062526096805, 0.13736338636871, 0.142649125779378, 0.225733072582342,
0.225752781332766, 0.180084025442341, 0.213935598869078, 0.261067241489387,
0.139824991219242, 0.189865352171248, 0.207996472478743, 0.156223909638647,
0.279743362932101, 0.261630166237954, 0.112091188149989, 0.133759402337049,
0.121998147373204, 0.0849281892292818), subsample_unique = c("1-Total-20000k-used_oil",
"1-Total-20000k-used_oil", "1-Total-20000k-used_oil", "2-Total-20000k-used_oil",
"2-Total-20000k-used_oil", "2-Total-20000k-used_oil", "3-Total-20000k-used_oil",
"3-Total-20000k-used_oil", "3-Total-20000k-used_oil", "4-Total-20000k-used_oil",
"4-Total-20000k-used_oil", "4-Total-20000k-used_oil", "1-Total-35k-used_oil",
"1-Total-35k-used_oil", "1-Total-35k-used_oil", "2-Total-35k-used_oil",
"2-Total-35k-used_oil", "2-Total-35k-used_oil", "3-Total-35k-used_oil",
"3-Total-35k-used_oil", "3-Total-35k-used_oil", "4-Total-35k-used_oil",
"4-Total-35k-used_oil", "4-Total-35k-used_oil", "1-Total-50000k-used_oil",
"1-Total-50000k-used_oil", "1-Total-50000k-used_oil", "1-Total-50000k-used_oil",
"1-Total-50000k-used_oil", "1-Total-50000k-used_oil", "2-Total-50000k-used_oil",
"2-Total-50000k-used_oil", "2-Total-50000k-used_oil", "3-Total-50000k-used_oil",
"3-Total-50000k-used_oil", "3-Total-50000k-used_oil", "3-Total-50000k-used_oil",
"4-Total-50000k-used_oil", "4-Total-50000k-used_oil", "4-Total-50000k-used_oil"
), oil = c("B", "B", "B", "B", "B", "B", "A", "A", "A", "A",
"A", "A", "B", "B", "B", "B", "B", "B", "A", "A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "A", "A", "A",
"A", "A", "A", "A")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -40L))
Model setup
A varying intercept model for vehicle and for the subsamples with a fixed effect for oil.
library(brms)
full_model_prior <-
brm(proportion ~ ol + (1|vehicle) + (1|subsample_unique) , data = data,
prior = c(prior(student_t(5, 0.25, .1), class = Intercept),
prior(normal(0,.1), class = b),
prior(exponential(2.5), class = sigma),
prior(exponential(2.5), class = sd)), sample_prior = "only")
I played around with prior predictive distribution and constrained the priors such that the plausible response generated from the priors is [0,1]. This is because we are working with proportions and negative proportions don’t make sense, and proportions >1 are implausible because particulate matter cannot increase after filtration.
please critique my priors - still learning
Posterior Model
full_model_posterior <-
brm(proportion ~ ol + (1|vehicle) + (1|subsample_unique) , data = data,
prior = c(prior(student_t(5, 0.25, .1), class = Intercept),
prior(normal(0,.1), class = b),
prior(exponential(2.5), class = sigma),
prior(exponential(2.5), class = sd)), control = list(adapt_delta = 0.99999), chains = 4, cores = 4, iter = 12000)
My posterior checks look OK I think. Given the small sample size, I can’t expect too much.
Now I’m interested in making a comparison in the posterior distribution of Oil A - Oil B. Since I also fit an intercept, I get a posterior distribution of B - A, the difference.
Can I make a statement, such as, the probability that Oil B has a higher proportion of particulate leftover as compared to B is xxx, using the posterior as shown below?
brms::posterior_samples(full_model_posterior, pars = "b_oilB") %>%
dplyr::filter(b_oilB>0) %>%
dplyr::summarise(probability = nrow(.)/24000)
## I take the posterior dsitribution of the difference, and calculate how many are >0 and calculate a proportion (24000 is the sampled values).
The result of the above is 0.82 - which I think is interesting.
However, if you look at the 80% quantile interval, it ranges from [-0.0187,0.0778], which includes zero.
posterior_interval(full_model_posterior, prob = 0.80)
How do I then interpret this result? Is it valid for me to calculate the probability of Oil B > Oil A using the posterior above? Or is the uncertainity interval a more meaningful measure to interpret the treatment effect?