Specifying a time series multilevel model

Happy 2020 Stan Community!!

Problem set up

I have experimental data on three(3) different oils (LAC, SVP, MG) in which these oils are measured for a specific property of theirs (the response variable). This property is measured every second for 4500 seconds while the oil is coated on a metal ball (responses are time dependent).
In addition, this experiment was run in duplicates with the order in which the samples were run was randomized i.e, run 1 of sample 1 was equally likely to be followed by run 2 of sample 1 or run 1 of sample 3.

We have 4500 responses for each run of the sample, and 2 runs of each sample so in total.

Model specification

There’s clearly non-linearity so I’d like to model time as a non-linear effect, but I’m not sure how to model the sample/runs.

  1. Model runs as being nested within the sample? I.e, a random intercept for sample and random intercept for the unique sample_run combination
  2. Simply include a fixed effect for sample

Modeling choice 1
For brevity, I won’t be specifying priors, but focus on the model formula

nested_mod <- brm(
response ~ (1|sample/run) + s(time, bs = "cs))  

Modeling choice 2

fixed_mod <- brm(
response ~ sample + s(time, bs = "cs))

Below is a random fraction of the data set

                     ~run,        ~response, ~sample_fac,          ~new_time,
                  "lac_1", 72.8442980643677,       "lac",    13.678046390098,
                  "lac_1",   72.41406270519,       "lac",   16.6501296541285,
                  "lac_1", 72.8452447618132,       "lac",   29.6175944592435,
                  "lac_1",  68.540513907131,       "lac",    7.7637882003658,
                  "lac_1", 71.8944808768374,       "lac",    28.828116263878,
                  "lac_1", 71.9308825540751,       "lac",   10.9361076503158,
                  "lac_1", 71.5376752310561,       "lac",   29.1871714518968,
                  "lac_1", 69.9779038226358,       "lac",   9.54013693347616,
                  "lac_1", 63.8421474913109,       "lac",   1.03286014605356,
                  "lac_1", 71.7385392924506,       "lac",   12.8925599294346,
                  "lac_1", 73.0084587777277,       "lac",   19.8486050005634,
                  "lac_1", 64.4812000067977,       "lac",   2.78784175255378,
                  "lac_1", 66.0082947375471,       "lac",   4.99397939155852,
                  "lac_2", 64.5907341765694,       "lac",   13.7044344759663,
                  "lac_2", 68.6342970533887,       "lac",   24.0550853718189,
                  "lac_2", 67.7282015400432,       "lac",   20.8624006841448,
                  "lac_2", 64.3935660677285,       "lac",   17.7782761641895,
                  "lac_2",  63.064660646255,       "lac",    19.220369340491,
                  "lac_2", 45.7941966608521,       "lac",  0.517151577968647,
                  "lac_2", 69.9130225629351,       "lac",   29.7351333742147,
                  "lac_2", 67.7026708945785,       "lac",   21.9425217491691,
                  "lac_2", 69.5001879334048,       "lac",   26.6210199553636,
                  "lac_2", 67.9561831547372,       "lac",   21.4493644818315,
                  "lac_2", 70.2920241598646,       "lac",   25.2476974761696,
                  "lac_2", 65.4506704373289,       "lac",   14.7406316652079,
                  "lac_2", 59.9148155584788,       "lac",    10.646390048659,
                   "mg_1", 260.503153387465,        "mg",   29.3485134429766,
                   "mg_1", 256.990726479642,        "mg",   22.8634543108756,
                   "mg_1", 231.523372460514,        "mg",   9.85511561594923,
                   "mg_1", 258.810302560367,        "mg",    26.887701640092,
                   "mg_1", 178.472230412952,        "mg",   2.74191499426981,
                   "mg_1",  207.84821888721,        "mg",   5.03972217645516,
                   "mg_1",  259.23150278601,        "mg",   26.6622660061812,
                   "mg_1", 109.265224850238,        "mg",  0.671548847761205,
                   "mg_1", 182.057836027649,        "mg",   2.94406872913261,
                   "mg_1", 210.412089184462,        "mg",   5.44547846862135,
                   "mg_1", 245.319770264699,        "mg",   16.1546196541285,
                   "mg_1", 251.182405939783,        "mg",   19.8020835881589,
                   "mg_1", 257.975048365293,        "mg",   24.8633570977954,
                   "mg_2", 225.130085396146,        "mg",   19.0380423209136,
                   "mg_2", 161.175762351683,        "mg",   2.85453049423819,
                   "mg_2", 173.724839789085,        "mg",   3.82363149155577,
                   "mg_2", 98.8292953783186,        "mg",  0.785282881275688,
                   "mg_2", 222.676523674486,        "mg",   15.9305291688507,
                   "mg_2", 222.833859459554,        "mg",   15.8397324856342,
                   "mg_2", 221.325160692058,        "mg",   15.5042274699518,
                   "mg_2", 230.524244029661,        "mg",   25.5590971066122,
                   "mg_2", 235.771677980177,        "mg",   29.8000343710155,
                   "mg_2", 211.679460636609,        "mg",   10.8247678280314,
                   "mg_2", 219.361066418684,        "mg",   14.1751858727512,
                   "mg_2", 220.909805561996,        "mg",   14.4909348788643,
                   "mg_2", 174.943764085913,        "mg",   3.89021373660949,
                  "svp_1", 230.242033419717,       "svp",   14.1970042256827,
                  "svp_1", 67.4564310659396,       "svp", 0.0900337974263472,
                  "svp_1", 239.085716326199,       "svp",    19.981160609862,
                  "svp_1", 242.941970072719,       "svp",   22.5472045039015,
                  "svp_1", 234.100667378645,       "svp",   16.3810161444081,
                  "svp_1",  140.87585851739,       "svp",  0.739684151597519,
                  "svp_1", 246.668871677588,       "svp",   29.4163462124953,
                  "svp_1",  183.55987624315,       "svp",   2.47351804875789,
                  "svp_1", 242.941726439951,       "svp",    25.470823572362,
                  "svp_1", 231.781479217227,       "svp",   14.7838310002072,
                  "svp_1", 248.871522769646,       "svp",   29.6645006072896,
                  "svp_1", 245.935330400674,       "svp",   25.9174953704888,
                  "svp_1",  242.81470616125,       "svp",   22.9078243732175,
                  "svp_2",  257.69373518629,       "svp",   20.5235447974847,
                  "svp_2", 246.193204057907,       "svp",   14.6493610002072,
                  "svp_2", 207.263723482967,       "svp",   2.24895021719572,
                  "svp_2", 232.308465434741,       "svp",   7.89871767237497,
                  "svp_2",   261.3222156315,       "svp",   22.3231398484641,
                  "svp_2", 209.263986590766,       "svp",   2.36152060171304,
                  "svp_2", 264.906273073939,       "svp",   24.7742095119757,
                  "svp_2", 245.940441513134,       "svp",   13.7025537803275,
                  "svp_2", 261.904791006638,       "svp",   23.2911638131337,
                  "svp_2", 232.010125007026,       "svp",   7.58396809296397,
                  "svp_2", 236.332222354715,       "svp",   9.20233885366184,
                  "svp_2", 238.382588724674,       "svp",   9.94583395966043,
                  "svp_2", 225.677538975778,       "svp",   5.37818322730896

Please also provide the following information in addition to your question:

  • Operating System: Windows 10
  • brms Version: 2.10.0

At a very basic level you should establish what is of interest to you and why you ran the experiment how you did.

Do you want to know the difference between the types of oils, or the variability of the property you are measuring?

Did you do two runs in random order to account for some possible differences in testing conditions? (e.g., temperature, humidity, phase of the moon?)

There are as many opinions on what you should count as “random” as programs to estimate multilevel models.

The primary reason to think about it in a bayesian context (in other contexts there are additional mathematical limitations ) is the idea of gaining information about the response from the groups.

In your example, you have done what I (and many other people) recommend, which is explicitly naming your lowest levels so the naming is nested within them. However you only have two observations at each time point of these lowest levels.

As a start I would try with sample as group level effect and time as your main effect

response ~ 1 + s(time, bs = "cs")) + (1 | sample)

inferring from your question that you are interested in the effect of time which varies by sample type.

Hi Meg!
Thank you for the response.

My interest is in comparing the 3 different oils. However, we know there is run-run to variation (the same sample ran again will have a slightly different profile) and that is why there were two runs (this was not based on any power calculations or such, but the client ran the experiment in this way).

I would like to use the runs to account for the measurment error and include that uncertainty when comparing the oils. As you can see in the plot below, the runs of each sample are slightly different (I have continuous data for 4500 seconds but this plot below is just for the data I pasted above)

plot of raw data

The reason I’m concerned about ignoring the runs in the model specification is, the time values aren’t consistent i.e, Run 1 of sample 1 may have been measured at 5.5 seconds, but Run 2 of sample 1 may NOT have been measured at 5.5 seconds, but instead at 5.8 seconds. If I specify only the sample term in my model, my model doesn’t “know” that these two responses are in fact from different runs - it would aggregrate the two runs into one time series. This ignores the measurment error.

Posterior Inferences

If I specify

response ~ (1|sample/run) + s(time, bs = "cs")

which according to the brms documentation becomes

(1|sample) + (1|sample:run) + s(time, bs = "cs")  

I can then generate posterior inferences for a NEW run by generating expectations like so

posterior_exp <- posterior_linpred(mod, re_formula = (1|sample), newdata = data) 

This would give me a posterior predictive distribution for a NEW run, and the uncertainty in this would account for the multiple runs as opposed to clubbing them together.

But tbh I’m not so sure about this approach. I can’t say what is wrong with this approach, but I’m having a hard time convincing myself this is appropriate.

I think in brms you can model the non-linear time/run response like

s(time, bs="cs", by=run)

Thanks for the response Ara!

That wouldn’t give me inferences for a new run - it would just estimate each run. What I’m interested in, is to infer what the average run for each sample would look like, while accounting for the run to run variation.

How about: response ~ s(time, bs = "cs", by = sample) + (1 | sample:run).

Whatever you do, it will be hard to get a decent estimate of the variability by run, since you have so few of them.

1 Like

Yes, but I think what I was suggesting before and what this reply suggests

Is that with only two runs you don’t have enough runs to estimate this variation. For that you need more than two points. Unless you have a particular reason to believe the runs would be different from one another (like I suggested, some covariate you didn’t measure but are attempting to account for by two runs separate in time from one another) I don’t think you need to include them.

Especially considering the number of samples you have (essentially 9000 samples for each type of oil (calling your factor sample is somewhat confusing in this context since sample is both an experimental design term and an MCMC term, using it as a variable name is rough).

You can predict values for a hypothetical and include the variation of your data run without grouping by runs. And you can make a plot similar to the one above with uncertainty ribbons around them.

I can come back later and give explicit code on it, but it involves calling something related to:

m %>% expand_grid(time, sample) %>% posterior_predict

But this is the sort of thing I have to look up EVERY TIME because of my workflow, but that should get you started in looking in to that method or someone who knows it off the top of their head could stumble along and save us all the poorly coded headache I would create.

Because your figure it looks pretty clear that one type of oil is much different from the other two.
I think the specification above:

response ~ s(time, bs = "cs", by = sample)

Is probably most reasonable, it allows the relationship between time and response to vary by type of oil. I would start with this as a base model, or even more simply
response ~ s(time, bs = "cs") and then add variation by type to see if it improves the fit/performance etc.

I don’t know how long it takes to run that many samples with a spline (god would I love to have that many observations of any one measurement, alas ecology) but building models up from most basic set is often informational and helps you figure out where your problems are introduced .

I’m sure some people here could give advice on how you could split these data into sets for building and validating/testing but sadly I"m not that person (again, rarely have enough data before dividing it into sets)

1 Like

One alternative model might be the model mentioned in here:
y=b_{1}\left(1-\exp \left(-\left(x / b_{2}\right)^{b_{3}}\right)\right)

brm(bf(response ~ exp(b1) * (1 - exp(-(new_time / exp(b2)) ^ exp(b3))),
               b1 ~ 1+(1|run)+sample_fac,
               b2 ~ 1+(1|run)+sample_fac,
               b3 ~ 1+(1|run)+sample_fac,
          prior = c(prior(student_t(10,0,1), nlpar = "b1"),
                    prior(student_t(10,0,1), nlpar = "b2"),
                    prior(student_t(10,0,1), nlpar = "b3")),
    control=list(adapt_delta=0.99, max_treedepth=15))

This seem to fit the data pretty well: