Modelling questions for medical follow-up(s) combined with genomic informations

Medical doctors are still reluctant about using Bayesian approaches, but I think with scarce data, limited samples, missing values, and values collected in time (epidemiology), it is a nice way to proceed with this technique, to give at least uncertainty around certain outcomes. I would like to ask few questions this community about a particular design, that repeats many times in medical studies, and a possibility of modelling it. The questions are a) how the data should be structured in such case I present b) how time should be implemented, and c) how possible design(s) could be constructed from such data?

It is one of this very small and messy follow-up of patients. I am also aware of many such designs out there, so I believe modelling community will benefit, giving some hints and discussing such scenarios a bit.

In this hypothetical design, there is only 60 ladies that got a drug against one type of cancer (all survived), but the drug is also known to cause long-term side effects. One of a side effects is development of TS syndrome (factorial variable 0, 1 – a variable constructed based on presence of 3 out of 5 sub-variables describing this syndrome). It is also possible to recover from the syndrome, although it is rare. Also, some ladies already had this syndrome at the beginning, as it is age dependent. One could wonder if you should censor this subgroup (10% of data!) from these 60 ladies or use this information actively in a modelling strategy. There are some missing values for the status of TS, even at the beginning.

Follow-up time and interesting variables measured
These 60 ladies (variable ID) were followed up for TS status before giving the treatment (time 0 -baseline), after 10 days, 100 days, 365 days, and 5 years and few even after 10 years (Variable: Days_After_Treatment). In addition, blood samples were collected (at time 0, 10, 100, 356 days), cumulative concentration of drug in time at 0, 10, 100 days was measured. Finally, there was 10 locations in their DNA measured as well at time 0, 10, 100 and 365 (variable: MethylationValue - a continuous values between 0% and 100% for every of 10 locations). These locations were shown previously to be aftected by the drug, and be associated with TE status, so at least some of them were expected to be informative for TE status. The responses in time of the MethylationValue(s) are not linear.

Other covariates possibly modulating the outcome variable
There is also information about their age (variable: Age) when they enter the treatment, smoking habits (categorical variable: Smoking (never, former, correct smoking), and body weight index (variable BMI) at many of the follow up time points. Some missing data are all over this design of course.

Possible questions and designs:
From this design the interesting questions could be:

  1. Can MethylationValue(s) change (time before and after chemotherapy) be predictive of the outcome of TS syndrome after 5 years, given all other available variables?

(in other words, is the changing pattern on measurements in time has something to do with increased probability of developing the TS syndrome, during the course of the follow up, or even 5 years after?).

  1. Is the dose of the drug itself associated with the change of the MethylationValue(s), increasing the probability of developing TS syndrome and if so how much?

  2. Other possible designs …

I tried all my best to explain this messy design. I do not worry yet about the priors but rather the use of Stan community to have some thoughts about it. I attached a dummy file with dummy numbers for 2 ladies here, that I could further expand for further dummy data, if I get any feedback here.

Thank you

dummy_medical.csv (901 Bytes)

Hi, @bazylisek and welcome to the Stan forums. Was there a Stan-related question here? I’m afraid we don’t have a lot of time to try to answer general Bayesian modeling questions.

Can’t help you there, but I can attest to having worked with a lot of epidemiologists, and they’re one of the largest groups of consumers of Bayesian methods. They’re also very popular in Phase I clinical trials.

Generally, I would not recommend censoring any data if you can model it.

If you know the non-linearity, then just model that. A brute force non-linear approach would be to break the values into ranges and fit an intercept to each, perhaps smoothing across the ranges. This is popular, for example, with age effects in biomedical models. It sounds like a lot of the other values may also be non-linear. To the extent possible, I’d suggest trying to plot them against outcomes (perhaps for different subgroups if you’re building a hierarchical model) in order to see what the effects are.

You can model missing data in Stan using the general techniques outlined in the Stan User’s Guide chapter on missing data.

Your questions (1) and (2) sound like empirical questions.