Joint model yields inflated association

I am using a multivariate joint model to test for main and interactive effects of two longitudinal timeseries on survival. It’s working almost as expected, except that one of the two longitudinal effects yields weirdly inflated associations. I have tried a univariate joint model, various association structures, and also working with different data subsets, but the effect remains, or becomes even bigger (like, 10^9 ridiculously big).
I don’t know what this means and how I can fix it. Because I am new to bayesian modelling, I am also not sure which additional information provide.

EDIT: I added the association structure, which I forgot to provide for this post.

survival_jm_full <-
  stan_jm(
    formulaLong = list(
      Pigmentation ~ Time_num * Food  + (1|Fam_Ind),
      Length ~ Time_num *  Food + (1|Fam_Ind)
      ),
    dataLong = mod_data_longi,
    formulaEvent = survival::Surv(Time_num, Survival) ~ Food,
    dataEvent = mod_data_event,
    assoc = list(c("etavalue", "etavalue_etavalue(2)", "etaslope_data(~ Food)"), 
             c("etavalue", "etaslope_data(~ Food)")),
    time_var = "Time_num",
    chains = 4, cores=8, seed = 12345, iter = 1000) 

  • Operating System: Ubuntu 18.04
  • rstanarm Version: 2.18.2
1 Like

If something in the posterior is implausible it could mean that there’s some prior information that isn’t included. So maybe a more informative prior is needed. I’m not
so familiar with these models but maybe @sambrilleman has some ideas?

@mluerig This looks like it is almost certainly a bug.

The model you’ve specified uses the default association structure (i.e. assoc = "etavalue"), meaning that the expected value for Pigmentation and the expected value for Length should be included as time-varying covariates in the survival submodel. That would correspond to a model with two association parameters.

But, as we can see from the estimates you’ve plotted, the model is trying to estimate five association parameters! My guess is that the use of the interaction term Time_num * Food in each longitudinal submodel is being incorrectly handled when I’m forming the model matrices for the association structure; hence there are association parameters shown in the output that correspond to some nonsense interactions that aren’t even in the model you specified.

To try and diagnose whether this might be the issue, can you try fitting the multivariate joint model without the interaction terms in the longitudinal submodel (you only need to run it for a few iterations, say iter = 20), e.g.

survival_jm_full <-
  stan_jm(
    formulaLong = list(
      Pigmentation ~ Time_num + Food + (1|Fam_Ind),
      Length ~ Time_num + Food + (1|Fam_Ind)
      ),
    dataLong = mod_data_longi,
    formulaEvent = survival::Surv(Time_num, Survival) ~ Food,
    dataEvent = mod_data_event,
    time_var = "Time_num",
    chains = 4, cores=8, seed = 12345, iter = 20) 

and then plot the estimates – there should only be two association parameters estimated (called Assoc|Long1|etavalue and Assoc|Long2|etavalue).

It the above model returns the appropriate number of parameters, then we can be pretty certain it is the interaction term causing the problem. … Unfortunately I can’t form a reproducible example, even trying to replicate what I perceive to be the issue. If you can show a reproducible example, e.g. using the pbcLong/pbcSurv dataset, or email me the dataset you are using then that would help me to troubleshoot.

Another thing I noticed in your code is that you have the variable Time_num specified as both (i) the longitudinal time variable and (ii) the event time variable in the survival submodel. The variable names being the same isn’t an issue in itself, since they are drawn from different datasets. But if those are in fact the same variable, then that is an issue. The variable denoting the time of the event is not the same as the variable denoting the time of the longitudinal measurements.

thanks very much for your response @sambrilleman

I just realized that I provided code for a model where I did not specify the association structures shown in the figure. At the time I played around with association structures quite a bit an then took the wrong one. So, the models is in fact returning the correct number of parameters. I updated the code to match the figure - my apologies for that.

But what you said about the time variable makes me think that I am maybe misunderstanding something about the data structure for joint models. So maybe I should explain what I am trying to do.
I measure Length and Pigmentation of my organisms at regular intervals (Time_num), but when I find that they are dead the time series becomes truncated, and the Time_num becomes the event time (see below). I now would like to test whether rates with which both traits increase (and the interaction of these rates) affect how fast they die. Is this possible with those model specifications?

A question I asked at CV pointed me towards joint models; there I also explain the data structure in more detail.

In any case, I also provided a subset of my data
mod_data_event.csv (25.9 KB)
mod_data_longi.csv (86.8 KB)

> mod_data_longi[Fam_Ind=="fam115_01", c("Time_num","Food","Length","Pigmentation")]
   Time_num         Food         Length  Pigmentation
1:        0 High protein -0.45390385457 -0.1697128974
2:        1 High protein -0.07202093855  0.2546282424

> mod_data_longi[Fam_Ind=="fam115_02", c("Time_num","Food","Length","Pigmentation")]
   Time_num        Food        Length   Pigmentation
1:        0 Low protein -0.4141243842 -0.23499614968
2:        1 Low protein -0.1409720206  0.05877848556
3:        2 Low protein  0.5856663057  0.05877848556
4:        3 Low protein  0.8508627752  0.28726986852
5:        4 Low protein  1.0736278095  0.48311962535

> mod_data_event[Fam_Ind=="fam115_01", c("Time_num","Food","Survival")]
   Time_num         Food Survival
1:        1 High protein   1
> mod_data_event[Fam_Ind=="fam115_02", c("Time_num","Food","Survival")]
   Time_num        Food Survival
1:        4 Low protein   1

I forgot to mention that I also tried with adding priors, which made it a bit better, but not much:

## priors on scaled data
priorEvent_intercept = normal(0,1),
priorEvent_aux = normal(0,1),
priorEvent = normal(0,1),
priorEvent_assoc = normal(0,1)

@mluerig Just fyi – I haven’t forgotten about this – will plan to take a closer look tomorrow.

But on first glance, it looks like you have heavily interval censored survival times. Unfortunately stan_jm doesn’t handle interval censoring yet (so some bias would already be induced by assuming that events occurs at the discrete (planned) measurement times, i.e. time_num == 0,1,2,3,4, when really the deaths likely occur in continuous time).

But I think a larger problem is not the interval censoring per-se, but the fact that events only occur at a very small number of discrete times. I’m not certain of the exact implications – but at least one implication would likely be difficulties in estimating a baseline hazard (unless you simplified to say an exponential, i.e. constant, baseline hazard).

Some other thoughts I had on first glance:

  • I saw you have Time_num equal to 0 in the event data for a number of individuals, which will cause issues

  • the association structure you specified in your example code is very complex – for example, including the interaction of two continuous covariates (length and pigmentation) – which may be where the model is running into problems (e.g. because there is little data with which to estimate those complex interaction effects). Although I think you mentioned you still had estimation problems even with the simpler association structure(s)?

  • perhaps the autoscaling of the priors isn’t working well, particularly for terms like the interaction between length and pigmentation

I’ll try take a closer look tomorrow, starting with a simpler model, to see if I can diagnose where things go astray.

2 Likes

@mluerig Sorry for the delay. I’ve now had a chance to look at your data and model. You have in fact discovered a bug with the etavalue_data and etaslope_data association structures. :-( Just a small bug but with pretty serious consequences for those estimated association parameters. I really should have caught this earlier (those association structures aren’t tested in my simulations, but will be now!). Thanks for reporting it. (Hopefully it hasn’t been insidious enough to catch other people without realising. Hopefully the wild standard errors would have alerted them to the fact that there was an issue, eek).

I’ve opened an rstanarm issue here, fixed the bug on this branch, and opened a pull request to get this merged into the master branch and hopefully up on CRAN asap.

Nonetheless, having now had a look at your data, there are another bunch of other things / issues you might want to consider for your analysis, aside from rerunning the models with any *_data association structures. Some things that came to mind:

  • there are a small number of unique event times (i.e. lots of ties). This means there is little information with which to shape of the baseline hazard. You should therefore assume a simple shape for the baseline hazard, e.g. a Weibull, by specifying basehaz = "weibull". The default baseline hazard used by stan_jm uses B-splines to approximate the log baseline hazard, and you won’t have enough information in your data to do that successfully.

  • your data is effectively interval censored. For instance, if length and pigmentation are observed at time 0 and time 1, and then at time 2 the critter is known to be dead, then the event time is known to be in the interval (1,2]. Unfortunately, stan_jm doesn’t handle interval censoring, so the best we can do is impute the event time as being at the centre of the interval. Note that, in the data you sent me, you had the event time (Time_num in the survival dataset) as the lower limit of the interval – this will cause problems, because (i) some individuals die at time 0, and (ii) critters die at the exact same time as their last biomarker measurements are taken. To fix this issue, I added 0.5 to the event times. (NB: I think the JMbayes package is supposed to fit multivariate JMs with interval censored event times, so that is another option that may be worth exploring).

  • in your data, the event indicator was equal to 1 for all observations. Is this correct? This suggests that all critters died out by the last biomarker observation time. If not, then the critters who were alive after the last biomarker measurement should be given a right censoring time and the event indicator set to 0 to indicate right censoring.

  • the biomarkers (Length and Pigmentation) seem pretty highly correlated. So you may have to fit both univariate and multivariate joint models to try and unpick potential confounding and/or multicollinearity issues. My guess is that the estimates for the association parameter(s) related to one biomarker are likely to be highly dependent on whether you adjust for the other biomarker.

  • the etaslope association structure assumes that there is information in the biomarker slopes that can inform the hazard of death. If the biomarker submodel(s) only include a random intercept (and no random slopes), then the only differences in slopes across individuals is due to the Food:Time_num interaction term in your biomarker submodels (i.e. there are only 2 unique values for each biomarker slope – one corresponding to each food group). This might be fine, but it probably deserves a bit of thought, especially since you are also adjusting for Food as a main effect in the survival submodel --so it becomes a little unclear to me whether the model is actually identifiable. I ran into quite a few difficulties trying to fit models with etaslope to your data, and I suspect that might be the reason why. EDIT: I did a little simulation study to assess the potential for this identifiability issue, and it confirmed it. The issue being that the ‘current slope’ (used an implicit covariate in the survival submodel) and the binary baseline covariate itself are perfectly collinear. You may have already been aware of this, but I needed to confirm it for myself! Code is attached in case it helps. Cheers.comparing_etaslope_models.R (5.1 KB)

3 Likes

@sambrilleman thanks a thousand for your effort and your incredibly thorough answer! I need some more time to digest and think about the points you made, but I really want to include this analysis in the paper I am writing right now, so I’ll definitely check if etavalue_data works with the new rstanarm version once its out (can I find a release roadmap somewhere?).

Following up what you said about etaslope, wouldn’t it help for etaslope to be identifiable to fit random intercepts with the group argument (i.e. (Time_num | Fam_Ind)? Otherwise, I am not sure whether etaslope makes much sense then with just two unique values. But I may have mistaken that part…

Also, I am curious about whether it is possible to test three-way interactions between main effects and the two biomarkers (i.e. Food * Growth rate * Pigmentation rate) on survival. This would effectively test for differences in fitness landscapes between the treatments. You don’t mention it in the function reference, so I assume that this is probably not possible, but I was wondering what you thought about this.

Thanks again, I will keep you posted.

@mluerig No worries, not sure of the exact roadmap for the next rstanarm release (perhaps @bgoodri can advise), but he did use the word “imminent” in previous correspondence! ;-)

Yeah, specifying (Time_num | Fam_Ind) would definitely make more sense for an etaslope association structure. But from memory, fitting a model with a random effect for Time_num was not easy with your data (if I recall correctly, perhaps even lme4::lmer failed to converge for the univariate longitudinal case?) which might make sense given the limited longitudinal data. I’m not sure though, you’d have to investigate.

At the moment, a three-way interaction would only be possible if two of the terms in the interaction were known covariates. A three-way interaction in which two of the terms in the interaction are (modelled) longitudinal biomarkers is not possible. I don’t know if I’d plan to add that any time soon (I think there would be very few use cases).