Survivorship bias in multiple outcome model

StuRu · April 25, 2018, 8:49am

HI all,

Theory question from a veterinarian working outside academia & playing with stats & some pretty nice datasets mostly ‘for fun’… please forgive my ignorance!

I’m trying to get my head around a bias-heavy dataset. I am wondering if you have any suggestions on how to deal with an outcome variable missing not at random because of early exit from the study.

I’m modelling something related to whole-lactation (305 days) milk yield in dairy cows. The lactation curve is heavily nonlinear (another STAN project of mine…), so I’m trying keep things simple by using this standardised gaussian outcome. I’m comfortable using standard predictions if a cow leaves the study before 305 days, but those predictions are based upon recorded milk production until that point.

In some cases cows leave the herd in the first few weeks after giving birth, and some don’t have any recorded milk production, so predicted 305-d milk yields are not available. Those cows are likely to have been sold, ill or injured and are likely to have produce less milk. Early-exits will almost certainly correlate with my particular variable of interest (attributes of the calf she produced, suggesting higher or lower risk of disease/injury around birth). i.e. the MAR assumption is violated. For perspective, ~5% of some subsets have missing 305-d milk yields.

I had initially tried to deal with this survivorship bias with a multivariate mixed-response (gaussian, bernoulli) heirarchical model, jointly estimating 305-d milk yield and probability of early-exit (some of which have no milk yield data). The results look extremely helpful, but I’m now worried about my model choice, as it was intuitive, and I have no idea whether or not it was appropriate.

What also doesn’t help - both yield and certain calf attributes are likely to affect probability of early exit. i.e. probability of early exit is a collider as I understand it. Is this even possible?

Thoughts welcomed. Thanks,

Stuart

betanalpha · April 26, 2018, 2:36pm

In order to do this correctly you will need to model the selection generatively. For example, you might have some latent health parameter for each cow and that health parameter determines not only the milk yield for cows that go on to produce milk but also how likely the cow is to drop out due to illness or being sold, etc. This is a bit more subtle than simply modeling everything jointly as you have to encode specific dependencies in your model.

You may find helpful examples in the survival models already implemented in Stan, for example http://www.hammerlab.org/2017/06/26/introducing-survivalstan/.

StuRu · April 26, 2018, 6:18pm

Perfect. I’ll take a look at the link & adjust my approach as needed. Many thanks for the pointer.

Topic		Replies	Views
Accounting for selection bias in a survival model Modeling rstan , specification , ecology	4	588	May 28, 2021
Censored Data: Modeling vs Integrating Out Modeling	6	532	October 2, 2018
Model Posterior Predictions Outside the Possible Range of the Data Modeling techniques , fitting-issues , specification	0	400	June 30, 2019
Survival analysis with multiple outcome General survival	5	1025	May 27, 2021
Survival analysis with simulated data - the model doesn't recover the parameters that I used to build the data Modeling survival	1	733	April 14, 2021

Survivorship bias in multiple outcome model

Related topics