Survivorship bias in multiple outcome model

HI all,

Theory question from a veterinarian working outside academia & playing with stats & some pretty nice datasets mostly ‘for fun’… please forgive my ignorance!

I’m trying to get my head around a bias-heavy dataset. I am wondering if you have any suggestions on how to deal with an outcome variable missing not at random because of early exit from the study.

I’m modelling something related to whole-lactation (305 days) milk yield in dairy cows. The lactation curve is heavily nonlinear (another STAN project of mine…), so I’m trying keep things simple by using this standardised gaussian outcome. I’m comfortable using standard predictions if a cow leaves the study before 305 days, but those predictions are based upon recorded milk production until that point.

In some cases cows leave the herd in the first few weeks after giving birth, and some don’t have any recorded milk production, so predicted 305-d milk yields are not available. Those cows are likely to have been sold, ill or injured and are likely to have produce less milk. Early-exits will almost certainly correlate with my particular variable of interest (attributes of the calf she produced, suggesting higher or lower risk of disease/injury around birth). i.e. the MAR assumption is violated. For perspective, ~5% of some subsets have missing 305-d milk yields.

I had initially tried to deal with this survivorship bias with a multivariate mixed-response (gaussian, bernoulli) heirarchical model, jointly estimating 305-d milk yield and probability of early-exit (some of which have no milk yield data). The results look extremely helpful, but I’m now worried about my model choice, as it was intuitive, and I have no idea whether or not it was appropriate.

What also doesn’t help - both yield and certain calf attributes are likely to affect probability of early exit. i.e. probability of early exit is a collider as I understand it. Is this even possible?

Thoughts welcomed. Thanks,


In order to do this correctly you will need to model the selection generatively. For example, you might have some latent health parameter for each cow and that health parameter determines not only the milk yield for cows that go on to produce milk but also how likely the cow is to drop out due to illness or being sold, etc. This is a bit more subtle than simply modeling everything jointly as you have to encode specific dependencies in your model.

You may find helpful examples in the survival models already implemented in Stan, for example

Perfect. I’ll take a look at the link & adjust my approach as needed. Many thanks for the pointer.