Hello, a simple question whose answer evades me.
I have a count model with a continuous predictor modelled as a spline.
I’d like to compute the predicted cumulative count given a range of the predictor.
Example: Poisson model of the number of patients by age for a hospital, with age being modelled as a spline.
How do I get the posterior predictive distribution of the number of patients between two ages?
The only solutions I could think of were:
- to model the cumulative distribution instead of the age distribution directly, then predict for two values and take the difference.
- use the original spline model to make predictions for each year of age in the range of interest and then sum them up. I’m not sure if I’d be underestimating the final count by discretizing age this way.
- just discretize age into groups already in the training data. I don’t like it since I’d like to keep the smoothing effect of splines.
Is there a way to obtain what I need directly from the original spline model?
Sorry this one got left behind. Given that this is a count model, the actual counts must not be distributed continuously along the predictor, but rather at discrete locations (for example, if the predictor is age and the response is the number of patients, you aren’t looking at the counts of patients by age measured to the nanosecond; you probably are looking at counts by age measured in years or something like that). When you fit a spline to this model, you cannot predict it everywhere, because it isn’t interpretable as the number of points of an exact age, but rather as the number of points falling into an age bin of some width (e.g. one year). So for example, if you have 10 patients who are 20 years old (i.e. between 20 and 21 years old) and 20 patients who are 21 years old (i.e. between 21 and 22 years old), and you fit a line through these points (20, 10) and (21, 20), you might predict that you have 15 patients who are 20.5 years old, by which you would really mean that you predict 15 patients between the ages of 20.5 and 21.5. So you will need to make a set of predictions at some set of discrete values and sum over them. Note that there is no guarantee that the fitted spline will even be self consistent. For example, imagine that the spline runs through the points (20, 1), (20.5, 3), and (21, 1). There is no way for all of these predictions to be simultaneously right. Packing 3 patients into the age range from 20.5 to 21.5 requires packing at either the 20-year-old category or the 21-year-old category has at least 2 patients in it.
I think the simplest suggestion I can give is to (1) make sure that your data come in age bins of equal widths, or else things could get really funky. (2) Predict the spline to the age bins in the data (which will still induce some smoothing) and sum over these predictions to get the expected count within a range (only works if the range you are interested in has a beginning and end time that matches up with the breaks between bins).