Order of operations and when to apply a transformation

Hi All, this isn’t a technical problem per se, although I am using Stan via the Rethinking package. Rather, I could use some advice from smart Bayesians. Apologies if this doesn’t belong here.

I have a model for sizes of prey items eaten at Peregrine Falcon nests. The data are expressed as a proportion of a full prey item, and the idea is that the proportion eaten will vary depending on prey type, the number of nestlings in the nest, and their age, etc… So build a GLM assuming a Beta distribution and away you go, right? Well not quite. Unfortunately, the resolution of the proportion data is such that the model produces unrealistically large estimated meal sizes for nestlings of young age. For example, large prey items were estimated to the nearest 1/8th of a prey item, but even 1/8th of those prey items is physically too large for young nestlings to eat in one sitting.

So i had the idea to create an expression for maximum meal size, which assumes maximum meal size scales linearly with nestling growth (which is logistic). The expression is parameterized using values from the literature. My idea was then to cap the predictions of my original model where they exceed estimates produced by the above expression. Thus predictions are capped to realistic meal sizes.

However, it has occurred to me that perhaps the better approach would be to transform the data based on my expression first (i.e., cap the data that are unrealistic), and then model the transformed dataset.

Any advice on which approach is more sound? I’ve been going back and forth on this for weeks.

Thanks,
Kevin

Given the choice between two options, update the model, postprocess the predictions. You’re probably better off trying to update the model to make the predictions better. The fact that the predictions are off gives you evidence the model is off.

And since this is a GLM maybe this just changes how the input covariates are encoded. For instance, maybe age groups as factors instead of age as a continuous variable, or something like that.

And then as far as capping the input data – Is the data unrealistic if you measured it :D? At least that’s the difficulty with the argument

Simulated data experiments are the way to play with new models. Real data can just get confusing cause you don’t have ground truths. Assume a True Data Generating Process, generate data, fit your model, and check the results against your Known Truth.

2 Likes

Just to recap here and make sure I’ve got this right:
You’ve got data estimated to the nearest \frac{1}{8}, but sometimes \frac{1}{8} is an unrealistic value when the true value is slightly over \frac{1}{16} (if the true value is less than \frac{1}{16} then you estimate a proportion of 0). You’re therefore concerned about making implausible predictions when the true value is \frac{1}{16}.

First, I wonder what are you really interested in predicting? If you just want to keep track of the energy budget integrated over a period of multiple meals, then you might hope/expect that proportions that get rounded down will offset these proportions that have been rounded up to unrealistic meal sizes. Thus, you might not have such a big problem on your hands.

If you really need to make good predictions about individual meals, then the first thing that I notice is that these unrealistic predictions are noticeable because they are unrealistic, but they are no worse in absolute terms than the rounding errors introduced elsewhere. Some are rounding down, some are rounding up. If capping the predictions based on a logistic growth model accurately identifies some meals that are unrealistically large, you will end up underestimating the total amount of food eaten overall because you are correcting a subset of errors that arise from rounding up without correcting any equivalent subset of errors that arise from rounding down.

One potential solution is to estimate a suitable measurement error model, where the true meal size is constrained to be near the observed size, but also depends on a model based on covariates such as the nestling size, etc.

Another potential problem could arise if you always rounded up for proportions below \frac{1}{8} (i.e. you never estimated a proportion of \frac{0}{8}). This will potentially introduce more serious bias at the low end (bias that almost certainly will not “come out in the wash” when averaging over multiple meals), and you might want to consider handling it with some kind of bespoke measurement error step that’s applied exclusively to estimated proportions of \frac{1}{8}.

Lastly, while these measurement error models might sound cool, be aware that they are fundamentally limited in what they can accomplish. There won’t be any magic cure for a realization that the resolution of your data is inadequate for the inference that you were hoping to achieve. You can fiddle around until you squash obviously unrealistic predictions, but that does not mean that your predictions are accurate.

2 Likes

Thanks to both of you!

@jsocolar : you got it. your second scenario, where meal sizes are consistently overestimated for small meals, is what I’m dealing with. I can see why recording 0/8 for some proportion of small meals makes statistical sense in the sense of then getting unbiased predictions, but if you record 0/8 for meal size, was there even a meal? For that reason the data were not collected like that.

ya I understand that my solution, whatever it is, is going to be kind of hack-y. Visual estimation of meal size on grainy nest camera images just isn’t a high resolution method.

as for what I need the predictions for - i’m looking for estimates of average meal size to combine with predictions of number of meals per day (a separate model) and estimates of prey body size (draws from a normal distribution to account for variation). The whole scheme is prey type specific, so, for example, there will be y meals of small birds in a day, and they will have x average meal size (proportion) and z body mass (drawn from N(30-g, 5-g)). The goal is to get an idea of how much biomass nestlings are predicted to ingest on a daily basis.

@bbbales2 : In this case there is obvious measurement bias, so the data do turn out to be unrealistic unfortunately. When nestlings are older and they can consume meals larger than the resolution of my data, then I’m confident the data are unbiased. Before that, meals are consistently biased large.

I agree simulations are probably the way to evaluate solutions. I’ll give it a go.

Thanks for your thoughts.