Hey guys,
I have a general conceptual modeling question I’d like to pose. I am working with some time-series data at the month resolution going back about 20 years. I am looking into generating predictive models for total catch using one model for effort (angler-hours, month level) and another for catch-per-angler-hour (total catch per month/total angler-hours per month). While I would like to make inferences on each of these models seperately, I would also like to generate complete posterior predictive distributions for each model at each month for effort and catch-per-angler-hour seperately, then multply the individual scans together to simulate what predicted total catch would be. The basic equation is simply enough:
Catch = angler-hour x catch-per-angler-hour
However, before I proceed to writing the MS, I wanted to make sure this is appropropriate. The models both fit fine to their respective datasets, and the amalgamation seems to recreate the aggregate catch data well.
Moreover, if this is ok, should I combine them scan-wise or use some multiplicative analog to a convolution to generate the combined result?
Finally, if this is indeed appropriate, there is a third iteration of modeling I would like to add on top of this, but I am trying to be as careful as I can not to use the models inappropriately.
Thanks!
Is there a reason you can’t build a joint model of everything? What you’re proposing is a kind of “cut” inference where information won’t be able to flow between the two models.
I would think the natural thing to do would be to have one model of angler hours and another catch per angler hour and then put them together to get a model of expected catch and then model the actual catch as a noisy realization of that. So if y[n] is the catch on month n, and we fit alpha[n] hours and beta[n] catch/hour, then something like
y[n] ~ normal(alpha[n] * beta[n], sigma);
The key thing is that catch isn’t angler hour x catch per angler hour, it’s that the expected catch is given by the angler hours * catch/hour and then the observation has additional noise. As @andrewgelman likes to say, “uncertainty greases the wheels of commerce.” If you try to make this work out to be exact, you’ll run into all sorts of problems.
That’s just two models, though, right? And ideally, it’d be just one model, I think.
First, thanks for lending a hand here. I’ve read a ton of your comments on other forum pages that have been extremely helpful in my own work.
Second, I was considering a single model, and wondered if that was the direction to go. It would certainly make inference easier.
So just so I am following you, in the same stan model, specify three likelihoods? Something like…
E_i \sim N(\mu_i, \sigma_E)
CPUE_i \sim N(\eta_i, \sigma_{CPUE})
C_i \sim N(\mu_i \cdot \eta_i, \sigma_C)
where E_i, CPUE_i, \text{ and } C_i are effort in month i, CPUE in month i, and catch in month i? That shouldn’t be too bad at all. Note that I used normal distributions here just for convenience, the distributions would be lognormal and hurdle-lognormal, respectively.
Lastly, there is one more component I was hoping to include, which is the probability of retaining/discarding given N animals caught in a logitistic regression framework. If you have the raw numbers of fish retained and discarded as data, would you endorse adding that as one more likelihood function in the model block? Total retained/discards are the end goal here and the thrid iteration I alluded to in the original post.
Thank you so much for taking the time to respond to these. I hope these questions aren’t too basic, I’ve done a lot of complex model building in stan, just never with quite this type of structure.
If I’ve understood correctly, you have data on total catch and data on total angler hours, from which you can compute catch per effort (or you have data on any two of these three quantities, from which you can compute the third). That is, you don’t have three independent-ish data streams that give you effort, catch per effort, and total catch, but rather you have strongly dependent data streams such that you can perfectly recover any one of them from the other two. You want a model, or a pair of models, that coherently predicts all three quantities.
Maybe I’m missing something and @Bob_Carpenter can set me straight, but I suspect that his answer assumes that the data on total catch are collected separately from the data on catch per effort. It seems to me that when this isn’t the case, it will be at best quite subtle and at worst nearly impossible to correctly include three likelihood terms (one for each of the three data streams) without getting overconfident inference that comes from reusing data twice in the likelihood.
My overall suggestion would be to pick whichever pair of data streams you have a better intuitive sense of how to model appropriately, and to fit both of them jointly in one model that includes estimation of the covariance between the residuals in the two data streams, and then derive predictions for the third data stream using the appropriate sample arithmetic.