Multiply scans from posterior distributions from seperate models?

Hey guys,

I have a general conceptual modeling question I’d like to pose. I am working with some time-series data at the month resolution going back about 20 years. I am looking into generating predictive models for total catch using one model for effort (angler-hours, month level) and another for catch-per-angler-hour (total catch per month/total angler-hours per month). While I would like to make inferences on each of these models seperately, I would also like to generate complete posterior predictive distributions for each model at each month for effort and catch-per-angler-hour seperately, then multply the individual scans together to simulate what predicted total catch would be. The basic equation is simply enough:

Catch = angler-hour x catch-per-angler-hour

However, before I proceed to writing the MS, I wanted to make sure this is appropropriate. The models both fit fine to their respective datasets, and the amalgamation seems to recreate the aggregate catch data well.

Moreover, if this is ok, should I combine them scan-wise or use some multiplicative analog to a convolution to generate the combined result?

Finally, if this is indeed appropriate, there is a third iteration of modeling I would like to add on top of this, but I am trying to be as careful as I can not to use the models inappropriately.

Thanks!

Is there a reason you can’t build a joint model of everything? What you’re proposing is a kind of “cut” inference where information won’t be able to flow between the two models.

I would think the natural thing to do would be to have one model of angler hours and another catch per angler hour and then put them together to get a model of expected catch and then model the actual catch as a noisy realization of that. So if y[n] is the catch on month n, and we fit alpha[n] hours and beta[n] catch/hour, then something like

y[n] ~ normal(alpha[n] * beta[n], sigma);

The key thing is that catch isn’t angler hour x catch per angler hour, it’s that the expected catch is given by the angler hours * catch/hour and then the observation has additional noise. As @andrewgelman likes to say, “uncertainty greases the wheels of commerce.” If you try to make this work out to be exact, you’ll run into all sorts of problems.

That’s just two models, though, right? And ideally, it’d be just one model, I think.

First, thanks for lending a hand here. I’ve read a ton of your comments on other forum pages that have been extremely helpful in my own work.

Second, I was considering a single model, and wondered if that was the direction to go. It would certainly make inference easier.

So just so I am following you, in the same stan model, specify three likelihoods? Something like…

E_i \sim N(\mu_i, \sigma_E)
CPUE_i \sim N(\eta_i, \sigma_{CPUE})
C_i \sim N(\mu_i \cdot \eta_i, \sigma_C)

where E_i, CPUE_i, \text{ and } C_i are effort in month i, CPUE in month i, and catch in month i? That shouldn’t be too bad at all. Note that I used normal distributions here just for convenience, the distributions would be lognormal and hurdle-lognormal, respectively.

Lastly, there is one more component I was hoping to include, which is the probability of retaining/discarding given N animals caught in a logitistic regression framework. If you have the raw numbers of fish retained and discarded as data, would you endorse adding that as one more likelihood function in the model block? Total retained/discards are the end goal here and the thrid iteration I alluded to in the original post.

Thank you so much for taking the time to respond to these. I hope these questions aren’t too basic, I’ve done a lot of complex model building in stan, just never with quite this type of structure.

If I’ve understood correctly, you have data on total catch and data on total angler hours, from which you can compute catch per effort (or you have data on any two of these three quantities, from which you can compute the third). That is, you don’t have three independent-ish data streams that give you effort, catch per effort, and total catch, but rather you have strongly dependent data streams such that you can perfectly recover any one of them from the other two. You want a model, or a pair of models, that coherently predicts all three quantities.

Maybe I’m missing something and @Bob_Carpenter can set me straight, but I suspect that his answer assumes that the data on total catch are collected separately from the data on catch per effort. It seems to me that when this isn’t the case, it will be at best quite subtle and at worst nearly impossible to correctly include three likelihood terms (one for each of the three data streams) without getting overconfident inference that comes from reusing data twice in the likelihood.

My overall suggestion would be to pick whichever pair of data streams you have a better intuitive sense of how to model appropriately, and to fit both of them jointly in one model that includes estimation of the covariance between the residuals in the two data streams, and then derive predictions for the third data stream using the appropriate sample arithmetic.

Forgive me, some of the notation below is a little sloppy.

You are correct. Thanks for that clarifying answer; this is very helpful. Let’s go with effort and catch per angler hour just to stick with the two data streams that I’d actually like to model with parameters.

The effort model is lognormal, and the catch per angler hour model is a hurdle lognormal to account for 0’s. In the past, with non-guassian models, I’ve modeled dependence between two response variables with random effects drawn from a multivariate normal distribution. Something like this:

\begin{bmatrix}\theta_{1i}\\\theta_{2i}\\\end{bmatrix}|\Sigma \sim \text{MVN}(0, \Sigma)\nonumber\\ \qquad \Sigma = \begin{bmatrix} \sigma_{1}^2 & \rho \sigma_{1}\sigma_{2}\\ \rho \sigma_{1} \sigma_{2} & \sigma_{2}^2\\ \end{bmatrix}\nonumber\\

where \theta_i are the random effects for the ith month and \rho describes the degree of correlation between effort and catch per angler hour in the covariance matrix. \sigma_1 and \sigma_2 are noise terms associated with the random effects.

I presume one would need to do something similar here for the lognormal components, but I’m a little unsure of whether/how to add in the random effect for the hurdle component. I’m still new to hurdle models. Perhaps something like this?

\ln (E_i) \sim N(\mu_i + \theta_{1i}, \sigma_E)\\ CPUE_i \sim HLN(\eta_i + \theta_{2i}, \sigma_{CPUE}^2, \pi_i)\\

where \pi_i is the probability of observing 0 catch per angler hour for the hurdle, and HLN denotes the hurdle lognormal probability distribution. Is that something similar to what you were envisioning @jsocolar?

The last question I have is then whether I can include a third model of probability of retention/discard of any given fish caught. I presume that would likely be correlated to the other two data streams. I could add a third random effect here and a more complex covariance structure. However predicting the total number of fish kept and discarded is the ultimate goal here. So I guess what I’m asking is if the covariance between probability of retention/discard, and both effort and CPUE are taken into account, would multiplying those posterior predictive distributions together to predict total number of fish retained/discarded by appropriate? I worry that even with the covariance taken into account, this is still similar to the “cut” inference that @Bob_Carpenter mentioned in his reply.

Once again, thank you both for the advice!