This is a question about how to create a composite prediction from two models that accounts for uncertainty in both the original models. The models involved should be generally independent, insofar as they derive from different data sources.
In this case the outcome of my first model is daily time allocation (in hours) for individuals in a population and my second model predicts daily production (kilocalories/day) for individuals in the population. Both models run smoothly and generate the desired posterior distributions with good converge. What I would like to do now is combine predictions from those two models to calculate rates (kilocalories/hour) with a composite error estimate that takes account uncertainty in production per day and uncertainty in the time allocation.
If I run both models using Stan, then I would end up with a posterior distribution of: A) production (kcals/day) and B) time spent foraging. Could I then simply divide the posterior distributions of A and B to obtain a third posterior distribution that will reflect uncertainty in both A and B?
My reading of McElreath 2015 suggests that a more principled approach might involve simulating from the posterior distribution, but it is unclear to me how that should work in this case. I would also be curious if a different approach would be required if the combining operation (here division, versus a sum or product) were different. Thanks!
If the two sets of individuals are identical (you have samples of both time allocation and production from the same individuals), I think you need to combine the models and the combined model will need to consider the correlation due to the individuals.
If there are any independent variables or factors in the models (e.g. larger individuals vs smaller individuals, by sex, or age) then you will need to combine the models or results such that results for the same levels of independent variables are matched together and divided. I would think you could do all of the new work in the generated quantities block. Youâll need to consider how to generate a representative population (given the distribution of the independent variables in the populations) and generate a sample for each member of the representative population. One way this could be done is to generate a sample for each individual in both data sets (assuming that the data sets are themselves random samples of the population), or a random subset of the data on each interation if the data size is too large.
If they are truly independent, then you might not NEED to combine them but it shouldnât hurt to combine them into a single stan model. The combined model would just concatenate the two models (dataA and data B in the data block, etc.) making sure the variable names are unique. Then in the generated quantities block you can define any computation you want on the posterior samples and you will produce posterior estimates of that quantity.
(Assuming true independence) If the individual models are producing posterior samples there are a few approaches that should be sound.
- If the dimensions of the two samples are the same, they are independent and randomly ordered so you should be able to do any calculation, just as in generated quantities. This is essentially the same computation as would be done in generated quantities, just performed outside of the sampling.
- you could sample from or permute each of the sets of samples to generate up to n*m posterior samples.
Some things I would avoid, even if the data is truly independent:
3) Fitting the posterior distributions and doing a monte carlo simulation based on those distributions. This will definitely introduce fitting error. If you want more samples of the posterior, generate them directly.
4) Doing any calculations based on the parameter posteriors or maximum likelihood values (e.g. the mean and sd if you fit the data with a normal distribution). This is where the form of the distribution and the type of calculation need to be compatible. Even if the operation is defined for random variables from that distribution, in many cases, the posterior is not actually the modeled distribution (because the parameters have distributions and are not constants). This path is VERY dangerous.
1 Like
Thank you @KariOlson for the thoughtful reply. This agrees with my intuition, and because the models are fitted to entirely different sets of individuals, I see no advantage to trying to fit the models together and perform calculations in the generated quantities block (no new information in the correlation structure).
This leaves me with your approaches 1 and 2. However the number of samples from each model are different, which leads to your sampling option 2. To make sure that I understand correctly with some concrete code:
Say I have modelA and modelB:
new_data <- data.frame( age = c(40) ) #for simplicity here just predictions at 1 age
postA <- rethinking::extract.samples(modelA)
p_A <- linkA( new_data) #linkA() is a custom function that generates predictions from draws in postA
postB <- rethinking::extract.samples(modelB)
p_B <- linkB( new_data) #linkB() is a custom function that generates predictions from draws in postB
# Now I have two vectors of different length (diff number of samples from stanfits)
# to accomodate different length, sample from distributions?
sp_A <- sample(p_A, size = length(p_A)*length(p_B), replace=T)
sp_B <- sample(p_B, size = length(p_A)*length(p_B), replace=T)
# Then combine directly?
out <- sp_A/sp_B
Is this a proper understanding of your suggestion? My apologies if I am misunderstanding. I just want to be perfectly clear that I am understanding.
As a small follow-up: It seems like this could be a reasonable approach to propagate uncertainty when combining a posterior distribution and a summary estimate for which you donât have any information beyond a mean and standard deviation. For example from above you could also replace sp_B with :
sp_B <- rnorm(length(p_A), mean= summary_mu, sd = summary_sd)
(although I understand that youâre advocating against this approach if you have knowledge of the full distribution)
I think you understand how to accomplish suggestion 2 better than I. R is not my strong suit. However, there is one thing that concerns me - if you are going to generate âsamplesâ for all the combinations of A and B, it will be more efficient just to do the cross-join and get all the combinations. I would only sample if the desired outcome was fewer than max samples.
My gut says that these are NOT independent - that more foraging is likely correlated with more output. Your response adds to that feeling, in particular:
new_data â data.frame( age = c(40) ) #for simplicity here just predictions at 1 age
If you are predicting each of these values on the basis of some independent variables, features, factors, Xâs or whatever you call them (I donât know what additional language to use in this field.) then the results are correlated and care must be taken to define an appropriate population and generate samples for that population.
Concerning your additional follow up. You are right, this is exactly what I was suggesting you not do in number 3. If you only had a mean and standard deviation, that might be appropriate but only if a normal distribution is the correct distribution. Not every random variable with a mean and standard deviation is normally distributed. Perhaps a different distribution is a better fit. If you assume normality where it doesnât apply, you will likely get incorrect predictions, particularly when you are multiplying or dividing instead of summing (when summing there is a general rule about adding variances that might mean you get the correct answer despite doing something stupid along the way) .
Pardon me for misunderstanding, but I do not quite understand your suggestion.
You are correct that the outcome variables from the 2 models are not independent within an individual, but those outcomes were measured on different subsets of the population so I am unable to assess their correlation empirically. Thus, the same measure does not exist on both variables for the same individuals.
So indeed I will be predicting both outcomes on individuals of the same age, and that is where I would like to combine them. In that case am I wrong to sample from the posteriors of each model and then combine those samples directly? What do you have in mind when you suggest to define an appropriate population and generate samples for that population?
My apologies for not understanding.
If your only predictor is age and you are careful about sampling at each age, then you can get the posterior samples of your computed variable for each age using the two modelsâ samples for that age. If you want to know about the population dsitrbution, you need to then combine these individual age samples into a population based upon the distribution of ages in your population. It is not appropriate to just combine n samples from each age, the number of samples from each age should be proportional to the amount that age exists in the population as a whole.
If you have additional predictors, you need to consider all of the predictors jointly, not one at a time.