Posterior predictive checks not matching distribution in multivariate model

I am still a novice, but I am trying to solve a particular problem that I was wondering if someone had advice about this.

I have a data set with three months of average daily elevation use in cattle (and other metrics like slope, distance traveled, distance to water, etc.) across two years. So each cow has ~90 repeated measures of avg daily elevation use within a year. I want to understand the consistency of use across the two years, so I have a multivariate mixed model with elevation year 1 and elevation year 2 as my response variables and things like day (spline), age, pregnancy status, temperature and social network centrality as my predictor variables (screenshots of models attached) - they are specific to only one year. I want to account for this information but also understand their effect on elevation use. All model checks (plotting the model, loo, Rhats, etc.) look great.

The issue is that my posterior draws are not matching up with my data when I check the distribution. My data is apparently multimodal (photos attached of year 1 and year 2 distributions against draws), and the posterior draws are averaging out the modality (photos attached of means). I tried to run a mixture model but ran into some serious issues. I have tried truncation, skew_normal, etc and they still do not match my distribution. My priors are weakly informative with mean 0, std dev 1 since my response variables are scaled. They are sampled from a normal distribution (but perhaps I should change this?) Mostly, I just want to know if the cows are consistent across the years given (or despite?) the fluctuations in their daily movement patterns.

I am using the brms package.

Any advice on how to get the posterior data to match better? How important is this at this stage of just trying to understand consistency? Is there a way to alter my priors and/or the distribution I am using to match the posteriors?

I am including the worst of the bunch - elevation data.



Howdy! Would you be able to repost the posterior predictive check plots as histograms using pp_check(model, resp="mean_elev_1", type="hist", binwidth=0.01) ? The density plots are difficult to read, but it does look like the data is truncated. I would suspect though, that there is some variable (that maybe you don’t have) that provides information that separates the two modes. For example, I am not sure what “daily elevation use” actually means, but if it is the total elevation climbed by cattle in there daily travels, then perhaps there is some factor like a particular water source location that forces some cattle to travel a different route than others and by necessity they have more elevation (just as hypothetical example). I’m not sure switching around through various model families will solve this kind of problem. For a mixture model, you might need pretty strong priors.

Hi!

Thanks for the reply! Here are the histograms:

I tried a truncation model and the posterior checks of the means and medians were so much worse, and the overlay didn’t look that much better.

The scenario you brought up is correct - they have a choice of water sources on the pasture and I believe the cows have preferences for these, that’s kind of what I am trying to get at! I am not really sure what to include in my model to account for this I guess…because I am trying to get at their consistency of preferences to travel high or low (for water, resources, etc.) across the two years. The averages are an average across their elevation that is extracted from their gps fixes (elevation every 10 min, grouped and averaged across day).

Sorry if this is more of a complex stats question than purely brms modeling.

Thanks. It looks like the response has been centered and scaled to 1 standard deviation, but I suspect elevation is a continuous variable > or = 0, is that correct? If so, I wouldn’t center and scale the response to 1 sd. I would use a model family that is more appropriate for data that is on the positive reals. For example, you might start with lognormal. That should help immensely with the apparent truncation problem.

The multimodality is more difficult, because it sounds like you need information in the model that separates the modes, but you don’t know (or have) the variables that would do that. I think this is where an obviously ill-fitting model is providing some good evidence that some piece of information is missing! I’m not sure that is really a stats question, but more of a science question. Is there some hard boundary like a fence? Ease of access like a trail? etc.

One thing I wonder is why you are using a multivariate model with an elevation response variable for each year, rather than using all the data for a single elevation response variable in a single model and including “year” as a fixed effect or varying intercept?

1 Like

Hi!

I will try the lognormal distribution, thank you for the suggestion! Can I do a truncation and lognormal model together? i.e. mean_elev_1|trunc (lb = 315) ~ [parameters] family = lognormal. Can I add an upper bound to this? I had some issues previously when I tried to do a lower and upper bound, but I was probably using a gaussian distribution.

I do think it is interesting that something explanatory is ‘missing’. There are fences, but not between elevations and yes there are trails, but it is odd that it is not averaging out, the pasture is on a large slope at varying elevations. I guess I am wondering if I can still trust that cows are being consistent across years even if the model is ill-fitting for modality (they are correlated across response variables according to the summary of the model).

The reason I am using a multivariate model is because I want to get at individual consistency across years. By using a multivariate, I’m not just getting the population level change across years, but if an individual is consistent in their elevation use (mean and variation around mean) across the two years. I can extract essentially ‘blups’ per individual across the two years. Like this example explaining behavior syndromes, but with years instead of different traits.

You should be able to do truncation, but I would try without truncation first.

I would not trust the current Gaussian model. I would be pretty suspicious of a model with such multimodality that isn’t accounted for. Does the data represent more than one herd that does not associate and has different behavior patterns? I don’t know much about cattle haha.

I’m not quite sure exactly what you are getting at with the multivariate model… the brms example has two completely different outcome variables. Why couldn’t you make year a factor variable and include it in a single model as a population-level effect and varying effect for each cow, like: elevation ~ 1 + .... + year_factor + (year_factor | cow_id)
That way, you could see both the population-level change between the two years and the change for each cow between the two years.
But I may not understand exactly what you are getting at.

Hi, thank you for your reply.

The data does not represent different herds, they are the same herd provided the same pasture. It is speculated in the literature that there are cows that climb more and cows that do not, but this is kind of what I am trying to get at in the model of elevation, so then I do not know how to account for it in the model itself. I can’t really categorize the cows as using more or less elevation and then put that in the model accounting for elevation, right?

I’ll try it out in a single model of elevation, I don’t think it is exactly what I am trying to get at, but if I am having trouble explaining/justifying, I should try this way too and see if it gives me similar information and a better model fit!

Thank you so much for all of your help, I appreciate it.

1 Like

That would defeat the purpose of the analysis, right? Seems like what you are trying to find is what variable separates them in terms of elevation. Also, if it was just the individual cows, then I would have thought the varying intercepts would have accounted for this, and you would have seen a clear separation in the values of the varying intercepts (as well as better pp checks).

It’s difficult to say much without seeing the data (or knowing much about cattle or the study), but when I look at the histogram of the actual data, what comes to mind is two distributions that are maybe pretty similar in shape but are offset by a certain amount. The peaks on the left seem to indicate that most cows don’t traverse much elevation (sort of the minimum necessary to get to the water), but that this minimum is different for different groups. Of course, I’m only guessing.

Thank you so much for your help, I’ll keep working on this, but the lognormal distribution significantly helped the pp_check. I am going to also try varying sigma per individual. Appreciate all your thoughts!!

1 Like

cool. You’re welcome. I’d try the model with a varying slope for year just to see, as well.

Absolutely, will do, like what you said here: elevation ~ 1 + … + year_factor + (year_factor | cow_id)

It may also be useful to fit a model with (0 + year_factor | cow_id).
In such a model, the random effects would be estimates of elevation in year 1 and year 2 for each cow, as well as the correlation between them.

I only have two years though, so this does not qualify year as an appropriate random effect or grouping factor for a varying slope model?

That’s fine. You need to have more than a few cows with elevation values in both years—but it’s not a problem that you have only 2 years.