I am wondering whether and how distribution of my data affects the outcome of my brms models and whether I would need to take action beyond scaling it.
For instance, imagine roughly 1000 measurements of body size in an invertebrate species collected from 15 different sites, with substantially variying sample size per site. Now I want to test how the concentration of three amino acids in the substrate collected at each site affects body size, so the concentration of each amino acid can only take 15 different values. So, if I wanted to do something like:
brm(length ~ his + trp + glu)
Does it matter that my independent variable is not as continuous as my dependent variable? i.e., one can take 1000 values, the other only 15, so this results in the following distributions (data already scaled to 0 and sd 1)
If I understand the structure of your data correctly, you have 15 values for concentration of each amino acid because you have one concentration of each amino acid for all specimens collected at each site. If I’ve got that right, I believe you can run the analysis the way you propose, but what you’ll find is that you are (essentially) fitting the regression to the mean body size at each site. The good thing, from my point of view, is that since you’re using brms to fit the regression, sites with larger sample sizes will naturally carry more weight than those with smaller sample sizes, because the mean will be estimated more securely with larger sample sizes.
you have 15 values for concentration of each amino acid because you have one concentration of each amino acid for all specimens collected at each site.
that is correct.
sites with larger sample sizes will naturally carry more weight than those with smaller sample sizes, because the mean will be estimated more securely with larger sample sizes.
I see, that makes sense. it also sort of answers another question I had, which is whether I would somehow have ot penalize sites with very few individuals, so they wouldn’t skew the analysis. I played around with excluding sites with very low samples sizes (N<20) from the analysis, which gives me quite dramatic differences : while the overall patterns are pretty much the same, excluding those samples increases ambiguity (more overlaps with zero) and the spread in the posterior. I somehow would have expected it to be the opposite…
I was already thinking of how to deal with this - or do I have do deal with it all? If the weighting is done automatically, I would just discuss this in my paper, and include both analyses.
If you have access to Gelman & Hill (Data analysis using regression and multilevel/hierarchical models), you might find it useful to review section 12.6 on group-level predictors. The county-level uranium measure is analogous to your site-level amino acid concentrations.
The effect of excluding sites with low sample sizes will depend on (a) how large a proportion of the data those sites are and (b) what the mean of those sites is relative to the sites with larger sizes. On (a), you are effectively fitting four parameters (an intercept and three slopes) to fifteen points (the site means). If you exclude more than a few sites, you have very little data left to estimate the parameters, and the estimates become more uncertain as a result. On (b), if the sites you exclude have means that lie outside the range observed in the sites you included they will tie down the regression more strongly (unless there’s a different pattern of relationship in sites with many vs. few samples).
What you might want to do to get a better feel for all of this is to simulate some data sets with a structure similar to yours and run the regressions on the simulated data sets. That way you’d be able to see directly how the estimates you get line up with the values you know they ought to have.
I had a look - so there they even include country (site in my case) as a random effect. I hadn’t done that before because I thought meaningful variation would somehow be removed by the random term, but if I understand correctly random variation would be etimated independently of the between-site variation in amino acids in the fixed term?
I think this is what I am looking for: Simulate from Existing Data. I’m not sure though which simulations you are suggesting: would you i) simulate a dataset with identical N within and across all sites but simulated data or ii) supplement the existing dataset, i.e., “filling up” low N sites? And would you do this only a few times to get a general idea, or were you thinking of a “bootstrap” approach - i.e., do this a few hundred or thousand times and get quantitative results?
These are my Ns - I have been dropping those last three to the left:
In your case, you only have one observation per site, so including a random site effect doesn’t make sense. The only variance is residual variance around the regression.
I’m suggesting something a bit different.
Set up a vector n where each element is the sample size for a given site and corresponding vectors for the amino acid concentrations at each site.
Pick values for the coefficient associated with each amino acid. (See below.)
Calculate the mean body size at each site from #1 and #2.
Generate pseudo-data from a normal distribution with a sample size from #1 and site mean set from #1 and #3, and a standard deviation roughly the same as what you see in your analysis.
Run a regression and record the results.
Repeat #4 and #5 100+ times and examine both (a) how close the posterior means of the regression coefficients are to the values set in #2 and (b) how well the credible intervals do at including the values in #2, e.g., do 80% of 80% credible intervals contain the values in #2.
This will give you a good sense of how well the statistical model does under ideal conditions when the estimation model matches the generating process. I presume it will do well, but this step will give you a good sense of what counts as “close”.
Now repeat the whole simulation, but drop the sites with a sample size less than 20. Unless the relationship between body size and amino acid concentrations is a lot different in those sites than it is in the remaining sites, the only difference you should see is that the credible intervals on the amino acid coefficients are larger. (I’m not sure offhand whether the residual variance will be larger, smaller, or unchanged.)
I need some clarifications - sorry for being slow:
first, rephrasing these steps to see if I got it right: I take the length-AA coefficients from my original brms model with all (including N<20) sites and use them and the actual AA concentrations to calculate body size at each site (through division).
second, there is the difficulty that I have scaled the data before the fitting. I’m, a bit at a loss of to re-scale - the coefficient, or just length and AA concentration?
Sorry for the slow reply. I’ve been offline for a while. The first part of what you mention matches what I was thinking. I hadn’t though about your scaling. That complicates things a bit, but not too much. If you scale the AA concentrations and use the estimated regression coefficients, you’ll get scaled body size at each site. That shouldn’t affect the simulations.