Variance, priors, and "maximum entropy" for a simple Phylogenetic/group-effect model


#1

Hello Stan Forum,

This is my first time posting, so I apologize for any errors, confusion, or poor formatting. I used brms in my analysis, so I am keeping the scripts in that format.

I am working with a simple binary multilevel model exploring the effect of a continuous variable (length) on the probability of two states (0:1) for a group of 385 species. State represents a discreet microhabitat, ground (0) or tree (1). I have included the covariance matrix derived from the phylogeny as a random effect/group-effect in the model. One thing that may be important to note is that of the 385 species, 52 are state “1”, and 333 are state “0”. These states are dispersed pretty evenly across the phylogeny.

I am having a hard time understanding priors and variance in my group-level effect (phylo). First, here is the model:

fit <- brm(
  state ~ length + (1|phylo), data = data1, 
  family = bernoulli(), cov_ranef = list(phylo = A),
  iter = 3000, 
  prior = c(
    set_prior("normal(0, 10)", class = "b", coef = "length"),
    set_prior("normal(0, 50)", class = "Intercept"),
    set_prior("XXXXXXXXXXX", "sd")
  ))

The “XXX…” symbolizes the various priors I have implemented, explained below. I noticed that running the model with a standard half-cauchy prior on “phylo” leads to a very large variance for the sd estimate, which in turn is mirrored in the error in my intercept and “length” estimates. This seemed a little unusual, so I started investigating. I read about exponential priors being used to solve problems of extreme variance in group-level effects (Statistical Rethinking, McElreath 2016, pg. 363) as they represent the maximum entropy prior for standard deviation, and prevent extreme values in the posterior. I ran a few models using exponential priors and indeed the variance was drastically minimized, and the results “seem reasonable” for my dataset. I ran a few more models with different “sd” priors, and compared using AIC. Here are the models and results:

List of Models: These represent the “XXXXXXXX” in the model above

fit1_E     exponential(0.1) 
fit2_E     exponential(1)
fit3_E     exponential(2)
fit1_S     prior(student_t(3, 0, 10) 
fit2_S     prior(student_t(3, 0, 4)
fit1_HC    cauchy(0,5)               
fit2_HC    cauchy(0,2)
fit3_HC    cauchy(0,1)

Results for each prior:

fit1_E          term   estimate std.error      lower     upper
1         b_Intercept -25.022634 18.378072 -69.313583  2.251715
2            b_length   3.527981  1.882504   1.005739  8.096916
3 sd_phylo__Intercept  30.780155 14.304652  10.953545 65.831346

fit2_E          term  estimate std.error       lower      upper
1         b_Intercept -9.552696 5.9436732 -23.7319120  0.4054055
2            b_length  1.279336 0.5420407   0.4991412  2.5891605
3 sd_phylo__Intercept 10.474395 3.5360108   5.3332635 18.7458643

fit3_E          term   estimate std.error      lower      upper
1         b_Intercept -6.9097195 3.9284101 -15.824364 -0.2717995
2            b_length  0.9090803 0.3363955   0.382051  1.7084082
3 sd_phylo__Intercept  7.0747783 1.9632313   3.989964 11.6119434

fit1_S          term   estimate std.error       lower      upper
1         b_Intercept -28.778864 23.325681 -86.2666068   2.385186
2            b_length   4.279439  2.948905   0.9648861  11.732086
3 sd_phylo__Intercept  38.792357 26.448354  10.0547181 109.472044

fit2_S          term   estimate std.error       lower     upper
1         b_Intercept -21.795929 19.801790 -74.7482721  2.610887
2            b_length   3.157116  2.431173   0.7307612 10.061481
3 sd_phylo__Intercept  27.953530 21.785459   7.4199031 89.639040

fit1_HC        term  estimate std.error       lower      upper
1         b_Intercept -41.21102 32.763435 -117.790693   9.265238
2            b_length   7.25270  4.893182    1.196928  19.708175
3 sd_phylo__Intercept  74.85902 60.718119   12.131478 232.738342

fit2_HC        term   estimate std.error      lower      upper
1         b_Intercept -41.299566 32.745079 -119.54397   6.357867
2            b_length   6.981974  4.713753    1.08501  18.763991
3 sd_phylo__Intercept  72.586361 59.047678   11.16131 228.011888

fit3_HC        term   estimate std.error       lower      upper
1         b_Intercept -42.285867 32.968770 -120.137500   6.356373
2            b_length   7.345756  4.882662    1.246481  19.664864
3 sd_phylo__Intercept  76.800876 63.936976   12.765590 236.180079

It is pretty obvious looking at these results that the error and variance is all over the place, and especially large (though consistent) with the cauchy prior. You can also see that the variance was greatly reduced using the exponential priors

It is important to note that every run converged very nicely, has an ESS > 1000, and rhat = 1.00

I performed an WAIC and was a little surprised by the results:

                     WAIC    SE
fit1_E              45.72  4.15
fit2_E             101.12  9.62
fit3_E             127.45 12.23
fit4_E             151.88 14.39
fit1_S              42.84  3.97
fit2_S              58.28  5.41
fit1_HC             27.72  2.45
fit2_HC             29.08  2.57
fit3_HC             26.71  2.35

It seems the models with cauchy priors have a drastically lower WAIC score than the student-t or exponential runs.

So here are the questions:

  • Is it safe to always follow the WAIC test to select a best model of reality, regardless of the extreme variance generated by the model(s) of choice?

  • Is the data itself, strongly weighted to score 0 over 1 (333 to 52) a possible cause of this variance and discrepancy in results? Is there a better distribution other than bernoulli() to model this kind of weighted data?

  • I understand that exponential() represents a vague “maximum entropy” prior here, which should limit the range of variance by excluding extreme results. Did I interpret this correctly?

  • Ultimately all of the models agree that length has a strong positive effect on score. Is it wrong/illadvised of me to fit the stronger prior in this case to minimize the large variance?

I am not including any data as I believe the questions do not require it, but if someone thinks it would help then I’d be happy to include it.

I am really looking forward to hearing the forum’s thoughts. Thanks for taking the time to read this!

Jon


#2

No. None of the information criteria have to do with reality, only which model is expected to predict future data best. And WAIC is not as good as LOOIC at diagnosing when the assumptions underlying it do not hold.

Bernoulli is pretty much the only distribution for data that only take the values 0 and 1. But you can and should investigate different link functions besides the logistic one.

Maximum entropy given that the only thing you know about a positive random variable is its expectation. It sounds as if you know more than that.

Presumably.