Convergence issues for random effects sd/Intercept in a dyadic (social relations) model

Please also provide the following information in addition to your question:

  • Operating System: MacOS 10.14.5 Mojave
  • brms Version: 2.9.0

I’m trying to fit a symmetric dyadic model (akin to Kenny et al. (1979)'s social relations model) in brms. It has the form y_ij ~ b * X_ij + a_i + a_j where i and j are two different individuals and y_ij is the similarity between them on some dimension and X_ij is their similarity on some other dimension(s). The specification in brms I’m using is:

brm(similarity ~ x_similarity + (1 | mm(ego, alter))

ego and alter is the identity of the two individuals but which one is ego and which one is alter is random, hence the mm model. (If it’s clearer, the model is essentially the same one described here: Distance matrix regression)
Sampling is fine for all of the parameters except for the Intercept/sd. Despite using 4 chains with 500 samples (300 warmup), I can’t seem to get the ESS for the Intercept and RE sd parameters to get higher than 10 (or the Rhats to get anywhere near 1.01).

The outcome variable is a cosine similarity between two texts so varies from 0 to 1. I expect the betas to be quite small so I am currently using the following priors:

mod_prior <- c(prior(student_t(3, 0, .05), class = b), prior(student_t(3, 0, .005), class = sd), prior(student_t(3, 0, .5), class = sigma), prior(student_t(3, 0, .5), class = Intercept))

However, I also have previously tried setting the sd of the sd hyperprior to be .5 or .05 without much change in convergence. I set it so low because regardless of the prior, the sampling behavior for the sd parameter is really bizarre. It simply continues to converge towards 0, i.e.
sd_hyperprior.pdf (14.5 KB)
Even with the super strong prior, it doesn’t seem to have reached anything like the typical set by the end of the 500 samples. Does this mean I just need a longer warmup? I should mention that I am running this with inits = 0 although I guess by the time sampling starts, the sd seems to have drifted significantly away from that.

It is somewhat surprising to me that these random effects are so hard to fit as each individual appears in the dataset as either ego/alter at least 130 times (and many around 2000 since there are 2300 individuals in the dataset with pairwise similarity computed available between many of them). Because it is such a large dataset (~2 million rows), it takes quite a while to fit the model so it doesn’t seem like I can just take 10x the number of samples to get the ESS where it needs to be. Are there any other obvious changes I could make to improve convergence for these random effects? This seems to be a pretty straightforward model and the population-level effects are well estimated. The only other potentially strange thing is that the similarity is constrained to be between 0 and 1 but I am using a Gaussian model. I normally wouldn’t worry about this kind of thing but there is some bunching near 0, i.e.
similarity_hist_raw.pdf (4.7 KB)
Would it be worth trying to put this in the model (use a truncated Gaussian instead or something)?

What response distribution are you assuming for you similarity variable currently? Simply gaussian or something else? For a score between 0 and 1, you may consider using the Beta family.

Thanks! I had been assuming a Gaussian but perhaps the Beta would work better. The similarity function is just cosine similarity between two lists of words. No word is included multiple times so it reduces to counting the number of words that occurred in both divided by the geometric mean number of words in the two lists. The assumption is that the probabilities of two people including a word on their list depends on how similar two people are in other aspects.

I am realizing as I write this that I should make the phi of the Beta/sigma of the Gaussian be dependent on that geometric mean as well (which is easy in brms!).

There may also be a better way to model this – I’m not quite smart enough to figure it out but maybe someone on here knows how to do it. It seems like there should be a way to model the the probability of a number of common elements in two samples drawn from two different distributions as some function of the similarity (mutual information?) between those distributions. We’d ideally like to learn what predicts the similarity of these distributions but that similarity is latent as we only observe single samples from each distribution. Maybe some sort of chi square distribution makes sense?

This sounds as if there is some research to be done to come up with a proper model for this kind of data. If there is not such model developed yet in the literature, this will likely be its own research project and not be solved within a discourse thread. But I haven’t thought of this in too much detail yet, so it may be much simpler than I think.