The 'right' way to convert a specific group into a population-level parameter

I’ve got a logistic regression with population-level effects and group-level effects. N equals 2217. Group A is very large (n = 1227), while the other groups are small, with sample sizes ranging from 1 to 46.

What would be the soundest way to model the Group A as a fixed effect while continuing to model the other groups as group-level effects? Simply adding a a dummy variable for Group A to the population-level part seems suboptimal, because then the model will have two perfectly correlated parameters (the fixed and random effect) both referring to the same group.

I already tried creating an alternate version of the grouping factor in which Group A’s value has been set to NA. But this leads brms to discard those rows of data altogether.

An alternative approach which comes to mind is creating a version of the grouping factor in which Group A has been merged with a bunch of “uninteresting groups” (I picked all the ones with n = 1, of which there are 180), and this certainly allows the model to be fit. But the new heterogeneous group still represents largely the patterns of Group A, thus removing credit from its fixed effect.

A better but highly labor-intensive approach coming to my mind is the conflation of Group A with other, hand-selected groups with random effects which collectively cancel out that of Group A, thus removing all overlap between Group A’s new fixed effect and the random-effect category into which Group A has been subsumed. But this would be an enormous hassle, requiring a large number of fits and refits that would take a lot of time.

Is there a simpler, more elegant solution?

Michael Betancourt has a case study which covers this situation - Hierarchical Modeling

and a followup case study: Impact Factor

I plowed through most of that page, but unfortunately failed to understand enough of it, at least on first pass. The level of mathematical abstraction on Betancourt’s github page seems to rather high.

Is there a gentler introduction to the concept of non-centered parameterizations somewhere?

An approximate point I did manage to grasp is that a non-centered parameterization might be needed to avoid divergent transitions. But my models do not suffer from divergent transitions – at least with adapt_delta at 0.96.

Here’s some additional detail:
Group A is very large and the others small. In addition, Group A correlates strongly (at 0.8) with binary fixed effect X. At the same time, the Intercept variability among the small groups is very minor. As a consequence of this small overall variability, the model in which Group A is a random effect is too sceptical of that effect, shrinking it too much and misattributing most of it to X instead. Two approaches get around this problem: one is merging Group A with a bunch of other groups and then giving it a dedicated binary indicator (fixed effect). The other is getting rid of the grouping factor altogether (resulting in a non-mixed model) and using just the binary indicator. The first approach yields a virtually identical elpd_loo (diff = 0.6, SE 1.8) while the simpler one yields a slightly improved one (diff = 4, SE = 2.5). But I have to discuss both approaches, as well as justifying the way in which the merging of random-effect categories is done under the first approach – even though that model will ultimately not be selected.

I’ll keep ruminating on this problem and eventually opt for an approach, and I’ll let the forum know what I decide. Meanwhile, any clarificatory further input would be appreciated.

I agree that the presentation is oblique, furthermore, I should have said more first time around - the specific idea in that discussion was presented in section 4.2.3 - “Unbalanced Likelihood Functions”.

the idea is this:

We can then model the two well-informed contexts with centered parameterizations and the two sparsely-informed contexts with non-centered parameterizations.

This is done by choosing a threshold for the number of observations per category and then assigning each category to either a centered or non-centered parameterization. This requires analysis of your data beforehand, in order to set up the proper set of indices - in addition to an N-length array indicating which group each observation belongs to, there are two sets of indices picking out which categories are centered, non-centered, respectively. With this up-front work, Stan’s index expressions let you write the model pretty cleanly:

data {
  int<lower=0> N;      // Number of observations
  vector[N] y;         // Observations
  real<lower=0> sigma; // Measurement variability
  
  // Number of individual contexts in hierarchy
  int<lower=0> K;
  // Individual context from which each observation is generated
  int<lower=1, upper=K> indiv_idx[N]; 
  
  int<lower=0, upper=K> K_ncp;          // Number of noncentered individuals
  int<lower=1, upper=K> ncp_idx[K_ncp]; // Index of noncentered individuals
  
  int<lower=0, upper=K> K_cp;           // Number of centered individuals
  int<lower=1, upper=K> cp_idx[K_cp];   // Index of noncentered individuals
}

parameters {
  real mu;                  // Population location
  real<lower=0> tau;        // Population scale
  vector[K_ncp] theta_ncp;  // Non-centered individual parameters
  vector[K_cp]  theta_cp;   // Ccentered individual parameters
}

transformed parameters {
  // Recentered individual parameters
  vector[K] theta;
  theta[ncp_idx] = mu + tau * theta_ncp;
  theta[cp_idx] = theta_cp;
}

model {
  mu ~ normal(0, 5);          // Prior model
  tau ~ normal(0, 5);         // Prior model
  
  theta_ncp ~ normal(0, 1);   // Non-centered hierarchical model
  theta_cp ~ normal(mu, tau); // Centered hierarchical model
  
  y ~ normal(theta[indiv_idx], sigma); // Observational model
}

to me this seems pretty elegant. disclaimer - I haven’t tried this. and I question how computationally expensive this might be.

The factor model case study introduces the idea of making one category the baseline - i.e., global intercept - and each category gets it’s own offset from the global intercept, with category A having offset 0 - see section 1.4 https://betanalpha.github.io/assets/case_studies/factor_modeling.html#14_Residual_Hierarchical_Models

addendum: perhaps I didn’t understand what you want to do because I’ve drunk too much hierarchical modeling kool-aid. you want partial pooling - what you know about category A should inform what you know about categories X, Y, and Z, of which you’ve only seen one observation.

You have a bunch of group specific parameters. If you desire, you can give the parameter for Group A its own (non-hierarchical) prior and give the rest of the groups the usual hierarchical prior.

A bigger question is why do you want to do this? If it’s just computational issues due to the imbalance in group size, then you can avoid these by non-centering some groups and centering others, as @mitzimorris shows. If you really want to set up a fixed effect for just one group, you need to have a justification for why none of the other groups should be pooled towards that group. If you think that the pooling is too aggressive, you could always use a longer-tailed distribution than gaussian as the random effect distribution.

1 Like

It does sound elegant to my limited brain, at a conceptual level. Those ‘funnel degeneracies’ sound a lot like what frequentists call ‘singularity’ in the context of mixed models – when there’s insufficient data to estimate variation between the groups, the sigma-hat gets pulled to zero with a warning of ‘singular fit’. The resulting model is then equivalent to an ordinary non-mixed model (assuming the absence of other group-level terms). Differential treatment of large and small groups does sound like a potential workaround. I am using brms though, so the Stan code you posted is not directly applicable.

Clarification: I have a group-level term with 331 groups, of which Group A is the largest with 1227 observations. The others are much smaller with 1 to 46 observations. What I referred to earlier as X is not a group but, rather, one of the fixed (population-level) effects, with which Group A happens to be strongly correlated. When Group A’s estimate gets shrunk near zero by the aggressive pooling, X gets most of the credit that belongs to Group A.

I do want partial pooling for groups other than A. It’s as if all the groups are individuals, and Group A is Brad Pitt whereas the others are random people from the street. Group A has so many observations that there’s no need to shrink its estimates toward the group mean. Under more typical circumstances, such shrinkage would presumably be minimal anyway because of this group’s large n. But in this case the variation among the smaller groups is so minimal (singular?) that sigma-hat becomes small and the model over-shrinks Group A’s estimates as well, in addition to misattributing its group effect to population-level effect X.

This is new to me. The grouping factor has an exponential(2) prior on its SD, but I have hitherto never set priors for individual groups within a grouping variable. When I call prior_summary(model) I don’t see any group-specific priors. I only see global priors for betas and intercepts (which equal normal(0, 2.5)), as well as the global prior for random-effect SDs. How would I set a diffuse prior for an individual group within a grouping factor? Something like prior(normal(0, 10), class = r, coef = "GroupA")?

The group is unique in a number of ways, as well as far larger than the others. Domain knowledge suggests there’s no need to shrink it toward the smaller groups, or vice versa. Yet all the groups are indeed categories of the same factor, so treating one of them as a fixed effect leads to the difficulties prompting the existence of this thread.

1 Like

I didn’t notice the brms tag; apologies! To shorehorn this model into brms, you can specify it this way:

First, use the 0 + Intercept syntax to make sure we know what we’re talking about when dealing with the intercept. Second, create a binary indicator data variable that tells you whether you are in group A or any other group. Crucially, code membership in group A as 0 and membership in another group as 1. Additionally, retain your factor variable giving the group membership. Finally, use the formula ~ 0 + Intercept + binary_indicator + (0 + binary_indicator | group). Now the intercept corresponds to group A. Every other group gets an offset corresponding to the coefficient placed on binary_indicator. And all groups get a random slope for binary_indicator, which amounts to a random intercept for all groups other than group A, and a zero for group A.

3 Likes

Note that this model creates an extra parameter for the random effect slope on group A, but notice that this parameter makes absolutely no likelihood contribution, and so it will be sampled from its hierarchical prior but can be totally ignored.

1 Like

The feedback has helped me realize that my question was due to an inadequate understanding of hierarchical models, abetted by a somewhat ambiguous wording in the literature. In their 3rd edition of Applied Logistic Regression, Hosmer, Lemeshow and Sturdivant (2013: 372) suggest to

which I mistook to mean you had to explicitly remove the influential group from the grouping variable, somehow. I failed to consider that one might simply add the new binary indicator on top of the existing parameters: the idea of creating a new parameter that was perfectly correlated with an existing one was unpalatable to me. The crucial distinction I missed was that such correlation is not dangerous when it occurs between a fixed and a random effect, because the random effect will then just

and this turns out to be exactly what Hosmer et al meant by “excluding the cluster from the random portion”. It really is that simple. I can just add a binary Group A indicator to the population-level part and be done with it.

@jsocolar 's reparameterization is an alternate way of accomplishing the same thing. The two models will produce the same predictions for any given datum. In my case, the reparameterization is interpretively less convenient because it uses Group A as baseline whereas I prefer to have an explicit parameter for this group. However, for reasons I don’t understand, the reparameterization seems to be better computationally: as of the time of this writing, I have fit both models with 7 different seeds, and divergent transitions have been so much more common without the reparameterization that I don’t think it’s a coincidence.

Anyway, thanks for all the help!

1 Like

These parameterizations are different in one potentially important respect. The one I provide above puts all of the difference between group A and the rest-of-group mean into the indicator, and samples the random effect offset from group A from the prior, yielding a posterior with no correlation between the offset and the group A random effect. The “standard” formulation splits the difference between A and other groups across the indicator and the random effect, and tends to yield a negative correlation between the indicator and the group A random effect.

1 Like

The “split” is not reflected in the posterior means, or indeed in any quantity of interest produced by the two models. In both cases, Group A’s random effect gets a posterior mean of 0, with the binary indicator duly getting all the credit. But you are right, the respective posteriors of the fixed and the random Group A effect are negatively correlated in the standard approach, uncorrelated in yours (I checked). I’m guessing this “posterior dependency” is the reason I see more divergences when fitting the standard version.

A follow-up question has arisen. Please go ahead and move my post into a new thread if appropriate, but I felt like this would be the best place to obtain an answer, given that the topics are intimately related.

Whether I use @jsocolar 's parameterization or the standard one, GroupA’s “random” effect will equal zero because the binary indicator gets all the credit for the Group A effect. But what is then the correct interpretation of \hat{\sigma}[Group]?

My initial assumption was that it now quantifies between-group variability within the set of groups not including A. This would be convenient because that variability is a quantity of interest. But then it occurred to me that my assumption may be wrong after all. Even though Group A’s random effect is sampled from the prior and equals zero, doesn’t its “official” inclusion among the groups nevertheless result in shrinkage for both \hat{\sigma}[Group] and the estimated random effects of the non-A groups? Group A is >50% of the dataset, so the shrinkage would be considerable. Unfortunately, this is what I now suspect is the case.

And does it then follow that if I want \hat{\sigma}[Group] to quantify overall inter-group variability only across the other groups, the only thing I can do is refit the model to a subset of the data from which Group A has been completely excluded?

This isn’t precisely true. The mean of group A’s effect will be zero. However, iteration-wise, group A’s effect will not be zero. In the “standard” parameterization, this means that iteration-wise, group A’s effect will get some of the credit for the group A effect, and it will change the shape of the posterior margin for the indicator (but not the mean of that posterior margin). In “my” parameterization, the indicator will consistently get all the credit, and the value for the group A effect has absolutely no influence on the likelihood.

This is correct, up to a minor caveat (see below).

No, up to the same minor caveat (see below) because group A’s effect makes no likelihood contribution. It contributes to the target only via its prior, which is precisely the random effect distribution. Put in the clearest possible terms, if I have a normal distribution with mean zero and some prior on the variance, then the realized prior on the variance does not care whether it’s a 1-dimensional normal distribution or a 100-dimensional normal distribution. I can add levels willy-nilly, and as long as these levels make no likelihood contribution, the fitted variance continues to recapitulate its prior.

This doesn’t matter at all, because there is no likelihood contribution from the fitted value for group A.

The minor caveat:
In the “standard parameterization” this is all true only insofar as the prior for the binary indicator is weak enough to absorb all of the variation in the binary effect induced by variability in the group-A random effect with negligible impacts to the target. If the prior exerts a meaningful influence on the binary indicator, then in the standard parameterization the group effect can start to become meaningful since it effectively relaxes the prior on the sum of the indicator effect and the group effect, which is what matters to the likelihood. In “my” parameterization this doesn’t happen and is a non-issue.

1 Like