Sum to zero for BYM2 unstructured effect

I was going through the great case study for how to use the sum-to-zero vector type for a BYM2 model by @mitzimorris (here: The Sum-to-Zero Constraint in Stan).

I noticed that in the BYM2 model as specified in the notes, the unstructured effect is left with the classic unconstrained \theta \sim N(0,1) unscaled distribution.

Is there any reason we would prefer that classic parametrisation to one with an explicit sum_to_zero vector, which includes the scaling factor in the BYM2 convolution like we do for the spatial component ?

i.e. currently we have:

\eta = \frac{\phi}{\sqrt s} \sqrt\rho + \theta \sqrt{1-\rho}

where phi is the usual ICAR effect, and theta is declared as an unconstrained vector with standard normal prior:

 vector[n] theta;

– what If instead we define the convolution as:
\eta = \frac{\phi}{\sqrt s} \sqrt\rho + \theta \sqrt{\frac{n} {n - 1}} \sqrt{1-\rho}

and declare \theta as a sum-to-zero vector:

 sum_to_zero_vector[n] theta;

It would seem preferable to be consistent about whether we use hard or soft sum to zero throughout the model, rather than selectively for some effects and not others, unless there is a very good reason to be inconsistent.

theta is constrained to sum to zero by construction - it is a vector of standard normals. phi, OTOH, is the ICAR component, which needs to be constrained. the sum to zero constraint is used when the model is otherwise non-identifiable.

Thanks @mitzimorris – my understanding is that \theta \times \sigma is basically a random effect specified via non-centred parametrisation, and as such it’s “soft” centred (correct me if I’m wrong). Based on my reading of this thread, I had understood that it was preferable for posterior-sampling efficiency gains to specify traditional random intercepts with the sum_to_zero formulation where possible. Did I get that wrong ? Thanks again.

yes the convolution of phi and theta is a random effect but it’s not a non-centered parameterization - cf. Stan User’s Guide. Unlike the discussion in the thread that you cite, this isn’t a nested regression - it’s a standard Poisson regression with an added spatial component - multi-level, non-nested.

The random effects vector theta will tend to sum to zero because each element of theta is given a normal(0, 1) prior. I think you’re correct that doesn’t necessarily sum to zero, but a) the BYM2 model of Riebler et al 2016 doesn’t include a sum-to-zero constraint on theta and b) in my fairly limited experience, this hasn’t been a problem.

I tried out the simple sum_to_zero_vector on \theta (including the singletons in the constraint, just by simple edit of your model here) and the st0 model gives basically the same estimates but has ever so slightly worse (very small difference) ESS per second.

Some ideas below – I’ve been thinking about this a bit (maybe wrongly).

I think it is odd that in the multi-comp BYM2 we sum-to-zero by component the spatial effects, but not the unstructured effects. I understand the former is necessary for identifiability, but it does change the qualitative interpretation of the parameters in a way that, in my mind, is not fully reconciled with the single-component case.

The motivation for the sum to 0 on \theta for me would be to recover the same interpretation of the parameters as you would have in the single-component case. So in the single component case I can say something like :

\theta^\star = \theta \times \sqrt{1-\rho} \times \sigma, is the unstructured area-level departure from the global (component) mean;
\phi^\star = \phi \times \sqrt\rho \times \sigma is the spatially structured area-level departure from the global (component) mean.

In multi-component BYM2, I retain the interpretation around \phi – I impose a sum to 0 constraint by component, such that each \phi is interpretable as the within-component-unit’s spatially-induced departure from the component-level mean.

But \theta interpretation becomes strange – it shares the variance across sub-national units over the full graph, so there is a soft centering effect – but this centering is global, not component-wise, and hence the original “unstructured deviations from the component mean” interpretation is lost. It is also almost never actually summing to 0 in practice based on my experience.

Here I initially thought the easy fix would be to st0 by component also the \theta, but that would basically remove the component-level effect from the model – because you can think of decomposing \theta_{a[j]} = \alpha_j + \delta_{a[j]}, and if I force \sum_{a \in C_j} \theta_{a} = 0, then that’s the same as setting \alpha_j = 0.

So then if you force \theta to st0 you have to add \alpha_j back in as a component-wise random effect – but then then question comes: do I do that inside the convolution or outside ?

It is actually somewhat strange that the component-level effect \alpha_j is implicitly included in the \theta of BYM2, because the whole point of that model is that we are trying to identify effects (structured and unstructured) which compete in explaining the share of the variance. But the cross-components variance, which is implicitly rolled into the variance of \theta, is not explainable by a spatial effect by definition, because the components are disconnected. So this looks like it’s, in some sense, artificially inflating the share of the unstructured variance relative to structured.

It would then make sense to keep \alpha_j outside the convolution. This changes the interpretation of \rho from % of area-level variance which is spatial to % of the within-component-level variance which is spatial, but it should in principle work I think.

I have tried a version of the model I describe, but it doesn’t work very well in that it is very slow and has poor mixing. I guess I got carried away with the “sum_to_zero just works” idea, but I’m still not super clear why we would use it for a random intercept in a multilevel model, and we would not in this case.


1 Like

I think you’ve thought through the issues here quite thoroughly, and there is probably little to add.

A multi-component ICAR model fundamentally assumes that the between-component structure is indistinguishable from sampling variation in the unstructured effect (unless there’s an additional spatial effect beyond the multi-component ICAR). The BYM2 model partitions variance into the within-component spatially structured part, and the entire unstructured part. This is the reason for he asymmetry you are encountering.

If desired, it would be possible to write down a model that attempts to partition the unstructured part in a way that reflects some further structure. In particular, we could imagine that the grouping of the data into components induces structure that is representable by a random effect of component. Now the “unstructured” variance is actually modeled as two random effects–a random effect of the spatial unit and a random effect of the component. This induces a component-wise structure in the previously unstructured part of the unit-level variance. In other words, it is one example of an additional spatial effect that you might choose to include in the model.

If you feel it appropriate to do so, then you could make the choice about whether to pack the component-wise unstructured effects into independent sum-to-zero vectors, but note that this would get awkward, because the random effect of component would no longer have homogeneous variance, as it would need to accommodate the sampling variation in the component means under non-constant sample size.

The bigger point is that this is a different model than the previous one, not just a reparameterization. Whether it is likely to be more realistic or not depends on the problem.

Hope some of this helps :)

2 Likes