Choosing between two hierarchical models


I am trying to fit a model for the N data points with 3 covariates and 1 response. This is the basic model used as one of the Guassian in a mixture of two Gaussian with another Gaussian to capture the outliers from the following model:

covariates: x, u, v
response: y

Model 1:

y_i = a(u_i, v_i) x_i + b(u_i, v_i) + \epsilon_i \\ \begin{cases} a(u_i, v_i) = a_0 + a_1 u_i + a_2 v_i \\ b(u_i, v_i) = b_0 + b_1 u_i + b_2 v_i \\ \sigma_{int}(u_i, v_i) = \sigma_0 + \sigma_1 u_i + \sigma_2v_i \\ \end{cases}\\ \epsilon_i \sim N(0, \sigma_{int}(u_i, v_i)^2 + \sigma_{y_i}^2 )

In which I fit for the parameters \{a_0, a_1, a_2, b_0, b_1, b_2, \sigma_0, \sigma_1, \sigma_2\}. I have five separate data sets and the fit looks good for each dataset. The final model I had in mind was to add a level for the variation between datasets to allow a better shrinkage.

Model 2:

People have used simple binning of the data in the sense that they binned in terms of v and then bin the data in terms of u and fit the model y = ax + b + \epsilon and looked at the change in a, b, \sigma as a function of the u, v by fitting the linear model to these estimates after the first fit. The v has a time-like nature and u is related to a property of the environment (there are k types of environments, at each time slice).

As an effort to make a model similar, I am trying to fit the linear model for y(x) but in each of these environments at different times using a random slope and intercept model that follows the linear structure in terms of u, and v as a baseline at each level, one level with a linear term for u and one level with a linear term in v, within a three-level model (i.e. using binned u and v as group-level covariates).

There are 4 types of environments at a given time so if there are 5 epochs, I’d have 20 level-2 parameters where variation in each parameter with a similar environment follows the same distribution which is in addition to the linear baseline in terms of the 4 binned environments. There are also 5 level-3 parameters to allow for variation at the level of time in addition to the linear baseline in terms of 5 binned time. This would be a four-level model if I include another level for different datasets.

y_i = a(\bar{u}_i, \bar{v}_i) x_i + b(\bar{u}_i, \bar{v}_i) + \epsilon_i \\ a(\bar{u}_i, \bar{v}_i) = a_0 + a_1 \bar{u}_i + a_2\bar{v}_i + \epsilon_{\bar{u}_i, \bar{v}_i} + \epsilon_{\bar{v}_i}\\ \epsilon_{\bar{u}_i, \bar{v}_i} \sim N(0, \tau^2_{a,\bar{u_i}}) \\ \epsilon_{\bar{v}_i} \sim N(0, \tau^2_{a, \bar{v}})\\ \epsilon_i \sim N(0, \sigma_{int}(\bar{u}_i, \bar{v}_i)^2 + \sigma_{y_i}^2 )

Where \tau_{a,\bar{u_i}} is the same for those with same environments, so there are k=4 such parameters. Also, \bar{u}, \bar{v} are the binned versions of the original u, v. So \bar{u}_i is the median or mean value of the bin where i'th data points falls into. An equation similar to what is written for a(\bar{u}_i, \bar{v}_i) is used for b(\bar{u}_i, \bar{v}_i) and \sigma_{int}(\bar{u}_i, \bar{v}_i).

Question:

  1. Which one of these models is correct and if both/none are correct what would you recommend? Also, are there any benefits in adding these extra levels instead of using the data like in model 1?

  2. Also, should I expect to get similar results from both approaches in terms of the slope for u and v, namely the a_1, a_2, b_1, b_2, \sigma_1, \sigma_2 or they will be washed out because of the extra random variation terms added at different levels?

1 Like

Neither model here is “correct” in any technical sense. The both make approximations, and the relative accuracy of those approximations will vary from application to application.

The first model can be thought about as a nested Taylor approximation. The location of a normal observational model is assumed to vary with the covariates x, u, and v. If the covariate values vary only slightly some baseline value then the dependence can be approximated by a linear dependence. First the x dependence is accounted for,

m(x, u, v) \approx b(u, v) + a(u, v) \cdot x,

then the u and v dependence is considered,

m(x, u, v) \approx (b_0 + b_1 \cdot u + b_2 \cdot v) + (a_0 + a_1 \cdot u + a_2 \cdot v) \cdot x.

This model is exact if the actual dependence is linear in x, u, and v, and it can be a reasonable approximation if the variation in those values is reasonably small. If the variation is larger then higher-order, non-linear contributions become important. Similarly if there are any correlations between the variations then the cross terms, i.e. “interactions”, also become important.

The second model turns u and v into discrete covariates which can then be used to group the observed data into different categories or contexts. If the actual dependence on u and v is relatively constant across the discretizing bins then the dependance can be reasonably approximated by constant values in each bin. Moreover, if the bins are wide enough that there’s not much continuity between neighborhood constant values then exchangeability becomes a reasonable assumption and one can apply hierarchies. The benefit of this model is that it can capture non-linear dependencies in u and v. One weakness is that the binning can lose a lot of information; another is that the non-continuous model for the u and v dependence is so flexible that it can lead to large uncertainties without a lot of data. Also there’s still no accounting for interactions between u and v.

Which model is most appropriate in any given situation will depend on which assumptions one thinks are more reasonable. If dependencies are expected to be roughly linear then the first model will probably be better, but the second model would be needed to capture any non-linear behavior.

1 Like

Dear @betanalpha,

Thanks for your detailed answer, it cleared up a lot of my issues.

One thing I wanted to ask was when you say:

does it mean that under the conditions you mentioned one can ignore what I considered as baseline a(\bar{u}_i, \bar{v}_i) = a_0 + a_1 . \bar{u}_i + a_2 . \bar{v}_i and instead can use a constant model in each bin a(\bar{u}_i, \bar{v}_i) = a_{\bar{u}_i, \bar{v}_i}?

BTW, thank you very much for providing fantastic resources on these topics through your writings as I am learning a lot from them.

1 Like

a_0 + a_1 \cdot u + a_2 \cdot v, u_{i} + v_{j} and uv_{i, j} are three different ways to approximate a(u, v), each with their own assumptions as discussed above (the latter two are best compared from a factor modeling perspective, https://betanalpha.github.io/assets/case_studies/factor_modeling.html). If the binning effects are negligible then the third model can accommodate the linear behavior of the first model as well as non-linear behavior, but as always a more flexible model will result in larger uncertainties if the flexibility isn’t needed.

1 Like