Hierarchical ordered multinomial regression

I have an hierarchical ordered multinomial probit regression as one part of my model and I’m looking for a sensible way to deal with the cut points. I have three categories \{1,2,3\}, and thus two (interior) cut points c_{1.5},c_{2.5}. Section 15.2 in the Gelman/Hill textbook simply let’s each cut point vary by group (student in their case), but does not talk about how to ensure c_{j \hspace{0.3em} 1.5} < c_{j \hspace{0.3em} 2.5} for all j.

In my model, I have multiple grouping variables, for which I want varying intercepts (cut points) and one grouping variable for which I want varying slopes (with correlation matrix including the intercept/cut points).

I have come across this great case study about ordinal regression by @betanalpha. I was wondering if the (arbitrary?) “achor point” \phi, which is set to zero in the final model, could be used for this. Something like \phi_j \sim N(0, \sigma_\phi) ?!

2 Likes

You have multiple choices each corresponding to different models.

If you let the “anchor point”, or in a more general ordinal regression a model like \phi = \beta * x, vary between groups then you can only change the offset of the latent logistic distribution. This requires that all of the probabilities change in a coherent way.

If you allow the cut points to vary by group then you essentially allow different latent distributions which allow for a richer variation of probabilities. The issue here is allowing varying within the constraints, as you note. One option would be to parameterize the cut points as a left-most point and then log differences from that point (which is how ordered vectors work behind the scenes) and then allow variation on that unconstrained space, but then you have a subtle asymmetry between the models for the different cut points.

4 Likes

To answer a followup question from @CerulloE one can also model heterogeneity in the cut points with the induced Dirichlet prior.

Recall that a hierarchical model takes the form

\pi( \mathbf{c}_{n} | \phi) \cdot \pi(\phi),

where the population model \pi( \mathbf{c}_{n} | \phi) is some self-consistent model for the parameters \mathbf{c}.

Above I suggested reparameterizing the cut points, \mathbf{c}_{n} \mapsto \boldsymbol{\delta}_{n} to remove the constraint, in which case the the individual unconstrained parameters can be modeled with independent normal population models,

\pi( \boldsymbol{\delta}_{n} | \phi) = \prod_{k = 1}^{K} \pi(\delta_{n, k} | \mu_n, \tau_n).

But one can also just used the induced Dirichlet prior for the population model directly,

\pi( \mathbf{c}_{n} | \phi) = \text{induced-Dirichlet}( \mathbf{c}_{n} | \boldsymbol{\alpha}).

In other words instead of fitting a model with

\text{induced-Dirichlet}( \mathbf{c}_{n} | \boldsymbol{\alpha})

where the \boldsymbol{\alpha} are constants one can fit

\text{induced-Dirichlet}( \mathbf{c}_{n} | \boldsymbol{\alpha}) \cdot \pi( \boldsymbol{\alpha})

where the \boldsymbol{\alpha} are parameters.

For the population prior one might try something like \pi( \boldsymbol{\alpha}) = \prod_{n} \pi(\alpha_{n}) or even modeling the \log \alpha_{n} with a multivariate normal.

2 Likes

Thanks for posting this here Michael, really useful.

I have tried the following in my model:

parameters { 
             .............
              vector[n_thresh+1] log_alpha;
              vector[n_thresh] C[n_studies];
}

model {
             .............
           log_alpha ~ std_normal();
            for (s in 1:n_studies) {
                  C[s,] ~  induced_dirichlet(exp(log_alpha), 0);
            }

However I get some errors “Dirichlet_lpdf: probabilities is not a valid simplex”, and that the probabilities are > 0. This occurs throughout the duration of sampling. Is my parameterisation wrong? If not I guess something else in the model is causing it.

That’s enough information for me to provide any constructive comments. The model snippet looks fine but there may be all kinds of numerical issues or bugs in the rest of the program.

1 Like

Thanks

How would you obtain the population-level (summary) cutpoints when using this parmeterisation?

There are no population-level cutpoints with this model!

In general a hierarchical model has individual parameters and population parameters, but none of the population parameters need to admit an interpretation as an “average” individual or anything similar. The population parameters just characterize the distribution of the individual parameters.

In particular there’s no single mean-like summary of the Dirichlet distribution or the induced distribution of cut points. You can sample cut points from the population parameter posterior (in generated quantiles sample cut points given the alpha) and then average the distribution of each individual cut point to get an “average cut point” but this won’t capture the correlation between the cut points or the variation within each cut point.

2 Likes

I see, thanks Michael

Using the other parameterisation you suggested in the earlier post above (from Aug '19) I obtain the population means of the log of the differences between successive cutpoints, then obtain the second summary-level cutpoint by adding the first cutpoint (set to 0) to exp(log_diff_mean_1), etc

I guess with this model it is a little harder to incorporate informative priors vs the induced Dirichlet but it seems to work

In general a hierarchical model does not always have a well-defined “population mean” characterization Really outside of normal hierarchal models most hierarchal models aren’t well described by “base shape” + “variation”. In particular while one can construct population averages like you suggest they will not, in general, look like any of the individuals in the population!

I understand that this complicates analyses and papers where people have come to expect such a characterization, but this complication can’t be avoided. Trying to reduce the model to an average will always introduce some kind of compromise. For a somewhat related discussion see https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html#231_Inferring_Individuals_Verses_The_Latent_Population.

2 Likes

Thanks for the clarification. So, is there any way to construct a hierarchical model on the thresholds whilst obtaining well-described population average cut-points? For my specific case, I need summary-level estimates, so I suppose if that is not possible whilst allowing partial pooling (w/ summary population cutpoints) on the threshold-specific parameters, the only way would be to keep them fixed between groups (complete pooling), however this will lead to some more misfit compared to the varying thresholds case. The ‘fixed cutpoint’ model does still fit 24/26 of the groups in my dataset very well, but allowing thresholds improves the fit of the remaining two groups.

Once you except that there is heterogeneity in a system concepts like “average” don’t make much sense – by definition there is no single behavior that characterizes the individuals in the population and hence no single summary to report! The temptation is to interpret a population average as some counterfactual behavior that would occur if the system were completely homogenous, but that makes strong assumptions about the nature of the heterogeneity that typically can’t be verified. This is true for normal hierarchal models as well, it’s just something that many take for granted.

The problem becomes even more complicated when the system requires multidimensional constraints, and component-wise averages won’t generally respect those constraints. One can try to construct some heuristic average that does respect the constraints, but the statistical interpretation is dubious at best.

All of this is to say that when a journal requires you to report a single “effect” there is no well-defined answer unless the system being studied is homogenous and fully characterized by a single value. Any attempt to summarize the behavior with a single summary will suffer in one way or another. In a publication setting all you can do is try to choose the one that the readers will misinterpret the least, and the optimal choice will depend on the specific audience.

3 Likes

Thanks as usual for the nuanced comments. Really good points regarding attempting to summarise with only one measure, especially without accompanying measures/plots to show the heterogeneity

Would this require an induced_dirichlet_rng function or would you sample the probabilities from dirichlet and then somehow obtain the cutpoints from this? Also, Just to clarify I meant average across the groups, so not average across the cutpoints (just in case I didnt explain properly before). So obtaining average cutpoints like you describe from the population parameter posterior is not quite the same as a location parameter obtained from a normal hierarchical model, as the latter is directly estimated, whereas for the dirichlet we obtain the averages a posteriori. Is such an estimate much more misleading compared to if we were able to (hypothetically) directly obtain mean cutpoint parameters without sampling from the posterior, akin to those obtained from normal hierarchical models? Or is it misleading simply because it does not convey the heterogeneity in the system without accompanying information? If one were to present information from summary cutpoints then, it would be a good idea to show the accompanying estimates of the group-specific cutpoints, but as you mention, the same goes for normal hierarchical models.

EDIT: nevermind - after looking at your ordinal regression case study again I noticed you derived the relations between the latent cutpoints and the ordinal pobabilities, so I just used those to convert the population-level posterior probabilities to the summary cutpoints

The normal hierarchal model

\theta_{k} \sim \text{normal}(\mu, \tau)

can equivalently be written as

\eta_{k} \sim \text{normal}(0, \tau) \\ \theta_{k} = \mu + \eta_{k},

allowing us to think about the population as some “baseline” behavior \mu and some heterogeneity around that baseline, \eta_{k}. In this case one might report \mu as the inference of interest or \eta_{k}. At the same time one might interpret \mu as a counterfactual context where the heterogeneity vanished,

\theta_{\text{counter}} = \mu + 0 = \mu.

In more general population models, however, it’s not always possible to separate the population into a separate baseline and heterogeneity contribution, in which case one can’t really talk about common behavior independently of the heterogeneity. In particular we can’t construct counterfactual like the above. Unfortunately the valid population models for most constrained variables, especially multi-dimensional constrained variables, almost never allow for the normal-like decomposition and we have to rethink how we present results.

For example all we can say here is how the cutpoints behave in each observe context (by reporting the posterior cut point distribution) and how they might behave in new, unobserved contexts (by sampling new cutpoints from the posterior population model).

2 Likes