# Multinomial probit

I am trying to fit a multinomial probit model via data augmentation with Stan where the predictors are fixed effects for each dimension, and one alternative-specific variable that varies in the cross-section. In practical terms, I want to model the demand for products via their latent attributes (fixed effects) and the price that is unique to every customer.

For simplicity, I’ve fixed the covariance matrix at an identity, such that only the coefficients (ie preferences) are to be estimated. With simulated data, the model does however not recover the parameters, and it is unclear to me why. The comments by @bgoodri and @betanalpha in [this](// Scaling of Log-likelihood value in Hamiltonian Monte Carlo - #5 by Subodh) post I am aware of, but (perhaps naively) I assumed that fixing the covariance matrix at an identity could perhaps alleviate the issues that other people have had.

What is particularly puzzling to me is that the E-BFMI is generally between 0.3 to 0.5, and the other diagnostics (from my understanding of them) are not particularly irregular either.

If anyone has a deep understanding of this, could you point me to potential errors in my model specification or explain why the model cannot be fitted properly?

Here’s my code:

``````functions {
real row_max_except(row_vector a, int d) {
int D = cols(a);
real max_val;

if (d == 1) {
max_val = max(a[2 : D]);
} else if (d == D) {
max_val = max(a[1 : (D - 1)]);
} else {
max_val = max(a[1 : (d - 1)]);
max_val = fmax(max_val, max(a[(d + 1) : D]));
}
return fmax(max_val, 0);
}
real partial_sum_lpdf(array[] matrix comb, int start, int end, vector beta,
matrix Sigma, int D) {
array[end - start + 1] vector[D] xb;
for (i in 1 : (end - start + 1)) {
xb[i] = comb[i,  : , 2 : (D + 2)] * beta;
}
return multi_normal_lupdf(comb[ : ,  : , 1] | xb, Sigma);
}
}
data {
int<lower=1> D; // Number of dimensions
int<lower=1> N; // Number of observations
array[N] int<lower=0, upper=D + 1> y; // Outcome category
matrix[N * D, D + 1] x; // Design matrix: Stacked identities of size D and a column vector of a price variable
}
parameters {
vector[D + 1] beta; // Regression coefficients
matrix[N, D] z_raw; // Unconstrained utilities
}
transformed parameters {
matrix[N, D] z; // Constrained utilities
profile("utilconst") {
for (n in 1 : N) {
for (d in 1 : D) {
if (y[n] == d) {
z[n, d] = exp(z_raw[n, d]) + row_max_except(row(z_raw, n), d);
} else {
z[n, d] = -exp(z_raw[n, d]) + row_max_except(row(z_raw, n), d);
}
}
}
}

array[N] matrix[D, D + 2] comb;
profile("combine") {
for (n in 1 : N) {
comb[n] = append_col(z[n,  : ]', x[((n - 1) * D + 1) : (n * D),  : ]);
}
}
}
model {
profile("params") {
beta ~ normal(0, 1);
}
profile("choice") {
matrix[D, D] Sigma = identity_matrix(D);
target += reduce_sum(partial_sum_lupdf, comb, 1, beta, Sigma, D);
}
for (n in 1 : N) {
for (d in 1 : D) {
target += z_raw[n, d]; // Jacobian is exp(z_raw) in both cases, adjustment is the log Jacobian
}
}
}
}
``````

Are you trying to code the following model?

It’s confusing because the “multivariate probit” is a different model.

One problem you may be having is with identifiability—you only identify parameters in the logistic form of the model up to additive constants because `softmax(x) = softmax(x + c)` for any constant `c`.

If this is the model you want, I’d suggest just coding it up directly rather than trying to augment the data as the Wikipedia shows. That’s useful for Gibbs sampling, but not for HMC.

If there are `D` possible outcomes, the outcome categories should be coded `<lower=1, upper=D>` in Stan. I expected to see a matrix of regression coefficients, with a vector for each outcome (or perhaps one fewer vectors with one pinned to zero). That’s what leads me to wonder what model you’re trying to fit.

There’s a direct implementation here (you’ll need to click on “multi-logit” as the URL from the outside doesn’t seem to work with our new doc nav):

https://mc-stan.org/docs/stan-users-guide/regression.html#multi-logit.section

It uses `categorical_logit`, which is defined to apply `softmax` to its arguments. You’d have to figure out the equivalent of `softmax` for the probit case to do that efficiently in Stan.

Thanks for reviewing my question so thoroughly! The multinomial probit model is exactly what I itend to fit, although ultimately with a covariance matrix that is not restricted to an identity matrix.

Do you have more information on why data augmentation is not useful in HMC? I had read a comment in this forum that direct integration (eg via GHK) would be quite hard on the autodiff, and had also hoped that data augmentation might improve mixing of the original parameters (as it does for Gibbs & MH samplers).

The number of outcomes is `D+1`, although the X matrix is already differenced against the base alternative and hence features only `D*N` rows (should have mentioned this). The regression coefficients in turn are indeed just a vector, which allows for (implicitly) constraining certain coefficients to be equal across the `D` outcomes. This is somewhat different from the standard multivariate regression format, but not entirely unusual when a common (price) coefficient is desired.

Although multinomial logit is certainly more tractable in many scenarios, I would like to fit the probit version, specifically because it allows for reparametrizing the covariance matrix later on. The closest logit equivalent to this would be pairwise-nested logit, which is significantly less practical than the probit version.

@CerulloE has worked on this models and may have some additional information he can share.

1 Like

In addition, Jim Savage worked on similar issues a while ago (albeit not directly this model), and his posts and code might be useful to examine:
https://khakieconomics.github.io/

Why does the link matter? Can’t you just use the logit version with a covariance matrix, or is it explicitly tied to marginalizing the data augmentation? Sorry, but we don’t do a lot of data augmentation with HMC, so I’m not that familiar with it.

Like conjugacy, we just don’t need it. The kind of data augmentation used in Gibbs increases dimensionality and requires sampling from truncated distributions, which can be unstable numerically. On the other hand, Stan should be OK with the increased dimensionality, so it’d be interesting to see what you find.

What’s GHK?

In Stan, this needs to be explicit.

Using normal + logit residuals (ie logit with a covariance matrix) is indeed an option, but we had a reviewer question that and because only normal errors are feasible, we stuck with them.

The GHK method (Geweke, Hajivassiliou and Keane) is the most numerically stable (and hence standard) way to compute the integral over the normal as would be needed here without data augmentation. I would have expected that this would be almost impossible to get through the autodiff, and I think Ben Goodrich mentioned sth in that vein in the Google Forum. He also posted his implementation with GHK (which also didn’t work), so I decided to go with data augmentation.

In the canonical multivariate regression setup, there would be D parameters whereas in the current setup, you only have a single one. Hence in the canonical setup you would need to apply the constraint on the D parameters to replicate what is parametrized as a mere single coefficient in the current setup. That is why it’s so convenient.

Why would you need to compute an integral? Again, maybe I’m misunderstanding the model you want to fit. I thought the logit version was this multi-logit model we talk about in the User’s Guide.