Multinomial probit

I am trying to fit a multinomial probit model via data augmentation with Stan where the predictors are fixed effects for each dimension, and one alternative-specific variable that varies in the cross-section. In practical terms, I want to model the demand for products via their latent attributes (fixed effects) and the price that is unique to every customer.

For simplicity, I’ve fixed the covariance matrix at an identity, such that only the coefficients (ie preferences) are to be estimated. With simulated data, the model does however not recover the parameters, and it is unclear to me why. The comments by @bgoodri and @betanalpha in [this](// Scaling of Log-likelihood value in Hamiltonian Monte Carlo - #5 by Subodh) post I am aware of, but (perhaps naively) I assumed that fixing the covariance matrix at an identity could perhaps alleviate the issues that other people have had.

What is particularly puzzling to me is that the E-BFMI is generally between 0.3 to 0.5, and the other diagnostics (from my understanding of them) are not particularly irregular either.

If anyone has a deep understanding of this, could you point me to potential errors in my model specification or explain why the model cannot be fitted properly?

Here’s my code:

functions {
  real row_max_except(row_vector a, int d) {
    int D = cols(a);
    real max_val;
    
    if (d == 1) {
      max_val = max(a[2 : D]);
    } else if (d == D) {
      max_val = max(a[1 : (D - 1)]);
    } else {
      max_val = max(a[1 : (d - 1)]);
      max_val = fmax(max_val, max(a[(d + 1) : D]));
    }
    return fmax(max_val, 0);
  }
  real partial_sum_lpdf(array[] matrix comb, int start, int end, vector beta,
                        matrix Sigma, int D) {
    array[end - start + 1] vector[D] xb;
    for (i in 1 : (end - start + 1)) {
      xb[i] = comb[i,  : , 2 : (D + 2)] * beta;
    }
    return multi_normal_lupdf(comb[ : ,  : , 1] | xb, Sigma);
  }
}
data {
  int<lower=1> D; // Number of dimensions
  int<lower=1> N; // Number of observations
  array[N] int<lower=0, upper=D + 1> y; // Outcome category
  matrix[N * D, D + 1] x; // Design matrix: Stacked identities of size D and a column vector of a price variable
}
parameters {
  vector[D + 1] beta; // Regression coefficients
  matrix[N, D] z_raw; // Unconstrained utilities
}
transformed parameters {
  matrix[N, D] z; // Constrained utilities
  profile("utilconst") {
    for (n in 1 : N) {
      for (d in 1 : D) {
        if (y[n] == d) {
          z[n, d] = exp(z_raw[n, d]) + row_max_except(row(z_raw, n), d);
        } else {
          z[n, d] = -exp(z_raw[n, d]) + row_max_except(row(z_raw, n), d);
        }
      }
    }
  }
  
  array[N] matrix[D, D + 2] comb;
  profile("combine") {
    for (n in 1 : N) {
      comb[n] = append_col(z[n,  : ]', x[((n - 1) * D + 1) : (n * D),  : ]);
    }
  }
}
model {
  profile("params") {
    beta ~ normal(0, 1);
  }
  profile("choice") {
    matrix[D, D] Sigma = identity_matrix(D);
    target += reduce_sum(partial_sum_lupdf, comb, 1, beta, Sigma, D);
  }
  profile("jacadj") {
    for (n in 1 : N) {
      for (d in 1 : D) {
        target += z_raw[n, d]; // Jacobian is exp(z_raw) in both cases, adjustment is the log Jacobian
      }
    }
  }
}

Are you trying to code the following model?

It’s confusing because the “multivariate probit” is a different model.

One problem you may be having is with identifiability—you only identify parameters in the logistic form of the model up to additive constants because softmax(x) = softmax(x + c) for any constant c.

If this is the model you want, I’d suggest just coding it up directly rather than trying to augment the data as the Wikipedia shows. That’s useful for Gibbs sampling, but not for HMC.

If there are D possible outcomes, the outcome categories should be coded <lower=1, upper=D> in Stan. I expected to see a matrix of regression coefficients, with a vector for each outcome (or perhaps one fewer vectors with one pinned to zero). That’s what leads me to wonder what model you’re trying to fit.

There’s a direct implementation here (you’ll need to click on “multi-logit” as the URL from the outside doesn’t seem to work with our new doc nav):

https://mc-stan.org/docs/stan-users-guide/regression.html#multi-logit.section

It uses categorical_logit, which is defined to apply softmax to its arguments. You’d have to figure out the equivalent of softmax for the probit case to do that efficiently in Stan.

Thanks for reviewing my question so thoroughly! The multinomial probit model is exactly what I itend to fit, although ultimately with a covariance matrix that is not restricted to an identity matrix.

Do you have more information on why data augmentation is not useful in HMC? I had read a comment in this forum that direct integration (eg via GHK) would be quite hard on the autodiff, and had also hoped that data augmentation might improve mixing of the original parameters (as it does for Gibbs & MH samplers).

The number of outcomes is D+1, although the X matrix is already differenced against the base alternative and hence features only D*N rows (should have mentioned this). The regression coefficients in turn are indeed just a vector, which allows for (implicitly) constraining certain coefficients to be equal across the D outcomes. This is somewhat different from the standard multivariate regression format, but not entirely unusual when a common (price) coefficient is desired.

Although multinomial logit is certainly more tractable in many scenarios, I would like to fit the probit version, specifically because it allows for reparametrizing the covariance matrix later on. The closest logit equivalent to this would be pairwise-nested logit, which is significantly less practical than the probit version.

@CerulloE has worked on this models and may have some additional information he can share.

1 Like

In addition, Jim Savage worked on similar issues a while ago (albeit not directly this model), and his posts and code might be useful to examine:
https://khakieconomics.github.io/

Why does the link matter? Can’t you just use the logit version with a covariance matrix, or is it explicitly tied to marginalizing the data augmentation? Sorry, but we don’t do a lot of data augmentation with HMC, so I’m not that familiar with it.

Like conjugacy, we just don’t need it. The kind of data augmentation used in Gibbs increases dimensionality and requires sampling from truncated distributions, which can be unstable numerically. On the other hand, Stan should be OK with the increased dimensionality, so it’d be interesting to see what you find.

What’s GHK?

In Stan, this needs to be explicit.

Using normal + logit residuals (ie logit with a covariance matrix) is indeed an option, but we had a reviewer question that and because only normal errors are feasible, we stuck with them.

The GHK method (Geweke, Hajivassiliou and Keane) is the most numerically stable (and hence standard) way to compute the integral over the normal as would be needed here without data augmentation. I would have expected that this would be almost impossible to get through the autodiff, and I think Ben Goodrich mentioned sth in that vein in the Google Forum. He also posted his implementation with GHK (which also didn’t work), so I decided to go with data augmentation.

In the canonical multivariate regression setup, there would be D parameters whereas in the current setup, you only have a single one. Hence in the canonical setup you would need to apply the constraint on the D parameters to replicate what is parametrized as a mere single coefficient in the current setup. That is why it’s so convenient.

Why would you need to compute an integral? Again, maybe I’m misunderstanding the model you want to fit. I thought the logit version was this multi-logit model we talk about in the User’s Guide.

https://cran.r-project.org/web/packages/MNP/vignettes/MNP.pdf

https://cran.r-project.org/web/packages/mlogit/vignettes/c6.mprobit.html

I’m actually very curious as to why the multinominal probit model isn’t integrated into the stan examples and brms package. I think the above links and related literature may provide some information, including a discussion of the a priori setup.

Assuming the multinomial probit is the generalization of probit regression the same way that multinomial logit is the generalization of logistic regression, as the Wikipedia indicates, then the answer is that it’s much easier to compute with logit/inverse-logit.

@torrell The implementations by Imai & van Dyck, as well as the one by Peter Rossi & co-authors, are Gibbs or hybrid MH samplers respectively. These differ fundamentally in their exploration of the posterior.

@Bob_Carpenter I’m sure you are aware that what is being modelled in any discrete dependent variable model is the outcome’s probability. The linear form is augmented with an error term (normal or Gumbel), which is integrated out to obtain the (conditional) probability for the particular outcome. A difference of Gumbel random variables is logistic, and an integral over a logistic variable takes the softmax form. (Part of this last sentence is what McFadden was awarded the Nobel prize for.)

To ensure the multinomial logit model is valid, the independence of irrelevant alternatives (IIA) assumption must hold. This means that the odds in each logistic equation remain unchanged regardless of whether other categories are included or excluded from the model. I believe this contributes to the relative simplicity of the multinomial logit model. If the IIA assumption is violated—meaning that the errors are correlated across choices—the alternative-specific multinomial probit model becomes a viable option. This might explain why the author of the original post considered using that model.

The complexity may also be why there has been limited discussion about it within the Stan environment. There was some brief discourse about the model about a decade ago by Professor Gelman and Goodrich, as noted in this thread:

https://groups.google.com/g/stan-users/c/8xYsh5WAiKM

As you mentioned, these two packages were developed before Hamiltonian Monte Carlo (HMC) and its extensions were invented or gained popularity. I hope it’s okay to ask, but if different MCMC sampling methods are used, would it not be possible to collectively compare their posterior samples and inferences?

Your mention of the Gumbel distribution also brings to mind the complementary log-log (cloglog) link function, which corresponds to it. However, I’m not entirely sure how this relates to the assumptions for multinomial probit models. Is the issue you raised similar to the Gumbel-Softmax reparameterization concept in machine learning?

https://arxiv.org/abs/1611.01144

Thanks for the clarifications, @mippeqf.

Always OK to ask anything!

Yes, it is possible to compare samples drawn from different samplers. For something like i.i.d. LOO, you need to have a log_lik variable defined in both.

There are all sorts of different parameterizations for distributions. In Stan, we transform everything to have support over all of \mathbb{R}^D, then map the unconstrained variables back to constrained, apply the change of variables correction then evaluate the model on the constrained scale. Then we apply HMC on the unconstrained scale. That determines which transforms for the changes of variables and which likelihoods and priors are going to be easier or harder to fit.