Summarizing a multinomial posterior (in R)

I have two questions on this issue.

  1. Many non-Bayesian / machine learning scripts offer the option with a multinomial regression of predicting either a softmax or a softprob output, with the former being the most likely category for the observation and the latter being the respective probabilities of each category. Does a typical Bayesian posterior offer both options simply based on how one summarizes the posterior? In other words, would the softmax equivalent be the level that appears most often in the various draws, whereas the softprob would be the percentage of times each level appears over the series of draws?

  2. In a posterior for a continuous output, using posterior_predict, getting the expected mean output value is as simple as using ColMeans. With a multinomial, however, this seems to get tricky. To get the softmax equivalent, perhaps one could use an apply function looking for the mode [there doesn’t appear to be a built in function specific to this]? Getting the softprob seems trickier, because one needs to tally both the appearances of each category by observation and eventually of course relate them to the sum across all the draws, although this latter part is simple to do once the counts for each category have been tabulated for the output of each observation. If anyone has suggestions for efficiently executing either of these functions across a Bayesian posterior, I would appreciate hearing about it.

On question 2, I have hacked together some functions that seem to work, although I don’t doubt there are better ways:

Softmax derivation

softmax <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
apply(preds, 2, softmax)

Softpreds derivation

soft_preds <- list()
for (i in 1:nrow(preds)){
tabl.tmp <- table(preds[,i])
sp.preds <- dplyr::bind_rows(soft_preds, tabl.tmp)

Yeah, you could compute both of these things in generated quantities. The input to the multinomial is a simplex, so that is often the output of a softmax or something like that.

I think all this would depend on what exactly you’re doing, but I think you can achieve the thing you want without doing any processing (other than plotting or any extra analysis) outside of the generated quantities.

Usually when you do the posterior predictive you do it over your whole dataset. So if you have N objects you’re classifying over M classes, and you want to know about the distribution of most-likely outcomes of your softmax, then you’d just generate a softmax from each of your inputs (there’d be N of these) and then put the argmax (an integer 1 to M) of each of them in an array.

So each generated quantities gives you a length N vector of argmaxes, and then you can go about postprocessing them after the fit however you please.

Similarly for the simplexes, you could just save them (though that’ll be a lot of data).

I think though you might be better served sampling from the multinomial distribution itself though. One of the goals of posterior predictives is to generate data that is just like the stuff you fed in. You didn’t feed in a list of simplexes or most likely labels. Presumably you fed in a list of labels generated by a noisy process you’re trying to model – so that’s probably the most interesting output to try to duplicate (though the others can also be interesting as well – just it’s probably worth your time to add this one to the mix).

Hope that helps!

This is confusing because the softmax function maps unconstrained linear predictors to probabilities. Then taking the max gives you the “first bets” guess.

No. The softmax output gives the probabilities and those are what they are. You can store the softmax output and inspect its posterior. Typically you’d only look at the posterior mean here as that would give you the event probabilities for the different outcomes. I do’nt see the point of doing anything with the first-best output. It may vary across the posterior, but it’s not the same as if you were to draw randomly from the posterior simplex. The random draw gives you what you want in terms of uncertainty, but there’s no reason to do that when you have the posterior simplex—that gives you the probabilities directly, so it’s more efficient statistically to compute with that.

Here’s what it looks like in enerated quantities if x is the data matrix (N x J) and beta the coefficient matrix (J x K):

generated quantities {
  simplex[K] p_cat[N];
    matrix[N, K] log_odds = x * beta;
    for (n in 1:N)
      p_cat[n] = softmax(log_odds[n]');