Priors for highly skewed multinomial word counts

Suppose we have words i = 1, ..., n, and observe individual word counts x_1, ..., x_n with N = \sum_{i=1}^n x_i fixed. Often, the relative frequency of words i and j are informative for some outcome, or generally of interest, so we use a model:

(X_1, ..., X_n) \sim \mathrm{Multinomial}(N, \pi)

where \pi lives on (n-1)-simplex, and we want to estimate \pi_i and \pi_j, for example. A standard prior for \pi is then a \mathrm{Dirichlet}(\alpha) distribution, where \alpha > 0 can be interpreted as pseudo-counts, \alpha = (1, ..., 1) is uninformative, \alpha \to 0 element-wise is anti-conservative, and \alpha \to \infty element-wise is conservative.

However, we also know that the distribution of the word counts X_i is typically higher skewed, say, approximately log normal or power law distributed. Can I incorporate this information into the prior for \pi (or generally into the whole model)? Do people do things like put a LogNormal hyperprior on \alpha? Also, does \alpha \to 0 imply a particular skew on \pi?


Tagging @stemangiola and @martinmodrak.

1 Like

Higher skewed than what? Count data is typically highly skewed, and mutinomial models that. For a more overdispersed counts you can use dirichlet multinomial. Although none of these allow extreme large tails. I think this as multinomial is at poisson like dirichlet multinomial is at negative binomial (but this is totally intuition and non-scientific :) )

// Equivalent to multinomial(dirichlet(alpha))
  real dirichlet_multinomial_lpmf(int[] y, vector alpha) {
    	real alpha_plus = sum(alpha);

      return lgamma(alpha_plus) + sum(lgamma(alpha + to_vector(y)))
                  - lgamma(alpha_plus+sum(y)) - sum(lgamma(alpha));

Where alpha is the real array parameter of a dirichlet

If the log proportions look roughly normal you should be OK with Dirichlet prior. Otherwise changing the prior to alpha does not do much, you should use something else than dirichlet, for example multinomial(softmax(parameter_coming_from_student_t) but we didn’t manage to go anywhere with that.

You could use rectangular-beta prior for extreme large tail proportional data, although I have 0 experience on that.

I think you might be well of with dirichlet_multinomial. My experience is to not go to exotic.


I think my real question comes down to this: multinomial models allow for highly skewed distributions, but they do not enforce highly skewed outcomes. If we remove or penalize models when \pi is dispersed approximately evenly across all counts, does it improve our estimates of \pi? It just feels like there is prior information not making it into the multinomial-dirichlet model, at least for word counts.

Can you explain this a bit more? I’m not seeing where it comes from.

I would not try to make a model enforce anything.

Anyhow, @Bob_Carpenter is much more knowledgeable than me on this. Before speculating all together of what a model can or cannot do on a certain data, maybe try to model your data with multinomial, and plot a posterior predictive check for an element that is poorly modelled. Then, members of the community can propose solutions based on something concrete.

1 Like