Vague Proper Dirichlet Prior


#1

I am estimating a model that contains a N-simplex pi with prior distribution pi ~ dirichlet(pi0 * n_prior);. I am looking for prior specification guidelines akin to those posted here https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations but for the Dirichlet distribution. In particular, I would like to know:

(1) What is the smallest value of n_prior such that the prior is proper (and does this depend on the values of pi0)? Put differently, what is the smallest value that any element of the Dirichlet concentration vector alpha = pi0 * n_prior can take such that the resulting distribution is still proper? (My guess would be 1.)

(2) Less crucially, what is the “boundary-avoiding” Dirichlet prior for use in modal estimation?

Thanks for the help!
Devin


#2

What are your goals for these priors? Dirichlet’s a very weak class as there’s no way to specify correlation among the parameters.

The Dirichlet’s a proper distribution as long as the parameters are greater than zero. That’s implicit in everybody’s notation that alpha > 0 (our doc, Wikipedia, etc.). If alpha = 1, you get a uniform distribution.

But taking Dirichlet(alpha) for some really tiny alpha values is far from vague. As the values get smaller, it concentrates the mass in the corners and the numerics will all go crazy because we can only represent as little as 1e-300 or so.


#3

Thanks, Bob. To be more precise, my goal is to specify priors that are as uninformative as possible while avoiding numerical problems at the boundaries and other issues that lead to poor sampling. In my current application, where alpha has 8 components, I’ve been using the alpha = 1 uniform prior you suggest (ie, in my notation, pi0=1/8 and n_prior=8). However, I am considering specifying more substantively motivated values for pi0, in which case my intuition was that I should use something like n_prior=1/min(pi0) to avoid any alpha too close to 0. But my concern with that approach is that if min(pi0) is small, the resulting n_prior would be large enough to yield a quite informative prior. But from your email it sounds like a prior with alpha close to 0 is not only numerically problematic, but also quite informative. So I guess my question is: (1) is it even possible to specify an uninformative prior with elements of alpha that are highly unequal, and (2) if so, what are some guidelines for choosing a value of n_prior that is both fairly uninformative and leads to good sampling performance?

Thanks for your help!
Devin


#4

Every prior is informative. Dirichlet(1) is uniform over simplexes. It’s information is that every theta in theta ~ Dirichlet(1) is equally likely.

Rather than trying to aim to be as uninformative as possible, I would suggest trying to incorporate as much prior information as you have into your prior. At the very least, make them weakly informative about the scale of the answer.

You can always think of conjugate priors in terms of prior data. The Dirichlet is conjugate for multinomial. If alpha is a positive K-vector and y is a vector of counts, then if the model is

p(theta, y) = Dirichlet(alpha) * multinomial(y | theta)

then the posterior is

p(theta | y) = Dirichleta(theta | alpha + y)`

So you can see that alpha = 1 corresponds to no prior observations.

If alpha < 1, you have “negative” prior counts, which is why it favors draws in the corners. If alpha > 1, you have prior information. How much you have is determined by how much bigger than one.


#5

Thanks, Bob. I take your point about every prior being informative, and also about the virtues of incorporating prior information. I should probably have used “vague” or “diffuse” instead. Thanks again for the help!


#6

The Dirichlet doesn’t have a simple monotonic notion of diffuseness the way something like a normal does. For the normal, the larger the scale parameter sigma, the more diffuse y ~ normal(mu, sigma) is. In the limit as sigma -> infinity, the distribution approaches uniformity over R; in the limit as sigma -> 0, the distribution approaches a delta function.

The Dirichlet is different. It behaves like a generalized Beta distribution. Dirichlet(1) is the most diffuse by most measures of diffusion. For instance, it’s highest entropy because it spreads the probability mass out more evenly over the subspace of simplexes (vectors with non-negative entries that sum to one). Dirichlet(0.1) concentrates more of the mass in the corners, whereas Dirichlet(10) concentrates moreof the mass around the uniform simplex.

So I’d ask again what you’re trying to do. A Dirichlet(0.001) is very informative in the sense that it concentrates most mass on very sparse realizations (they’ll look like one-hot simplexes with a single 1 value and the rest 0 because of floating point rounding). A Dirichlet(1000) is also very informative in that it concentrates most mass very near uniform simplexes. Let’s see what that looks like in practice with

data {
  int<lower = 1> K;
  real<lower = 0> alpha;
}
generated quantities {
  vector[K] theta = dirichlet_rng(rep_vector(alpha, K));
}

Here are the first 10 draws for alpha = 0.001.

iterations   [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]  [,9]  [,10]
      [1,] 3e-203  0e+00 2e-298 9e-106  1e+00  0e+00  0e+00  1e-47 0e+00 4e-279
      [2,]  1e+00  0e+00 5e-279  2e-14 1e-275  0e+00 3e-285 9e-147 0e+00  0e+00
      [3,] 1e-308  0e+00 1e-213  0e+00  0e+00  8e-75  0e+00  1e+00 4e-58 7e-112
      [4,] 6e-166  5e-65  3e-68 3e-147  0e+00  1e+00 3e-249  0e+00 0e+00  0e+00
      [5,]  2e-91  0e+00  0e+00  0e+00  1e-60  0e+00 4e-312  1e+00 0e+00  0e+00
      [6,] 1e-114  0e+00  0e+00 1e-231  1e+00 1e-302  4e-67  0e+00 0e+00  3e-16
      [7,] 3e-311  5e-53 3e-249  0e+00  1e+00 5e-309  0e+00  0e+00 0e+00  0e+00
      [8,] 9e-267  0e+00  1e+00  0e+00  4e-20  0e+00 5e-143 4e-147 2e-90  0e+00
      [9,]  1e+00  0e+00 3e-230 5e-100  0e+00 3e-234 7e-121  6e-76 0e+00  0e+00
     [10,]  0e+00 3e-173  2e-96 3e-164  1e+00  0e+00 4e-257 1e-178 0e+00  2e-06

Here are the first 10 draws for alpha = 1

iterations [,1] [,2] [,3] [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
      [1,] 0.17 0.05 0.07 0.17 0.034 0.133 0.026 0.032 0.271  0.05
      [2,] 0.08 0.02 0.12 0.07 0.521 0.008 0.069 0.043 0.008  0.06
      [3,] 0.02 0.03 0.22 0.29 0.171 0.096 0.086 0.002 0.051  0.03
      [4,] 0.04 0.03 0.21 0.13 0.041 0.009 0.098 0.037 0.224  0.18
      [5,] 0.11 0.22 0.02 0.01 0.059 0.183 0.333 0.041 0.010  0.01
      [6,] 0.19 0.05 0.22 0.03 0.007 0.093 0.036 0.209 0.025  0.13
      [7,] 0.01 0.14 0.18 0.14 0.128 0.051 0.119 0.092 0.077  0.05
      [8,] 0.03 0.06 0.04 0.10 0.049 0.060 0.009 0.227 0.203  0.22
      [9,] 0.03 0.20 0.01 0.05 0.012 0.237 0.112 0.143 0.038  0.17
     [10,] 0.05 0.08 0.06 0.15 0.137 0.106 0.040 0.132 0.070  0.17

And here are the first 10 for alpha = 1000

iterations [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
      [1,]  0.1 0.10  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1
      [2,]  0.1 0.10  0.1  0.1 0.10 0.10 0.11  0.1 0.10   0.1
      [3,]  0.1 0.10  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1
      [4,]  0.1 0.10  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1
      [5,]  0.1 0.10  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1
      [6,]  0.1 0.10  0.1  0.1 0.10 0.10 0.09  0.1 0.11   0.1
      [7,]  0.1 0.10  0.1  0.1 0.10 0.09 0.10  0.1 0.10   0.1
      [8,]  0.1 0.09  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1
      [9,]  0.1 0.11  0.1  0.1 0.09 0.10 0.10  0.1 0.09   0.1
     [10,]  0.1 0.10  0.1  0.1 0.10 0.10 0.10  0.1 0.10   0.1

As the parameter alpha increases, the simplexes produced are increasingly uniform.

I was using this in RStan to fit:

> fit <- stan("dir.stan", data = list(K = 10, alpha = 1),
              chains=1, iter=10, warmup = 0,
              algorithm = "Fixed_param")

> print(extract(fit)$theta, digits=1)

#7

Hey @Bob_Carpenter, since the Dirichlet cannot support correlation among the parameters (due to it’s alpha being a neutral vector), would you recommend putting priors on the \pi_i for the multinomial directly? That is, forgo the Dirichlet entirely and work with priors on the probabilities for the multinomial?


#8

Maybe this answer by @Bob_Carpenter and this wikipedia article can be of interest. There are other possibilities for building priors on simplices.


#9

That’s the usual solution—the logistic multivariate normal. There’s an example in the user’s guide (now the Stan Book) for the correlated topic model.


#10

Do you know of anywhere with an updated correlated topic model (for stability)? I attempted the following:

data {
  int K;                            // number of possible combinations
  int<lower=0> N_new;               // number of samples to generate
  int<lower=0> N;                   // Number of observations
  int<lower=0,upper=1>  y[N,K];      // Value of each observation
  vector[K] mu_fixed;         // To produce samples
  cov_matrix[K] Sigma_fixed;  // To produce samples
  
  int<lower = 0, upper = 1> run_estimation; // a switch to evaluate the likelihood
}

parameters {
  vector[K] mu;   
  cholesky_factor_corr[K] L_Omega; 
  vector<lower=0>[K] tau;
  
  vector[K] eta;
}

transformed parameters {
  simplex[K] theta;
  theta = softmax(eta);
  
} 

model {
  mu ~ normal(0,1);
  L_Omega ~ lkj_corr_cholesky(2.0);
  tau ~ cauchy(0, 2.5); 
  
  eta ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L_Omega));
  
  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    for (n in 1:N) {
      y[n] ~ multinomial(theta);
    }
  }
}

generated quantities {
    int<lower=0,upper=1>  y_hat[N_new,K];
    vector<lower=0,upper=1>[K] theta_fixed;
    theta_fixed = softmax(multi_normal_rng(mu_fixed,Sigma_fixed));

    for (n in 1:N_new) {
      y_hat[n] = multinomial_rng(theta_fixed, 1);
    }
}

and I will generate N_new samples then attempt to recover the model parameters from those samples when doing inference.

Does this seem reasonable?