Dirichlet Priors

In section 20.2 of the manual, some examples for priors for the beta distribution are given (from

parameters {
  real<lower=0,upper=1> phi;
  real<lower=0.1> lambda;
  ...
transformed parameters {
  real<lower=0> alpha = lambda * phi;
  real<lower=0> beta = lambda * (1 - phi);
  ...
model {
  phi ~ beta(1, 1); // uniform on phi, could drop
  lambda ~ pareto(0.1, 1.5);
  for (n in 1:N)
    theta[n] ~ beta(alpha, beta);
  ...

However, for the Dirichlet only the form is specificed but not example prior distributions,


parameters {
  simplex[K] phi;
  real<lower=0> kappa;
  simplex[K] theta[N];
  ...
transformed parameters {
  vector[K] alpha = kappa * phi;
  ...
}
model {
  phi ~ ...;
  kappa ~ ...;
  for (n in 1:N)
    theta[n] ~ dirichlet(alpha);

Obviously this will depend on the problem, but in general would be some sensible choices for phi and kappa ?

1 Like

Sorry for taking so long to reply. I’m a bit uncertain myself of suitable priors, but my hope is that @martinmodrak could help, or find the expertise to do it :)

2 Likes

Thanks, I ended up doing:

parameters {
  simplex[K] phi;
  real kappa;
  simplex[K] theta[N];
.....
}
transformed parameters {
  vector[K] alpha = exp(kappa) * phi;
....
}
model {
  phi ~ beta(1, 1); // redundant
  kappa ~ normal(0, 5) ;// allows a lot of variation in the dirichlet hyperparameters 
  for (n in 1:N)
    theta[n] ~ dirichlet(alpha);
.....
}

1 Like

I guess the prior predictive checks looked sane?

2 Likes

So the combination of

vector[K] alpha = exp(kappa) * phi;

and

kappa ~ normal(0, 5) ;

is indeed extremely wide - but hey, as long as you have enough data to constrain it, it should not be a problem. To be safe, I would also consider running the model with a narrower (say normal(0,1)) prior on kappa and see if your inferences change. If not, you are probably good.

Here is a quick code I wrote to visualise bivariate marginal distributions from the prior to let me better understand it (you could work this out analytically, but I write code faster than I do math, so I chose simulation :-)

N <- 1e5
K <- 10

res <- array(NA_real_, dim = c(K, N))
kappa <- rnorm(N, 0, 5)
phi <- MCMCpack::rdirichlet(N, rep(1, K)) #Uniform simplex
for(i in 1:N) {
  res[,i] <- MCMCpack::rdirichlet(1, exp(kappa[i]) * phi[i,])
}

print(paste0(sum(is.na(res)), " NAs"))

df_to_plot <- data.frame(x1 = res[1,], x2 = res[2,]) %>% filter(!is.na(x1), !is.na(x2))       
ggplot(df_to_plot, aes(x1, x2)) + geom_hex() + scale_fill_continuous(trans = "log10")

So for K = 10 the bivariate implied prior is:

which is pretty concentrated at “both almost zero” and “single almost 1, other almost zero” (not the log scale on fill)

Best of luck with your model.

1 Like

Thanks!

This model is part of a more complicated model, which involves a multivariate ordered probit, and i’m using a hierarchical induced-dirichlet model on the latent cutpoints. For prior predictive checks for this part of the model, I have been plotting the posteriors for the cumulative probabilities. I have incorporated domain expertise into other aspects on the model, and I don’t want to for this part of the model which uses ordered regression (as there isnt much available). In this case would I want the priors for the cumulative probabilities to be approximately uniform? Or, should I not look at the cumulative probabilities for prior predictive checks? I guess just because these are uniform it doesn’t mean it is less informative, since the priors might be “forcing” them to be more uniform…

I have 17 categories (K=17) and have plotted:

It looks like the N(0,5) would more easily allow 2 out of 17 of the categories to have most of the weight compared to the N(0, 1)?

I actually get better mixing and less divergent transitions (none) with normal(0, 5) vs normal(0,1) on kappa. I guess N(0, 1) is conflicting with the likelihood too much as the posterior median for kappa is 4.81 with this prior.

Posterior draw summaries:

Normal(0,5)

image

Normal(0, 1):

image

Experiment beats my speculation :-). Yes this looks like indeed normal(0,1) would be a bad choice and normal(0,5) is better.

Best of luck with your model!

1 Like