Recovering parameters from fake Dirichlet data

I’ve made some fake data and I’m trying to recover the parameters with a Stan model as a training exercise. I’m not getting what I want at all and I’d really appreciate if someone could help me out.

The fake dataset as shown in the code below is just two Dirichlet mixtures with \alpha's (1,2,10) and (1,1,1) combined in equal parts to create a matrix with 3 columns and N=250 rows.

require(rstan)
require(MCMCpack)
S=runif(250)
S=rbind(rdirichlet(NROW(S),c(1,2,10))[S>.5,], rdirichlet(NROW(S),c(1,1,1))[S<=.5,])
d=list(N=NROW(S),y=S)
f=stan("dirilecht-test.stan",data=d)

Here’s my Stan model, which I hoped would recover the original parameter values from the fake data.


data {
  int N;
  simplex[3] y[N];
}

parameters {
  simplex[3] theta1;
  simplex[3] theta2;
  real<lower=0,upper=1> lambda;
}

model {
  lambda ~ uniform(0,1);
  for (n in 1:N) {
    target += log_sum_exp(log(lambda) + dirichlet_lpdf(y[n]|theta1),
                          log(1-lambda) +dirichlet_lpdf(y[n]|theta2));
  }
}

The model fails to converge, and for some reason theta1 and theta2 are not estimated at all.

          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
lambda    0.09    0.00 0.01    0.07    0.08    0.09    0.09    0.11  1396    1
lp__   -346.83    0.02 0.74 -349.08 -346.97 -346.55 -346.37 -346.32  1743    1

I’d appreciate if someone could explain how they’d approach this.

I don’t see this issue. When I run your code I get

           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
theta1[1]  0.28    0.00 0.08  0.12  0.25  0.27  0.29  0.51  1098 1.00
theta1[2]  0.32    0.00 0.08  0.14  0.30  0.32  0.33  0.52  1745 1.00
theta1[3]  0.40    0.00 0.09  0.15  0.39  0.41  0.43  0.58   786 1.00
theta2[1]  0.28    0.00 0.08  0.11  0.25  0.27  0.29  0.49  1108 1.00
theta2[2]  0.32    0.00 0.09  0.12  0.30  0.32  0.34  0.57  1520 1.00
theta2[3]  0.40    0.00 0.10  0.11  0.38  0.41  0.43  0.61  1173 1.01
lambda     0.52    0.03 0.41  0.00  0.06  0.59  0.96  1.00   157 1.03
lp__      96.45    0.11 2.34 91.07 95.05 96.77 98.29 99.84   476 1.01

The model cannot recover the parameters because thetas are declared to be simplexes but $\alpha$s are not simplexes. It works (somewhat) if you change the parameter constraints to

vector<lower=0>[3] theta1;
vector<lower=0>[3] theta2;

There’s still a label switching problem because it could be either theta1=(1,2,10),theta2=(1,1,1) or theta1=(1,1,1),theta2=(1,2,10). If you run only one chain it converges to one solution but if you run multiple chains different chains choose different labeling at random and that messes up the diagnostics.

arghh… I see . Thank you. You are right, of course. Is there any canonical approach to dealing with label switching for direchlet data?

Sorry, I don’t know anything that isn’t in the Stan User’s Guide. There’s some ideas there but I think in general label switching is an unsolved problem.

Thank you for the help! Very much appreciated.

For the label-switching, a possible approach could be an ordered simplex: Ordered Simplex

But you should also read Michael Betancourt’s case study on label-switching for more info

1 Like

That’s a cunning idea; thank you. The ordered simplex is useful to know about but as @nhuurre mentioned, the a parameter of the dirichlet distribution isn’t actually a simplex :-) Will read Betancourt’s article on identifiability in mixture models.

For my use case, I could model my problem as a multi-normal mixture with a common covariance, but would that give me more options to tackle identifiability I wonder?