Hierarchical Dirichlet Process: divergent transitions with hyperprior

I am trying to model a Hierarchical Dirichlet Process in Stan. The mathematical model is as follows:

G_j | G_0 , \alpha_0 \sim DP(\alpha_0, G_0) \quad j=1,2
G_0 \sim DP(2, \mathcal N(0, 5))

The data are drawn from two univariate normal distribution with different means.

Here is the model, I use a truncated version of the DP with the same number of components (H) for all the 3 of them:

data {
  int<lower=0> H; // number of components in the DP
  int<lower=0> J; // number of groups
  int<lower=0> max_num_samples;
  int<lower=0> num_samples[J]; // number of samples in each group
  matrix[J, max_num_samples] samples;
}

parameters {
  real<lower=0> alpha_0;
  vector<lower=0, upper=1>[H-1] nus[J];
  vector<lower=0, upper=1>[H-1] nu_top;
  vector[H] means; // cluster means
}

transformed parameters {
  simplex[H] weights[J];
  simplex[H] weights_top;
  real prod1_nu = 0;

  for (j in 1:J) {
    weights[j][1] = nus[j][1];
    prod1_nu = 1 - nus[j][1];
    for (h in 2:(H-1)) {
      weights[j][h] = nus[j][h] * prod1_nu;
      prod1_nu *= (1 - nus[j][h]);
    }
    weights[j][H] = fmax(0.0, 1 - sum(weights[j][1:(H-1)]));
  }

  weights_top[1] = nu_top[1];
  prod1_nu = 1 - nu_top[1];
  for (h in 2:(H-1)) {
    weights_top[h] = nu_top[h] * prod1_nu;
    prod1_nu *= (1 - nu_top[h]);
  }
  weights_top[H] = fmax(0.0, 1 - sum(weights_top[1:(H-1)]));
}

model {

  // hyperparams for G0 | gamma, H
  // H ~ N(0, 5);
  real sigmaH = 5;
  real gamma = 1.0;

  // Top level DP
  for (h in 1:H) {
    means[h] ~ normal(0, sigmaH);
  }

  nu_top ~ beta(1, gamma);

  alpha_0 ~ gamma(3,3);

  // Bottom level DPs
  for (j in 1:J) {
    nus[j] ~ beta(1, alpha_0);
  }

  for (j in 1:J) {
    for (i in 1:num_samples[j]) {
      real partial_sums[H];
      for (h in 1:H) {
        partial_sums[h] = log(weights[j][h]) + normal_lpdf(samples[j, i] | means[h], 1.0);
      }
      target += log_sum_exp(partial_sums);
    }
  }
}

This together with this R code:

mu = 0
sigma = 1
J = 2
num_samples = 100

samples = matrix(nrow=J, ncol=num_samples)

for (j in 1:J) {
  samples[j, ] = rnorm(num_samples, mu + 2*j, sigma)
}

dat = list(
  H=10,
  J=2,
  max_num_samples=num_samples,
  num_samples=c(num_samples,num_samples),
  samples=samples)

fit = stan(file="hdp.stan", data=dat)

Produces divergent chains: if i try to plot the trace of the means, I get 4 flat lines.

Instead, if one fixes the hyperparameter \alpha_0 to be equal to 1, the model recovers the data (apart from a little bit of label switching).

Has anyone encountered the same problem?

Don’t know if that’s the problem but your stick-breaking construction is incorrect. As described in Teh et al. (2006), for an HDP, the priors on the nus depend on both alpha_0 and nu_top. Intuitively, this makes sense, as a concentrated instance of G0 should result in equally concentrated children.