Parameterising a latent simplex

mhollanders · April 7, 2025, 3:48am

Hi everyone,

I’m working on a Jolly-Seber (mark-recapture) model, specifically the superpopulation approach, where the entry probabilities are modeled as a simplex. We have individuals i = 1, \dots, I that enter the study during one of j = 1, \dots, J surveys, so that the entry occasion b_i \sim \mathrm{Categorical} (\boldsymbol{\gamma}), where \sum_{j=1}^J \gamma_j = 1. A natural uninformative prior distribution is \gamma \sim \textrm{Dirichlet}(\boldsymbol{1}), which works fine.

I wish to extend this to account for unequal survey intervals \tau_j, where we expect a higher entry probability for surveys that had a longer period of time between it and the last survey. One way to do this would be via Dirichlet regression, e.g. in Stan:

parameters {
  simplex[J] gamma;
  real alpha, beta;
}
transformed parameters {
  vector[J] mu;
  mu[1] = 0;  // pin first entry probability for identifiability
  mu[2:J] = alpha + beta * tau;
}
model {
  gamma ~ dirichlet(softmax(mu));
}

However, this has some divergent transitions, I think for the same reason that the centered parameterisation of Gaussian random effects gets them. Also, I’d prefer to think of the \tau's as an offset, where we have some baseline rate that’s offset by the survey length, and therefore I’d rather not estimate \beta. If we think of generating random Dirichlet variables by simulating \textrm{Gamma}(\alpha, 1) variates and normalising them, I want to find a good way to model the entry “rates” and to interpret them as the expected number of entries per survey.

The following two models (I’ll call them centered and non-centered) give the same loo output and parameter estimates:

parameters {
  simplex[J] gamma;
}
model {
  gamma ~ dirichlet(ones_vector(J)));  // centered
}

And “non-centered”:

parameters {
  vector<lower=0>[J] z;
}
transformed parameters {
  vector[J] gamma = z / sum(z);
}
model {
  z ~ gamma(1, 1);  // non-centered
}

What would be a good way to incorporate the survey intervals here? I’ve tried a few things, including just adding log(tau) or with coefficients, but I want to find a good principled approach to include the survey intervals. Note that the first entry probability doesn’t have a survey interval as it’s the first occasion. Therefore, I wish to estimate it separately from the remaining 2:J surveys.

Thanks,

Matt

Bob_Carpenter · April 7, 2025, 4:16pm

This is only going to give you values in the (0, 1) range for the parameters. Was that intentional? You can reparameterize the Dirichlet to take in a mean simplex and a total concentration, i.e.,

\textrm{DirichletMean}(\theta, \kappa) = \textrm{Dirichlet}(\theta \cdot \kappa),

for \theta \in \Delta^{J-1} and \kappa > 0.

Your program didn’t define a tau, so I’m not sure where that’s coming from.

Our current simplex parameterization doesn’t work so well for Diricihlet distributions with entries \alpha_j < 1. It’s an unusual situation in that \alpha_j < 1 means you’re expecting very small probabilities for the corresponding entry of the simplex, which pushes sampling into the corners of parameter space.

Hopefully, for our next release, we’ll be rolling out the isometric log ratio transform that will improve behavior of simplexes in just this case.

You could probably also help things by putting priors on alpha and beta.

If this is not part of a larger model, you can move the sampling statement down to generated quantities, which should get rid of the divergences that arise from sampling.

generated quantities {
  gamma = dirichlet_rng(softmax(mu));
}

P.S.

The standard “reference prior” is \text{Beta}(0.5, 0.5), but I don’t know what it is for a Dirichlet. That is, uniform isn’t always the “uniformative” prior because uniformity is scale dependent. For example you can have uniform prior on probabilities in p = (0, 1) which are equivalent to standard logistic priors on the log odds \log(p / (1 - p)).

Dalton · April 7, 2025, 8:12pm

It might be good to post the rest of the code, since I can envision some difficulties in implementing the superpopulation approach to the Jolly-Seber model in Stan. Specifically the data augmentation part of the likelihood where you have to sum over the individuals never encountered seems to me difficult to implement. It could be that the divergent transition are arising from that part of the likelihood.

However, one approach you might consider is to implement survey length directly by borrowing from the survival modeling work and parameterizing the probability of entry using something like the CDF of the exponential distribution (or other possible distributions).

You might also consider using a Poisson distribution to describe the probability of the first encounter for each individual.

mhollanders · April 7, 2025, 8:16pm

Hey Bob,

Thanks for getting back so quickly. First, tau are the survey intervals and are defined in the data. Note that in mark-recapture models tau is often expressed in some arbitrary unit, for instance years if people are interested in having annual survival, even if the surveys happened every month:

data {
  int<lower=0> I, J;  // number of observed individuals and surveys
  vector<lower=0>[J - 1]  tau;  // survey intervals
}

This is only going to give you values in the (0, 1) range for the parameters. Was that intentional?

It wasn’t, thanks for the catch. I tried doing the following, but it takes about twice as long as my “non-centered” version and had 64% divergent transitions compared to 0%:

parameters {
  simplex[J] gamma;
  real<lower=0> kappa;
  real beta;
}
transformed parameters {
  vector[J] mu;
  mu[1] = 0;
  mu[2:J] = beta * log(tau);
}
model {
  gamma ~ dirichlet(softmax(mu) * kappa);
  kappa ~ exponential(0.01);
  beta ~ normal(0, 10);
}

Therefore, I’d like to try to proceed with my original approach of expressing mu as the log “expected number of entires” and then normalising them.

It might be good to post the rest of the code, since I can envision some difficulties in implementing the superpopulation approach to the Jolly-Seber model in Stan. Specifically the data augmentation part of the likelihood where you have to sum over the individuals never encountered seems to me difficult to implement. It could be that the divergent transition are arising from that part of the likelihood.

My new parameterisation marginalises out the entry occasions for all observed and augmented individuals and works just fine with a Dirichlet prior for \boldsymbol{\gamma}. It’s just incorporating the survey intervals (ideally without a coefficient as I’d like it to function as an offset) that I’m struggling with.

mhollanders · April 8, 2025, 5:30am

Maybe it’s useful if I post my stab at it. To reiterate, I’ve got a simplex gamma which I wish to model as normalised rates (like the Dirichlet) as a function of the time interval between surveys tau, where the first survey doesn’t have a time interval:

data {
  int<lower=2> J;
  vector<lower=0>[J - 1] tau;
}

In the transformed parameters I’m creating a J-length vector mu with the first element pinned to 0; I think this is common in Dirichlet regression but some loo comparisons I did estimating an extra parameter was really similar to not doing it, but I’ll proceed like that. Without accounting for time intervals, we might just specify J-1 gamma or exponential variates and normalise them. If we add time intervals, which could be in some arbitrary unit, we should probably estimate the rate of the exponential variables, so we have:

parameters {
  vector<lower=0>[J - 1] z;
  real<lower=0> theta;
}
transformed parameters {
  vector[J] mu, gamma;
  mu[1] = 0;
  mu[2:J] = log(z) + log(tau);  // survey intervals as offset
  gamma = softmax(mu);
}
model {
  z ~ exponential(mu);
  mu ~ exponential(1);
}

Let’s call this the centered parameterisation. ChatGPT tells me we can non-center like this, which seems to give the same results but much better sampling:

transformed parameters {
  mu[1] = 0;
  mu[2:J] = log(z) - log(mu) + log(tau);
  gamma = softmax(mu);
}
model {
  z ~ exponential(1);
  mu ~ exponential(1);
}

So, that’s where I’ve gotten. Does this make sense?

Bob_Carpenter · April 9, 2025, 6:49pm

The Dirichlet is just generating proportions, so you lose the expected number of entries. On the other hand, if you do a Dirichlet-multinomial with count N and Dirichlet parameter alpha, then N * alpha / sum(alpha) gives you the expected counts.

That’s the standard way to randomly generate Dirichlet variates, so you can use it for non-centering.

We’re about to roll out a new simplex parameterization that should be more robust.

mhollanders · April 9, 2025, 9:07pm

Hey Bob,

The Dirichlet is just generating proportions, so you lose the expected number of entries. On the other hand, if you do a Dirichlet-multinomial with count N and Dirichlet parameter alpha , then N * alpha / sum(alpha) gives you the expected counts.

Re: generating proportions, yes, but I think I’d like my priors to be exponential or gamma instead of normal to keep it closer to the Dirichlet distribution. I don’t think I can do Dirichlet-multinomial because N is generated quantity, not a parameter.

That’s the standard way to randomly generate Dirichlet variates, so you can use it for non-centering.

So how would you go about incorporating the time intervals, while recognising that the first entry doesn’t have a time interval and needs to be accounted for separately?

Bob_Carpenter · April 10, 2025, 5:13pm

I’d try the first approach you had of just making it an additive model. You can basically turn anything that looks like a constant into a regression. That is, if y[n] ~ gamma(a[n], 1), then y / sum(y) ~ Dirichleta(a).

You can turn anything into a regression. In this case, just replace the a[n] with the result of a regression. Let’s say you have a matrix of predictors x of size N x K (i.e., K covariates per entry). Then you can set up a regression

data {
  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] x;
}
parameters {
  vector[K] beta;
  vector<lower=0>[N] phi;
}
transformed parameters {
  simplex[N] theta = phi / sum(phi);
}
model {
    vector[N] a = exp(x * beta);
    phi ~ gamma(a, 1);
    // implies theta ~ Dirichlet(exp(x * beta))

mhollanders · April 11, 2025, 5:10am

Thanks Bob, that makes perfect sense and I implemented it. However, sampling is slow with some divergence and E-BFMI warnings. I think it’s because some of the estimated probabilities are super low and it essentially shows the same issues as just parameterising the Dirichlet directly. Is there a way to re-parameterise the gamma distribution to make sampling easier?

saudiwin · April 11, 2025, 4:11pm

You should try simulation to figure out the convergence issues (just have chatGPT write one with the stan code). Tightening priors will likely help with convergence & that is a good way to test.

mhollanders · April 11, 2025, 10:11pm

Hey, yeah I’ve already done lots of SBC and things get recovered with multiple parameterisations, and the poor mixing seems to happen with very low entry probabilities for some of the surveys.

avehtari · April 13, 2025, 2:55pm

Prior SBC can miss this kind of issues. Check out Posterior SBC: Simulation-Based Calibration Checking Conditional on Data. arXiv preprint arXiv:2502.03279 (it’s not going to fix the problem, but at least you know better how big problem you have)

Topic		Replies	Views
Ragged array of simplexes Modeling	38	5702	May 12, 2024
Two parameter distribution over the simplex Modeling	23	2855	April 13, 2018
Following up on several discussions of simplex adjustments and ragged arrays of simplexes Modeling techniques , specification , user-defined-functions , constraint-transform	13	208	December 10, 2024
Divergent transitions when sampling from Dirichlet distribution Modeling	3	1712	August 14, 2017
Vague Proper Dirichlet Prior Modeling	9	4726	December 4, 2018

Parameterising a latent simplex

Related topics