Fitting a Beta mixture model

Hello,

I am trying to fit a Beta mixture model as shown below. I reparameterized the beta distribution in terms of \phi and \lambda and ordered both \phi and \lambda to avoid label switching. In the diagnostics (beta_mixture_log.txt (1.2 KB)), several parameters have an R-hat value greater than 1.01 and the log recommends more effective parameterization. I would appreciate any suggestions in this regard. Thank you!

Input data
beta_mixture_data.txt (9.9 KB)

data {
    int<lower=1> N;
    int<lower=1> K;
    array[N] real<lower=0, upper=1> y;
}

parameters {
    simplex[K] theta;
    ordered[K] phi_raw;
    ordered[K] lambda_raw;
}

transformed parameters {
    array[K] real<lower=50, upper=100> lambda;
    array[K] real<lower=1e-3, upper=1 - 1e-3> phi;
    array[K] real<lower=0> alpha;
    array[K] real<lower=0> beta;

    for (k in 1:K) {
        phi[k] = 1e-3 + (1 - 2e-3) * inv_logit(phi_raw[k]);
        lambda[k] = 50 + 50 * inv_logit(lambda_raw[k]);
        alpha[k] = lambda[k] * phi[k];
        beta[k]  = lambda[k] * (1 - phi[k]);
    }
}

model {
    vector[K] log_theta = log(theta);

    //Priors
    theta ~ dirichlet(rep_vector(2.0, K));
    phi_raw ~ normal(0, 2);
    lambda_raw ~ normal(0, 1);

    for (n in 1:N) {
        vector[K] lps = log_theta;
        for (k in 1:K) {
            lps[k] += beta_lpdf(y[n] | alpha[k], beta[k]);
        }
        target += log_sum_exp(lps);
    }
}

generated quantities {
    array[N] int<lower=1, upper=K> component_assignment;
    array[N, K] real log_prob_component;

    for (n in 1:N) {
        vector[K] lps = log(theta);
        for (k in 1:K) {
            lps[k] += beta_lpdf(y[n] | alpha[k], beta[k]);
        }

        vector[K] probs = softmax(lps);

        for (k in 1:K) {
            log_prob_component[n, k] = lps[k] - log_sum_exp(lps);
        }

        component_assignment[n] = categorical_rng(probs);
    }
}

Are you running into problems with standard <lower=0, upper=1> declarations? That is, does phi_raw somehow devolve to a really small or really large number prone to overflow/underflow of inv_logit?

You don’t need to do the beta reparameterization manually and reconstruct alpha and beta. You can just write

lps[k] += beta_proportion_lpdf(y[n] | phi[k], lambda[k]);

There’s a lot you can do to make things more efficient, e.g., replacing the loop over K with the much more efficient version that does it collectively with only one call to log_sum_exp:

log_prob_component[n] = lps - log_sum_exp(lps);

The biggest source of inefficiency in Stan programs is redundant calculations.

With your categorical_rng(probs), what is your goal? If you just want to calculate means, you can use probs to calculate the expectation of component_assignment.

I didn’t really understand what was going on with the coefficients and ranges like 5o to 100. We generally discourage this kind of hard boundary as you typically get probability mass bunching up at the boundaries, which then biases the result relative to just having an unconstrained lambda.

To avoid label switching, you only need to order one of the two (where I would declare the mean as an ordered vector). Currently, you are forcing the model to combine the lowest lambda with the lowest phi and vice versa, which is not necessarily the case, right? As long as the mean vector (phi) is ordered, it should be possible to identify and estimate the level of concentration (lambda) associated with each mean.

Thank you both for your suggestions!

The issue was that label switching was not affecting posterior inference but affecting the convergence diagnostics. For instance, R-hat values looks good for a single chain. This was previously discussed in this post. In that post, it was suggested that post-hoc relabelling would not be appropriate if uncertainties between components are overlapping. I’d welcome any other suggestions if there are any other methods to estimate convergence across all 4 chains!

I removed the ordering on the \lambda as suggested!

I added the hard constraint simply to get convergence (even with one chain). I will remove the hard constraints and use stronger priors instead.

Unless you can identify the model, standard convergence monitoring of parameter mean convergence with R-hat will not work.

What will work is modeling convergence of quantities of interest, like posterior predictive quantities. Those typically marginalize out the mixture component. I wrote a bit about this in the User’s Guide chapter on mixtures.