Suggestions for speeding up a mixture model

performance

#1

I wrote a two-component mixture model on estimating the true effect size of GWAS summary statistics for phenotypes with genetic correlation (Stan code pasted at the end). The model works fine on small dataset (N \approx 200, M = 2), and takes around 20 minutes to run 100 warmup + 1000 iterations for 4 chains in parallel. However, when I tried larger dataset (N \approx 3 \times 10^5, M = 2). It has taken around 5 days to get 500 samples and hasn’t converged yet (most Rhat is around 1.1 and Neff around 10; sampler parameters from stansummary also pasted below).

name,Mean,MCSE,StdDev,5%,50%,95%,N_Eff,N_Eff/s,R_hat
"lp__",-4.01751e+07,1.96137e+07,9.6404e+07,-2.89641e+08,-1.43121e+06,756479,24.1586,inf,1.01487
"accept_stat__",0.999343,0.000218408,0.00346515,0.998102,0.999955,1,251.712,inf,0.99805
"stepsize__",3.48344e-05,2.06904e-06,4.04532e-06,3.21151e-05,3.2273e-05,4.23705e-05,3.82268,inf,1.57626
"treedepth__",9.52953,0.231384,1.07193,7,10,10,21.4616,inf,1.01066
"n_leapfrog__",874.465,71.2052,320.677,127,1023,1023,20.2821,inf,1.01249
"divergent__",0,0,0,0,0,0,508,inf,-nan
"energy__",4.32656e+07,2.07136e+07,1.0122e+08,-417360,1.91271e+06,3.02495e+08,23.8792,inf,1.01533

Another thing that doesn’t work for the larger dataset is the generated quantity gamma, which is the probability of an observation ending up in each mixture component. Most values are -nan. I would appreciate it for any suggestion to debug the model and speed it up!

data {
    int<lower=0> N;                 // number of loci
    int<lower=1> M;                // number of phenotypes
    matrix[N, M] B;                  // observed effect sizes
    matrix[N, M] SE;                // standard errors
    int<lower=1> K;                 // number of mixture components (K = 2 for this model)
}

transformed data{
    vector[M] zeros;
    vector[K] ones; 
    vector[M] SE_vec[N];
    
    zeros = rep_vector(0, M);
    ones = rep_vector(1, K);
    for (n in 1:N) {
        SE_vec[n] = to_vector(SE[n]);
    }
}

parameters {
    simplex[K] pi;                              // mixing proportions
    cholesky_factor_corr[M] L_e;     // cholesky factor of correlation matrix of errors
    cholesky_factor_corr[M] L_b;     // cholesky factor of correlation matrix of bias
    cholesky_factor_corr[M] L_g;     // cholesky factor of correlation matrix of effect
    vector[M] beta_vec[N];               // true effect size
    vector<lower=0>[M] tau_0;        // standard deviation of each phenotype (null component)
    vector<lower=0>[M] tau_1;        // standard deviation of each phenotype (nonnull component)
}

transformed parameters{
    matrix[M, M] Sigmas[K];
    
    Sigmas[1] = diag_pre_multiply(tau_0, L_b);
    Sigmas[2] = diag_pre_multiply(tau_1, L_g);
}

model {
    real betaloglik[N, K];
    
    tau_0 ~ double_exponential(0, 1e-3); 
    tau_1 ~ cauchy(0, 2.5);                      
    L_e ~ lkj_corr_cholesky(2.0);
    L_b ~ lkj_corr_cholesky(2.0);
    L_g ~ lkj_corr_cholesky(2.0);
    pi ~ dirichlet(ones);
    
    for (n in 1:N) {
        for (k in 1:K) {
            betaloglik[n, k] = log(pi[k]) + multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[k]); 
        }
        target += log_sum_exp(betaloglik[n]);
    }
}

generated quantities {
    matrix[M,M] cor_e;
    matrix[M,M] cor_b;
    matrix[M,M] cor_g;
    matrix[N,K] gamma; // probability of being in each mixture component
    
    cor_e = multiply_lower_tri_self_transpose(L_e);
    cor_b = multiply_lower_tri_self_transpose(L_b);
    cor_g = multiply_lower_tri_self_transpose(L_g);
    
    for (n in 1:N) {
        row_vector[K] gamma_unnormalized;
        for (k in 1:K) {
            gamma_unnormalized[k] = pi[k] * exp(multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[k]));
        }
        gamma[n] = gamma_unnormalized/sum(gamma_unnormalized);
    }
}

#2

If I understand correctly, you are trying to run a two component mixture model over the variances.

I’m concerned by the high accept_stat and small step_size. Did you turn up adapt_delta to avoid divergent transitions? If not, are you getting any warnings about params being NA, or 0 that should not be?

I’m also a little concerned about the identifiability of all of the L_e, L_b, l_g components. I would look at the pairs plot of these to see potential issues.

Aside from those specific issues, in my experience, the number one reason these mixture models fail to converge is that the model is not identifiable in terms of the identity of the mixture components. These cause low n_eff and high Rhat, but not the high accept stat and small step size issues.

You may be able to see this by looking at the traceplots at pair plots of tau_0 and tau_1.

To fix this, you need to find some way to break the symmetry. Typically, this is done by imposing an ordered constraint on one of the parameters, but in your case, I don’t see a natural way to do this.

I would suggest you either specify 0 < pi_0 < 0.5 to break the symmetry in the size of the components, or impose the ordering on the variance of the components.

parameters {
...
simplex[M] tau_simplex [K];
positive_ordered component_sd [K];
}
transformed parameters {
vector<lower=0>[M] tau_0 = tau_simplex[1] * component_sd[1]; 
vector<lower=0>[M] tau_1 = = tau_simplex[2] * component_sd[2]; 
...
}

#3

Thank you for the suggestions! I tried putting a constraint on pi0 (The prior knowledge is that null component pi0 > 0.5) and the updated code are pasted at the end. I didn’t turn up adapt_delta and got a few warning messages during the warmup phase for the double exponential prior I was placing on tau0 (null component) and log_mix function used in the model section (warnings pasted below). The log_mix lambda1 warning also seems to comes from lp of the null component. Was the double exponential prior problematic? What I intend to use this prior for is to constrain tau_0 to be very small numbers on the order of 1e-03.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: double_exponential_lpdf: Random variable[2] is inf, but must be finite!  (in '/oak/stanford/groups/mrivas/users/szmamie/repos/multilevel-models/notebook/mvpmm_sharpDensity_orderpi2.stan' at line 42)

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: log_mix: lambda1 is -nan, but must not be nan!  (in '/oak/stanford/groups/mrivas/users/szmamie/repos/multilevel-models/notebook/mvpmm_sharpDensity_orderpi2.stan' at line 50)

And here are the pairs plot for the L_e, L_b, L_g components and tau_0 and tau_1 components for the updated model. The updated model still have low neff and high Rhat. What specific aspects should I look at for diagnosing non-identifiability issues?

data {
    int<lower=0> N;                 // number of loci
    int<lower=1> M;                // number of phenotypes
    matrix[N, M] B;                  // observed effect sizes
    matrix[N, M] SE;                // standard errors
    int<lower=1> K;                 // number of mixture components
}

transformed data{
    vector[M] zeros;
    vector[M] SE_vec[N];

    zeros = rep_vector(0, M);
    for (n in 1:N) {
        SE_vec[n] = to_vector(SE[n]);
    }
}

parameters {
    real<lower=0.5, upper=1> pi0;    // null component mixing proportion
    cholesky_factor_corr[M] L_e;     // cholesky factor of correlation of errors
    cholesky_factor_corr[M] L_b;     // cholesky factor of correlation of bias
    cholesky_factor_corr[M] L_g;     // cholesky factor of correlation of effect
    vector[M] beta_vec[N];               // true effects
    vector<lower=0>[M] tau_0;        
    vector<lower=0>[M] tau_1;        
}

transformed parameters{
    matrix[M, M] Sigmas[K];

    Sigmas[1] = diag_pre_multiply(tau_0, L_b);
    Sigmas[2] = diag_pre_multiply(tau_1, L_g);
}

model {
    tau_0 ~ double_exponential(0, 1e-3);
    tau_1 ~ cauchy(0, 2.5);
    L_e ~ lkj_corr_cholesky(2.0);
    L_b ~ lkj_corr_cholesky(2.0);
    L_g ~ lkj_corr_cholesky(2.0);
    pi0 ~ beta(1, 1);

    for (n in 1:N) {
        target += log_mix(pi0, multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[1]), 
                                           multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[2]));
    }
}

generated quantities {
    matrix[M,M] cor_e;
    matrix[M,M] cor_b;
    matrix[M,M] cor_g;
    matrix[N,K] gamma;

    cor_e = multiply_lower_tri_self_transpose(L_e);
    cor_b = multiply_lower_tri_self_transpose(L_b);
    cor_g = multiply_lower_tri_self_transpose(L_g);

    for (n in 1:N) {
        row_vector[K] gamma_unnormalized;

        gamma_unnormalized[1] = pi0 * exp(multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[1]));

        gamma_unnormalized[2] = (1 - pi0) * exp(multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[2]));

        gamma[n] = gamma_unnormalized/sum(gamma_unnormalized);
    }
}

#4

Quite likely. I would look at the traceplot of tau_0. You could also parameterize it as tau_0_raw and then make tau_0 = tau_0_raw * 1e-3, and put the prior on tau_0_raw ~ double_exponential(1). But I don’t see why you are using the double exponential when tau_0 is constrained to be greater than 0. Why not just use the Laplace distribution?

Also, I see you are at Stanford, some colleagues and I organize a Stan User group there. I’ll send you the details.


#5

Just tried adding ordering constraints for tau_0 and tau_1 (stan code below) and it helps the Neff parameters. But it still seems hard to learn L_g, L_b, and tau_0. Here is the trace plot.

Another problem is that when I ran the updated model on the larger datasets, posterior mean of tau_0 and tau_1 are on the order of 2 and 5 (instead of 1e-3 and 1e-1 on downsampling of the original dataset, which is also the prior knowledge), and the generated quantity gamma (probability of being in the mixture component) are mostly -nan for the larger dataset. Does it mean that there is underflowing occurring somewhere? I tried setting the prior on tau_1 to have smaller scale, like cauchy(0, 0.2), but it only slightly decrease the posterior mean for tau_0 and tau_1. Is there anyway to shrink the estimate of the two parameters?

data {
    int<lower=0> N;                 // number of loci
    int<lower=1> M;                 // number of phenotypes
    matrix[N, M] B;                 // observed effect sizes
    matrix[N, M] SE;                // standard errors
    int<lower=1> K;                 // number of mixture components
}

transformed data{
    vector[M] zeros;
    vector[K] ones;
    vector[M] SE_vec[N];

    zeros = rep_vector(0, M);
    ones = rep_vector(1, K);
    for (n in 1:N) {
        SE_vec[n] = to_vector(SE[n]);
    }
}

parameters {
    real<lower=0.5, upper=1> pi;      // mixing proportions
    cholesky_factor_corr[M] L_e;     // cholesky factor of correlation of errors
    cholesky_factor_corr[M] L_b;     // cholesky factor of correlation of bias
    cholesky_factor_corr[M] L_g;     // cholesky factor of correlation of effect
    vector[M] beta_vec[N];               // true effects
    positive_ordered[K] taus[M];      // tau_0 < tau_1
}

transformed parameters{
    matrix[M, M] Sigmas[K];
    simplex[K] pi_vec;
    vector<lower=0>[M] tau_0;
    vector<lower=0>[M] tau_1;

    for (m in 1:M) {
        tau_0[m] = taus[m,1];
        tau_1[m] = taus[m,2];
    }
    pi_vec[1] = pi;
    pi_vec[2] = 1 - pi;
    Sigmas[1] = diag_pre_multiply(tau_0, L_b);
    Sigmas[2] = diag_pre_multiply(tau_1, L_g);
}

model {    
    real betaloglik[K];          
    real betahatloglik;

    for (m in 1:M) {
        taus[m,1] ~ exponential(1e3);
        taus[m,2] ~ cauchy(0, 2.5);
    }
    L_e ~ lkj_corr_cholesky(2.0);
    L_b ~ lkj_corr_cholesky(2.0);
    L_g ~ lkj_corr_cholesky(2.0);
    pi ~ beta(1, 1);

    for (n in 1:N) {
        for (k in 1:K) {
            betaloglik[k] = log(pi_vec[k]) + multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[k]);
        }
        target += log_sum_exp(betaloglik);
    }
}

generated quantities {
    matrix[M,M] cor_e;
    matrix[M,M] cor_b;
    matrix[M,M] cor_g;
    matrix[N,K] gamma;

    cor_e = multiply_lower_tri_self_transpose(L_e);
    cor_b = multiply_lower_tri_self_transpose(L_b);
    cor_g = multiply_lower_tri_self_transpose(L_g);

    for (n in 1:N) {
        row_vector[K] gamma_unnormalized;
        for (k in 1:K) {
            gamma_unnormalized[k] = pi_vec[k] * exp(multi_normal_cholesky_lpdf(B[n] | beta_vec[n], diag_pre_multiply(SE_vec[n], L_e)) + multi_normal_cholesky_lpdf(beta_vec[n] | zeros, Sigmas[k]));
        }
        gamma[n] = gamma_unnormalized/sum(gamma_unnormalized);
    }
}

#6

you wrote taus[m,1] ~ exponential(1e3); rather than taus[m,1] ~ exponential(1e-3); which I suspect is why your tau values are so large. Try doing the tau_raw trick I mentioned earlier. This should help with stability.