Attempt at vectorizing mixture results in a slight *reduction* of speed

I need to fit a mixture model to a (quite large) data set. I know Stan can be slower at mixtures than Gibbs-type samplers. However, I also know that it is not currently possible to vectorize mixtures within Stan.

To that end, I decided to take matters into my own hands by creating my own (somewhat) vectorized code. However, I was surprised to see that my code is actually somewhat slower than doing the traditional “loop over n and K” code.

I’ll preface this by saying I’m a statistician, not a computer scientist. So I apologize if my logic is poor.

My code is below:

functions {
  
  real mixture_normal_lpdf(
    vector y, vector logp, vector mu, vector sigma
  ) {
    real NEGHALFLOG2PI = -0.918938533204673;
    int n = size(y);
    int K = size(mu);
    matrix[n,K] logComp;
    real logProb = 0;
    vector[n] M = rep_vector(negative_infinity(), n);
    
    // Loop through num. components and compute likelihood of each component
    for ( k in 1:K ) {
      logComp[, k] = logp[k] + NEGHALFLOG2PI 
        - log(sigma[k]) 
        - 0.5 * square((y - mu[k]) ./ sigma[k]);
    }
    
    // Loop through num. obs. and use log-sum-exp trick
    for ( i in 1:n )
      logProb += log_sum_exp(logComp[i, ]);
    
    // Return log likelihood as a scalar
    return logProb;
  }
  
}

data {
  int<lower=1> n;                  // Number of observations
  int<lower=1> K;                  // number of mixture components
  vector[n] y;                     // Outcome variable
  int<lower=0,upper=1> vectorize;  // Whether to use custom (vectorized) likelihood
}
parameters {
  vector[K] mu;
  vector<lower=0>[K] sigma;
  simplex[K] p;
}
model {
  vector[K] logf;
  vector[K] logp = log(p);
  mu    ~ std_normal();
  sigma ~ std_normal();
  
  // Likelihood
  if ( vectorize )
    y ~ mixture_normal(logp, mu, sigma);
  else {
    for ( i in 1:n ) {
      for ( k in 1:K )
        logf[k] = logp[k] + normal_lpdf(y[i] | mu[k], sigma[k]);
      target += log_sum_exp(logf);
    }
  }
}

I would have bet that my code would be at least as fast as the traditional version for the following reasons:

  1. My code has to loop through n + K iterations, whereas the traditional approach is n*K
  2. The traditional code results in n*K function calls to the normal_lpdf function
  3. It should be more efficient to have a single sampling statement

Stan should be way faster than sampling the discrete mixture components. Stan’s version is Rao-Blackwellized compared to the Gibbs sampler—see the Rao-Blackwell theorem, which is from stats, not CS. Gibbs will also be faster if you marginalize out the responsibility parameters. See the chapter in the User’s Guide on latent discrete parameters for more explanation and some examples.

In both implementations, each component of the n-vector y gets a mixture distribution. Both of them do n * M amount of work, it’s just that it gets vectorized in the function rather than being in an explicit double loop.

    for (i in 1:n) {
      for (k in 1:K)
        logf[k] = logp[k] + normal_lpdf(y[i] | mu[k], sigma[k]);
      target += log_sum_exp(logf);
    }

You may think it’d be faster to expand this on your own, but there’s a lot going on to complicate the situation. In your functional form, you have:

real mixture_normal_lpdf(vector y, vector logp, vector mu, vector sigma) {
    int K = size(mu);
    matrix[n, K] logComp;
    real logProb = 0;
    for (k in 1:K) {
      logComp[, k] = logp[k]
        - log(sigma[k]) 
        - 0.5 * square((y - mu[k]) ./ sigma[k]);
    }
    for (i in 1:n)
      logProb += log_sum_exp(logComp[i, ]);
    return logProb;
  }

I removed the unused variables M and n and the unnecessary additive component (you can add or subtract a constant from logComp and it doesn’t change the result, which will help a bit with speed, especially removing the vector M. That is, for a scalar c and vector v, we have

log_sum_exp(c + v) = c + log_sum_exp(v)

In your case, you just get a constant on the outside in logProb, which doesn’t change anything in the sampler, so it can be dropped.

It may be a bit faster to do this

vector[K] log_sigma = log(sigma);

and then use log_sigma[k] in the code.

Counterintuitively, I’m guessing it’ll be faster to do this:

vector[n] log_probs;
for (i in 1:n) {
  log_probs[I] = log_sum_exp(logComp[i]);  // don't need second index here
}
return sum(log_probs);

It feels counter-intuitive and does extra allocation, but applying sum() all at once to a vector of log probs is much faster. I’m not sure either of these vectorizations will help.

In the end, it’s hard to beat our efficient C++ built-ins. What you have in your function definition has to be autodialed through, which is less efficient than our built-ins. But then you have fewer calls.