Code efficiency wrt the transformed parameters blog

Merlin_Heidemanns · October 24, 2019, 6:41pm

I’ve the code below for two different implementations of a Markov-Switching Model where one uses a multinomial and one a probit implementation for the transition probabilities. It turns out that the multinomial implementation takes about twice the time compared to the probit implementation and I was wondering why that is. Both model recover parameters drawn from the respective prior distributions and initialize from the same prior. Most of the additional time is spent on the warmup and before even the first warm up draws come in.

multinomial


transformed parameters {
  vector[K] logalpha[T, N];
  vector[K] A[N, T, K];             // A[t][i][j] = p(z_t = j | z_{t-1} = i)
  matrix[Md_var, K] lambda_prime[K];
  {
    int count = 1;
    for (i in 1:K){
      for (j in 1:K){
        for (q in 1:Md_var){
          if (i == j){
            lambda_prime[i, q, j] = 0;
          } else {
            lambda_prime[i, q, j] = lambda[count]; count += 1;
          }
        }
      }
    }
  }

  for (nn in 1:N){
    for (t in 1:T){
      for (i in 1:K){
        vector[K] unA;
        vector[K] nom;
        real den;
        nom = to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i]));
        den = log(log_sum_exp(nom));
        unA = exp(log(nom) - den);
        A[nn, t, i] = softmax(unA);
      }
    }
  }
  { // Forward algorithm log p(z_t = j | x_{1:t})
  // logalpha[t, j] = Pr(z_t = j| x_{1:t})
  // logalpha[t-1, i] = Pr(z_t = i| x_{1:t-1})
  // normal_lpdf(y[t] | mu[j] + phi[j] * y[t - 1], sigma):  p(x_t|z_t = j) local evidence at t for j

  for (nn in 1:N){
    for (j in 1:K){
      logalpha[1, nn, j] = log(pi1[nn, j]) + normal_lpdf(y[startstop[nn, 1]] | mu[j], sigma);
    }
    for (t in 2:T){
      for (j in 1:K) {
        real accumulator[K];
        for (i in 1:K) {
          accumulator[i] = logalpha[t - 1, nn, i] +
                            log(A[nn, t, i, j]) +
                            normal_lpdf(y[startstop[nn, 1] + t - 1] | mu[j] +
                                        phi[j] * y[startstop[nn, 1] + t - 2] +
                                        x_sha[startstop[nn, 1] + t - 1] * zeta +
                                        x_var[startstop[nn, 1] + t - 1] * beta[:,j], sigma);
        }
        logalpha[t, nn, j] = log_sum_exp(accumulator);
      }
    }
  }
  }
}

probit


transformed parameters {
  vector[2] logalpha[T, N];
  simplex[2] A[N, T, 2];             // A[t][i][j] = p(z_t = j | z_{t-1} = i)
  for (nn in 1:N){
    for (t in 1:T){
      A[nn, t, 1, 1] = normal_cdf(z_sha[startstop[nn, 1] + t - 1] *   gamma +
                                  z_var[startstop[nn, 1] + t - 1] *  lambda[:,1] , 0, 1);
      A[nn, t, 2, 2] = normal_cdf(z_sha[startstop[nn, 1] + t - 1] *   gamma +
                                  z_var[startstop[nn, 1] + t - 1] *  lambda[:,2], 0, 1);
      A[nn, t, 1, 2] = 1 - A[nn, t, 1, 1];
      A[nn, t, 2, 1] = 1 - A[nn, t, 2, 2];
    }
  }
  { // Forward algorithm log p(z_t = j | x_{1:t})
  // logalpha[t, j] = Pr(z_t = j| x_{1:t})
  // logalpha[t-1, i] = Pr(z_t = i| x_{1:t-1})
  // normal_lpdf(y[t] | mu[j] + phi[j] * y[t - 1], sigma):  p(x_t|z_t = j) local evidence at t for j

  for (nn in 1:N){
    for (j in 1:2){
      logalpha[1, nn, j] = log(pi1[nn][j]) + normal_lpdf(y[startstop[nn, 1]] | mu[j], sigma);
    }
    for (t in 2:T){
      for (j in 1:2) {
        real accumulator[2];
        for (i in 1:2) {
          accumulator[i] = logalpha[t - 1, nn, i] +
                            log(A[nn, t, i, j]) +
                            normal_lpdf(y[startstop[nn, 1] + t - 1] | mu[j] +
                                        y[startstop[nn, 1] + t - 2] * phi[j] +
                                        x_sha[startstop[nn, 1] + t - 1] * zeta +
                                        x_var[startstop[nn, 1] + t - 1] * beta[:,j], sigma);
        }
        logalpha[t, nn, j] = log_sum_exp(accumulator);
      }
    }
  }
  }
}

bbbales2 · October 25, 2019, 1:01pm

nom = to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i]));
den = log(log_sum_exp(nom));
unA = exp(log(nom) - den);
A[nn, t, i] = softmax(unA);

This looks like a suspicious number of exponentiations and logs.

Like den looks something like:

\log \left( \log \left( e^{e^{z_1}} + e^{e^{z_2}} + ... \right ) \right )

Is that right? Then this is eventually gonna go through another softmax?

Merlin_Heidemanns · October 25, 2019, 1:20pm

So the way I want this to work would be for i = 1
nom = \begin{bmatrix} e^{0} & e^{z'\lambda_{1,2}} & e^{z'\lambda_{1,3} }\end{bmatrix}
den = log(1 + e^{z'\lambda_{1,2}} + e^{z'\lambda_{1,3} })
unA = \begin{bmatrix} e^{log(1) - den} & e^{log(e^{z'\lambda_{1,2}}) - den} & e^{log(e^{z'\lambda_{1,3}}) - den} \end{bmatrix}

And then the softmax to turn this into probabilities. But given the sum for the denominator I don’t think that I can just remove the exponentiations/logs in the first and third line.

bbbales2 · October 25, 2019, 1:28pm

Why not just softmax the z \lambda_{1, j} vector?

Merlin_Heidemanns · October 25, 2019, 1:31pm

So I replaced it with the code below and it seems to work (parameters recovered for one simulation. It cuts the time in half despite running models in a separate R session. Thanks!

Just out of curiosity: Why is the effect so substantial? I thought that the transformed parameters block runs only once and and all these logs are only transformations I’d assume.

  for (nn in 1:N){
    for (t in 1:T){
      for (i in 1:K){
        A[nn, t, i] = softmax(to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i])));
      }
    }
  }

bbbales2 · October 25, 2019, 3:38pm

Oh okay cool. First reason for looking here I guess is all the exps probably make the numerics really finicky if you’re not near the solution (and hence might make warmup slow).

This bit of code also appeared to be what was different between the probit and non-probit version. If the difference is in how we do link functions, it’s my understanding that that stuff is pretty arbitrary and is mostly driven by ease/convention, so the thought was maybe we could maybe just try something different for fun?

Merlin_Heidemanns · October 25, 2019, 3:42pm

I see. I ended up going through the derivation and both are identical for that matter so I’d figure that’s also why it works. Are there any other efficiency gains to be had here?

bbbales2 · October 25, 2019, 3:45pm

Well that’s a better way to do it lol.

Transformed parameters gets run every leapfrog step.

Maybe :D?

bbbales2 · October 25, 2019, 3:49pm

Nah I kid I kid, lemme not be sarcastic and answer your question.

Loops in Stan are fast, so I don’t suspect that.

I think the only practical way you’d make your code faster is if you see ways to reorder things to reduce computation.

Lotta that stuff would be guess and check though. There’s an approx_phi or something like that in the manual to approximate a normal CDF I think. But it’s not clear if that’s the most costly thing in probit code.

Merlin_Heidemanns · October 25, 2019, 3:54pm

I will try that at some point. I am just happy that this implementation takes about 2 minutes per chain rather than the previous six… Good enough for now. Again thanks! My laptop will be thankful as well going forward.

Current:
Chain 2: Elapsed Time: 77.3039 seconds (Warm-up)
Chain 2: 43.4298 seconds (Sampling)
Chain 2: 120.734 seconds (Total)

Previously:
Chain 2: Elapsed Time: 188.119 seconds (Warm-up)
Chain 2: 161.055 seconds (Sampling)
Chain 2: 349.174 seconds (Total)

Topic		Replies	Views
HMM with time-varying probabilities Modeling hmm	3	798	July 12, 2023
Implementing Logit-Normal priors on Dirichlet-Multinomial model Modeling	1	775	January 3, 2022
Faster sampling with unused parameters included in model Modeling	3	573	October 28, 2021
Efficient computation of multi-logit Modeling	4	94	December 16, 2024
Hidden Markov Model: Trouble With Forward Algorithm Modeling hmm	2	82	January 16, 2025

Code efficiency wrt the transformed parameters blog

Related topics