Code efficiency wrt the transformed parameters blog

I’ve the code below for two different implementations of a Markov-Switching Model where one uses a multinomial and one a probit implementation for the transition probabilities. It turns out that the multinomial implementation takes about twice the time compared to the probit implementation and I was wondering why that is. Both model recover parameters drawn from the respective prior distributions and initialize from the same prior. Most of the additional time is spent on the warmup and before even the first warm up draws come in.

multinomial


transformed parameters {
  vector[K] logalpha[T, N];
  vector[K] A[N, T, K];             // A[t][i][j] = p(z_t = j | z_{t-1} = i)
  matrix[Md_var, K] lambda_prime[K];
  {
    int count = 1;
    for (i in 1:K){
      for (j in 1:K){
        for (q in 1:Md_var){
          if (i == j){
            lambda_prime[i, q, j] = 0;
          } else {
            lambda_prime[i, q, j] = lambda[count]; count += 1;
          }
        }
      }
    }
  }

  for (nn in 1:N){
    for (t in 1:T){
      for (i in 1:K){
        vector[K] unA;
        vector[K] nom;
        real den;
        nom = to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i]));
        den = log(log_sum_exp(nom));
        unA = exp(log(nom) - den);
        A[nn, t, i] = softmax(unA);
      }
    }
  }
  { // Forward algorithm log p(z_t = j | x_{1:t})
  // logalpha[t, j] = Pr(z_t = j| x_{1:t})
  // logalpha[t-1, i] = Pr(z_t = i| x_{1:t-1})
  // normal_lpdf(y[t] | mu[j] + phi[j] * y[t - 1], sigma):  p(x_t|z_t = j) local evidence at t for j

  for (nn in 1:N){
    for (j in 1:K){
      logalpha[1, nn, j] = log(pi1[nn, j]) + normal_lpdf(y[startstop[nn, 1]] | mu[j], sigma);
    }
    for (t in 2:T){
      for (j in 1:K) {
        real accumulator[K];
        for (i in 1:K) {
          accumulator[i] = logalpha[t - 1, nn, i] +
                            log(A[nn, t, i, j]) +
                            normal_lpdf(y[startstop[nn, 1] + t - 1] | mu[j] +
                                        phi[j] * y[startstop[nn, 1] + t - 2] +
                                        x_sha[startstop[nn, 1] + t - 1] * zeta +
                                        x_var[startstop[nn, 1] + t - 1] * beta[:,j], sigma);
        }
        logalpha[t, nn, j] = log_sum_exp(accumulator);
      }
    }
  }
  }
}

probit


transformed parameters {
  vector[2] logalpha[T, N];
  simplex[2] A[N, T, 2];             // A[t][i][j] = p(z_t = j | z_{t-1} = i)
  for (nn in 1:N){
    for (t in 1:T){
      A[nn, t, 1, 1] = normal_cdf(z_sha[startstop[nn, 1] + t - 1] *   gamma +
                                  z_var[startstop[nn, 1] + t - 1] *  lambda[:,1] , 0, 1);
      A[nn, t, 2, 2] = normal_cdf(z_sha[startstop[nn, 1] + t - 1] *   gamma +
                                  z_var[startstop[nn, 1] + t - 1] *  lambda[:,2], 0, 1);
      A[nn, t, 1, 2] = 1 - A[nn, t, 1, 1];
      A[nn, t, 2, 1] = 1 - A[nn, t, 2, 2];
    }
  }
  { // Forward algorithm log p(z_t = j | x_{1:t})
  // logalpha[t, j] = Pr(z_t = j| x_{1:t})
  // logalpha[t-1, i] = Pr(z_t = i| x_{1:t-1})
  // normal_lpdf(y[t] | mu[j] + phi[j] * y[t - 1], sigma):  p(x_t|z_t = j) local evidence at t for j

  for (nn in 1:N){
    for (j in 1:2){
      logalpha[1, nn, j] = log(pi1[nn][j]) + normal_lpdf(y[startstop[nn, 1]] | mu[j], sigma);
    }
    for (t in 2:T){
      for (j in 1:2) {
        real accumulator[2];
        for (i in 1:2) {
          accumulator[i] = logalpha[t - 1, nn, i] +
                            log(A[nn, t, i, j]) +
                            normal_lpdf(y[startstop[nn, 1] + t - 1] | mu[j] +
                                        y[startstop[nn, 1] + t - 2] * phi[j] +
                                        x_sha[startstop[nn, 1] + t - 1] * zeta +
                                        x_var[startstop[nn, 1] + t - 1] * beta[:,j], sigma);
        }
        logalpha[t, nn, j] = log_sum_exp(accumulator);
      }
    }
  }
  }
}

nom = to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i]));
den = log(log_sum_exp(nom));
unA = exp(log(nom) - den);
A[nn, t, i] = softmax(unA);

This looks like a suspicious number of exponentiations and logs.

Like den looks something like:

\log \left( \log \left( e^{e^{z_1}} + e^{e^{z_2}} + ... \right ) \right )

Is that right? Then this is eventually gonna go through another softmax?

So the way I want this to work would be for i = 1
nom = \begin{bmatrix} e^{0} & e^{z'\lambda_{1,2}} & e^{z'\lambda_{1,3} }\end{bmatrix}
den = log(1 + e^{z'\lambda_{1,2}} + e^{z'\lambda_{1,3} })
unA = \begin{bmatrix} e^{log(1) - den} & e^{log(e^{z'\lambda_{1,2}}) - den} & e^{log(e^{z'\lambda_{1,3}}) - den} \end{bmatrix}

And then the softmax to turn this into probabilities. But given the sum for the denominator I don’t think that I can just remove the exponentiations/logs in the first and third line.

Why not just softmax the z \lambda_{1, j} vector?

1 Like

So I replaced it with the code below and it seems to work (parameters recovered for one simulation. It cuts the time in half despite running models in a separate R session. Thanks!

Just out of curiosity: Why is the effect so substantial? I thought that the transformed parameters block runs only once and and all these logs are only transformations I’d assume.

  for (nn in 1:N){
    for (t in 1:T){
      for (i in 1:K){
        A[nn, t, i] = softmax(to_vector(exp(z_var[startstop[nn, 1] + t - 1] * lambda_prime[i])));
      }
    }
  }

Oh okay cool. First reason for looking here I guess is all the exps probably make the numerics really finicky if you’re not near the solution (and hence might make warmup slow).

This bit of code also appeared to be what was different between the probit and non-probit version. If the difference is in how we do link functions, it’s my understanding that that stuff is pretty arbitrary and is mostly driven by ease/convention, so the thought was maybe we could maybe just try something different for fun?

1 Like

I see. I ended up going through the derivation and both are identical for that matter so I’d figure that’s also why it works. Are there any other efficiency gains to be had here?

Well that’s a better way to do it lol.

Transformed parameters gets run every leapfrog step.

Maybe :D?

Nah I kid I kid, lemme not be sarcastic and answer your question.

Loops in Stan are fast, so I don’t suspect that.

I think the only practical way you’d make your code faster is if you see ways to reorder things to reduce computation.

Lotta that stuff would be guess and check though. There’s an approx_phi or something like that in the manual to approximate a normal CDF I think. But it’s not clear if that’s the most costly thing in probit code.

I will try that at some point. I am just happy that this implementation takes about 2 minutes per chain rather than the previous six… Good enough for now. Again thanks! My laptop will be thankful as well going forward.

Current:
Chain 2: Elapsed Time: 77.3039 seconds (Warm-up)
Chain 2: 43.4298 seconds (Sampling)
Chain 2: 120.734 seconds (Total)

Previously:
Chain 2: Elapsed Time: 188.119 seconds (Warm-up)
Chain 2: 161.055 seconds (Sampling)
Chain 2: 349.174 seconds (Total)