Help speeding up a non-standard multi-logit model

Dear wonderful Stan team,

I’m returning to some very old Stan code (from 2014!) that I’m hoping to “modernize.”

The setup is very similar to a standard multi-logit problem. Units belong to one of K types and we want to model the probability that a unit belongs to a given type as function of covariates X. The complication is that types are (partially) latent. For types {A, B, C}, we can directly observe a subset of units that are type A or type B, but we can never observe units of type C directly — for those units we only know that they are, say, either A or C. As a result, I can no longer use the off-the-shelf multi-logit Stan models (e.g., Section 9.6 of the Stan manual).

The good news is that I can easily modify the likelihood for these units directly, which seems to work well (see attached code). At the same time, I feel like my code must be inefficient for this non-standard setting. I’m planning to expand this to much more complex models, so any help would be greatly appreciated.

[The motivation for this setup is a classic causal inference problem where we’re trying to estimate compliance types: {Always Takers, Never Takers, and Compliers}. We can directly observe some Always Takers and some Never Takers. But we can never directly observe Compliers.]

Stan code, R file, and knitted PDF attached (apparently I can’t post Rmd or html?).

Many thanks,
Avi

pscore_script_for_list.R (1.4 KB)
pscore_script_for_list.pdf (167.6 KB)
pscore_model_for_list.stan (1.3 KB)

parameters {
matrix[N_type-1, N_cov] beta_raw;
}

transformed parameters{
matrix[N_type, N_cov] beta;

to_vector(beta) ~ normal(0, 5);

In your posted code I found this one. I didn’t look further.

Thanks for taking a look. But that code chunk is copy-and-paste from the
Stan manual. Is that no longer the recommendation? It’s certainly easy to
change that part.

If you have a minute, I’m most worried about the main model block.

Thanks again.

  • Avi

Which version/page is that?

You have to use:

to_vector(beta_raw) ~ normal(0, 5);

Ah. Good catch. That was my mistake!

Again, I was focused on the main model block (hence missing that typo!).
I’d especially welcome comments there.

Thanks again,
Avi

Also, while I obviously just made a coding error, it might be worth noting
this simple point in the discussion on p. 135 of the manual. Looking at it
again, this was easy to gloss over in going through that section.

While that would certainly speed this up, it unfortunately destroys the
underlying structure of the problem. For example, I couldn’t, I believe,
use the posterior from beta to make a marginal prediction about a unit
being a Complier (corresponding to pi[3] in what you wrote). Perhaps I’m
missing something?

You could omit the log_sum_exp, log_softmax and work directly on the probability level:

matrix[N, N_type] pi = exp(beta * x);

Then you could try:

target += log(pi[n, G[n]] / sum(pi[n]));

just to get the idea. You may left-side multiply a design matrix X built upon G.

target += log(X * PI) - log(row_sum(pi));

(Sorry not my day).

This was an excellent suggestion, cutting compute time by about 80 percent
for similar ESS relative to the previous model (with my typo fixed).
Amazing that that change made such a difference.

Thanks again for the help. Incredibly useful.

  • Avi

@andre.pfeuffer got the big one by converting to matrix multiply rather than a loop for beta * x.

The next step would be to break down x based on z and d because you only ever need one or two columns out of the result. And there are a lot of zeroes that aren’t efficient to multiply.

Some smaller suggestions:

    to_vector(beta) ~ normal(0, 5);

should be

    to_vector(beta_raw) ~ normal(0, 5);

because you don’t need the priors on the zero entries.

It doesn’t help with speed, but

  row_vector[N_cov] zeros;
  zeros = rep_row_vector(0, N_cov);

can use declare-define to put it on a single line

  row_vector[N_cov] zeros = rep_row_vector(0, N_cov);

And was that really supposed to be 0 and not 1? Using a 1 would be an intercept, but using 0 doesn’t do anything I can see.

Many thanks, Bob.

I’m not sure if I totally follow your idea to break down x based on z and d, since I need the full multinomial structure.

Regardless, that new declare-define notation is pretty slick!

Thanks again,

Avi

Bob_Carpenter
Developer

    February 19

@andre.pfeuffer got the big one by converting to matrix multiply rather than a loop for beta * x.

The next step would be to break down x based on z and d because you only ever need one or two columns out of the result. And there are a lot of zeroes that aren’t efficient to multiply.

Some smaller suggestions:

>     to_vector(beta) ~ normal(0, 5);

should be

>     to_vector(beta_raw) ~ normal(0, 5);

because you don’t need the priors on the zero entries.

It doesn’t help with speed, but

>   row_vector[N_cov] zeros;
> zeros = rep_row_vector(0, N_cov);

can use declare-define to put it on a single line

>   row_vector[N_cov] zeros = rep_row_vector(0, N_cov);

Let me write the relevant loop out. I moved the inner multiplies out into a single matrix multiply—that’ll need to

{
  matrix[N, N_type] beta_x = beta * x;  // declare `x` as matrix
  for (n in 1:N) {
    vector[N_type] log_pi = log_softmax(beta_x[n]);
    if(z[n] == 0 && d[n] == 1)
      target += log_pi[1];
    else if (z[n] == 1 && d[n] == 0)
      target += log_pi[2];
    else if (z[n] == 0 && d[n] == 0)
      target += log_sum_exp(log_pi[2], log_pi[3]);
    else if (z[n] == 1 && d[n] == 1)
      target += log_sum_exp(log_pi[1], log_pi[3]);
  }
}

I had thought you’d be able to avoid computing some of the components, but that won’t work given the softmax.

Softmax itself isn’t the most efficient way to compute just log_pi[1] in these cases because it produces a multivariate return and builds up a bigger structure than necessary. So instead of

target += sofmax(a)[n]

it’s better to do

log(a[n]) - log_sum_exp(a)

which can be coded in Stan as

n ~ categorical_logit(a);

This only computes the gradients needed.

I actually don’t see how to do what I suggested, since you need all of the columns of beta * x.

What is the size of log_pi (i.e., N_type) here? If it’s 3, then the last two can be reduced to the first two, because sum(softmax(a)[1:3]) = 1.