Speed-up for hierarchical conditional logistic / softmax regression with ragged data structure

Dear Stan community,

i have a ragged-data hierarchical conditional logistic / softmax regression model (code below), which I would like to speed up. The model is relatively simple but there is a lot of data, at least by my standards: The number of ‘choice events’ (E) is at around 4500 and at each choice there are 60-80 options (this is a type of relational event model for those familiar with network analysis), yielding around 300k (N) ‘observations’. There are 3 groups (L) and 5-15 predictors (K).

Currently, the model takes about 15h to run on a 8-core workstation with gradient evaluation estimated at around .06 seconds.

data {
  int<lower=0> N;
  int<lower=0> E;
  int<lower=1> K;
  int<lower=1> L;
  int<lower=1,upper=L> ll[N];
  vector[N] y;
  matrix[N,K] x;
  int<lower=1> start[E];
  int<lower=1> end[E];

parameters {
  vector[K] mu;
  vector<lower=0>[K] sigma;
  vector[K] z[L];

transformed parameters {
  vector[K] beta[L];

  for(l in 1:L)
    beta[l] = mu + sigma .* z[l];

model {
  vector[N] log_prob;

  mu ~ normal(0,2);
  sigma ~ normal(0,5);
  for (l in 1:L)
    z[l] ~ std_normal();

  for (e in 1:E) 
    log_prob[start[e]:end[e]] = log_softmax(x[start[e]:end[e]] * beta[ll[e]]);

  target += log_prob' * y;

I have no experience with stan’s map reduce and GPU capabilities and am unsure how much there could be gained and whether it is even possible to do for this model (it might be difficult to split ragged data into shards).

On the other hand, I’m also unsure if there are any further model-specific improvements to make sampling more efficient. Predictors are mostly functions of former events (e.g. how often has actor A chosen B in the past?) and I guess this could induce a complicated posterior geometry.

Does anyone have any advice or experiences to share on how to optimize similar models?

Best regards and thanks

1 Like

Hi, sorry for taking so long to respond. I am by no means an expert on this, but here are my few cents:

If I get it correctly, for each event e you have end[e] - start[e] choices and y records how many times the actual choice was taken? Then you are trying to regress the probabilities of the individual choices, right?

If so, there are a few things that might (and might not) be a problem hindering your efficiency. I would guess a well behaved model to be likely faster than what you describe, but possibly not much:

  • shouldn’t y by of type int[N]?
  • log_prob' * y I am not sure where that’s comming from, I would belive you want to have y[start[e]:end[e]] ~ multinomial(softmax(x[start[e]:end[e]] * beta[ll[e]])). Maybe that is equivalent, but I don’t see it immediately.
  • You have quite wide priors for mu and sigma remember you are on the logscale, checking up the implied prior:
hist(rnorm(1e4,0, abs(rnorm(1e4,0,5))))


So log odds of ±15, e.g. odds ratios of exp(15) = 3269017 are still a priori quite plausible.

  • Finally, and maybe most importantly, softmax for a vector of length K has only K-1 degrees of freedom - for any scalar c and vector beta, softmax(beta) == softmax(beta * c) so you need to constrain your betas. One way that I think is used quite often is to set beta[1] to 1. Then the other coefficients represent odds relative to this baseline option.

Hope I am making sense, please double check my reasoning.
Good luck with your model!

Hi Martin,

thanks for coming back to this!

In this case, y is a dummy variable that for each event e indicates which of the available options/choices was actually chosen (so a categorical likelihood). The model is basically this one proposed by @James_Savage, who uses the target += log_prob' * y trick to sum up the log probabilities at the realized y's and ignore the rest for the log likelihood contributions.

Regarding the priors you’re right that they are too wide, they should have been something like 1 or even .5.

As an update for the performance issue, I managed to get a significant performance increase (I’d say at least x10) by switching to a more fine-grained grouping variable of around 30 groups, which appareantly allowed the sampler to more easily traverse the posterior. This also allowed me to incorporate a full model for the parameter covariance matrix (as in the model linked above) without any loss of speed. So I think for now I’m happy without the GPU or map_rect stuff :)


You could try to arrange the array of vector style into a matrix based organisation. Then you can hopefully get rid of some for loops and also do a to_vector(z_as_matrix) ~ std_normal().

The other thing which I noticed earlier is that

  target += log_prob' * y;

is slower than the equivalent

  target += dot_product(log_prob, y);

at least this is what I recall.


The matrix for z I already had in my current version, but I’ll give the dot_product a try.

So you did manage to make the model work without further constraining betas? That would be surprising to me, so maybe it is an opportunity to learn something :-) Just out of curiousity how does the pairs plot or similar look for say beta[1] (or a subset thereof if K is large)?

Sorry, I forgot to address the identifiability issue. I followed the advice given here and added an “outside good/choice” to make the model identifiable. However, I got the model to fit decently even without the outside good, I guess because the parameter space was constrained sufficiently by priors.

1 Like