Dear Stan community,

i have a ragged-data hierarchical conditional logistic / softmax regression model (code below), which I would like to speed up. The model is relatively simple but there is a lot of data, at least by my standards: The number of ‘choice events’ (`E`

) is at around 4500 and at each choice there are 60-80 options (this is a type of relational event model for those familiar with network analysis), yielding around 300k (`N`

) ‘observations’. There are 3 groups (`L`

) and 5-15 predictors (`K`

).

Currently, the model takes about 15h to run on a 8-core workstation with gradient evaluation estimated at around .06 seconds.

```
data {
int<lower=0> N;
int<lower=0> E;
int<lower=1> K;
int<lower=1> L;
int<lower=1,upper=L> ll[N];
vector[N] y;
matrix[N,K] x;
int<lower=1> start[E];
int<lower=1> end[E];
}
parameters {
vector[K] mu;
vector<lower=0>[K] sigma;
vector[K] z[L];
}
transformed parameters {
vector[K] beta[L];
for(l in 1:L)
beta[l] = mu + sigma .* z[l];
}
model {
vector[N] log_prob;
mu ~ normal(0,2);
sigma ~ normal(0,5);
for (l in 1:L)
z[l] ~ std_normal();
for (e in 1:E)
log_prob[start[e]:end[e]] = log_softmax(x[start[e]:end[e]] * beta[ll[e]]);
target += log_prob' * y;
}
```

I have no experience with stan’s map reduce and GPU capabilities and am unsure how much there could be gained and whether it is even possible to do for this model (it might be difficult to split ragged data into shards).

On the other hand, I’m also unsure if there are any further model-specific improvements to make sampling more efficient. Predictors are mostly functions of former events (e.g. how often has actor A chosen B in the past?) and I guess this could induce a complicated posterior geometry.

Does anyone have any advice or experiences to share on how to optimize similar models?

Best regards and thanks

Jakob