Performance issues with non-centered parametrization for custom likelihood

performance

#1

Hello,

I am currently trying to fit a hierarchical variant of a specific type of network-based choice model (link for the interested, I am here only focusing on the second of the two sub-models: https://www.sociologicalscience.com/articles-v4-14-318/). The outcome is the choice of an interaction partner from a set of available target actors for a given relational event, where the sender in a given event can be considered fixed here. The number of available target actors can vary from 60 to 90 in my case, the total number of events is ca. 4500. The predictors consist of statistics regarding the history of past interactions linking the sender of a given interaction event with potential target actors (e.g. the number of past times the sender i chose a target j). The hierarchical model structure is over different types of interactions (in my case 3 different types of economic exchanges).

I have managed to build a working version based on a custom likelihood function, but it is relatively slow. I have tried to vectorize where possible but might have missed some opportunities. I also implemented a non-centered parametrization as suggested in the manual because I encountered divergend iterations, but the fitting process is actually quite a bit slower than with the centered parametrization, which seems odd. Gradient evaluation is in the ballpark of 0.15 to 0.2 seconds.

Here is the Stan model:

functions {
  real choiceModel_lpmf(int receiver, int sender, vector beta, vector[] X, int A){
   
   real lprob;
   real denom = 0;

   for (k in 1:A){
     if (k == sender) continue;
     denom += exp(dot_product(beta, X[k]));
   }
   lprob = dot_product(beta, X[receiver]) - log(denom); 
   return(lprob);
  }
}

data {
  int<lower=0> N_events;              // number of events
  int<lower=0> N_pred;                // number of predictors / network statistics
  int<lower=0> A[N_events];           // number actors present at each event
  int<lower=1> N_L;                   // number of levels 
  int<lower=0,upper=N_L> L[N_events]; // levels for events
  int<lower=1> receiver[N_events];    // receiver position at current event
  int<lower=1> sender[N_events];      // sender position at current event
  vector[N_pred] X[sum(A)];           // predictor matrix in unragged format
}

parameters {
  vector[N_pred] mu;
  vector<lower=0>[N_pred] sigma;
  vector[N_pred] beta_raw[N_L];
}

transformed parameters {
  vector[N_pred] beta[N_L];

  for(l in 1:N_L)
    beta[l] = mu + sigma .* beta_raw[l];
}

model {
  int pos; // needed for segmenting the unragged array X
  pos = 1;

   // Priors
   mu ~ normal(0,2);
   sigma ~ exponential(1);

   for(l in 1:N_L)
      beta_raw[l] ~ std_normal();

  // likelihood (evaluated only at the currently active actors)
  for(n in 1:N_events){
    receiver[n] ~ choiceModel(sender[n], beta[L[n]], segment(X, pos, A[n]), A[n]);
    pos = pos + A[n];
  }
}

generated quantities {
  vector[N_events] log_lik;
  int pos;
  pos = 1;
 
  for(n in 1:N_events){
    log_lik[n] = choiceModel_lpmf(receiver[n] | sender[n], beta[L[n]], segment(X, pos, A[n]), A[n]);
    pos = pos + A[n];
  }   
}

I would greatly appreciate any advice on how to improve performance or on more efficent coding / overall model structure, as I am relatively new to Stan and Bayesian Stats in general.

Thank you,
Jakob


#2

It wasn’t the question, but I’d be scared of this:

for (k in 1:A){
   if (k == sender) continue;
   denom += exp(dot_product(beta, X[k]));
}
lprob = dot_product(beta, X[receiver]) - log(denom); 

It looks like you’re doing some sort of softmax transformation. Using the internal softmax or doing a log_sum_exp is going to be better. It’s really easy for sums of exponentials to overflow.


#3

Hey Ben,

thanks for the tip. In the meantime i found this thread: Mixed Logit Model. The models discussed there have basically the same structure but indeed utilize the internal softmax function and I adapted the model provided by @James_Savage to my case. So far it doesn’t seem an awful lot faster, but hopefully the internal softmax variant will result in less divergence issues. The bottleneck regarding speed is probably just in the amount of data, given that there are more than 300k ‘observations’; It seems I just have to find more computing power and/or be patient.


#4

@Jakob The only suggestion I would have is to ask how much data you need to identify the parameters of the model to a degree of precision that makes you comfortable. Sometimes fitting your model to a subset of the data really makes practical sense.


#5

Hey James,

actually, the variant of your model with the fully modelled covariance matrix both ran decently fast and without divergence issues on the full dataset, so thanks a lot for putting those on the forums. I now figured that estimation time and convergence seem to depend much more on the grouping I choose for the hierarchy (I have one grouping into 3 groups and one into 14 groups, with the latter running much more smoothely) and the coding of the predictors. I’ll just have to figure out which one of those works best.


#6

Jakob,

Glad to hear. Yep – more groups will provide more information about the
covariance matrix of the random slopes. Smart priors will make a big
difference too. It’s worth simulating choices by drawing beta from your prior,
(using the actual choice attribute matrix) then storing out the maximum choice
share. Repeat many times. What you’ll find is something like this

What does this mean? Wide priors in these models indicate a prior that
you believe a single choice is always dominant, and that we should ex ante
expect very strong concentration of choices. This interacts strongly with your
choice attribute matrix. Good to keep in mind!

Jim


#7

Hey Jim,

I wanted to set priors based on prior predictive checks and so far I went for fairly strict priors along the lines of normal(0,1) due to the concentration issue. However, when I set up a script for prior predictions and set priors to something ridiculous like normal(0,100), i don’t get the strong concentration on the top choice like in your plots.
Is something wrong with simulating data with categorical_rng for this example, like the following?

generated quantities {
  vector[E] y_rep;
  for(i in 1:E) {
    y_rep[i] = categorical_rng(softmax(X[start[i]:end[i]] * beta[cat[i]]'));
  }
}

And am I correct in the assumption that your example plots show something along the lines of:
t <- apply(y_rep, 2, function(x) table(x)[which.max(table(x))]/length(x))
hist(t)

I wasn’t too sure what you meant by “maximum choice share”.

Thanks a lot for your help,
Jakob


#8

I don’t know, but I’m guessing it might be something about looking at the maximum value of softmax(X[start[i]:end[i]] * beta[cat[i]]') instead of drawing from the categorical_rng.

The output of the softmax is the choice share, right?


#9

Hey Ben,

computing the probabilities for all the choice options via the softmax and then looking at which target got the highest choice probability across all samples was actually what I was doing before to get (posterior) predictions, instead of the categorical_rng.
The primary problem I had with that again was size as even a moderate number of samples results in a huge (around 7Gb) matrix since there are 300k+ choice options. Also, when I sample probabilities for all choices from the prior like in the script below, it takes about an hour before sampling starts, while sampling itself is fast and another hour after sampling is finished. The categorical_rng as above is lightning fast, not sure what causes the huge difference.

data {
  int E;       // number of relational events
  int A[E];    // number of available targets at each event
  int N;       // number of rows
  int K;       // number of random effects
  int K2;      // number of fixed effects
  int cat[E];  // event categories
  int L;       // number of event categories
  vector<lower=0,upper=1>[N] choice; // binary choice indicator
  int whichChoice[E]; // position of chosen targets in choice[N] 
  matrix[N,K] X;      // predictors for random effects
  matrix[N,K2] Z;     // predictors for fixed effects
  int start[E]; // the starting observation for each event
  int end[E];   // the ending observation for each event
}

parameters {
  vector[K] mu;
  vector<lower=0>[K] tau;
  matrix[L,K] beta_raw;
  vector[K2] gamma;
  cholesky_factor_corr[K] omega; // cholesky factor of the beta correlation matrix
}

transformed parameters {
  matrix[L,K] beta = rep_matrix(mu', L) + beta_raw*diag_pre_multiply(tau, omega);
}

model {
  vector[N] log_prob;
  
   // priors
   gamma ~ normal(0,0.5);
   mu ~ normal(0, 0.4);
   tau ~ exponential(2);
   to_vector(beta_raw) ~ normal(0, 1);
   omega ~ lkj_corr_cholesky(4);
}

generated quantities {
  vector[N] log_prob;

  for(i in 1:E) {
    log_prob[start[i]:end[i]] = log_softmax(X[start[i]:end[i]]*beta[cat[i]]' 
                                              + Z[start[i]:end[i]]*gamma);
  }
}

#10

Yeah, for large N it’d be pretty easy to get I/O bound here saving the full simplexes. What if you just save the value of the maximum probability for each E? That should be small (only E numbers instead of N).

I think it’s the distribution of maximum probabilities for each E you want anyway.


#11

And by “I think” I mean “I’m totally guessing” based on what @James_Savage posted haha.


#12

You’re right, it makes sense to just get the maximum value from Stan and not the whole simplex and I guess you’re right in that those are what you want in the end as they indicate the degree of ‘concentration’ in the prior predictions. The respective plots do also look and behave more or less like @James_Savage’s, only that there is also a varying degree of concetration around zero, which however makes sense given that for some events the data don’t contain much variation and the buest guess is accordingly more or less random (i.e. 1 / number of choice alternatives).
Does Stan contain a function along R’s which.max() by any chance so you could extract the ‘best guess’ target of a choice in a similar fashion to the maximum choice probability? I couldn’t find any in a cursory search in the manuals.


#13

In the marketing world at least, it’s more common to save the probability of the choice that was actually made rather than the most probable outcome.

Might I suggest doing the posterior predictions out of Stan? That way you can batch it a lot easier.

I’m about to post a new thread but you might be interested in this method for prior choice: http://khakieconomics.github.io/2019/03/17/Choosing-priors-for-logit-choice-models.html

Jim