I am currently trying to fit a hierarchical variant of a specific type of network-based choice model (link for the interested, I am here only focusing on the second of the two sub-models: https://www.sociologicalscience.com/articles-v4-14-318/). The outcome is the choice of an interaction partner from a set of available target actors for a given relational event, where the sender in a given event can be considered fixed here. The number of available target actors can vary from 60 to 90 in my case, the total number of events is ca. 4500. The predictors consist of statistics regarding the history of past interactions linking the sender of a given interaction event with potential target actors (e.g. the number of past times the sender i chose a target j). The hierarchical model structure is over different types of interactions (in my case 3 different types of economic exchanges).
I have managed to build a working version based on a custom likelihood function, but it is relatively slow. I have tried to vectorize where possible but might have missed some opportunities. I also implemented a non-centered parametrization as suggested in the manual because I encountered divergend iterations, but the fitting process is actually quite a bit slower than with the centered parametrization, which seems odd. Gradient evaluation is in the ballpark of 0.15 to 0.2 seconds.
Here is the Stan model:
functions {
real choiceModel_lpmf(int receiver, int sender, vector beta, vector[] X, int A){
real lprob;
real denom = 0;
for (k in 1:A){
if (k == sender) continue;
denom += exp(dot_product(beta, X[k]));
}
lprob = dot_product(beta, X[receiver]) - log(denom);
return(lprob);
}
}
data {
int<lower=0> N_events; // number of events
int<lower=0> N_pred; // number of predictors / network statistics
int<lower=0> A[N_events]; // number actors present at each event
int<lower=1> N_L; // number of levels
int<lower=0,upper=N_L> L[N_events]; // levels for events
int<lower=1> receiver[N_events]; // receiver position at current event
int<lower=1> sender[N_events]; // sender position at current event
vector[N_pred] X[sum(A)]; // predictor matrix in unragged format
}
parameters {
vector[N_pred] mu;
vector<lower=0>[N_pred] sigma;
vector[N_pred] beta_raw[N_L];
}
transformed parameters {
vector[N_pred] beta[N_L];
for(l in 1:N_L)
beta[l] = mu + sigma .* beta_raw[l];
}
model {
int pos; // needed for segmenting the unragged array X
pos = 1;
// Priors
mu ~ normal(0,2);
sigma ~ exponential(1);
for(l in 1:N_L)
beta_raw[l] ~ std_normal();
// likelihood (evaluated only at the currently active actors)
for(n in 1:N_events){
receiver[n] ~ choiceModel(sender[n], beta[L[n]], segment(X, pos, A[n]), A[n]);
pos = pos + A[n];
}
}
generated quantities {
vector[N_events] log_lik;
int pos;
pos = 1;
for(n in 1:N_events){
log_lik[n] = choiceModel_lpmf(receiver[n] | sender[n], beta[L[n]], segment(X, pos, A[n]), A[n]);
pos = pos + A[n];
}
}
I would greatly appreciate any advice on how to improve performance or on more efficent coding / overall model structure, as I am relatively new to Stan and Bayesian Stats in general.
It wasnât the question, but Iâd be scared of this:
for (k in 1:A){
if (k == sender) continue;
denom += exp(dot_product(beta, X[k]));
}
lprob = dot_product(beta, X[receiver]) - log(denom);
It looks like youâre doing some sort of softmax transformation. Using the internal softmax or doing a log_sum_exp is going to be better. Itâs really easy for sums of exponentials to overflow.
thanks for the tip. In the meantime i found this thread: Mixed Logit Model. The models discussed there have basically the same structure but indeed utilize the internal softmax function and I adapted the model provided by @James_Savage to my case. So far it doesnât seem an awful lot faster, but hopefully the internal softmax variant will result in less divergence issues. The bottleneck regarding speed is probably just in the amount of data, given that there are more than 300k âobservationsâ; It seems I just have to find more computing power and/or be patient.
@Jakob The only suggestion I would have is to ask how much data you need to identify the parameters of the model to a degree of precision that makes you comfortable. Sometimes fitting your model to a subset of the data really makes practical sense.
actually, the variant of your model with the fully modelled covariance matrix both ran decently fast and without divergence issues on the full dataset, so thanks a lot for putting those on the forums. I now figured that estimation time and convergence seem to depend much more on the grouping I choose for the hierarchy (I have one grouping into 3 groups and one into 14 groups, with the latter running much more smoothely) and the coding of the predictors. Iâll just have to figure out which one of those works best.
Glad to hear. Yep â more groups will provide more information about the
covariance matrix of the random slopes. Smart priors will make a big
difference too. Itâs worth simulating choices by drawing beta from your prior,
(using the actual choice attribute matrix) then storing out the maximum choice
share. Repeat many times. What youâll find is something like this
What does this mean? Wide priors in these models indicate a prior that
you believe a single choice is always dominant, and that we should ex ante
expect very strong concentration of choices. This interacts strongly with your
choice attribute matrix. Good to keep in mind!
I wanted to set priors based on prior predictive checks and so far I went for fairly strict priors along the lines of normal(0,1) due to the concentration issue. However, when I set up a script for prior predictions and set priors to something ridiculous like normal(0,100), i donât get the strong concentration on the top choice like in your plots.
Is something wrong with simulating data with categorical_rng for this example, like the following?
And am I correct in the assumption that your example plots show something along the lines of:
t <- apply(y_rep, 2, function(x) table(x)[which.max(table(x))]/length(x))
hist(t)
I wasnât too sure what you meant by âmaximum choice shareâ.
I donât know, but Iâm guessing it might be something about looking at the maximum value of softmax(X[start[i]:end[i]] * beta[cat[i]]') instead of drawing from the categorical_rng.
The output of the softmax is the choice share, right?
computing the probabilities for all the choice options via the softmax and then looking at which target got the highest choice probability across all samples was actually what I was doing before to get (posterior) predictions, instead of the categorical_rng.
The primary problem I had with that again was size as even a moderate number of samples results in a huge (around 7Gb) matrix since there are 300k+ choice options. Also, when I sample probabilities for all choices from the prior like in the script below, it takes about an hour before sampling starts, while sampling itself is fast and another hour after sampling is finished. The categorical_rng as above is lightning fast, not sure what causes the huge difference.
data {
int E; // number of relational events
int A[E]; // number of available targets at each event
int N; // number of rows
int K; // number of random effects
int K2; // number of fixed effects
int cat[E]; // event categories
int L; // number of event categories
vector<lower=0,upper=1>[N] choice; // binary choice indicator
int whichChoice[E]; // position of chosen targets in choice[N]
matrix[N,K] X; // predictors for random effects
matrix[N,K2] Z; // predictors for fixed effects
int start[E]; // the starting observation for each event
int end[E]; // the ending observation for each event
}
parameters {
vector[K] mu;
vector<lower=0>[K] tau;
matrix[L,K] beta_raw;
vector[K2] gamma;
cholesky_factor_corr[K] omega; // cholesky factor of the beta correlation matrix
}
transformed parameters {
matrix[L,K] beta = rep_matrix(mu', L) + beta_raw*diag_pre_multiply(tau, omega);
}
model {
vector[N] log_prob;
// priors
gamma ~ normal(0,0.5);
mu ~ normal(0, 0.4);
tau ~ exponential(2);
to_vector(beta_raw) ~ normal(0, 1);
omega ~ lkj_corr_cholesky(4);
}
generated quantities {
vector[N] log_prob;
for(i in 1:E) {
log_prob[start[i]:end[i]] = log_softmax(X[start[i]:end[i]]*beta[cat[i]]'
+ Z[start[i]:end[i]]*gamma);
}
}
Yeah, for large N itâd be pretty easy to get I/O bound here saving the full simplexes. What if you just save the value of the maximum probability for each E? That should be small (only E numbers instead of N).
I think itâs the distribution of maximum probabilities for each E you want anyway.
Youâre right, it makes sense to just get the maximum value from Stan and not the whole simplex and I guess youâre right in that those are what you want in the end as they indicate the degree of âconcentrationâ in the prior predictions. The respective plots do also look and behave more or less like @James_Savageâs, only that there is also a varying degree of concetration around zero, which however makes sense given that for some events the data donât contain much variation and the buest guess is accordingly more or less random (i.e. 1 / number of choice alternatives).
Does Stan contain a function along Râs which.max() by any chance so you could extract the âbest guessâ target of a choice in a similar fashion to the maximum choice probability? I couldnât find any in a cursory search in the manuals.
In the marketing world at least, itâs more common to save the probability of the choice that was actually made rather than the most probable outcome.
Might I suggest doing the posterior predictions out of Stan? That way you can batch it a lot easier.