HB-MNL with ragged data (i.e., variation in obs, outcomes)

HB Logit is easy in stan. However, if we allow the number of observations/choices to vary between subjects, it is a little harder. If we also allow the number of possible outcomes/choice alternatives to vary between subjects and tasks, it gets even harder.

Data needs to be organized in a long format, and one needs to be good at indexing.

Suspected issues:
So I did that, and the model seems to work fine. However, it isn’t particularly fast.
I suspect that there are some copies made in memory instead of pointers, because I am using both segment and to_matrix. If I were to create Eigen / Arma code myself, I’d probably use some advanced constructors to still use a pointer.
I also think the recursive nature of the indexing with pos and rs would make it hard to parallelize. If I were to code from scratch, I’d probably generate a lookup matrix / lookup vectors first, and then I would be able to access the right portions of the long data in parallel (e.g., using openMP #pragma for).

So here is the model. I’d appreciate any ideas that help make it more efficient.

Performance: Using a synthetic example (nresp=100, nsc[]=30), 2000 draws take me 580 seconds. Using non-ragged data structure, it is down to 220 seconds. 2.6 times slower to allow for ragged data. Ouch.


data {
  //dimensions
  int<lower=1> nresp;           // # of respondents
  int<lower=1> nvar;            // # of covariates of alternatives
  int<lower=0> nz;              // # of respondent covariates, i.e. upper level
  int<lower=1> ntot;            // total # of obs
  int<lower=1> ntotsc;          // total # of scenarios
  int nsc[nresp];               // # of scenarios for each resp
  int<lower=1> nalts[ntotsc];   // number of alternatives in each scenario

  //data
  matrix[nz, nresp] Z;      //'upper level' covariates
  int<lower=1> Y[ntotsc];   // observed choices (long format, 1-d array)
  vector[ntot] X;           // choice alternative covariates (long format, vector)
}

parameters {
  matrix[nvar, nresp] Beta;
  matrix[nvar, nz] Theta;
  corr_matrix[nvar] Omega;
  vector<lower=0>[nvar] tau;
}

transformed parameters {
  cov_matrix[nvar] Sigma = quad_form_diag(Omega, tau);
}


model {
  int rs;  //unique scenario counter, position in y array
  int pos; //position in x vector
  rs =1;
  pos = 1;
  
//priors
  to_vector(Theta) ~ normal(0, 15);
  tau ~ cauchy(0, 2.5); 
  Omega ~ lkj_corr(2);
  
//likelihood  
  for (r in 1:nresp) {
    Beta[,r] ~ multi_normal(Theta*Z[,r], Sigma);	
    
    //go through all scenarios of the r-th resp
    for (s in 1:nsc[r]){
   
      //actual ll
      Y[rs] ~ categorical_logit(to_matrix(segment(X, pos, nvar*nalts[rs]),nvar,nalts[rs])'*Beta[,r]);
      
      //update counter
      pos=pos+nvar*nalts[rs]; 
      rs=rs+1; 
    }
  }
}
1 Like

Booted back into linux, and it runs faster. However, long format / ragged code still behind (60 vs 260 seconds).

So I changed the model and pre-specified indexing vectors which are loaded as data (see code below). This improved speed to 88 seconds, so much closer to the dense array version of the code (factor 1.46 down from 2-4).

I still (in either ragged or dense version) see little multicore use. Out of 32 cores, only one peaks, and maybe 4 more see partial load.

Not quite sure where the speed gains come from and if copies are avoided. Ran it a few more times, and ragged version runs in a 70-90 second window, while dense version 60-70. So it seems that it’s not so bad. Still, I would like this to run even faster. I also don’t see why the individual log-likelihood contributions aren’t parallelized more efficiently.


data {
  //dimensions
  int<lower=1> nresp;           // # of respondents
  int<lower=1> nvar;            // # of covariates of alternatives
  int<lower=0> nz;              // # of respondent covariates, i.e. upper level
  int<lower=1> ntot;            // total # of obs
  int<lower=1> ntotsc;          // total # of scenarios
  int nsc[nresp];               // # of scenarios for each resp
  int<lower=1> nalts[ntotsc];   // number of alternatives in each scenario

//lookups
  int lookup1a[nresp]; //indices to find relevant entries in another lookup
  int lookup1b[nresp];
  int lookup2pos[ntotsc]; //indices to relevant rows in long x vector
  int lookup2len[ntotsc];
  int lookup2nalt[ntotsc];
  int lookup2rs[ntotsc];

  //data
  matrix[nz, nresp] Z;      //'upper level' covariates
  int<lower=1> Y[ntotsc];   // observed choices (long format, 1-d array)
  vector[ntot] X;           // choice alternative covariates (long format, vector)
}

parameters {
  matrix[nvar, nresp] Beta;
  matrix[nvar, nz] Theta;
  corr_matrix[nvar] Omega;
  vector<lower=0>[nvar] tau;
}

transformed parameters {
  cov_matrix[nvar] Sigma = quad_form_diag(Omega, tau);
}


model {
//priors
  to_vector(Theta) ~ normal(0, 15);
  tau ~ cauchy(0, 2.5); 
  Omega ~ lkj_corr(2);
  
//likelihood  
  for (r in 1:nresp) {
    Beta[,r] ~ multi_normal(Theta*Z[,r], Sigma);	
    
    //go through all scenarios of the r-th resp
    for (s in lookup1a[r]:lookup1b[r]){

      //actual ll
      Y[lookup2rs[s]] ~ categorical_logit(to_matrix(segment(X, lookup2pos[s], lookup2len[s]),nvar,lookup2nalt[s])'*Beta[,r]);
      
    }
  }
}

R code for simulation of ragged data:

#simulation set-up
  nresp = 100
  nsc   = rep(30,nresp); ntotsc= sum(nsc)
  nalts = rep(3,ntotsc)
  Theta = matrix(rep(1, 4), nrow=1) 
  Sigma = diag(0.1, 4)

#dimensions
  nvar  =ncol(Theta)
  ntot = sum(nalts*nvar)
  
#attributes
  X <- rnorm(ntot)      # normal attributes
  
#output storage
  Y <- rep(NA,ntotsc)
  
# no upper level model
  Z <- array(1, dim=c(1, nresp)) 
  nz=nrow(Z)
  
# i-level parameters
  Beta <- array(dim=c(nvar, nresp))
  for (r in 1:nresp) {
    Beta[,r] = Z[,r]%*%Theta + matrix(rnorm(nvar), ncol = nvar) %*% chol(Sigma)
  }
  
  lookup=c()

  rs=1
  pos=1
  for (r in 1:nresp){
    for (s in 1:nsc[r]){
      #logit choice
      x=matrix(X[pos:(pos+nvar*nalts[rs]-1)],nalts[r],nvar,byrow=T)
      Y[rs] = sample(x=nalts[rs], size=1, prob=exp(x%*%Beta[,r])) 
      
      lookup=rbind(lookup,c(r,s,rs,pos,nvar*nalts[rs],nalts[rs]))
      
      #counter
      rs=rs+1
      pos=pos+nvar*nalts[rs];
    }
  }
  
  colnames(lookup)=c('r','s','rs','pos','len','nalt')
  

  lookup1=lookup %>% data.frame %>% group_by(r) %>% summarise(fr=min(pos)) %>% mutate(la=lead(fr)-1)
  lookup1[nrow(lookup1),]$la= length(X)
  
  lookup1=lookup %>% data.frame %>% group_by(r) %>% summarise(fr=min(rs),to=max(rs)) %>%select(-r)
  lookup2=lookup %>% data.frame
  
  testdata= 
    list(nresp=nresp,   # of respondents
         nvar=nvar,     # of covariates of alternatives
         nz=nz,         # of respondent covariates, i.e. upper level
         ntot=ntot,     # of obs
         ntotsc=ntotsc, # of scenarios
         nsc=nsc,       # of scenarios for each resp
         nalts=nalts,   # number of alternatives in each scenario
         Y=Y, X=X, Z=Z, 
         Beta.true=Beta, Theta.true=Theta, Sigma.true=Sigma,
         lookup1a=lookup1$fr%>% as.integer(),lookup1b=lookup1$to%>% as.integer(),
         lookup2pos=lookup2$pos%>% as.integer(),lookup2len=lookup2$len%>% as.integer(),lookup2nalt=lookup2$nalt%>% as.integer(),lookup2rs=lookup2$rs%>% as.integer())


1 Like

You can get speedup by working with the cholesky version.

cholesky_factor_corr[nvar] L_Omega;
L_Omega ~ lkj_corr_cholesky(2.0);
matrix[nvar, nvar] L_Sigma = diag_pre_multiply(tau, L_Omega);

for (r in 1:nresp) {
Beta[,r] ~ multi_normal_cholesky(Theta*Z[,r], L_Sigma);

However I’d recommend to replace the multi_normal_cholesky by

Beta[,r] = Theta*Z[,r] + L_Sigma * beta_par[,r]';

with beta_par parameter, Beta transformed parameter, prior for beta_par

to_vector(beta_par) ~ std_normal();

Second, there is a also a variant of categorical_logit using openCL, eg. GPU support.

1 Like

Thanks. I’ll try the cholesky decomp version. However, I’d expect more gain from parallel computation of categorical_logit. While GPU support is interesting, the issue seems to be that the Y[rs] ~ categorical_logit part is not parallelised at all. Running it on 16 CPU cores would be good enough for me. Is there any way I can see the cpp code that stan puts together?

There is a function categorical_logit_glm_lpmf which goes beyond categorical_logit. I think you can use it in your case. You may also consider map_rect for multiple CPU’s.

The benefits of the cholesky decomp should by no means underestimated. Stan calculates the derivatives and these are memory dynamic operations. Matrix inversion is in O(n^3) + derivatives. The inverse of cholesky decomp. is in O(n^2), but here we don’t need the whole matrix inversion process.

Thank you. I will also look at categorical_logit_glm_lpmf.

I considered map_rect, however, having variable sizes of x may make it tricky. map_rect examples I have seen use fixed-size arrays.

With map_rect one has to partition the data, sure. reduce_sum is “very much converging”, so there will be another way soon.