Tips for speeding up model

Hello, I am currently adapting and extending an algorithm, NetRate (Gomez-Rodriguez et al 2011, described here: http://snap.stanford.edu/class/cs224w-readings/rodriguez11diffusion.pdf) in RStan to estimate likely transmission pathways for infectious diseases using data about locations and times of cases showing symptoms. It works well on smaller datasets (N=100) but I am looking to apply the algorithm to larger datasets (N=10,000) and am keen to improve the speed of the model (perhaps through running in parallel / vectorization). Does anyone have any advice for how best to improve the speed?

The code is below, and I can also provide dummy data for reproducibility in RStan if useful.

//data comprising of 
//n total cases (all infectors)
//q total locally acquired cases (all potential infectees)
//NI is vector of locally acquired cases
//z dimensions of matrix n*q
//dt times between symptom onset of pairs of cases
//dis distance between pairs of cases
data {
  int n;
  int q;
  int NI[q];
  int z;
  vector[z] dt;
  vector[z] dis;
}
//parameters - A  determines connectedness of pairs in network
//epsilon edge - chance of being infected by outside source
//distance_scale - param shaping distance likelihood functon
parameters {
  vector < lower = 0, upper = 0.01 > [z] A;
  vector < lower = 0 > [q] epsilon_edge;
  real < lower = 0, upper = 100 > distance_scale;
}
transformed parameters {}
//model  adapted from NetRate (Gomexz-Rodriguez,2014)
model {
  real H = 0;
  real S = 0;
  int counter = 1;
  //priors
  for (i in 1: z) {
    A[i] ~normal(0.005, 0.001) T[0, 0.01];
  }
  for (i in 1: q) {
    epsilon_edge[i] ~exponential(10);
  }
  distance_scale~exponential(10) T[0, 100];
  counter = 1;
  for (i in 1: q) {
    H = 0;
    //sum of log hazards of infection across all candidate infectees
    for (j in 1: (NI[i])) {
      H = H + A[counter] * fmax(dt[counter], 0) * distance_scale * exp(-dis[counter] * distance_scale); //Rayleigh Hazard is b+at
      counter = counter + 1;
    }
    target += log(H + epsilon_edge[i]);
  }
  counter = 1;
  //sum of log survivals across infectees - chances of reaching time t without being infected by another case
  for (i in 1: q) {
    for (k in 1: (NI[i])) {
      target += -A[counter] * ((fmax(dt[counter], 0) ^ 2) * 0.5); //Rayleigh Survival on log is -a*0.5*t*t 
      counter = counter + 1;
    }
    target += -epsilon_edge[i];
  }

}

Thanks!

1 Like