Hello, I am currently adapting and extending an algorithm, NetRate (Gomez-Rodriguez et al 2011, described here: http://snap.stanford.edu/class/cs224w-readings/rodriguez11diffusion.pdf) in RStan to estimate likely transmission pathways for infectious diseases using data about locations and times of cases showing symptoms. It works well on smaller datasets (N=100) but I am looking to apply the algorithm to larger datasets (N=10,000) and am keen to improve the speed of the model (perhaps through running in parallel / vectorization). Does anyone have any advice for how best to improve the speed?
The code is below, and I can also provide dummy data for reproducibility in RStan if useful.
//data comprising of
//n total cases (all infectors)
//q total locally acquired cases (all potential infectees)
//NI is vector of locally acquired cases
//z dimensions of matrix n*q
//dt times between symptom onset of pairs of cases
//dis distance between pairs of cases
data {
int n;
int q;
int NI[q];
int z;
vector[z] dt;
vector[z] dis;
}
//parameters - A determines connectedness of pairs in network
//epsilon edge - chance of being infected by outside source
//distance_scale - param shaping distance likelihood functon
parameters {
vector < lower = 0, upper = 0.01 > [z] A;
vector < lower = 0 > [q] epsilon_edge;
real < lower = 0, upper = 100 > distance_scale;
}
transformed parameters {}
//model adapted from NetRate (Gomexz-Rodriguez,2014)
model {
real H = 0;
real S = 0;
int counter = 1;
//priors
for (i in 1: z) {
A[i] ~normal(0.005, 0.001) T[0, 0.01];
}
for (i in 1: q) {
epsilon_edge[i] ~exponential(10);
}
distance_scale~exponential(10) T[0, 100];
counter = 1;
for (i in 1: q) {
H = 0;
//sum of log hazards of infection across all candidate infectees
for (j in 1: (NI[i])) {
H = H + A[counter] * fmax(dt[counter], 0) * distance_scale * exp(-dis[counter] * distance_scale); //Rayleigh Hazard is b+at
counter = counter + 1;
}
target += log(H + epsilon_edge[i]);
}
counter = 1;
//sum of log survivals across infectees - chances of reaching time t without being infected by another case
for (i in 1: q) {
for (k in 1: (NI[i])) {
target += -A[counter] * ((fmax(dt[counter], 0) ^ 2) * 0.5); //Rayleigh Survival on log is -a*0.5*t*t
counter = counter + 1;
}
target += -epsilon_edge[i];
}
}
Thanks!