Speeding up a large double sum in the log-likelihood

I’m fitting a large self-exciting point process model (preprint here). The data are the location and times of thousands of events (crimes), and each event temporarily increases the risk of more events in its vicinity. The risk decays exponentially in time and like a Gaussian in space.

Unfortunately the log-likelihood for point processes is pretty hairy: the model specifies an intensity function lambda(s, t), the rate of events at every point in time and space, and the log-likelihood is (eq 4 in the paper)

sum(lambda(s_i, t_i)) + \int lambda(s,t) ds dt

And since lambda(s, t) involves a sum over all previous events (eq 5), the log-likelihood is a double sum that’s O(n^2) in the number of events.

I implemented maximum likelihood estimation using EM and a bunch of custom parallelized code to speed up that sum. Now I’m trying to implement the model in Stan. I increment target directly using the sum:

transformed parameters {
  // intensity experienced by each event
  vector[n] tents;

  real lp;

  tents = exp(event_covariates * beta);

  for (i in 2:n) {
    tents[i] += ((theta / lengthscaleT) *
                 dot_product(exp(timeD[i, 1:(i - 1)] / lengthscaleT),
                             gaussian(spaceD2[i, 1:(i - 1)], lengthscaleS)));
  }

  // full log-likelihood
  lp = sum(log(tents)) - time_window * dot_product(cell_areas, exp(cell_covariates * beta)) +
          theta * (sum(exp(timeD2 / lengthscaleT)) - n);
}

model {
  target += lp;

  lengthscaleS ~ uniform(0, 1000);

  // Gamma(shape, rate), so mean is shape/rate, variance shape/rate^2
  // measured in seconds
  lengthscaleT ~ gamma(1, 1.0 / (60.0 * 60 * 24 * 60));

  theta ~ uniform(0, 1);

  beta[1] ~ normal(-30, 5);
  for (i in 2:p) {
    beta[i] ~ normal(0, 5);
  }
}

(complete model code is below)

There are a bunch of equivalent ways to write that double sum – nested for loops, rearrange the arithmetic expressions – but I’m new to Stan and don’t have any intuition as to which way is best. Is there an obviously better way to write the loop to make it most efficient for Stan?

(Right now, for reasonable datasets of about 2000 events, 2000 draws take about 10 hours on a fast machine. I can implement the method with naive Metropolis-Hastings and my custom parallelized code and draw in 30 minutes, but with much worse mixing.)

Full model code:

// Adapted from https://github.com/flaxter/gunshot-contagion
// by Charles Loeffler and Seth Flaxman

functions {
  row_vector gaussian(row_vector dist2s, real lengthscale) {
    return(exp(-0.5 * dist2s / lengthscale^2) / (2 * pi() * lengthscale^2));
  }
}

data {
  // Total number of events
  int<lower=1> n;

  // Total number of covariates
  int<lower=1> p;

  // Total number of covariate cells
  int<lower=1> C;

  // Event coordinates (in feet)
  matrix[n,2] Space;

  // Event times (in seconds)
  // NB: these must be sorted from smallest to largest!
  vector[n] Time;

  // Total length of time observed
  real time_window;

  // Covariate value at each event
  matrix[n,p] event_covariates;

  // Covariate values in each covariate cell
  matrix[C,p] cell_covariates;

  // Total land area of each covariate cell
  vector[C] cell_areas;
}

transformed data {
  // Time elapsed between each event and predecessors
  matrix[n,n] timeD;

  // Time between each event and the end of the observation period
  vector[n] timeD2;

  // Squared Euclidean distance between each event
  matrix[n,n] spaceD2;

  for (i in 1:n) {
    for (j in 1:n) {
      timeD[i,j] = -(Time[i] - Time[j]);
      spaceD2[i,j] = distance(Space[i], Space[j])^2;
    }

    timeD2[i] = -(time_window - Time[i]);
  }
}

parameters {
  // Spatial decay distance (sigma, feet)
  real<lower=0> lengthscaleS;

  // Temporal decay time (omega, seconds)
  real<lower=0> lengthscaleT;

  // Self-excitation amount
  real<lower=0> theta;

  // Background covariate coefficients
  vector[p] beta;
}

transformed parameters {
  // intensity experienced by each event
  vector[n] tents;

  real lp;

  tents = exp(event_covariates * beta);

  for (i in 2:n) {
    tents[i] += ((theta / lengthscaleT) *
                 dot_product(exp(timeD[i, 1:(i - 1)] / lengthscaleT),
                             gaussian(spaceD2[i, 1:(i - 1)], lengthscaleS)));
  }

  // full log-likelihood
  lp = sum(log(tents)) - time_window * dot_product(cell_areas, exp(cell_covariates * beta)) +
          theta * (sum(exp(timeD2 / lengthscaleT)) - n);
}

model {
  target += lp;

  lengthscaleS ~ uniform(0, 1000);

  // Gamma(shape, rate), so mean is shape/rate, variance shape/rate^2
  // measured in seconds
  lengthscaleT ~ gamma(1, 1.0 / (60.0 * 60 * 24 * 60));

  theta ~ uniform(0, 1);

  beta[1] ~ normal(-30, 5);
  for (i in 2:p) {
    beta[i] ~ normal(0, 5);
  }
}

generated quantities {
  real lengthscale_days;

  lengthscale_days = lengthscaleT / 60 / 60 / 24;
}

The measure of speed that matters is effective sample size divided by time (or its inverse). Iteration speed doesn’t matter if it doesn’t mix. You are typically only going to need an effective sample size of around 100 for expectation calculations (tail quantiles require many more draws).

Step 1 is to include the whole Stan program. When we don’t know what’s a parameter and what’s data, we can’t help you optimize.

If you know C++, the loops compile to the obvious C++ loops. All the action’s in the autodiff (formulas involving parameters or functions thereof) in terms of log density evaluation speed. There’s a chapter of the manual that goes over the basics. The important thing is that there’s allso statistical efficiency to deal with.

The most important principle is to never repeat computations. For instance, squaring length scale as you do in your Gaussin definition. Exponentiating is going to be very prone to underflow if you’re out in the tails.

That’s why you want to vectorize wherever possible and pull divisions out of sums, etc.

Anotehr important principle is to never use priors like this:

  lengthscaleS ~ uniform(0, 1000);

And if you insist on doing that, declare lengthscaleS with matching bounds.

We have a wiki page on prior recommendations (well Google indexed) as well as discussion in the manual.

The second most important principle is to always work on the log scale for as long as possible to avoid overflow and underflow. So you way to keep tents calculated on the log scale, rather than using the linear scale and then taking log(tents) later.

Yes, so I’ve been tracking effective sample size per minute as well; I get about 4-8 effective samples per minute under Stan, versus about 10-15 per minute for my custom code (though that depends on which parameter you look at, as some mix worse than others).

The whole program is in my original post. I split it up to call out the loop; sorry if that was confusing.

Good point, thanks. I’ll look at this and at working on the log scale, though I think I’m limited there because each entry of tents is a sum, so I can’t calculate the individual terms on the log scale.

I’ll also look at the prior choice recommendations page and see what I can do for each of my parameters.

Is there the Stan equivalent of a profiling tool? Something which, given a particular dataset, shows how much autodiff work is spent on each part of the log density, so I can see which expressions are worth optimizing and which duplicate expressions are most expensive? I find profiling immensely useful for my normal work, and it’d help my intuition here a lot if I knew where Stan was spending its effort.

Hi Alex,

To add to Bob’s suggestion of keeping things on the log scale, I think the log_sum_exp function for vectors should be helpful when calculating tents.

Thanks, that turned out to be helpful: using log_sum_exp inside the loop seems to have cut sampling time by 30% (when initialized with the same data and seed).

I also noticed the manual’s discussion of matrices being stored in column-major order, and the suggestion to use arrays of vectors if you need efficient access to rows. I do a lot of indexing of the rows of timeD and spaceD, so I converted each to an array of vectors. That seemed to improve runtime by about 10%.

Some experiments show that sampling is reasonably practical for a few thousand events, but for ~10,000 events each gradient evaluation might take 20 seconds or more. I assume this is because the loop puts 10,000 terms in the gradient to be evaluated, but I don’t see any obvious ways to rewrite it in a single vectorized operation. I’ll have to keep thinking about it.

You’re right—just saw that. Not confusing, I just read through these things too quickly because there are so many of them.

Stan builds the whole expression graph up as the log density is evaluated then propagates the chain rule in a bacward pass. So the minimum you get for a function from \mathbb{R}^N \rightarrow \mathbb{R}^M is M new nodes connected to the N input nodes and labeled with the derivative (the whole thing is the Jacobian). When things get vectorized, it potentially reduces the size of the expression graph and allows shared calculations.

Using our normal_lpdf instead of gaussian (it’s on the log scale), should also be more efficient.