Issues with slow sampling

Hi all,

I am learning to use Stan and attempting to build a model to find rings in a 2D image. These rings are circular in shape with a Gaussian profile in the radial direction, and an image can consist of a number of rings. We have written a model to fit to these images, by fitting to the central coordinates of the rings, the ring widths, the distance between the rings and the location of the first rings. We are enforcing that the distance between rings remains constant, so knowing the location of the first ring and the distance between rings is enough to specify the position of the rest of the rings.

There is a simple Python function below that we can use to generate the data:

def make_some_rings(x0, y0, xcoords, ycoords, radii, widths):
    """
    Function to generate 15 rings on a 240, 80 2D image.
    Args:
        x0 (float): x-coordinate of centre of circle.
        y0 (float): y-coordinate of centre of circle.
        xcoords (1D-array): x-coordinates of 2D image pixels.
        ycoords (1D-array): y-coordinates of 2D image pixels.
        radii (list): radial value of each pixel in 2D image.
        widths (list): standard deviation of each ring width in pixel space.
    """
    # Generate radial position for every coordinate in the image
    pixel_radii = np.sqrt((xcoords-x0)**2 + (ycoords-y0)**2)
    image = np.zeros(xcoords.shape)
    for i,dist in enumerate(radii):
        # For each ring add a radial gaussian to each pixel
        image += np.exp(-(pixel_radii-dist)**2/(2*widths[i]**2))
    return image

We used 240x80 images and flattened them to a 1D list for analysis in our model. We adopt a statistical model we have previously found to work in other contexts. More specifically, we integrate over an uninformative prior on both the amplitude of the rings and the variance of the assumed Gaussian noise such that the likelihood is a Student-T.
Our Stan model is shown below:


data {
    int<lower=1> N_d; // length of data vector
    vector[N_d] pixels; // counts for all pixels
    int<lower=1> N_r; // Number of rings
    vector[N_d] x_pos; // x coordinate of every datapoint in 2D image
    vector[N_d] y_pos; // y coordinate of every datapoint in 2D image

}

parameters {
    real<lower=0,upper=1> delta_unit; // Distance between rings
    real<lower=0,upper=1> r_1_unit; // Radius of first ring
    real<lower=0,upper=1> x_0_unit; // x-coordinate of circle centre
    real<lower=0,upper=1> y_0_unit; // y-coordinate of circle centre
    vector<lower=0,upper=1>[N_r] sigma_r_unit; // Radial extent of each ring

}

transformed parameters {
    real delta = (delta_unit*10) + 8;
    real r_1 = (r_1_unit*30) + 5;
    real x_0 = (x_0_unit*35) + 85;
    real y_0 = (y_0_unit*20)+ 240;
    vector[N_r] sigma_r = (sigma_r_unit*2) + 1;

}

model {
    matrix[N_d, N_r] G_theta; //
    vector[N_r] r;
    real term_1;
    real combined_2_3_4;
    real term_2;
    row_vector[N_r] term_3;
    vector[N_r] term_4;

    for (j in 1:N_r) {
      r[j] = r_1 + (delta * (j - 1));
      G_theta[:, j] = exp(-square(sqrt(square(x_pos - x_0) + square(y_pos - y_0)) - r[j]) ./ square(sigma_r[j]) * 2);
    }


    term_1 = 1 / sqrt(abs(determinant(G_theta' * G_theta)));
    term_2 = dot_product(pixels', pixels);
    term_3 = pixels'*G_theta * inverse(G_theta'* G_theta);
    term_4 = G_theta' * pixels;
    combined_2_3_4 = ((-N_d+N_r) / 2) * log(term_2 - term_3* term_4);
   
    target += log(term_1) + combined_2_3_4;

    // Prior
    delta_unit ~ uniform(0, 1);
    r_1_unit ~ uniform(0, 1);
    x_0_unit ~ uniform(0, 1); 
    y_0_unit ~ uniform(0, 1);
    sigma_r_unit ~ uniform(0, 1);

}

Issues we are encountering:
• Sampling and warm up are prohibitively slow, frequently complaining of over constrained parameters regarding G_theta. “ Exception: multiply: B[1] is nan, but must not be nan! (in ‘Stan_stage_1.stan’ at line 51)” Predominantly appears to happen in the warm up phase.
• Optimiser is inconsistent – is this due to there being many likelihood minima?

Any help with these issues would be greatly appreciated.

Thanks,

David

1 Like

Hi again,

In the interests of providing a bit more detail, here are some typical values for the data we are dealing with:

N_d = 240*80
N_r ≃ 15
delta ≃ 12
r_1 ≃ 25
x_0 ≃ 95
y_0 ≃ 250

Hope that helps.

Best regards,

David

Looks like profiling is coming “soon”: https://github.com/stan-dev/cmdstan/issues/960

1 Like

The other idea (thanks @bgoodri) mentioned in a conversation just now was to write a bespoke C++ function for the bit you suspect is slow (see example here: https://github.com/stan-dev/rstanarm/blob/master/inst/include/csr_matrix_times_vector2.hpp). I suspect that’ll be the dot products of sparse vectors with one another that is implicitly involved in calculating the elements of G_theta'*G_theta and the dot product of a sparse vector and a dense vector involved in calculating the elements of G_theta' * pixels.

Yeah so it’ll be really easy shortly (edit: if we successfully add this feature in 2.26), or not-any-less-difficult-than-it-currently-is shortly in the future (edit: if we fail to add this feature in 2.26) :D.

Simon said there were a lot of zeroes. Which things had lots of zeros?

In this case, I would write out the complexity for each of the operations and the assumptions for the matrices. Like is G_theta' * G_theta positive definite?

You might be able to save time by doing something like:

matrix chol_GG = cholesky_decompose(G_theta' * G_theta);

And then you can compute the determinant from this, and then probably do the solve here:

pixels'*G_theta * inverse(G_theta'* G_theta)

quicker (make sure and do the pixels' * G_theta product before the solve, etc. etc.)

1 Like

@bbales2: It is G_theta that is sparse.

A development version of soon is here: https://github.com/stan-dev/cmdstan/issues/960#issuecomment-748466895

Example link here: https://github.com/stan-dev/cmdstan/issues/960#issuecomment-748328670

It might break cuz it’s development, but this will give you the info you need!

@bbbales2: Thanks for the links to the profiling project (and for your earlier suggestions). We will definitely try them out and post an update.

Here are profiling results from a small sampling run (25 warm-up iterations and 50 sampling) with 2 chains. Shows that generating the G_theta matrix is very slow, as is calculating terms involving matrix multiplications of G_theta’ * G_theta.

Profiling information (chain 1):

name, total, fwd, rev
calculating combined, 0.0960929,0.0802959,0.015797
calculating term_1, 583.702,236.118,347.584
calculating term_2, 1.11552,1.11169,0.00383175
calculating term_3, 605.202,278.712,326.49
calculating term_4, 91.9088,49.2647,42.6441
generating g_theta, 2243.6,1436.43,807.1
priors, 0.21295,0.18607,0.0268796
updating lp, 0.0826979,0.0643497,0.0183482

Profiling information (chain 2):

name, total, fwd, rev
calculating combined, 0.090818,0.0753556,0.0154624
calculating term_1, 557.624,226.966,330.657
calculating term_2, 1.06241,1.05872,0.00368262
calculating term_3, 586.637,267.351,319.287
calculating term_4, 90.1564,48.0262,42.1302
generating g_theta, 2203.23,1404.38,798.846
priors, 0.215369,0.188583,0.0267864
updating lp, 0.0687928,0.0511221,0.0176707

here is the model block with the profile statements:

model {
    matrix[N_d, N_r] G_theta; //
    vector[N_r] r;
    real term_1;
    real combined_2_3_4;
    real term_2;
    row_vector[N_r] term_3;
    vector[N_r] term_4;

    {
    profile("generating g_theta");
    for (j in 1:N_r) {
      r[j] = r_1 + (delta * (j - 1));
      G_theta[:, j] = exp(-square(sqrt(square(x_pos - x_0) + square(y_pos - y_0)) - r[j]) ./ square(sigma_r[j]) * 2);
    }
    }
    
    {
    profile("calculating term_1");
    term_1 = 1 / sqrt(abs(determinant(G_theta' * G_theta)));
    }
    
    {
    profile("calculating term_2");
    term_2 = dot_product(pixels', pixels);
    }
    {
    profile("calculating term_3");
    term_3 = pixels'*G_theta * inverse(G_theta'* G_theta);
    }
    {
    profile("calculating term_4");
    term_4 = G_theta' * pixels;
    }
    {
    profile("calculating combined");
    combined_2_3_4 = ((-N_d+N_r) / 2) * log(term_2 - term_3* term_4);
    }
  
    {
    profile("updating lp");
    target += log(term_1) + combined_2_3_4;

}
     
    // Prior
    {
    profile("priors");
    delta_unit ~ uniform(0, 1); 
    r_1_unit ~ uniform(0, 1); 
    x_0_unit ~ uniform(0, 1); 
    y_0_unit ~ uniform(0, 1); 
    sigma_r_unit ~ uniform(0.9, 1.1); 
    }

}
2 Likes

It looks like you will get some speed improvements by spotting that:

  1. you can compute gtg=G_theta'*G_theta once and then calculate the determinant and inverse of gtg.
  2. you can compute gtp=pixels'*G_theta once and then exploit the fact that term4 is equal to gtp'.

However, i think the key to this running fast will be to calculate (a high fidelity approximation to) gtg and gtp (directly: without explicitly calculating G_theta) in a way that exploits the fact that G_theta is (well approximated as) sparse.

1 Like

@bbbales2: Just thinking about this:

  1. is it likely that we’ll lose a lot of speed because we will be calculating the dot products as an explicit for loop rather than as a one-line instruction?
  2. do dot products in Stan already use the map-reduce work that I recall offering significant speed-ups?
  3. how we can adapt the Stan code to use GPUs to calculate the dot products efficiently?

I’m not sure what version of Stan we are using, but @david_p can probably confirm.

Most of the work has been carried out using PyStan version 2.19.1.1
We have also used the development version of cmdstan as posted by @bbbales2 for profiling.

Cheers

Yeah. Building matrices like this sometimes ends up being more expensive than the decomposition (edit: or if not more expensive – surprisingly expensive relative to). It happens in GPs too. It’s just so much stuff ends up getting pushed on the autodiff stack (for each element). Even though a lot of those operations are vectorized at the interface level, they’re still creating a ton of temporaries for every element in each operation.

@tadej can you comment here? I think the 2.26 situation would probably change things too if everything works out (there’s a bunch of GPU stuff in it)

1 Like

Dot products are not very efficient to calculate with GPUs due to large amount of data that needs to be copied to the GPU compared to the amount of calculations. This changes, however, if you can rewrite your model in a way that replaces multiple dot products with a matrix product without too much extra calculations.

Even if you can not, you can wait for the next release that will enable you to run OpenCL on a CPU or an integrated GPU without copying the data and probably add OpenCL implementation for the dot product. That might offer some speedup compared to sequential implementation.

1 Like