# Issues with slow sampling

Hi all,

I am learning to use Stan and attempting to build a model to find rings in a 2D image. These rings are circular in shape with a Gaussian profile in the radial direction, and an image can consist of a number of rings. We have written a model to fit to these images, by fitting to the central coordinates of the rings, the ring widths, the distance between the rings and the location of the first rings. We are enforcing that the distance between rings remains constant, so knowing the location of the first ring and the distance between rings is enough to specify the position of the rest of the rings.

There is a simple Python function below that we can use to generate the data:

``````def make_some_rings(x0, y0, xcoords, ycoords, radii, widths):
"""
Function to generate 15 rings on a 240, 80 2D image.
Args:
x0 (float): x-coordinate of centre of circle.
y0 (float): y-coordinate of centre of circle.
xcoords (1D-array): x-coordinates of 2D image pixels.
ycoords (1D-array): y-coordinates of 2D image pixels.
widths (list): standard deviation of each ring width in pixel space.
"""
# Generate radial position for every coordinate in the image
image = np.zeros(xcoords.shape)
return image
``````

We used 240x80 images and flattened them to a 1D list for analysis in our model. We adopt a statistical model we have previously found to work in other contexts. More specifically, we integrate over an uninformative prior on both the amplitude of the rings and the variance of the assumed Gaussian noise such that the likelihood is a Student-T.
Our Stan model is shown below:

``````
data {
int<lower=1> N_d; // length of data vector
vector[N_d] pixels; // counts for all pixels
int<lower=1> N_r; // Number of rings
vector[N_d] x_pos; // x coordinate of every datapoint in 2D image
vector[N_d] y_pos; // y coordinate of every datapoint in 2D image

}

parameters {
real<lower=0,upper=1> delta_unit; // Distance between rings
real<lower=0,upper=1> r_1_unit; // Radius of first ring
real<lower=0,upper=1> x_0_unit; // x-coordinate of circle centre
real<lower=0,upper=1> y_0_unit; // y-coordinate of circle centre
vector<lower=0,upper=1>[N_r] sigma_r_unit; // Radial extent of each ring

}

transformed parameters {
real delta = (delta_unit*10) + 8;
real r_1 = (r_1_unit*30) + 5;
real x_0 = (x_0_unit*35) + 85;
real y_0 = (y_0_unit*20)+ 240;
vector[N_r] sigma_r = (sigma_r_unit*2) + 1;

}

model {
matrix[N_d, N_r] G_theta; //
vector[N_r] r;
real term_1;
real combined_2_3_4;
real term_2;
row_vector[N_r] term_3;
vector[N_r] term_4;

for (j in 1:N_r) {
r[j] = r_1 + (delta * (j - 1));
G_theta[:, j] = exp(-square(sqrt(square(x_pos - x_0) + square(y_pos - y_0)) - r[j]) ./ square(sigma_r[j]) * 2);
}

term_1 = 1 / sqrt(abs(determinant(G_theta' * G_theta)));
term_2 = dot_product(pixels', pixels);
term_3 = pixels'*G_theta * inverse(G_theta'* G_theta);
term_4 = G_theta' * pixels;
combined_2_3_4 = ((-N_d+N_r) / 2) * log(term_2 - term_3* term_4);

target += log(term_1) + combined_2_3_4;

// Prior
delta_unit ~ uniform(0, 1);
r_1_unit ~ uniform(0, 1);
x_0_unit ~ uniform(0, 1);
y_0_unit ~ uniform(0, 1);
sigma_r_unit ~ uniform(0, 1);

}
``````

Issues we are encountering:
• Sampling and warm up are prohibitively slow, frequently complaining of over constrained parameters regarding G_theta. “ Exception: multiply: B[1] is nan, but must not be nan! (in ‘Stan_stage_1.stan’ at line 51)” Predominantly appears to happen in the warm up phase.
• Optimiser is inconsistent – is this due to there being many likelihood minima?

Any help with these issues would be greatly appreciated.

Thanks,

David

1 Like

Hi again,

In the interests of providing a bit more detail, here are some typical values for the data we are dealing with:

N_d = 240*80
N_r ≃ 15
delta ≃ 12
r_1 ≃ 25
x_0 ≃ 95
y_0 ≃ 250

Hope that helps.

Best regards,

David

Looks like profiling is coming “soon”: https://github.com/stan-dev/cmdstan/issues/960

1 Like

The other idea (thanks @bgoodri) mentioned in a conversation just now was to write a bespoke C++ function for the bit you suspect is slow (see example here: https://github.com/stan-dev/rstanarm/blob/master/inst/include/csr_matrix_times_vector2.hpp). I suspect that’ll be the dot products of sparse vectors with one another that is implicitly involved in calculating the elements of `G_theta'*G_theta` and the dot product of a sparse vector and a dense vector involved in calculating the elements of `G_theta' * pixels`.

Yeah so it’ll be really easy shortly (edit: if we successfully add this feature in 2.26), or not-any-less-difficult-than-it-currently-is shortly in the future (edit: if we fail to add this feature in 2.26) :D.

Simon said there were a lot of zeroes. Which things had lots of zeros?

In this case, I would write out the complexity for each of the operations and the assumptions for the matrices. Like is `G_theta' * G_theta` positive definite?

You might be able to save time by doing something like:

``````matrix chol_GG = cholesky_decompose(G_theta' * G_theta);
``````

And then you can compute the determinant from this, and then probably do the solve here:

``````pixels'*G_theta * inverse(G_theta'* G_theta)
``````

quicker (make sure and do the `pixels' * G_theta` product before the solve, etc. etc.)

1 Like

@bbales2: It is `G_theta` that is sparse.

A development version of soon is here: https://github.com/stan-dev/cmdstan/issues/960#issuecomment-748466895

It might break cuz it’s development, but this will give you the info you need!

@bbbales2: Thanks for the links to the profiling project (and for your earlier suggestions). We will definitely try them out and post an update.

Here are profiling results from a small sampling run (25 warm-up iterations and 50 sampling) with 2 chains. Shows that generating the G_theta matrix is very slow, as is calculating terms involving matrix multiplications of G_theta’ * G_theta.

Profiling information (chain 1):

``````name, total, fwd, rev
calculating combined, 0.0960929,0.0802959,0.015797
calculating term_1, 583.702,236.118,347.584
calculating term_2, 1.11552,1.11169,0.00383175
calculating term_3, 605.202,278.712,326.49
calculating term_4, 91.9088,49.2647,42.6441
generating g_theta, 2243.6,1436.43,807.1
priors, 0.21295,0.18607,0.0268796
updating lp, 0.0826979,0.0643497,0.0183482
``````

Profiling information (chain 2):

``````name, total, fwd, rev
calculating combined, 0.090818,0.0753556,0.0154624
calculating term_1, 557.624,226.966,330.657
calculating term_2, 1.06241,1.05872,0.00368262
calculating term_3, 586.637,267.351,319.287
calculating term_4, 90.1564,48.0262,42.1302
generating g_theta, 2203.23,1404.38,798.846
priors, 0.215369,0.188583,0.0267864
updating lp, 0.0687928,0.0511221,0.0176707
``````

here is the model block with the profile statements:

``````model {
matrix[N_d, N_r] G_theta; //
vector[N_r] r;
real term_1;
real combined_2_3_4;
real term_2;
row_vector[N_r] term_3;
vector[N_r] term_4;

{
profile("generating g_theta");
for (j in 1:N_r) {
r[j] = r_1 + (delta * (j - 1));
G_theta[:, j] = exp(-square(sqrt(square(x_pos - x_0) + square(y_pos - y_0)) - r[j]) ./ square(sigma_r[j]) * 2);
}
}

{
profile("calculating term_1");
term_1 = 1 / sqrt(abs(determinant(G_theta' * G_theta)));
}

{
profile("calculating term_2");
term_2 = dot_product(pixels', pixels);
}
{
profile("calculating term_3");
term_3 = pixels'*G_theta * inverse(G_theta'* G_theta);
}
{
profile("calculating term_4");
term_4 = G_theta' * pixels;
}
{
profile("calculating combined");
combined_2_3_4 = ((-N_d+N_r) / 2) * log(term_2 - term_3* term_4);
}

{
profile("updating lp");
target += log(term_1) + combined_2_3_4;

}

// Prior
{
profile("priors");
delta_unit ~ uniform(0, 1);
r_1_unit ~ uniform(0, 1);
x_0_unit ~ uniform(0, 1);
y_0_unit ~ uniform(0, 1);
sigma_r_unit ~ uniform(0.9, 1.1);
}

}
``````
2 Likes

It looks like you will get some speed improvements by spotting that:

1. you can compute `gtg=G_theta'*G_theta` once and then calculate the determinant and inverse of `gtg`.
2. you can compute `gtp=pixels'*G_theta` once and then exploit the fact that `term4` is equal to `gtp'`.

However, i think the key to this running fast will be to calculate (a high fidelity approximation to) `gtg` and `gtp` (directly: without explicitly calculating `G_theta`) in a way that exploits the fact that `G_theta` is (well approximated as) sparse.

1 Like

1. is it likely that we’ll lose a lot of speed because we will be calculating the dot products as an explicit for loop rather than as a one-line instruction?
2. do dot products in Stan already use the map-reduce work that I recall offering significant speed-ups?
3. how we can adapt the Stan code to use GPUs to calculate the dot products efficiently?

I’m not sure what version of Stan we are using, but @david_p can probably confirm.

Most of the work has been carried out using PyStan version 2.19.1.1
We have also used the development version of cmdstan as posted by @bbbales2 for profiling.

Cheers

Yeah. Building matrices like this sometimes ends up being more expensive than the decomposition (edit: or if not more expensive – surprisingly expensive relative to). It happens in GPs too. It’s just so much stuff ends up getting pushed on the autodiff stack (for each element). Even though a lot of those operations are vectorized at the interface level, they’re still creating a ton of temporaries for every element in each operation.

@tadej can you comment here? I think the 2.26 situation would probably change things too if everything works out (there’s a bunch of GPU stuff in it)

1 Like

Dot products are not very efficient to calculate with GPUs due to large amount of data that needs to be copied to the GPU compared to the amount of calculations. This changes, however, if you can rewrite your model in a way that replaces multiple dot products with a matrix product without too much extra calculations.

Even if you can not, you can wait for the next release that will enable you to run OpenCL on a CPU or an integrated GPU without copying the data and probably add OpenCL implementation for the dot product. That might offer some speedup compared to sequential implementation.

1 Like