Stan threads/reduce_sum doesn't seem to make any difference

Hi everyone,

I’ve been recently trying to use reduce_sum and STAN_THREADS on linux and so far wasn’t very successful.

My configuration is pretty standard ubuntu 20.04 linux, 64bit, gcc-9.3
I have cmdstan-2.27.0 which has STAN_THREADS=true in make/local (and cmdstan was recompiled after that setting).

I then have the following stan code (which is a bit of a toy example just for testing reduce_sum, it evaluates the spline on large vector of points)

functions{
#include spline.stan

  real partial_lpdf_sum(int[] x_pos_knots_slice,
			int start, int end,
			int nknots,
			vector xknots,
			vector yknots,
			vector spl_coeffs,
			int N,
			vector x,
			vector y,
			vector ey
		    )
  {
    vector[end-start] ymod;
    ymod = spline_eval(nknots, xknots,
		       yknots, spl_coeffs, end-start+1,
		       x[start:end], x_pos_knots_slice);

    return normal_lpdf(y[start:end]|ymod,ey[start:end]);
  }
}

data{
  int N;
  int nknots;
  vector[N] x;
  vector[N] y;
  vector[N] ey;
  vector[nknots] xknots;
  int grainsize;
}
transformed data
{
  // determine which knots the point belong to
  int x_pos_knots[N] = spline_findpos(nknots, xknots, N, x);
}
parameters
{
  // the parameters of our spline model are
  // the values at the knots
  vector[nknots] yknots;
}
transformed parameters
{
  vector[nknots] spl_coeffs = spline_getcoeffs(nknots, xknots, yknots);
  // these are the spline coefficients corresponding to the current model
}

model
{
  yknots  ~ normal (0,100);

    target += reduce_sum(partial_lpdf_sum, x_pos_knots, N,
		       nknots,
		       xknots,
		       yknots,
		       spl_coeffs,
		       N,
		       x,
		       y,
		       ey); 
}

This code is then compiled using CMDstan with make command (and I clearly see the ‘-DSTAN_THREADS’ option being passed to the compiler)

When I run the compiled program

env STAN_NUM_THREADS=20 example_precompute id=1 random seed=434 data file=/tmp/fitting/uy9xoo6e.json output file=/tmp/fitting/example_precompute-202107211727-1-w9zlxy4w.csv method=sample algorithm=hmc adapt engaged=1

I see in the output
num_threads = 20
But in the same time

  1. top clearly shows 100%CPU for the corresponding process (and not more, indicating lack of threading activity)
  2. I’ve checked if there are any threads in /proc/$PID/task/ and I don’t see any.

So I’m how suspecting the threads are not used at all for some reason.
Does anyone have an idea what I’m doing wrong here ?

Thanks,
Sergey

The reduce_sum signature is

real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)

Did you intend to set the grainsize equal to N? Though Idt that would cause no threads being created. If you have some example data I could take a look on my computer

1 Like

Ah, thank you @stevebronder , I’ve misplaced the arguments indeed. I needed grainsize there instead of N. After fixing that I see threads being created.

Huh, Glad to hear! I’m 99% sure I’ve done the same thing before