Simple Gaussian model: three variants with dramatically different sampling times

So, start with a simple standard IID Gaussian model:

// standard.stan
data{
  int n;
  vector[n] y ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;
  y ~ normal(mu,sigma) ;
}

Fits fast even for big n:

mod = cmdstanr::cmdstan_model('standard.stan')
mod$sample(
	data = tibble::lst(
		n = 1e4
		, y = rnorm(n)
	)
	, parallel_chains = parallel::detect_cores()/2
	, refresh = 0 #too fast for bothering to update
	, iter_warmup = 1e4 #just to slow things down
	, iter_sampling = 1e4 #ditto
)
#about 6s

Still, one can get substantial speedup by replacing the likelihood with sufficient statistics:

// sufficient.stan
data{
  int<lower=1> n ;
  vector[n] y ;
}
transformed data{
  real obs_mean = mean(y) ;
  real obs_var = variance(y) ;
  real sqrt_n = sqrt(n) ;
  real var_gamma_par = (n - 1) / 2 ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;
  obs_mean ~ normal( mu , sigma/sqrt_n ) ;
  obs_var  ~ gamma( var_gamma_par , var_gamma_par ./ pow(sigma,2) ) ;
}

Which for n=1e4 samples in about 0.3s! (more speed comparison here).

Now that trick is of rather limited application, depending on IID, but we can induce IID for any model with a y~normal(...) likelihood by centering and scaling the observations with respect to the model’s mean and standard deviation parameters. This of course means that instead of a data variable to the left of the ~'s, we’ll have variables that reflect non-linear transforms of parameters, thereby requiring jacobian adjustment. Happily the amazing @spinkney provided these for me, yielding the model:

// transformed_sufficient.stan
data{
  int n;
  vector[n] y ;
}
transformed data{
  real sqrt_n = sqrt(n) ;
  real var_gamma_par = (n-1)/2 ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;

  // d: deviations of y from mu
  vector[n] d = y-mu ;
  real sigma_sq = pow(sigma,2) ;

  // mz_se: mean z divided by standard error
  //    should be distributed ~normal(0,1)
  real mz_se = sum(d)/(sqrt_n*sigma_sq) ;
  mz_se ~ std_normal() ;

  // v: variance of d divided by expected variance
  //    should be distributed ~gamma( (n-1)/2 , (n-1)/2 )
  real zv = variance(d)/sigma_sq ;
  zv ~ gamma(var_gamma_par,var_gamma_par) ;

  //jacobians:
  target += log(zv) -3*log(sigma);

}

However, compared to both the earlier models, this one is much slower. What gives?

Because you declare d??

Is the double query intended to convey that this should be an obvious source of the time difference? If so, you’ll need to dumb it down for me, sorry. I really didn’t expect that to make it go from 20x faster (6s → 0.3s) to around 3x slower (6s → 15s).

Any Stan program is limited in performance by its AD tree size (the thing to get the gradient). In the last example you do declare “d” which has the size of the data (n entries). You should try to avoid declaring “d” and instead only store reductions of it in variables within the model block. So try to save sum(y-mu) and sum((y-mu)^2) and then use these to get the final quantities you need. This avoids storing a full vector of size “n”. You should also consider the use of the profiling facility when doing these things.

And my “?” is for noting that I am not sure if the above will really help you. It relies on long-year Stan experience, but for performance stuff it can be surprising what one will find. The rule “avoid” declared parameters is usually a good one to get more speed (this is why the size of AD tape is being reported in the profiling outputs).

Makes more sense?

1 Like

@mike-lawrence can you post a simple example where y will be a parameter? Let’s make sure those jacobians are correct.

Ah, that’s something new for me! Thanks!

Hm, no luck. Tried this plus a bunch of other ideas but everything I tried slowed things down even more.

I guess I should have had a stronger prior going into this that it was unlikely that I’d happened upon an optimization left undiscovered for such a ubiquitous model structure!

A shame though; that speed up for the large-N IID case is spectacular.

This was intended for the other thread on jacaobians, right?