Simple Gaussian model: three variants with dramatically different sampling times

mike-lawrence · May 30, 2021, 5:36pm

So, start with a simple standard IID Gaussian model:

// standard.stan
data{
  int n;
  vector[n] y ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;
  y ~ normal(mu,sigma) ;
}

Fits fast even for big n:

mod = cmdstanr::cmdstan_model('standard.stan')
mod$sample(
	data = tibble::lst(
		n = 1e4
		, y = rnorm(n)
	)
	, parallel_chains = parallel::detect_cores()/2
	, refresh = 0 #too fast for bothering to update
	, iter_warmup = 1e4 #just to slow things down
	, iter_sampling = 1e4 #ditto
)
#about 6s

Still, one can get substantial speedup by replacing the likelihood with sufficient statistics:

// sufficient.stan
data{
  int<lower=1> n ;
  vector[n] y ;
}
transformed data{
  real obs_mean = mean(y) ;
  real obs_var = variance(y) ;
  real sqrt_n = sqrt(n) ;
  real var_gamma_par = (n - 1) / 2 ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;
  obs_mean ~ normal( mu , sigma/sqrt_n ) ;
  obs_var  ~ gamma( var_gamma_par , var_gamma_par ./ pow(sigma,2) ) ;
}

Which for n=1e4 samples in about 0.3s! (more speed comparison here).

Now that trick is of rather limited application, depending on IID, but we can induce IID for any model with a y~normal(...) likelihood by centering and scaling the observations with respect to the model’s mean and standard deviation parameters. This of course means that instead of a data variable to the left of the ~'s, we’ll have variables that reflect non-linear transforms of parameters, thereby requiring jacobian adjustment. Happily the amazing @spinkney provided these for me, yielding the model:

// transformed_sufficient.stan
data{
  int n;
  vector[n] y ;
}
transformed data{
  real sqrt_n = sqrt(n) ;
  real var_gamma_par = (n-1)/2 ;
}
parameters{
  real mu ;
  real<lower=0> sigma ;
}
model{
  mu ~ std_normal() ;
  sigma ~ weibull(2,1) ;

  // d: deviations of y from mu
  vector[n] d = y-mu ;
  real sigma_sq = pow(sigma,2) ;

  // mz_se: mean z divided by standard error
  //    should be distributed ~normal(0,1)
  real mz_se = sum(d)/(sqrt_n*sigma_sq) ;
  mz_se ~ std_normal() ;

  // v: variance of d divided by expected variance
  //    should be distributed ~gamma( (n-1)/2 , (n-1)/2 )
  real zv = variance(d)/sigma_sq ;
  zv ~ gamma(var_gamma_par,var_gamma_par) ;

  //jacobians:
  target += log(zv) -3*log(sigma);

}

However, compared to both the earlier models, this one is much slower. What gives?

wds15 · May 30, 2021, 6:33pm

Because you declare d??

mike-lawrence · May 30, 2021, 7:47pm

Is the double query intended to convey that this should be an obvious source of the time difference? If so, you’ll need to dumb it down for me, sorry. I really didn’t expect that to make it go from 20x faster (6s → 0.3s) to around 3x slower (6s → 15s).

wds15 · May 31, 2021, 7:05am

Any Stan program is limited in performance by its AD tree size (the thing to get the gradient). In the last example you do declare “d” which has the size of the data (n entries). You should try to avoid declaring “d” and instead only store reductions of it in variables within the model block. So try to save sum(y-mu) and sum((y-mu)^2) and then use these to get the final quantities you need. This avoids storing a full vector of size “n”. You should also consider the use of the profiling facility when doing these things.

And my “?” is for noting that I am not sure if the above will really help you. It relies on long-year Stan experience, but for performance stuff it can be surprising what one will find. The rule “avoid” declared parameters is usually a good one to get more speed (this is why the size of AD tape is being reported in the profiling outputs).

Makes more sense?

spinkney · May 31, 2021, 10:46am

@mike-lawrence can you post a simple example where y will be a parameter? Let’s make sure those jacobians are correct.

mike-lawrence · May 31, 2021, 12:25pm

Ah, that’s something new for me! Thanks!

mike-lawrence · May 31, 2021, 3:31pm

Hm, no luck. Tried this plus a bunch of other ideas but everything I tried slowed things down even more.

I guess I should have had a stronger prior going into this that it was unlikely that I’d happened upon an optimization left undiscovered for such a ubiquitous model structure!

A shame though; that speed up for the large-N IID case is spectacular.

mike-lawrence · May 31, 2021, 3:33pm

This was intended for the other thread on jacaobians, right?

Topic		Replies	Views
Seeking expert stan modeler for help speeding up a complex stan model Jobs fitting-issues , specification , performance	4	928	July 29, 2020
PSA: using stantargets for SBC(-esque) checks, & big speedups for big gaussian likelihoods Publicity techniques , specification , performance	1	976	May 30, 2021
Speeding up multivariate normal model Modeling techniques	7	3167	September 9, 2017
Assessing speed of multilevel model Modeling	1	409	March 25, 2020
"Fixed_param" mode agonisingly slow General	7	1010	August 3, 2019

Simple Gaussian model: three variants with dramatically different sampling times

Related topics