Using truncated negative binom. - large time increase ~100x

Using a dummy negative binomial model, I compared the execution time when using truncation. Is it normal this incredible increase of conputation time?

# Sampling from a truncated distribution

x = rnbinom(1000, mu = 100, size = 10) %>% enframe %>% filter(value<150) %>% pull(value),
X = x %>% length
rstan::stan(
	data = list(x = x, X = X),
	model_code = "
	data{int X; int x[X];}
	parameters{real mu; real<lower=0> sigma;}
	model{ 
		for(i in 1:X) x[i] ~ neg_binomial_2(exp(mu), sigma);
	}" 	
)

3.61 seconds (Total)

rstan::stan(
	data = list(x = x, X = X),
	model_code = "
	data{int X; int x[X];}
	parameters{real mu; real<lower=0> sigma;}
	model{ 
	for(i in 1:X) x[i] ~ neg_binomial_2(exp(mu), sigma) T[,150];
	}" 	
)

258.33 seconds (Total)

Is this expected?

1 Like

Your truncation does not line up with the parameter constraints. That should align - otherwise NUTS runs terribly inefficient. Have a read on that in the manual.

I have read https://mc-stan.org/docs/2_18/stan-users-guide/truncated-data-section.html but I am not sure if the examples there apply to my case.

my parameters are mu and sigma (the truncation point is known). You mean restricting real<upper=log(150)> mu;? (although the mean could be higher theoretically than the truncation point)

rstan::stan(
 	data = list(x = x, X = X),
 	model_code = "
 	data{int X; int x[X];}
	parameters{real<upper=log(150)> mu; real<lower=0> sigma;}
 	model{ 
 	for(i in 1:X) x[i] ~ neg_binomial_2(exp(mu), sigma) T[,150];
 	}" 	
 )

241.07 seconds (Total)

If this what you are referring, does not change the execution time.


Unless I am missing something, my example is exactly the NB version of the manual example.


data {
  int<lower=0> N;
  real U;
  real<upper=U> y[N];
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  for (n in 1:N)
    y[n] ~ normal(mu, sigma) T[,U];
}

Uh… too early for me… wrong direction in this case.

Can you try writing the likelihood without the T? So spell out the log-lik as neg_binomial_2 / Pr(x < 150). So on the log-lik scale you subtract the log CDF of the negative binomial.

:) Thanks for your reply nonetheless.

Yes experimenting with the explicit formulation, I discovered that neg_binomial_2_lcdf is not vectorisable at the moment.

A simple formulation supposing upper truncation is constant.

res1 = rstan::stan(
	data = list(x = x, X = X), cores = 4,
	model_code = "
	data{
	int X; 
	int x[X];
	}
	parameters{
		real mu; 
		real<lower=0> sigma;}
	model{ 
	real lp[2];
	
	lp[1] = neg_binomial_2_lpmf(x | exp(mu), sigma);
	lp[2] = - neg_binomial_2_lcdf(150 | exp(mu), sigma) * X;
		
	target += sum(lp);

	}" 
)

3.79 seconds (Total)

A formulation supposing upper truncation is variable (as I have in my actual model).


res2 = rstan::stan(
	data = list(x = x, X = X), cores = 4,
	model_code = "
	data{
	int X; 
	int x[X];
	}
	transformed data{
	int U[X];
	for(i in 1:X) U[i] = 150;
	}
	parameters{
		real mu; 
		real<lower=0> sigma;}
	model{ 
	real lp[2];
	
	lp[1] = neg_binomial_2_lpmf(x | exp(mu), sigma);
	lp[2] = - neg_binomial_2_lcdf(U | exp(mu), sigma);
		
	target += sum(lp);

	}" 
)

309.7 seconds (Total)

I am wondering, is there any particular limit or is just matter of coding it? Without vectorisation the application of truncated NB is not viable. :(

Well, you are calling the negative binomial CDF which involves very nasty functions (regularised incomplete beta).

Could you please recompile your model, but add this compiler option (add this to your ~/.R/Mkaevars…mabye to CXXFLAGS and to CXX14FLAGS to be sure):

-DBOOST_MATH_PROMOTE_DOUBLE_POLICY=false

This may speedup things at the cost of slightly lower precision. I am curious what you get, really.

An alternative route is to try a Poisson-Gamma representation of the negative binomial. This is equivalent, but maybe the numerics is a bit friendlier wrt to truncation.

For 2 or 3 of my applications I found that a Poisson model with moderately complex (like varying intercepts for two grouping variables or so) becomes very much like a negative binomial model in terms of regression coefficients, posterior predictions, LOO… which is not really surprising, since it kinda becomes a lognormal-poisson model (and since the gamma and log-normal are not that different…). I actually never even use the negative binomial anymore, as it is often so slow…

2 Likes

Very interesting.

the model I built would be a check for negative binomiality in case of NB regression. (differential gene transcript abundance). I need the truncation because, after I quarantine the outliers (0.95 CI), I want to recalculate the coefficients (the truncation is necessary to not increase the false positive rates), and estimate the probability of a piece of data being outside of NB distribution (generated quantities) of the new model. If it makes sense…

I will try tomorrow gamma-poisson implementation. @wds15 I imagine I have to place the truncation only on the poisson.

Yup, if you are willing to trade the gamma for a log-normal (which are similar, yes), then you should gain a lot of performance… I often do that myself indeed.

Ok… so I was so curious to try out myself - and the compile flag makes a huge difference for me. With the standards I am getting 105s of inference time for one chain. With the BOOST thing turned on as suggested I am getting 54s inference time - so a doubling of the speed. The results are basically the same. So this is a good thing as it looks like (I have an open PR which would change this to the fast behaviour by default).

… but just to get back to your comment that the the truncation would vary by item… I cannot see that from your model right now. Or is your real model you are interested in one with varying means?

1 Like

Interesting.

If you feel to try the poisson + gamma would be amazing

my model is a bit more complex: a NB multiple linear model for many genes (with hyperpriors), that is run once to get possible outliers and run second time with CI 0.95 truncation (still to work out).

You can see the example for two genes (blue and red treatments), where I detect an outlier and I run the model a second time, observing a decrease (red CI) of the difference between the two groups, compared with the previous dashed CI, driven by one outlier.

(all the mess with non-centered parametrisation for the second step is because otherwise the truncation gives under/overflow)

Most of the code is for threading (please make this simpler! ;) ).

functions{

	real exp_gamma_meanSd_lpdf(vector x_log, real m_log, real s){

    // This function is the  probability of the log gamma function
    // in case you have data that is aleady in log form

    real m = exp(m_log);
    real v = m + square(m) * s;
    real a = square(m) / v;
    real b = m / v;

		vector[rows(x_log)] jacob = x_log; //jacobian
		real norm_constant = a * log(b) -lgamma(a);
		real a_minus_1 = a-1;
		return sum( jacob ) + norm_constant * rows(x_log) + sum(  x_log * a_minus_1 - exp(x_log) * b ) ;

	}

	real exp_gamma_meanSd_rng(real m_log, real s){
	  // This function takes care of the two prior choice
	  // without complicating too much the model itself

    real m = exp(m_log);
    real v = m + square(m) * s;
    real a = square(m) / v;
    real b = m / v;

	  return log(gamma_rng(a, b));
	}

	vector[] get_reference_parameters_MPI(int n_shards, int M, int[] G_per_shard, int[,] G_ind, matrix lambda_log, vector sigma, vector exposure_rate){

		int S = rows(exposure_rate);
		vector[(M*S) + M + S] lambda_sigma_exposure_MPI[n_shards];

		for( i in 1:n_shards ) {

			int size_buffer = ((M*S) + M) - ((G_per_shard[i]*S) + G_per_shard[i]) ;
		  vector[ size_buffer] buffer = rep_vector(0.0,size_buffer);

			lambda_sigma_exposure_MPI[i] =
	  		append_row(
	  		  append_row(
		  		    append_row(
		  		    	to_vector(lambda_log[,G_ind[i, 1:G_per_shard[i]]]),
		      		  sigma[G_ind[i, 1:G_per_shard[i]]]
		      		),
	      		buffer
	      	),
	      	exposure_rate
	      );
		}

		return(lambda_sigma_exposure_MPI);
	}

	vector lp_reduce( vector global_parameters , vector local_parameters , real[] real_data , int[] int_data ) {

		real lp;

		// Integer data unpack
	 	int M = int_data[1];
	 	int N = int_data[2];
	 	int S = int_data[3];
	 	int G_per_shard = int_data[4];
	 	int symbol_end[M+1] = int_data[(4+1):(4+1+M)];
	 	int sample_idx[N] = int_data[(4+1+M+1):(4+1+M+1+N-1)];
	 	int counts[N] = int_data[(4+1+M+1+N-1+1):(4+1+M+1+N-1+N)];

		// Truncation // Only DE genes: T < N
	 	int T =    int_data[(4+1+M+1+N-1+N+1)]; // Size all truncations
	 	int my_T = int_data[(4+1+M+1+N-1+N+1+1)]; // Size truncations in this shard
		int lower_truncation[T] = int_data[(4+1+M+1+N-1+N+1+1+1):(4+1+M+1+N-1+N+1+1+T)];
		int upper_truncation[T] = int_data[(4+1+M+1+N-1+N+1+1+T+1):(4+1+M+1+N-1+N+1+1+T+T)];

	 	// Data to exclude for outliers
	 	int size_exclude = int_data[(4+1+M+1+N-1+N+1+1+T+T+1)];
		int to_exclude[size_exclude] = int_data[(4+1+M+1+N-1+N+1+1+T+T+1+1):(4+1+M+1+N-1+N+1+1+T+T+1+size_exclude)]; // we are lucky for packaging it is the last variabe

		// Parameters unpack
	 	vector[G_per_shard*S] lambda_MPI = local_parameters[1:(G_per_shard*S)];
	 	vector[G_per_shard] sigma_MPI = local_parameters[((G_per_shard*S)+1):((G_per_shard*S) + G_per_shard)];
	 	vector[S] exposure_rate = local_parameters[(((M*S) + M)+1):rows(local_parameters)];

		// Vectorise lpmf
		//vector[symbol_end[G_per_shard+1]] lambda_MPI_c;
		vector[symbol_end[G_per_shard+1]] sigma_MPI_c;
		for(g in 1:G_per_shard){
			int how_many = symbol_end[g+1] - (symbol_end[g]);
			sigma_MPI_c [(symbol_end[g]+1):symbol_end[g+1]] = rep_vector(sigma_MPI[g],  how_many);
		}

		lp =
			neg_binomial_2_log_lpmf(
    		counts[1:symbol_end[G_per_shard+1]] |
    		exposure_rate[sample_idx[1:symbol_end[G_per_shard+1]]] +
    		lambda_MPI,
	    	sigma_MPI_c
    	)	-
    	(
    		// Truncation, if needed
    		my_T > 0 ?
    		neg_binomial_2_lccdf(
	    		lower_truncation[1:my_T] |
	    		exp(exposure_rate[sample_idx[1:my_T]] +	lambda_MPI[1:my_T]),
	    		sigma_MPI_c[1:my_T]
	    	)
	    	// +
	    	// neg_binomial_2_lcdf(
	    	// 	upper_truncation[1:my_T] |
	    	// 	exp(exposure_rate[sample_idx[1:my_T]] +	lambda_MPI[1:my_T]),
	    	// 	sigma_MPI_c[1:my_T]
	    	// )
	    	:
	    	0
    	) -
    	(
    		// Exclude outliers by subtracting probability, if needed
    		size_exclude > 0 ?
    		neg_binomial_2_log_lpmf(
	    		counts[1:symbol_end[G_per_shard+1]][to_exclude] |
	    		exposure_rate[sample_idx[1:symbol_end[G_per_shard+1]]][to_exclude] +
	    		lambda_MPI[to_exclude],
		    	sigma_MPI_c[to_exclude]
	    	) :
	    	0
    	);



		// Return
		return [lp]';

  }

matrix create_alpha(row_vector intercept, row_vector alpha_sub_1, matrix alpha_2,  int C, int S, int G){
	matrix[C,G] my_alpha;


	if(C==1)
		my_alpha = to_matrix(intercept);
	else if(C>1)
		my_alpha =
			append_row(
				append_row(
					intercept,
					append_col(alpha_sub_1, rep_row_vector(0.0, G-cols(alpha_sub_1)))
				),
				append_col(alpha_2, rep_matrix(0.0, rows(alpha_2), G-cols(alpha_2)))
			);

	return my_alpha;
}

}
data {

	// Reference matrix inference
  int<lower=0> N;
  int<lower=0> M;
	int<lower=0> G;
	int<lower=0> S;
  int n_shards;
	int<lower=0> counts[n_shards, N];
	int<lower=0> symbol_end[n_shards, M+1];
	int<lower=0> G_ind[n_shards, M];
	int<lower=0> sample_idx[n_shards, N];
	int<lower=0> G_per_shard[n_shards];
	int<lower=0> G_per_shard_idx[n_shards + 1];

	int<lower=0> CP; // Counts package size
  int<lower=0> counts_package[n_shards, CP];

	int<lower=1> C; // Covariates
	matrix[S,C] X; // Design matrix

	real<lower=0> lambda_mu_mu;

	int<lower=0> how_many_to_check;

	// For better adaptation
	real exposure_rate_multiplier;
	real intercept_shift_scale[2];

	// Prior information needed for truncation
	int<lower=0, upper=1> has_prior;

  vector[S*has_prior] prior_exposure_rate[2];
  row_vector[G*has_prior] prior_intercept[2];
  row_vector[how_many_to_check*has_prior] prior_alpha_sub_1[2];
  //matrix[max(0, C-2),how_many_to_check] prior_alpha_2; // Linear model for calculating lambda_log
  vector[G*has_prior] prior_sigma[2];


}
transformed data {

  vector[0] global_parameters;
  real real_data[n_shards, 0];

}
parameters {


  // Overall properties of the data
  real lambda_mu_raw; // So is compatible with logGamma prior
  real<lower=0> lambda_sigma;
  real sigma_slope;
  real sigma_intercept;
  real<lower=0> sigma_sigma;

  // Gene-wise properties of the data
  vector[S] exposure_rate_raw;
  row_vector[G] intercept_raw;
  row_vector[how_many_to_check] alpha_sub_1_raw;
  matrix[max(0, C-2),how_many_to_check] alpha_2_raw; // Linear model for calculating lambda_log
  vector[G] sigma_raw;


}
transformed parameters {

	// For better adaptation
	real lambda_mu = lambda_mu_raw + lambda_mu_mu;

	// For truncation
	vector[S] exposure_rate = exposure_rate_raw * exposure_rate_multiplier;
	// vector[S] exposure_rate =
	// 	has_prior == 0 ?
	// 	exposure_rate_raw * exposure_rate_multiplier :
	// 	prior_exposure_rate[1] + exposure_rate_raw .* prior_exposure_rate[2];

  row_vector[G] intercept =
  has_prior == 0 ?
  intercept_raw :
  prior_intercept[1] + intercept_raw .* prior_intercept[2];

  row_vector[how_many_to_check] alpha_sub_1 =
  has_prior == 0 ?
  alpha_sub_1_raw :
  prior_alpha_sub_1[1] + alpha_sub_1_raw .* prior_alpha_sub_1[2];

  matrix[max(0, C-2),how_many_to_check] alpha_2 =
  has_prior == 0 ?
  alpha_2_raw:
  alpha_2_raw;


	// Linear algebra
	//matrix[C,G] alpha = create_alpha(intercept, alpha_sub_1, alpha_2,  C,  S,  G);
	matrix[S,G] lambda_log_param = X * create_alpha(intercept, alpha_sub_1, alpha_2,  C,  S,  G);

  // Sigma
  vector[G] sigma =
  has_prior == 0 ?
  1.0 ./ exp(sigma_raw) :
  1.0 ./ exp(prior_sigma[1] + sigma_raw .* prior_sigma[2]);
//print(has_prior);
}

model {

	lambda_mu_raw ~ normal(0,2);
  lambda_sigma ~ normal(0,2);

  sigma_intercept ~ normal(0,2);
  sigma_slope ~ normal(0,2);
  sigma_sigma ~ normal(0,2);

  // Gene-wise properties of the data
 	target +=
	has_prior == 0 ?
	exp_gamma_meanSd_lpdf(to_vector(intercept_raw) | lambda_mu, lambda_sigma) :
	std_normal_lpdf(to_vector(intercept_raw));

  if(C>=2)
  	target +=
  		has_prior == 0 ?
  		double_exponential_lpdf(alpha_sub_1_raw | 0,1) :
  		std_normal_lpdf(to_vector(alpha_sub_1_raw));

	if(C>=3) to_vector(alpha_2) ~ normal(0,2.5);

  target +=
  	has_prior == 0 ?
  	normal_lpdf(sigma_raw | sigma_slope * intercept + sigma_intercept, sigma_sigma) :
  	std_normal_lpdf(sigma_raw);

  // Exposure prior
  exposure_rate_raw ~ normal(0,1);
  sum(exposure_rate_raw) ~ normal(0, 0.001 * S);

	// Gene-wise properties of the data
	target += sum(map_rect(
		lp_reduce ,
		global_parameters ,
		get_reference_parameters_MPI(
			n_shards,
			M,
			G_per_shard,
			G_ind,
			lambda_log_param,
			sigma,
			exposure_rate
		),
		real_data,
		counts_package
	));


}
generated quantities{
	vector[G] counts_rng[S];

	for(g in 1:G) for(s in 1:S)
		counts_rng[s,g] =	neg_binomial_2_log_rng(exposure_rate[s] + lambda_log_param[s,g],	sigma[g]);

}

I see… life is more interesting than just a single mean.

If you are fine with coding up simple C++ code and can live with a prototype code (big warning here, of course) you can already now have a relatively simple to use parallel reduce. See the other thread on tbb benefits of a thread pool.

You should really look into the offset+multiplier feature to make your life easier with ncp.

(whoa, that’s ugly code… yack… sorry… we are actively working on cleaning that up, so we should hopefully see an easy to use parallel reduce eventually)

No stress, I know you guys are working on it. I will wait patiently.

I am not sure to what you are referring to.

scroll down to affine transforms.

Here is the 8schools example using this technique:

data {
  int<lower=0> J;         // number of schools 
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // standard error of effect estimates 
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector<offset=mu,multiplier=tau>[J] theta;          // unscaled deviation from mu by school
}
transformed parameters {
  //vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(theta | mu, tau);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
}

So NCP coding is super easy now with this. What an amazing feature (I just learned about it as I ignored it so far).

I read about it when it was proposed. Good is out!

I hope offset and multiplier can be vector them self, for vector variables… Thanks for the tip!

As I recall you can give it vectors, yes.

It’s already out since 2.19 as far as I know.

1 Like

Some update on the gamma-poisson parametrisation (example taken from a post here)

this requires 5X time compared with NB

	data {
		int<lower=1> X;
		int<lower=0> x[X];
		}
	parameters {
		real mu;
		real sigma;
		vector<lower=0>[X] g;
		}
		
	model {
		mu ~ student_t(4, 0, 2.3);
		sigma ~ student_t(4, 0, 2.3);
		g ~ gamma(1/exp(sigma), 1/exp(sigma));
		
		x ~ poisson( exp(mu) * g);
 }

17.54 seconds (Total)

Gamma-poisson, with truncation and assuming different truncation points

	data {
		int<lower=1> X;
		int<lower=0> x[X];
	}
	transformed data{
		int U[X];
	  for(i in 1:X) U[i] = 150;
	}
	parameters {
		real mu;
		real sigma;
		vector<lower=0>[X] g;
		}
		
	model {
		real lp[2];
			
		mu ~ student_t(4, 0, 2.3);
		sigma ~ student_t(4, 0, 2.3);
		g ~ gamma(1/exp(sigma), 1/exp(sigma));
		

		
		lp[1] = poisson_lpmf(x | exp(mu) * g);
		lp[2] = - poisson_lcdf(U | exp(mu) * g);
	
		target += sum(lp);
		
 }

Gives the wrong mu, as if the truncation wasn’t there. I am pretty sure this is not the right modelling but I cannot think about anything else.

81.39 seconds (Total)

There is an advantage compared to the NB truncated (~300 secs). Still a big loss compared to the naive NB (~3 secs)

I would start with a Poisson + LogNormal random effect. Doing that allows you to use poisson_log easily and you can use a non-centered parametrisation of the random effect as well. Once that works you can go back to Poisson-Gamma if you really want to.

The poisson logNormal


data{
  int<lower=0> X;
  int x[X];
}

parameters {
  real mu;
  real<lower=0> sigma;
  vector[X] log_lambda;
}

model {
  sigma ~ cauchy(0, 5);
  mu ~ normal(0, 10);
  log_lambda ~ normal(mu, sigma);
  x ~ poisson_log( log_lambda);
}

5.33 seconds (Total)

The poisson-logNormal truncated

data{
  int<lower=0> X;
  int x[X];
}
	transformed data{
		int U[X];
	  for(i in 1:X) U[i] = 150;
	}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[X] log_lambda;
}

model {
	real lp[2];
	
  sigma ~ cauchy(0, 5);
  mu ~ normal(0, 10);
  log_lambda ~ normal(mu, sigma);

  lp[1] = poisson_log_lpmf(x | log_lambda);
	lp[2] = - poisson_lcdf(U | exp(log_lambda));
	
		target += sum(lp);
		
}

Still the wrong mu retrieved. I must be doing something wrong in the truncation. Should I truncate just the poisson I imagine…

64.11 seconds (Total) Similar performances to poisson-gamma

However the point is that I am testing a NB assumption, so I should use a non approximated form of NB. If the poisson-gamma truncation worked, I could almost go with that for my algorithm.

Could you please sample log_lambda via a non-centered parametization? As an alternative to that you can use the cool new multiplier offset magic by doing this:

replace
vector[X] log_lambda;
with
vector<offset=mu,multiplier=sigma>[X] log_lambda;

that’s the same as coding it as NCP (given how I understand this super new thing).