Elementwise multiplications with row sums appear to be slow

mike-lawrence · August 22, 2021, 6:18pm

A & C aren’t going to be positive-constrained, eh? One thing I’m noticing as I play is that there can be rapid amplification of values either towards Infinity or zero depending on the range of values in A & C, so if A & C are positive-constrained, you’ll have safer numeric representation by doing things on the log scale.

Samir_Bhatt · August 22, 2021, 6:27pm

both are positive yes, exp(random walk), exp(spline) and C is a discrete probability [0,1]. What this process models is a branching process, so A typically cant be above one (super critical) for long or you will get a rapid explosion, C being [0,1] and sum to 1 helps “damp” the explosive behaviour. Briefly, the branching process is a stochastic process Z(t) for which you can define a probability generating function. Log(Z(t)) is a more nasty thing to deal with even if numerically it makes life easy.

A very simple worked example is here, you seem to understand this all very fast so this should be easy to understand.

https://github.com/ImperialCollegeLondon/BellmanHarris_simulation_and_inference/blob/main/notebooks/simulate_prevalence_incidence_cumincidence.ipynb

mike-lawrence · August 22, 2021, 7:54pm

Ah, one efficiency boost comes from incrementing the row selection inside the loop. Using the R code from the notebook you linked:

# Simulation length and resolution
dx <- 5#0.05 # step size
max_time <- 15 # time horizon
xx <- seq(0, max_time, dx)
nn <- length(xx)


# Generation distribution
rbase <- function(n) rgamma(n, theta1, theta2)
dbase <- function(n) dgamma(n, theta1, theta2)
pbase <- function(n) pgamma(n, theta1, theta2)

# specify parameters for generation interval
theta1 <- 5
theta2 <- 6

g <- pbase(xx[2:nn]) - pbase(xx[1:(nn-1)]) # precompute generation intervals in the bins
g <- c(g, 1-sum(g)) # ensure sum to 1
G <- 1 - pbase(xx) # survival function


# Time-varying reproduction number
R0 <- 1.3
Rt <- function(x) {
	R0 + sin(0.5 * x)
}


# Reproduction number and generation time arrays
A <- matrix(0, nrow = nn, ncol = nn) # initialise Rt matrix
C <- matrix(NA, nrow = nn, ncol = nn) # initialise g matrix

for(i in 1:nn){
	A[,i] <- Rt(pmax(0, xx-xx[i])) # populate Rt matrix
	C[i,] <- g # populate g matrix
}

A[upper.tri(A)] <- 0 # make sure we dont have negative values

Bci = matrix(NA, nrow = nn, ncol = nn)
Bci[,1] <- 1 
Bci[,2] <- 1 + C[,1] * A[,1] 

Bci2 = matrix(NA, nrow = nn, ncol = nn)
Bci2[,1] <- 1 
Bci2[2:nn,2] <- 1 + C[2:nn,1] * A[2:nn,1] 

print(A)
print(C)
print(Bci)
print(Bci2)

for(i in 2:(nn-1)){
	cat('---------------------\n')
	cat(i,'revC:\n')
	print(C[,i:1])
	cat(i,'A:\n')
	print(A[,1:i])
	cat(i,'Bci:\n')
	print(Bci[,1:i])
	prods = C[,i:1] * A[,1:i] * Bci[,1:i]
	rowsums <- rowSums(prods)#, na.rm=T)
	cat(i,'prods:\n')
	print(cbind(prods,NA,rowsums))
	Bci[,i+1] <- 1 + rowsums
	cat(i,'Bci:\n')
	print(Bci)

	r = (i+1):nn
	prods2 = C[r,i:1] * A[r,1:i] * Bci[r,1:i]
	if(length(r)>1){
		rowsums2 <- rowSums(prods2)
	}else{
		rowsums2 <- sum(prods2)
	}
	cat(i,'prods2:\n')
	print(cbind(prods2,NA,rowsums2))
	Bci2[r,i+1] <- 1 + rowsums2
	cat(i,'Bci2:\n')
	print(Bci2)
	
}
all.equal(diag(Bci),diag(Bci2))

And my attempt to implement in Stan efficiently (inc. minimal variable declarations, which can slow things down apparently)

transformed data{
	vector[N] ones = rep_vector(1.0,N) ;
}
transformed parameters {
	vector[N] E_inc ;
	{
		matrix[N,N] B ;
		B[,1] = 1.0 ;
		B[2:N,2] = 1.0 + C[2:N,1] .* A[2:N,1] ;
		for(i in 2:(N-1)){
			B[(i+1):N,i+1] = ones[(i+1):N] + (
				C[(i+1):N,i:1] //removed revmat
				.*A[(i+1):N,1:i]
				.*B[(i+1):N,1:i]
			) * ones[(i+1):N] ;
		}
		E_inc = diagonal(B);
	}
}

mike-lawrence · August 22, 2021, 8:48pm

And here’s a version that attempts to avoid numeric instability by doing the products as sums on the log-scale:

functions{
	vector rows_log_sum_plus1_exp(matrix x){
		int N = rows(x) ;
		vector[N] result ;
		for(i in 1:N){
			result[i] = log_sum_exp( append_col( x[i], 0.0 ) ) ;
		}
		return(result) ;
	}
}
data{
	int N ;
	matrix[N,N] logA ;
	matrix[N,N] logC ;
}
transformed data {
	vector[N] E_inc ;
	{
		matrix[N,N] logB ;
		logB[,1] = rep_vector(0.0,N) ; //no terms for first col
		//second col has two terms & needs extra step of to_matrix
		logB[2:N,2] = rows_log_sum_plus1_exp(
			to_matrix(
				logC[2:N,1]
				+ logA[2:N,1]
				, N-1, 1
			)
		) ;
		//cols 3:N need 3 terms
		for(i in 3:N){
			logB[i:N,i] = rows_log_sum_plus1_exp(
				logC[i:N,(i-1):1]
				+ logA[i:N,1:(i-1)]
				+ logB[i:N,1:(i-1)]
			) ;
		}
		E_inc = exp(diagonal(logB));
	}
}

``

mike-lawrence · August 22, 2021, 9:25pm

@wds15 & @spinkney Just double-checking my self-taught-math understanding: If the problem was just:

data{
    int N ;
    vector[N] X ;
    vector[N] Y ;
}
transformed parameters {
    vector[N] Z ;
    for(i in 1:N){	
      Z[i] = sum( X[i:1]) .* Y[1:i]) ;
    }
}

then that’s just convolution of two signals, for which there appear to be many tricks available compute-wise (fft, etc).

But because the desired structure involves a third “signal”, none of those tricks are applicable, right?

Samir_Bhatt · August 23, 2021, 8:12am

thank you so much for your time @mike-lawrence, it is really appreciated. Let try that with the profiling approach you mentioned. Once again, thank you Mike for your engagement in helping with this. Both your suggestions are very clever

Regarding the convolution, you could use the convolution theorem which converts a convolution into a product via a Fourier transform (as you say). This means in the discrete case we could use a fast fourier transform (as you say). I have tried this, and the overheads means it is only really useful for long sequences which is rarely the case.

Note in what you wrote, that is only valid for circular convolutions, so some annoying padding is required for linear convolutions, but that should be easy enough to do in stan. We would need also to do the inverse FFT in stan.

For a simple convolution one could just use a dot product, something like

dot_product(b[1:(i-1)],tail(c, i-1));

Our loop is a bit more complex due to the inclusion of l and so a simple approach like above does not seem possible. I would love to be able to make things faster with an FFT on the matricies, but I am afraid its outside my ability to think up this solution.

wds15 · August 23, 2021, 8:20am

can you run a simplified model which can use the dot_product for the convolution? That should make the point that dot_product should give major speedups and in turn justify to think a bit harder about how to get something similar for your more involved model.

wds15 · August 23, 2021, 8:27am

forgot… you should be using 2.27.0 cmdstan for good Eigen expressions and then you also want to have

STAN_NO_RANGE_CHECKS=true

in your make/local (at least whenever you are sure about indices are correct)

Samir_Bhatt · August 23, 2021, 4:14pm

Forgive one annoying practical question just to implement what has been suggested above

does

C[(i+1):N,i:1]

Work for you? by this i mean specifically the i:1. The reason I wrote revmat was I never got this to work and always got an error

Exception: elt_multiply: Columns of m1 (0) and columns of m2 (2) must match in size

I am using cmdstan “2.27.0”

I assuming reverse indexing does not work in stan, but from your code this is not the case?

Thanks @wds15 Im pretty bad with versioning and will fix this

mike-lawrence · August 23, 2021, 5:01pm

Weird, I don’t get that error with the log model in my latest post (I did get an error related to using the wrong row/col sizes in to_matrix, but just updated my post to fix that).

Samir_Bhatt · August 23, 2021, 5:56pm

Yes weird, I reinstalled everything and it seems fine now, one less thing to worry about. Ill go about implementing your log code - hope that is faster, its a nice idea.

I should mention there is one thing that can also be used for speed ups that I forgot to mention (incase you are interested or spotted this and wondered) - we can truncate the matrices as each row of C is a probability distribution which quickly becomes close to zero. However, this is likely TMI as even truncating to just 50 rows is still very slow.

    B[1:N,1] = one_mat[1:N,1];
    B[2:N,2] = one_mat[2:N,2]+(C[2:N,1].*A[2:N,1].*B[2:N,1]);
    
    for(i in 3:T){	
      B[i:N,i] = one_mat[i:N,i] + (C[i:N,(i-1):1].*A[i:N,1:(i-1)].*B[i:N,1:(i-1)])* one_vec[1:(i - 1)];
    }
    for(i in (T+1):N){	
      B[i:N,i] = one_mat[i:N,i] + (C[i:N,T:1].*A[i:N,(i-T):(i-1)].*B[i:N,(i-T):(i-1)])* one_vec[1:T];
    }
    E_inc = diagonal(B);

wds15 · August 23, 2021, 6:40pm

A full working model with some fake data would help by now.

A few more comments: While we cannot index things in reverse order 5:1, what you can do is create an integer vector c(5, 4, 3, 2, 1) and index with that. Creating integer sequences needs its own function, but is cheap to do. Maybe you want to create an issue for reverse indexing? That should go to the stan repository. Though…I recall, that we have reverse for vectors. So you can write reverse(vec[1:5])… and with Eigen expressions that won’t be too terribly bad in terms of performance; but better profile this stuff.

To me it looks as if you cannot do a simple dot_product(a, b) for the convolution, but instead need to fold into it another matrix multiply like dot_product(a, C * b), right? You should still gain a lot of speed if you were to spell out the dot_product.

Samir_Bhatt · August 23, 2021, 8:08pm

branching_process.stan (1.4 KB)
test_for_stan_forum.r (1.8 KB)

Thanks. I have tried to put something together quickly that looks to work. It’s a simple, simulate data from the model, add some noise, fit. The code is not super clean, but wanted to share now and ill comment and clean up and put on our git repository asap. This should run fast - once i figure out how to profile (sorry I am a luddite) that should be useful

Topic		Replies	Views
Improving the performance of a model with lots of iteration Modeling specification , performance	14	2009	January 14, 2022
Improving the efficiency of *_dot_product for trinary-and-redundant matrices Modeling	38	1218	May 3, 2023
Matrix operations (1x1) run 10x slower than using reals RStan performance	8	775	June 3, 2020
Looking for suggetions to make codes more efficient General rstan	4	188	April 28, 2024
Speed of matrix multiplication versus multiplication loops Modeling rstan , techniques , performance	4	1697	November 8, 2022

Elementwise multiplications with row sums appear to be slow

Related topics