Stan how to optimize the estimation process?


#1

Hi guys!
I have been working on a hierachical mixed effect multinomila choice model. The dataset is not huge, around 20,000 obs, so I guess it is doable using bayesian.

I know it is going to take a huge amount of time before hand. I coded the model using a matrix form without any loops, and also avoided using any complicated distributions.

I ended up finding that avoiding using loops didn’t decreases the running time significantly. I tried on a subsample of 260 obs using 4 chians 1500 iterations. The time consumed for coding without loop is around 8 hours and with loops is around 10 hours, which makes no sense to me. Usually in R, matlab or any other softwares avoiding loops saves a lot of time.

That’s also to say, it will take “forever” to fit the whole dataset. I tried to find answers in the user manual but found nothing.

I would like to know where the majority of time is spent and how to improve the efficiency. I will appreciate if you could give me any suggestions or feedbacks.

Thanks in advance,
Austin
stan2.txt (3.1 KB)


#2

Loops are bad in Stan for different reasons than in R. So in general, removing loops in both languages is ‘good’ but exactly how and when to remove loops can vary.

In R, loops are bad because R itself is slow, and so it is better to vectorize things and let the internal C components handle things, which runs quickly.

Stan programs are compiled down to C++, so the loops themselves are actually very fast. However allocating things onto the autodiff stack can be very slow, so using loops to copy things into a vector and then running the vectorized function can speed things up.


#3
theta[,1]=-log1p(exp(-u[,1]))-log1p(exp(-u[,2]));  //probablity vector for multinomial model: p*q
theta[,2]=-log1p(exp(-u[,1]))-log1p(exp(u[,2])); //probablity vector for multinomial model: p*(1-q)
theta[,3]=-log1p(exp(u[,1])); //probablity vector for multinomial model: 1-p

summands=y[,1].*theta[,1]+y[,2].*theta[,2]+y[,3].*theta[,3];
target+=sum(summands);

This looks like a mistake. Is this a mixture model?

You probably want a log_sum_exp in there somewhere. Check Chapter 13: Finite Mixtures in the manual.

But concerning the performance, usually when Stan samples models slowly it’s because it’s taking a long time for NUTS to move around in the posterior and generate informative samples. Usually the fix for stuff like this (if there is one) is in reparameterizations (edit: and priors).

If this is a mixture model you’re fitting, there’s a fair bit about them in Chapter 24: Problematic Posteriors, which probably isn’t what you wanted to hear :P.

If you’re using R, the easiest place to start diagnosing a model is with the pairplots/Shinystan stuff. This is the sorta thing you wanna work through: http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html. This page also has some good stuff on it: https://github.com/stan-dev/stan/wiki/Stan-Best-Practices .

This stuff make sense at all? tldr; is there’s probably an issue with your model, and there might be a reparameterization that can help you. Trick is finding it :D.


#4

Hi bbbales2,

Thanks for your reply. The mixture modes are actually for the u[,1] and u[,2] since I have both fixed and random coefficients.

As I checked myself, theta seems to be right. For here, p=exp(u[,1])/(1+exp(u[,1]))=1/(1+exp(-u[,1])), q=exp(u[,2])/(1+exp(u[,2]))=1/(1+exp(-u[,2])), and prob vector=[p.q, p.(1-q), (1-p)].

Since I decide to write the loglikelihood out, theta in the program is actually the log(theta), say theta=[log§+log(q),log§+log(1-q), log(1-p)]. The loglikelihood=y1log(pq)+y2log(p(1-q))+y3log(1-p)

I think target+= is to maximize the likelihood of the function, so it should be given the log-likelihood (correct me if I am wrong).

Anyway, thanks for the info, I will check them out and see if I could find anything.


#5

Thanks for the info Aaron. Let me do a few rounds of simulation and see how it works.

Cheers!


#6

One problem in your test could be that the 260 observations in your test don’t provide enough information to estimate all the hierarchical coefficients. The geometry of the model (and the speed) could be better behaved if you include more observations. I understand you might not be able to easily test this but you could get an idea with less iterations but more observations.

Another way could be to use the ‘brms’ package if you are an R user.

To narrow down the problem it might be a good idea to test simpler models first. Maybe try one with just the fixed effect, than one with a hierachical intercept, than the full model. That might already give you an idea whether your model is off or whether it’s an efficiency problem.

I also had a quick look at the model and the folllowing bit seemed the least robust part to me.

umu[,1]=xu*a; //for fixed coefficients
umu[,2]=xu*b; //for fixed coefficients
 for (j in 1:K_c){
 umu[nkeyword[j,1]:nkeyword[j,2],1]=xc[nkeyword[j,1]:nkeyword[j,2],]*(alpha[j,]')+umu[nkeyword[j,1]:nkeyword[j,2],1]; //for random coefficients
 umu[nkeyword[j,1]:nkeyword[j,2],2]=xc[nkeyword[j,1]:nkeyword[j,2],]*(beta[j,]')+umu[nkeyword[j,1]:nkeyword[j,2],2]; //for random coefficients
 }
u=umu+(sigmay*z)';
}

It looks to me like you are matching the fixed effects xu * a and xu * b with the random effect through the use of nkeyword for each “random” variable. I assume nkeyword is something like the a start and stop index for each keyword. I think this loop should be over all keywords i.e. for ( j in 1:R). You want to match on the keywords not on the number of variables.


#7

Hi Stijn,

Thanks for the reply. That’s a good catch, thanks. I did try to fit the model part by part, this bug is actually caused by revision. I will check out the brms package to see if it works.