Using rstan::optimizing to obtain MLE does not seem to run with large number of parameters

Dear stan users: I am working on a MLE problem. In my problem there are 185 parameters.
Essentially, there are 35 groups and each is from a bivariate normal distribution with group-specific mean, var-covariance matrix. I am modelling on (\mu[1:35, 1:2],\sigma[1:35,1:2], \rho[35], altogether there are 175 parameters)
In addition, I assume that \mu[1:35,1:2] come from a multivariate student-t distribution with means vector to be dependent on a covariate and a universal variance covariance matrix made up of (\sigma_{\mu}[1:2 ],
\rho_{\mu} ) and a degree of freedom.

\mu[1:35,1 ] = coef[1] + coef[2] * risk_lvl[1:35] + coef[3] * risk_lvl[1:35]^2
\mu[1:35,2 ] = coef[4] + coef[5] * risk_lvl[1:35] + coef[6] * risk_lvl[1:35]^2

Therefore, altogether there are 185 parameters. I just want to do MLE so I am calling rstan::optimizing.
The model seems to get stuck at
Initial log joint probability = -1.27423e+006

Here is how I call the model, I do supply the initial values
initf1<- list( mu = cbind(theta.hist.norm[,1], theta.hist.norm[,2]),
sigma = cbind(sqrt(theta.hist.norm[,3]), sqrt(theta.hist.norm[,5])),
rho = c(theta.hist.norm[,4]/sqrt(theta.hist.norm[,3]*theta.hist.norm[,5])),
coef = c(Coef1[3],Coef1[2],Coef1[1],
nu = 1, sigma_mu = c(sqrt(CCC[1,1]),sqrt(CCC[2,2])), rho_mu= CCC[1,2]/sqrt(
CCC[1,1] * CCC[2,2]))
sym_mle <-optimizing(sym_bivariate_mle, data=data_set,init=initf1, iter=50)

I have no idea whether is because there are too many parameters or there is something wrong with my model

Below is my model. Can someone please suggest what’s the problem causing optimizing not running?
Thanks in advance.

  real binormal_cdf(real z1, real z2, real rho) {
    if (z1 != 0 || z2 != 0) {
      real denom = fabs(rho) < 1.0 ? sqrt((1 + rho) * (1 - rho)) : not_a_number();
      real a1 = (z2 / z1 - rho) / denom;
      real a2 = (z1 / z2 - rho) / denom;
      real product = z1 * z2;
      real delta = product < 0 || (product == 0 && (z1 + z2) < 0);
      return 0.5 * (Phi(z1) + Phi(z2) - delta) - owens_t(z1, a1) - owens_t(z2, a2);
    return 0.25 + asin(rho) / (2 * pi());

//joint likelihood function for random vector (q1,m,q3)
data {
  int<lower=0> n;//number of risk groups ==35
  int<lower=0> m;//number of breaks in x.breaks and y.breaks ==6
  vector[m] x_breaks[n];//matrix of bivaraite min max in x
  vector[m] y_breaks[n];//matrix of bivaraite min max in y
  int<lower=0> B;//number of bins per x and y-dimension histogram ==5
  matrix[B,B] counts[n];//counts 
  vector[n] risk_lvl;//risk level for each bivaraite histogram

  vector<lower=0>[2] mu[n];
  vector<lower=0>[2] sigma[n];
  vector<lower=-1, upper=1>[n] rho;
  vector[6] coef;//coefficients for linear regression on mu 
  //coef[1:3] are intercept, beta1, beta2 for mu[1:35,1]
  //coef[4:6] are intercept, beta1, beta2 for mu[1:35,2]
  vector<lower=0>[2] sigma_mu;//sigma_mu[1] is the std dev for mu[1:35,1]
  //sigma_mu[2] is the std dev for mu[1:35,2]
  real<lower=-1,upper=1> rho_mu;
  real<lower=0> nu;//degree of freedom of the multivariate student-t distn

transformed parameters{
   cov_matrix[2] Sigma;
  Sigma[1,1] = square(sigma_mu[1]);
  Sigma[1,2] = rho_mu * sigma_mu[1] * sigma_mu[2];
  Sigma[2,1] = Sigma[1,2];
  Sigma[2,2] = square(sigma_mu[2]);

  vector[m] x_breaks_std[n];//matrix of bivaraite min max in x
  vector[m] y_breaks_std[n];//matrix of bivaraite min max in y
  matrix[m,m] bi_normal_cdf[n];//contain bivariate normal cdf at every combination per risk group 
  vector[B*B] logg[n];//intermediate container for loglikelihood
  vector[2] mean_mu[n];
  for(i in 1:n){
   mean_mu[i,1] = coef[1] + coef[2]* risk_lvl[i] + coef[3]*square(risk_lvl[i]);
   mean_mu[i,2] = coef[4] + coef[5]* risk_lvl[i] + coef[6]*square(risk_lvl[i]);

  for(i in 1:n){
    target += multi_student_t_lpdf(to_vector(mu[i,])| nu, mean_mu[i,1:2], Sigma );
   for(i in 1:n){
  x_breaks_std[i,] = (x_breaks[i,] - mu[i,1])/sigma[i,1];
  y_breaks_std[i,] = (y_breaks[i,] - mu[i,2])/sigma[i,2];
  for(i in 1:n){
   for(j in 1:m){
     for(k in 1:m){
bi_normal_cdf[i,j,k]= binormal_cdf(x_breaks_std[i,j], y_breaks_std[i,k], rho[i]) ;


for(i in 1:n){
    for(j in 1:B){
      for(k in 1:B){
      if(counts[i,j,k] != 0){
          logg[i,(k+5*(j-1))] = bi_normal_cdf[i,j+1, k+1] - bi_normal_cdf[i,j, k+1] - bi_normal_cdf[i, j+1,k] + bi_normal_cdf[i,j,k];

if(logg[i,(k+5*(j-1))] <=0){
  target += log(1e-320) * counts[i,j,k];

 target += log( logg[i,(k+5*(j-1))] ) * counts[i,j,k];


Well if a model is sick and you don’t know why, first thing to do is try to build the simplest model that you know should work, and then layer the complexity back till things break (so you can figure out why they break).

Can you work with a one group model? Is that a fair place to start here?

I’m not familiar with the binormal distribution, but it looks like you’re working with binned data that you assume actually comes from a binormal and then was binned? This sort of binned stuff is tricky I think. There was at least one recent question on it: Gaussian mixture model with interval censored data -- where do the divergences come from?

It’s tricky to get numerics right on the CDFs. There was another thread on this awhile ago but I failed to find it last time I looked for it.

Is there any way you can model this with a discrete distribution?

Thank you for your reply. I modified my model slightly but now it is running but it seems to take forever to run (I have been waiting for more than 1 hour and so far nothing seems to come out from the model).
I am just wondering whether it is the complexity of the model that causes it to be extremely slow or there might be something else causing the problem.

I read the manual regarding the optimizing section, in my model I have 70 \sigma and 35 \rho to be estimated and a variance and co-variance matrix. Initially I set all these parameters to be properly constrained but the manual suggests not to constrain the parameters and supply initial values and this way should be more efficient, I am running both constrained and unconstrained versions but neither seems to run.

Also I ran one group at a time and it is running and giving reasonable results.

sorry for rambling but any suggestions would be greatly appreciated .

Do you get anywhere using 2 groups together?

What does the sampler do? I think some hierarchical models don’t have maximums, so optimizing doesn’t make sense for them.

I read a paper
and found that perhaps the way I constructed the log likelihood function is not exactly. Essentially, I think I am doing a hierarchical model MLE.

where l( \phi, \theta_i | y) = \int l(y| \theta_i) l(\theta_i | \phi) d \theta_i

Here \theta_i is the group-level parameters and \phi is the global parameters governing all groups.

so the right way to do hierarchical model MLE you have to integrate out group-level parameters and then to get MLE for the global parameters and based on the MLE of the global parameters then you can infer group level parameters \theta_i. However, based on my current stan model, I only have the product of the two models ( the likelihood and the prior on the group-level) but I didn’t integrate out the group-level parameters \theta_i that is perhaps where it goes wrong?

Can anyone suggest me how i can integrate out the group-level parameters within stan? are there any examples lying around that I can refer to before implementing it for my model?

Thanks in advance.

If you just run your sampler on the full model, then just ignoring the variables you wanted to integrate out in the posterior is the same as integrating them out.

So in your case sample all the \theta_i and \phi parameters but only look at \phi in the results. If all your diagnostics are good (no divergences or treedepths or BFMI warnings), then you’re basically good to go.

Integrating these parameters out manually usually requires a really specific model structure. Lots of normals, etc, for the algebra to be easy.

thanks for your fast reply.
but what you suggested is for doing a hierarhical Bayesian right?
but in my case I just want to do hierarchical maximum likelihood using rstan::optimising
so the joint likelihood l(\phi | y ) = \int l(y|\theta_i) l(\theta_i| \phi) d\theta_i .

In the paper, the author used fourth-order Gauss–Lobatto quadrature, with 20 points along each parameter and integration limits at the .01 and .99 quantiles of each parent distribution. Then minimized the negative log-likelihood by using the SIMPLEX algorithm.


I’m not going to say this is impossible, but this isn’t what Stan is built for and so you’re looking at writing quite a bit of custom code to achieve this with Stan. Why avoid the sampler though? It should give better results.

There is no hierarchical MLE in the usual cases. The density is unbounded as the hierarchical variance approaches 0 and the lower-level parameters approach the hierarchical mean.

You need to add in something like a zero-avoiding prior as discussed by Anrew Gelman and others in several papers about ten years back. I’ll point this out in the next revision of the manual.

If they were doing quadrature, wouldn’t that have been to solve the expectations in the Bayesain posterior? Or were they doing quadrature to integrate out the low-level parameters?

Thank you both for your explanation. I will dig up the papers and have a look.

I am wondering with the current model construction where I have just have the likelihood for each group and a prior model for group-level mean without assigning any prior distributions on the rest of the parameters, then what is the difference between using rstan::optimising versus rstan::sampling?

I tried rstan::sampling on the current model above and R hat looks terrible and all chains are not mixing well/ but the point estimate does not look far away from what I expect. but I never got the model to run using rstan::optimising.

are there any multivariate hierarchical model example lying around? I looked through manual but couldnt find any example code for me to study.

thanks for your reply.

The short answer is that optimizing finds the posterior mode whereas sampling draws from the posterior. The posterior mean minimizes expected square error in parameter estimation (assuming the model is well specified, of course).

There’s a longer answer in the two book chapters. But details on how they differ in use are hard to come by as people don’t usually mix and match in the same problems.

thank you for your quick reply. I will look into zero-avoiding prior on variance parameter and also simplify my model a bit to see if I can get sampling working without horrible Rhat etc.

Yes, this is the way to go. Look for non-identifiabilities and places to use non-centered parameterizations.

My instincts say that in a hierarchical model where quadrature was useful you should be able to sample that as well. If the sampler isn’t working, I’m led to doubt the quadrature. I could easily be wrong