Using rstan::optimizing to obtain MLE does not seem to run with large number of parameters


Dear stan users: I am working on a MLE problem. In my problem there are 185 parameters.
Essentially, there are 35 groups and each is from a bivariate normal distribution with group-specific mean, var-covariance matrix. I am modelling on (\mu[1:35, 1:2],\sigma[1:35,1:2], \rho[35], altogether there are 175 parameters)
In addition, I assume that \mu[1:35,1:2] come from a multivariate student-t distribution with means vector to be dependent on a covariate and a universal variance covariance matrix made up of (\sigma_{\mu}[1:2 ],
\rho_{\mu} ) and a degree of freedom.

\mu[1:35,1 ] = coef[1] + coef[2] * risk_lvl[1:35] + coef[3] * risk_lvl[1:35]^2
\mu[1:35,2 ] = coef[4] + coef[5] * risk_lvl[1:35] + coef[6] * risk_lvl[1:35]^2

Therefore, altogether there are 185 parameters. I just want to do MLE so I am calling rstan::optimizing.
The model seems to get stuck at
Initial log joint probability = -1.27423e+006

Here is how I call the model, I do supply the initial values
initf1<- list( mu = cbind(theta.hist.norm[,1], theta.hist.norm[,2]),
sigma = cbind(sqrt(theta.hist.norm[,3]), sqrt(theta.hist.norm[,5])),
rho = c(theta.hist.norm[,4]/sqrt(theta.hist.norm[,3]*theta.hist.norm[,5])),
coef = c(Coef1[3],Coef1[2],Coef1[1],
nu = 1, sigma_mu = c(sqrt(CCC[1,1]),sqrt(CCC[2,2])), rho_mu= CCC[1,2]/sqrt(
CCC[1,1] * CCC[2,2]))
sym_mle <-optimizing(sym_bivariate_mle, data=data_set,init=initf1, iter=50)

I have no idea whether is because there are too many parameters or there is something wrong with my model

Below is my model. Can someone please suggest what’s the problem causing optimizing not running?
Thanks in advance.

  real binormal_cdf(real z1, real z2, real rho) {
    if (z1 != 0 || z2 != 0) {
      real denom = fabs(rho) < 1.0 ? sqrt((1 + rho) * (1 - rho)) : not_a_number();
      real a1 = (z2 / z1 - rho) / denom;
      real a2 = (z1 / z2 - rho) / denom;
      real product = z1 * z2;
      real delta = product < 0 || (product == 0 && (z1 + z2) < 0);
      return 0.5 * (Phi(z1) + Phi(z2) - delta) - owens_t(z1, a1) - owens_t(z2, a2);
    return 0.25 + asin(rho) / (2 * pi());

//joint likelihood function for random vector (q1,m,q3)
data {
  int<lower=0> n;//number of risk groups ==35
  int<lower=0> m;//number of breaks in x.breaks and y.breaks ==6
  vector[m] x_breaks[n];//matrix of bivaraite min max in x
  vector[m] y_breaks[n];//matrix of bivaraite min max in y
  int<lower=0> B;//number of bins per x and y-dimension histogram ==5
  matrix[B,B] counts[n];//counts 
  vector[n] risk_lvl;//risk level for each bivaraite histogram

  vector<lower=0>[2] mu[n];
  vector<lower=0>[2] sigma[n];
  vector<lower=-1, upper=1>[n] rho;
  vector[6] coef;//coefficients for linear regression on mu 
  //coef[1:3] are intercept, beta1, beta2 for mu[1:35,1]
  //coef[4:6] are intercept, beta1, beta2 for mu[1:35,2]
  vector<lower=0>[2] sigma_mu;//sigma_mu[1] is the std dev for mu[1:35,1]
  //sigma_mu[2] is the std dev for mu[1:35,2]
  real<lower=-1,upper=1> rho_mu;
  real<lower=0> nu;//degree of freedom of the multivariate student-t distn

transformed parameters{
   cov_matrix[2] Sigma;
  Sigma[1,1] = square(sigma_mu[1]);
  Sigma[1,2] = rho_mu * sigma_mu[1] * sigma_mu[2];
  Sigma[2,1] = Sigma[1,2];
  Sigma[2,2] = square(sigma_mu[2]);

  vector[m] x_breaks_std[n];//matrix of bivaraite min max in x
  vector[m] y_breaks_std[n];//matrix of bivaraite min max in y
  matrix[m,m] bi_normal_cdf[n];//contain bivariate normal cdf at every combination per risk group 
  vector[B*B] logg[n];//intermediate container for loglikelihood
  vector[2] mean_mu[n];
  for(i in 1:n){
   mean_mu[i,1] = coef[1] + coef[2]* risk_lvl[i] + coef[3]*square(risk_lvl[i]);
   mean_mu[i,2] = coef[4] + coef[5]* risk_lvl[i] + coef[6]*square(risk_lvl[i]);

  for(i in 1:n){
    target += multi_student_t_lpdf(to_vector(mu[i,])| nu, mean_mu[i,1:2], Sigma );
   for(i in 1:n){
  x_breaks_std[i,] = (x_breaks[i,] - mu[i,1])/sigma[i,1];
  y_breaks_std[i,] = (y_breaks[i,] - mu[i,2])/sigma[i,2];
  for(i in 1:n){
   for(j in 1:m){
     for(k in 1:m){
bi_normal_cdf[i,j,k]= binormal_cdf(x_breaks_std[i,j], y_breaks_std[i,k], rho[i]) ;


for(i in 1:n){
    for(j in 1:B){
      for(k in 1:B){
      if(counts[i,j,k] != 0){
          logg[i,(k+5*(j-1))] = bi_normal_cdf[i,j+1, k+1] - bi_normal_cdf[i,j, k+1] - bi_normal_cdf[i, j+1,k] + bi_normal_cdf[i,j,k];

if(logg[i,(k+5*(j-1))] <=0){
  target += log(1e-320) * counts[i,j,k];

 target += log( logg[i,(k+5*(j-1))] ) * counts[i,j,k];



Well if a model is sick and you don’t know why, first thing to do is try to build the simplest model that you know should work, and then layer the complexity back till things break (so you can figure out why they break).

Can you work with a one group model? Is that a fair place to start here?

I’m not familiar with the binormal distribution, but it looks like you’re working with binned data that you assume actually comes from a binormal and then was binned? This sort of binned stuff is tricky I think. There was at least one recent question on it: Gaussian mixture model with interval censored data -- where do the divergences come from?

It’s tricky to get numerics right on the CDFs. There was another thread on this awhile ago but I failed to find it last time I looked for it.

Is there any way you can model this with a discrete distribution?


Thank you for your reply. I modified my model slightly but now it is running but it seems to take forever to run (I have been waiting for more than 1 hour and so far nothing seems to come out from the model).
I am just wondering whether it is the complexity of the model that causes it to be extremely slow or there might be something else causing the problem.

I read the manual regarding the optimizing section, in my model I have 70 \sigma and 35 \rho to be estimated and a variance and co-variance matrix. Initially I set all these parameters to be properly constrained but the manual suggests not to constrain the parameters and supply initial values and this way should be more efficient, I am running both constrained and unconstrained versions but neither seems to run.

Also I ran one group at a time and it is running and giving reasonable results.

sorry for rambling but any suggestions would be greatly appreciated .


Do you get anywhere using 2 groups together?

What does the sampler do? I think some hierarchical models don’t have maximums, so optimizing doesn’t make sense for them.