Gaussian copula for discrete marginals

Hi,

I would like to estimate a model that uses a Gaussian copula to piece together discrete marginals for count data (e.g. poisson or negative binomial distributed). To do this requires the multivariate Gaussian, which I don’t think currently exists in Stan. (Please correct me if I am wrong, though.)

It also requires the inverse normal cdf, which I also don’t think exists in Stan? Again, shoot me down if I am wrong.

I attach the paper that I am working off. In particular I would like to use eqn. (2.5) to work out the probability of a discrete data point.

If anyone has any bright ideas how to do this I am all ears! (I don’t want to resort to using multivariate poissons etc. – my problem has too many dimensions for this to be feasible.) Otherwise I will go and code up the Gibbs sampler in section 4 of the paper.

Best,

Ben

Gaussian copula better.pdf (272.2 KB)

4 Likes

We don’t have the Gaussian copula density if that is what you are asking, but you can use multi_normal_prec with precision matrix equal to Gamma inverse minus the identity matrix and then correct the determinant afterwards. But you would probably be better off writing your own function that takes a Cholesky factor of a correlation matrix.

This is totally wrong; it is the inv_Phi function.

Hi Ben,

Thank you, and my apologies for the misunderstanding here – I meant to say multivariate gaussian CDF, rather than just multivariate gaussian, (which I don’t think exists in Stan?).

Also apologies for not knowing where to look in the pages of the Stan manual for the univariate inverse normal cdf. I could not find it next to normal_cdf and so assumed it wasn’t there.

Best,

Ben

The multivariate Gaussian CDF does not exist anywhere really. But you don’t need it to do the stuff in section 4 of that paper.

Ok, thank you. Sorry am still being a bit slow though (first time working with Copulas) – to code this up in Stan would it be ok to work with with augmented likelihood (eqn. at the top of section 4.1), setting priors on Gamma and Theta?

Best, Ben

You don’t want to do that indicator function stuff with NUTS. Rather, you want to declare a bunch of nuisance parameters that are between 0 and 1, and then adjust them to the relevant bounds, and then make the Jacobian adjustment. It is a bit like what I outlined here:

https://groups.google.com/d/msg/stan-users/GuWUJogum1o/LvxjlUBnBwAJ

Hi Ben,

Thank you very much for your response, and I must say I did enjoy your approach to sampling from a truncated multivariate normal in the other question.

I do see the similarities between this problem and the copula one, however I am still not quite joining all the dots. Are you suggesting to do something like the following,

  • Create a load of u_ij which are uniformly distributed between the following bounds: a_ij <= u_ij < b_ij, where a_ij = F_j(y_ij-,theta_ij) and b_ij = F(y_ij,theta_ij). Here F(y_ij-) = F(y_ij-1) since y_ij is discrete. (Here F(.) is the CDF for a discrete distribution, say the poisson.)
  • Transform these to x_ij = Phi^{-1}(u_ij).
  • Calculate the likelihood for a N(x,|0,Gamma), and use this to increment the log probability.
  • Correct for the non-linear transformation of u - > x via a Jacobian.

I would be really grateful if you have the chance to look over the above, and tell me your thoughts. I do suspect that these types of model would be really helpful for people in Stan as they look for alternatives to multivariate discrete distributions.

Best,

Ben

Just following up on my own point here: would the likelihood be an unbounded N(x,Gamma) distribution, or a truncated one? Or am I still way off here?

Best,

Ben

So the following R code generates data from a bivariate gaussian copula with poisson marginals,

library(MASS)
fBivariateGaussianCopulaPoisson <- function(aVar,aCor,aMean,bMean){
    aSigma <- matrix(c(aVar,aCor,aCor,aVar),ncol = 2)
  vX <- mvrnorm(1,c(0,0),aSigma)
  vU <- pnorm(vX)
  aY <- qpois(vU[1],aMean)
  bY <- qpois(vU[2],bMean)
  return(c(aY,bY))
}

N <- 100
mCopula <- matrix(nrow = N,ncol=2)
for(i in 1:N){
    mCopula[i,] <- fBivariateGaussianCopulaPoisson(1,-0.5,5,3)  
}

I then use the below Stan file to attempt to estimate both the covariance matrix and the means of the marginals. It is able to reproduce the means of the marginals, but it does not seem to get the covariance matrix right. I have also tried the below but replacing inv_Phi(mv[i,j]) with inv_Phi(u[i,j]) but this does not seem to help. Note it is hard coded for the bivariate case.

Any ideas anyone?

functions{
  vector [] fFindBounds(row_vector vU, matrix L, int[] vY, vector vLambda){
    matrix[2,2] mBounds;
    vector[2] lLogProbU;
    matrix[2,2] u_star;
    vector[2] z_star;
    real z;
    vector[2] v;
    vector[2] out[2];

    mBounds[1,1] = poisson_cdf(vY[1]-1,vLambda[1]);
    mBounds[1,2] = poisson_cdf(vY[1],vLambda[1]);
    mBounds[2,1] = poisson_cdf(vY[2]-1,vLambda[2]);
    mBounds[2,2] = poisson_cdf(vY[2],vLambda[2]);

    z_star[1] = mBounds[1,1]/L[1,1];
    z_star[2] = mBounds[1,2]/L[1,1];

    u_star[1,1] = Phi(z_star[1]); 
    u_star[1,2] = Phi(z_star[2]);

    v[1] = u_star[1,1] + (u_star[1,2]-u_star[1,1]) * vU[1];
    z  = inv_Phi(v[1]);

    u_star[2,1] = Phi((mBounds[2,1]-L[2,1] * z)/L[2,2]); 
    u_star[2,2] = Phi((mBounds[2,2]-L[2,1] * z)/L[2,2]); 

    v[2] = u_star[2,1] + (u_star[2,2]-u_star[2,1]) * vU[2];

    lLogProbU[1] = u_star[1,2] - u_star[1,1];
    lLogProbU[2] = u_star[2,2] - u_star[2,1];

    out[1] = lLogProbU;
    out[2] = v;

    return(out);
   }
}

data{
  int N;
  int Y[N,2];
}

transformed data{
  vector[2] vZeros;
  for(i in 1:2)
    vZeros[i] = 1;
}

parameters{
  cholesky_factor_cov[2] L; // covariance between Xs
  vector<lower=0>[2] lambda; // mean of each marginal poisson
  
  // augmented liKelihood parameters
  matrix<lower=0,upper=1>[N,2] u;
}


transformed parameters{
  matrix[N,2] x;
  vector[N] lLogProb;
  matrix[N,2] mLogProbU;
  matrix[N,2] z;
  vector[2] lOut[N,2];
  matrix[N,2] mv;
  
  for(i in 1:N){
    lOut[i] = fFindBounds(u[i],L,Y[i],lambda);
    mv[i] = to_row_vector(lOut[i,2]);
    mLogProbU[i] = to_row_vector(lOut[i,1]);
  }
    
  for(i in 1:N){
    for(j in 1:2){
      z[i,j] = inv_Phi(mv[i,j]);
    }
  }
      
      
  for(i in 1:N){
       x[i] =  to_row_vector(L * to_vector(z[i]));
  }
    
  for(i in 1:N){
    lLogProb[i] = multi_normal_cholesky_lpdf(to_vector(x[i])|vZeros,L);
  }
}

model{
  for(i in 1:N){
    target += lLogProb[i];
    for(j in 1:2)
      target += log(mLogProbU[i,j]);
  }
  
  L ~ lkj_corr_cholesky(2.0);
  lambda ~ normal(0,10);
}

generated quantities{
  cov_matrix[2] Omega;
  Omega = L * L';
}

That should be cholesky_factor_corr[2] for starters. Also, I don’t think fFindBounds is doing the right thing. When it calculates zstar I believe those should be values for standard normal variates, such that if you computed their standard normal CDF, those are the breakpoints in the quantile function for the Poisson. Specifically, the calculation of z_star should involve inv_Phi

So, if an outcome is 5 and its mean is 3, then

z_star_high <- qnorm(ppois(4, 3))
qpois(pnorm(z_star_high), 3) # yields 4
qpois(pnorm(z_star_high) + 1e-13, 3) # now yields 5

z_star_low <- qnorm(ppois(4 - 1, 3))
qpois(pnorm(z_star_low), 3) # yields 3
qpois(pnorm(z_star_low) + 1e-13, 3) # now yields 4

So, you want a uniform between u_star_low = Phi(z_star_low) and u_star_high = Phi(z_star_high) but the Stan language only lets you put scalar bounds so you have a uniform between 0 and 1. So, do a change of variables there.

Then, for the second observed count, the critical values depend on L too.

 real fcop(real tau,real mu,real ni) {
  return( pow( pow(mu,-tau) + pow(ni,-tau) - 1.0,-1/tau));

}

real clayton_poisson(int y1, int y2, real la1, real la2, real tau) {
if(y1==0 && y2==0) {
return fcop(tau,exp(-la1), exp(-la2));
}
else if(y1>0 && y2==0) {
real C11m = poisson_cdf(y1-1,la1);
real C1 = C11m + exp(poisson_lpmf(y1 | la1));
return fcop(tau,C1, exp(-la2))
- fcop(tau,C11m, exp(-la2));
}
else if(y1==0 && y2>0) {
real C21m = poisson_cdf(y2-1, la2);
real C2 = C21m + exp(poisson_lpmf(y2 | la2));
return fcop(tau,exp(-la1), C2)
- fcop(tau,exp(-la1), C21m);
}
else {
real C11m = poisson_cdf(y1-1,la1);
real C1 = C11m + exp(poisson_lpmf(y1 | la1));
real C21m = poisson_cdf(y2-1, la2);
real C2 = C21m + exp(poisson_lpmf(y2 | la2));
return fcop(tau, C1, C2)
- fcop(tau, C11m, C2)
- fcop(tau, C1, C21m)
+ fcop(tau, C11m, C21m);
}
}

la1, la2 is on log-scale. It uses a bivariate poisson using clayton copula. The gaussian copula I posted one time in the
google Stan discussion board.

Added: Take a look at A PRIMER ON COPULAS FOR COUNT DATA, CHRISTIAN GENEST
AND JOHANNA NESLEHOVA for details and how and if you can expand this in the
multivariate case. There is also a multivariate poisson distribution.

Andre

Hi Ben,

Thanks for your advice here.

I agree I need to be taking the inverse standard normal of the poisson cdf value first. I have also changed to a correlation cholesky matrix.

Now I have the following code, which unfortunately seems to generate ```Rejecting initial value: gradient evaluated at the initial value is not finite’. I have been debugging by print but can’t seem to find the cause of this. Any ideas?

functions{
  vector [] fFindBounds(row_vector vU, matrix L, int[] vY, vector vLambda){
    matrix[2,2] mBounds;
    vector[2] lLogProbU;
    vector[2] u_star_lower;
    vector[2] u_star_upper;
    real z_star_lower;
    real z_star_upper;
    real z;
    vector[2] v;
    vector[2] out[2];

mBounds[1,1] = inv_Phi(poisson_cdf(vY[1]-1,vLambda[1]));
mBounds[1,2] = inv_Phi(poisson_cdf(vY[1],vLambda[1]));
mBounds[2,1] = inv_Phi(poisson_cdf(vY[2]-1,vLambda[2]));
mBounds[2,2] = inv_Phi(poisson_cdf(vY[2],vLambda[2]));

z_star_lower = mBounds[1,1]/L[1,1];
z_star_upper = mBounds[1,2]/L[1,1];

u_star_lower[1] = Phi(z_star_lower); 
u_star_upper[1] = Phi(z_star_upper);

v[1] = u_star_lower[1] + (u_star_upper[1]-u_star_lower[1]) * vU[1];

z  = inv_Phi(v[1]);

u_star_lower[2] = Phi((mBounds[2,1]-(L[2,1] * z))/L[2,2]); 
u_star_upper[2] = Phi((mBounds[2,2]-(L[2,1] * z))/L[2,2]); 

v[2] = u_star_lower[2] + (u_star_upper[2]-u_star_lower[2]) * vU[2];

lLogProbU[1] = u_star_upper[1] - u_star_lower[1];
lLogProbU[2] = u_star_upper[2] - u_star_lower[2];

out[1] = lLogProbU;
out[2] = v;

return(out);
  }
}

data{
  int N;
  int Y[N,2];
}

transformed data{
  vector[2] vZeros;
  for(i in 1:2)
    vZeros[i] = 0;
}

parameters{
  cholesky_factor_corr[2] L; // correlation between Xs
  vector<lower=0>[2] lambda; // mean of each marginal poisson
  
  // augmented liKelihood parameters
  matrix<lower=0,upper=1>[N,2] u;
}


transformed parameters{
  matrix[N,2] x;
  vector[N] lLogProb;
  matrix[N,2] mLogProbU;
  matrix[N,2] z;
  vector[2] lOut[N,2];
  matrix[N,2] mv;
  
  for(i in 1:N){
    lOut[i] = fFindBounds(u[i],L,Y[i],lambda);
    mv[i] = to_row_vector(lOut[i,2]);
    mLogProbU[i] = to_row_vector(lOut[i,1]);
  }
    
  for(i in 1:N){
    for(j in 1:2){
      z[i,j] = inv_Phi(mv[i,j]);
    }
  }
      
  for(i in 1:N){
       x[i] =  to_row_vector(L* to_vector(z[i]));
  }
    
  for(i in 1:N){
    lLogProb[i] = multi_normal_cholesky_lpdf(to_vector(x[i])|vZeros,L);
  }
}

model{
  for(i in 1:N){
    target += lLogProb[i];
    for(j in 1:2)
      target += log(mLogProbU[i,j]);
  }
  
  L ~ lkj_corr_cholesky(2.0);
  lambda ~ normal(0,10);
}

Hi Ben,

Sorry to bother you, but am really quite stuck on this one. Any ideas on the cause of the issue I mentioned in the previous message?

Best,

Ben

On Discourse, you have to tag @bgoodri in order for it to show up in the list at the top right. Also, for things like this, you have to install StanHeaders and RStan from the 2.14.x release in order to see the actual error message, which in your case is

[407] "Exception thrown at line 88: multi_normal_cholesky_log: Random variable[1] is -nan, but must not be nan!"  

It looks like inv_Phi is returning NaN, which I would guess means sometimes its input is outside of the (0,1) interval.

@bgoodri – thanks for your response. I can replicate the error for the few datasets for which the code appears to run. These are inevitably the smaller (fewer N) datasets, although it only appears to run about 1 time in 10, for a sample size of N=5. It does appear to produce reasonable results for the times when it does execute. However I cannot figure out what drives the failure during the 9/10 times when it fails, apart from it seems more likely to run when N is small.

There are occasions when the inputs appear to print as either 0 or 1 (for example when printing v[1]), and hence inv_Phi evaluates to - or + inf. However I cannot understand what is driving the sampler to visit these values…

I have tried passing initial values for the parameters which are sensible but this doesn’t seem to ameliorate the situation either.

Am really at wit’s end here, so any help would be most appreciated.

Best,

Ben

The observed counts that are zero mess up inv_Phi(poisson_cdf(vY[1]-1,vLambda[1])) and inv_Phi(poisson_cdf(vY[2]-1,vLambda[2])).
You need to handle those cases separately because it is just uniform between 0 and whatever the upper bound is.

@bgoodri Thank you very much for all your help with this. I have now coded up the example for a multivariate distribution with K dimensions (see below). All seems to work – the model is actually excellent at returning back the parameters that I use to generate the data (R code also below). Many thanks again for your help here, and include the code below in case others find it useful. I am, in particular, looking forward to extending this model to encompass a variety of discrete marginals: bernoulli, binomial, and negative binomial for starters. Best, Ben

R code to generate data from a Gaussian Copula with Poisson marginals (using a covariance matrix aSigma, and with marginal means given by the vector lMeans):

fMultivariateGaussianCopulaPoisson <- function(aSigma,lMeans){
  K <- length(lMeans)
  vX <- mvrnorm(1,rep(0,K),aSigma)
  vU <- pnorm(vX)
  lY <- vector(length=K)
  for(i in 1:K)
    lY[i] <- qpois(vU[i],lMeans[i])
  return(lY)
}

// For N observations from the above
fGenerateMultivariateCopulaPoisson <- function(N,aSigma,lMeans){
  K <- nrow(aSigma)
  mCopula <- matrix(nrow = N,ncol=K)
  for(i in 1:N){
    mCopula[i,] <- fMultivariateGaussianCopulaPoisson(aSigma,lMeans) 
  }
  return(mCopula)
}

Stan code (note I had to constrain the Poisson marginal parameters to avoid initialising these parameters with too large values that caused the log probability to go to -inf):

functions{
  vector [] fFindBounds(row_vector vU, matrix L, int[] vY, vector vLambda){
    int K = cols(vU);
    matrix[K,2] mBounds;
    vector[K] lLogProbU;
    vector[K] u_star_lower;
    vector[K] u_star_upper;
    real z_star_lower;
    real z_star_upper;
    vector[K] z;
    vector[K] v;
    vector[K] out[2];
    for(k in 1:K){
      int km1 = k - 1;
      if(vY[k]<0.001){ ## upper bound only
        mBounds[k,1] = -1;
        mBounds[k,2] = inv_Phi(poisson_cdf(vY[k],vLambda[k]));
        z_star_lower = -1;
        z_star_upper = (mBounds[k,2] - ((k > 1) ? L[k,1:km1] * head(z,km1) : 0)) / L[k,k];
        u_star_lower[k] = 0;
        u_star_upper[k] = Phi(z_star_upper);
      }else{
        mBounds[k,1] = inv_Phi(poisson_cdf(vY[k]-1,vLambda[k]));
        mBounds[k,2] = inv_Phi(poisson_cdf(vY[k],vLambda[k]));
        z_star_lower = (mBounds[k,1] - ((k > 1) ? L[k,1:km1] * head(z,km1) : 0)) / L[k,k];
        z_star_upper = (mBounds[k,2] - ((k > 1) ? L[k,1:km1] * head(z,km1) : 0)) / L[k,k];
        u_star_lower[k] = Phi(z_star_lower); 
        u_star_upper[k] = Phi(z_star_upper);
      }
      v[k] = u_star_lower[k] + (u_star_upper[k]-u_star_lower[k]) * vU[k];
      z[k] = inv_Phi(v[k]);
      lLogProbU[k] = u_star_upper[k] - u_star_lower[k];
    }
    
    out[1] = lLogProbU;
    out[2] = v;
    
    return(out);
  }
}

data{
  int N;
  int K;
  int Y[N,K];
}

transformed data{
  vector[K] vZeros;
  for(i in 1:K)
    vZeros[i] = 0;
}

parameters{
  cholesky_factor_corr[K] L; // correlation between Xs
  vector<lower=0,upper=30>[K] lambda; // mean of each marginal poisson
  vector<lower=0>[K] sigma;
  
  // augmented likelihood parameters
  matrix<lower=0,upper=1>[N,K] u;
}


transformed parameters{
  matrix[N,K] x;
  vector[N] lLogProb;
  matrix[N,K] mLogProbU;
  matrix[N,K] z;
  vector[K] lOut[N,2];
  matrix[N,K] mv;
  cholesky_factor_cov[K] L_cov;
  
  L_cov = diag_pre_multiply(sigma,L);
  
  for(i in 1:N){
    lOut[i] = fFindBounds(u[i],L_cov,Y[i],lambda);
    mv[i] = to_row_vector(lOut[i,2]);
    mLogProbU[i] = to_row_vector(lOut[i,1]);
  }
    
  for(i in 1:N){
    for(j in 1:K){
      z[i,j] = inv_Phi(mv[i,j]);
    }
  }
      
  for(i in 1:N){
       x[i] =  to_row_vector(L_cov * to_vector(z[i]));
  }
    
  for(i in 1:N){
    lLogProb[i] = multi_normal_cholesky_lpdf(to_vector(x[i])|vZeros,L_cov);
  }
}

model{
  for(i in 1:N){
    target += lLogProb[i];
    for(j in 1:K)
      target += log(mLogProbU[i,j]);
  }
  
  L ~ lkj_corr_cholesky(2.0);
  lambda ~ normal(5,10);
  sigma ~ lognormal(0,1);
}

generated quantities{
  cov_matrix[K] Omega;
  Omega = L_cov * L_cov';
}
1 Like

Did it not work to just initialize the parameters with smaller values?

Yes, this is what I actually have done subsequently.

@Ben_Lambert thank you for posting your code here. I’ve tried it out and it works well. I work a lot with counts of organisms and these sorts of models could be extremely useful. I’d be interested to hear if you’ve developed this any further, e.g., for other marginal distributions such as the negative binomial, as you suggested.