Hello Stan Community,
I’m attempting to perform probabilistic clustering utilizing mixture models, where my mixture components are copulas. Ultimately, my final copula components would be comprised of varying marginals, but just to attempt a working example I have generated and written my stan program to accept only copula components with lognormal marginals.
The likelihood function I am aiming to model takes the form of:
L(x_1,x_2) = \Sigma_i \hspace{2mm}\pi_i C_i(x_1,x_2)f_{i,1}(x_1)f_{i,2}(x_2)
Where C_i is the gaussian copula and f_{i,\_} are lognormal pdfs.
Here is the R code I’ve used to generate some sample data.
library(cmdstanr)
library(tidyverse)
## Generate Data
n_comp <- 2; n_mix <- 2; mixing_prob <- c(0.3, 0.7); n_tot <- 300;
gen_corr_mat <- function(n_comp){
rnorm(n_comp^2) |> matrix(nrow=n_comp) |> qr() |> qr.Q() %>%
{crossprod(., .*(n_comp:1))} |> cov2cor() |> round(4)
}
corr_list <- lapply(1:n_mix, \(x) gen_corr_mat(n_comp))
params <- lapply(1:n_mix, \(y) lapply(
1:(n_comp+1),
\(x) {
if (x == n_comp + 1){
corr_list[[y]]
}
else {
c('meanlog' = runif(1,0.1,20), 'sdlog' = runif(1,0.1,2))
}
}
))
gen_from_lnorm_cop <- function(n, p_list){
## Assumes that last element is the correlation matrix
z <- MASS::mvrnorm(n, mu=rep(0,length(p_list)-1), Sigma = p_list[[length(p_list)]])
for (i in 1:ncol(z)){
z[,i] <- z[,i] |> pnorm() |> qlnorm(meanlog=p_list[[i]]['meanlog'], sdlog=p_list[[i]]['sdlog'])
}
z
}
data <- matrix(NA, nrow=n_tot, ncol=n_comp + 1)
data[,ncol(data)] <- sample(seq(1,n_mix), n_tot, replace=TRUE, prob=mixing_prob)
for (i in seq_along(mixing_prob)){
data[data[,ncol(data)] == i, -ncol(data)] <- gen_from_lnorm_cop(sum(data[,ncol(data)] == i), params[[i]])
}
plot(data[,1] |> log(), data[,2] |> log())
m1 <- cmdstan_model('copula_mix.stan')
fit <- m1$sample(data=data_list, seed=42, chains = 4, parallel_chains = 4)
As well as the stan code to fit the model, using the awesome Gaussian Copula function provided by @spinkney :
functions {
real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
int N = rows(U);
int J = cols(U);
matrix[J, J] Gammainv = chol2inv(L);
return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
}
}
data{
int<lower=0> N; // Number of obs
int<lower=0> p; // Number of features
int<lower=0> nMix; // Numer of mixture components
array[N] vector[p] X;
}
parameters {
array[p] ordered[nMix] logMean;
array[p] vector<lower=0>[nMix] logSd;
array[nMix] cholesky_factor_corr[p] L;
real<lower=0, upper=1> mixing;
}
model {
for (i in 1:nMix){
logMean[:,i] ~ normal(0,2);
logSd[i, :] ~ normal(0,3);
L[i, :] ~ lkj_corr_cholesky(2.0);
}
mixing ~ beta(3,3);
matrix [N, nMix] marginalLogLik = rep_matrix(0, N, nMix);
for (i in 1:nMix) {
for (j in 1:N) {
matrix[1,p] u;
for (k in 1:p){
u[1,k] = inv_Phi(lognormal_cdf(X[j,k] | logMean[k,i], logSd[k,i]));
}
marginalLogLik[j,i] = multi_normal_cholesky_copula_lpdf(u | L[i]);
}
}
for (i in 1:nMix){
for (j in 1:p){
marginalLogLik[:,i] += lognormal_lupdf(X[:,i] | logMean[j,i], logSd[j,i]);
}
marginalLogLik[:,i] += i == 1 ? log(mixing) : log1m(mixing);
}
for (i in 1:N){
target += log_sum_exp(marginalLogLik[i,:]);
}
}
generated quantities {
array[nMix] matrix[p,p] Gamma;
for (i in 1:nMix){
Gamma[i] = multiply_lower_tri_self_transpose(L[i]);
}
}
I’m currently running into identifiability issues where the mixing
parameter converges to basically 1 or 0, and the LogMean
components for each component become identical.
Has anyone tried to fit this type of model before? Is there an alternative parameterization that I should be aware of?
Thanks for any feedback you can offer!