Summary
In this post I am doing 2 things:
-
Providing a final update on a copula model that I have been working on for a year and has been greatly improved thanks to the community on here.
-
Asking an additional question: How can I use the samples from the fit of this model for downstream prediction using
fit$generated_quantities
? (see below)
A Big Thank You
Before I get to my question, I want to send a huge shoutout to the community on here. I have skulked this forum for well over a year as I try to fit a Mixed Discrete and Continuous Gaussian Copula and I can finally say that I have been successful in such an endeavor. My full model is here: MDCG_V3.stan (49.0 KB). A smaller more condensed version is pasted below when I ask my prediction question. Essentially, I have a 54 dimension dataset with some continuous, normally distributed observed values y_c and some ordered probit observed values (latent continuous) y_o. These are both conditional on some positive, real independent variable x and linked with the correlation matrix \Sigma in a Gaussian Copula. There is missing data in the multivariate response vector that is accounted for in the model linked above. I discovered that given the complexity of the model, I needed tight priors to avoid inefficiency and BFMI issues. This was achieved by first fitting each marginal univariately and then taking the expectation of each parameter in that setting as a prior in the larger multivariate model. I then found the Cholesky parameter finicky to tune and did so until all R-hat values are 1.01 or below with appropriate ESS and no BFMI warnings. I did have to set max_treedepth to 15, but not change adapt_delta. I think the big reason for these tight priors is the missing data - I needed to make sure I had information to regularize this parameter or the sample would go off in any random direction. I have both recovered simulated parameters from this model (see smaller code chunk below) and done a posterior predictive check on every marginal in this large model. It took ~30 hours to fit with a dataset that is 1316 x 54 and 2000 warmup iterations and 4000 samples across 4 chains on an AMD Threadripper. Given the amount of missing data this is certainly a p>>n problem. I want to make it faster although I am unsure if any real speed up will be gained with either openCL or multi-threading (or how to prep it for multi-threading…). I hope these steps serve others in the future. I am indebted to all of the assistance from those on this forum.
Question
All of this said, I do have a question for the community. As I prep to write this model up I am looking at possible downstream applications. One of these is certainly the prediction of unobserved x (age) given observed y (continuous and ordinal outcomes). Is there a way to do this in the generated quantities block using the generated_quantities function in cmdstanr? If so, what is the best way to start? I have a held out test sample with observed y values, but unobserved x. I want to use the information I learned from the fit of the model to predict x. What is the best way to get started doing so? I post an abbreviated model below. I think this involves taking the cdf of the new observed y and then doing something with the quantile function, but admittedly I am lost as to the best way to start particularly because of the multivariate nature of response. I am also unsure what to do if I want to (or should?) include a prior on unobserved x to make this prediction.
Point of clarity: Currently I have p(y|\theta, x) where x is observed. Essentially, I am looking for p(x| \boldsymbol{\theta}, y) where \theta parameterizes both x and y and using the initial model parameters as starting point for estimation of x.
functions{
// Gaussian Copula Log Probability Density
// Gaussian Copula Log Probability Density
real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
int N = rows(U);
int J = cols(U);
matrix[J, J] Gammainv = chol2inv(L);
return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
}
// Prepare data for LPDF
real centered_gaussian_copula_cholesky_(array[,] matrix marginals, matrix L) {
// Extract dimensions of final outcome matrix
int N = rows(marginals[1][1]);
int J = rows(L);
matrix[N, J] U;
// Iterate through marginal arrays, concatenating the outcome matrices by column
// and aggregating the log-likelihoods (from continuous marginals) and jacobian
// adjustments (from discrete marginals)
real adj = 0;
int pos = 1;
for (m in 1 : size(marginals)) {
int curr_cols = cols(marginals[m][1]);
U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];
adj += sum(marginals[m][2]);
pos += curr_cols;
}
// Return the sum of the log-probability for copula outcomes and jacobian adjustments
return multi_normal_cholesky_copula_lpdf(U | L) + adj;
}
// Continuous Marginal Distribution (Normal)
array[] matrix normal_marginal(array[] real y, array[] real mu, array[] real sigma, int N) {
array[2] matrix[N, 1] rtn; // empty 2D array to return
// Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
rtn[2] = rep_matrix(0, N, 1);
for (n in 1 : N) {
rtn[1][n, 1] = (y[n] - mu[n]) / sigma[n]; // center RV
rtn[2][n, 1] = normal_lpdf(y[n] | mu[n], sigma[n]); // "jacobian"
}
return rtn;
}
array[] matrix probit_marginal(array[] int y, array[] real mu_glm, array[] real u_raw, vector cutpoints) {
int N = num_elements(mu_glm); // number of observations
array[2] matrix[N, 1] rtn; // empty 2D array to return
for(n in 1:N){
int C = num_elements(cutpoints) + 1; // total number of ord categories
if(y[n] == 99){ // missing data
rtn[1][n,1] = inv_Phi(u_raw[n]); // missing RV
rtn[2][n,1] = 0;
} else if(y[n] == 1){ // lowest bound
real bound = Phi((cutpoints[1] - mu_glm[n])); // data augmentation
rtn[1][n,1] = inv_Phi((bound*u_raw[n])); // latent RV
rtn[2][n,1] = log(bound); // jacobian
} else if (y[n] == C){ // highest bound
real bound = Phi((cutpoints[C - 1] - mu_glm[n])); // data augmentation
rtn[1][n,1] = inv_Phi(bound + (1-bound)*u_raw[n]); // latent RV
rtn[2][n,1] = log1m(bound); // jacobian
} else { // in between
real ub = Phi((cutpoints[y[n]] - mu_glm[n])); // data augmentation
real lb = Phi((cutpoints[y[n] - 1] - mu_glm[n])); // data augmentation
rtn[1][n,1] = inv_Phi((lb + (ub-lb)*u_raw[n])); // latent RV
rtn[2][n,1] = log(ub-lb); // jacobian
}
}
return rtn;
}
}
data{
int N;
int M;
array[N] real x;
array[N] real z1; // continuous
array[N] int z2; // ordinal
array[N] int z3; // ordinal
int C_z2;
int C_z3;
}
transformed data{
int thresh_z2 = C_z2 - 1;
int thresh_z3 = C_z3 - 1;
}
parameters{
cholesky_factor_corr[M] L;
real<lower=0> z1_constant;
real<lower=0> z1_exponent;
real<lower=0> z1_offset;
real<lower=0> z1_noise1;
real<lower=0> z1_noise2;
real<lower=0> z2_constant;
array[N] real<lower=0, upper=1> z2_u;
ordered[thresh_z2] z2_t_pars;
real<lower=0> z3_constant;
array[N] real<lower=0, upper=1> z3_u;
ordered[thresh_z3] z3_t_pars;
}
transformed parameters{
array[N] real z1_mean;
array[N] real z1_sd;
array[N] real z2_mean;
array[N] real z3_mean;
for(n in 1:N){
z1_mean[n] = z1_constant*x[n]^z1_exponent + z1_offset;
z1_sd[n] = z1_noise1*(1+x[n]*z1_noise2);
z2_mean[n] = z2_constant*x[n];
z3_mean[n] = z3_constant*x[n];
}
}
model{
L ~ lkj_corr_cholesky(20.0);
z1_constant ~ normal(0,50);
z1_exponent ~ normal(0,1);
z1_offset ~ normal(0,50);
z1_noise1 ~ normal(0,10);
z1_noise2 ~ normal(0,1);
z2_constant ~ normal(0,1);
for(i in 1:size(z2_t_pars)){
z2_t_pars[i] ~ normal(i+1,1);
}
z3_constant ~ normal(0,1);
for(i in 1:size(z3_t_pars)){
z3_t_pars[i] ~ normal(i+1,1);
}
target += centered_gaussian_copula_cholesky_(
{normal_marginal(z1, z1_mean, z1_sd, N),
probit_marginal(z2, z2_mean, z2_u, z2_t_pars),
probit_marginal(z3, z3_mean, z3_u, z3_t_pars)}, L);
}
generated quantities{
corr_matrix[M] corr_mat = multiply_lower_tri_self_transpose(L);
}