Hi everyone,
Apologies beforehand if any of this comes across as a bit clumsy, this is my first post. I have a problem that’s quite prevalent yet mostly ignored in the field of biomarkers; data is quite often missing, yet not completely at random - it’s just below the assay levels of quantitation. Hence, we have information such that the missing values for each biomarker have a range of <lower = 0 upper = less than the assay LLOQ
>. The LLOQ is typically provided by the company who makes the assay, so we have the values. Essentially I would like to model the missing biomarker data as parameters in a simple factor analysis while using the prior information I have that the value is between zero and either 1) a provided vector of numbers representing the LLOQ for each biomarker, or possibly more simply, 2) lower than the minimum observed value for each biomarker in the data. There’s a great factor analysis discussion here with some of the community members (https://groups.google.com/g/stan-users/c/1jh0pdZgnDs ; thanks to Mike Lawrence, Bob Carpenter and GitHub user mbjoseph!!). From this discussion is the following code below which deals with missing data imputation for factor analysis. My issue is that I just don’t know how to code in the aforementioned missing data condition.
stan code for factor analysis from GitHub user mbjoseph: Factor analysis sandbox · GitHub
stan code
data {
int<lower = 0> m; // dimensions - number of biomarkers
int<lower = 0> k; // number of latent factors
int<lower = 0> n; // number of participants
int<lower = 0> n_obs; // number of observations
int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
vector[n_obs] y_obs;
int<lower=0> n_miss; // number of missing observations
int<lower = 1, upper = n> row_miss[n_miss];
int<lower = 1, upper = m> col_miss[n_miss];
}
transformed data {
int<lower = 1> p; // number of nonzero lower triangular factor loadings
vector[m] zeros;
zeros = rep_vector(0, m);
p = k * (m - k) + k * (k - 1) / 2;
}
parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector [n_miss] y_miss;
}
transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];
// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss[i];
//y[row_miss[i], col_miss[i]] = y_miss[i];
}
{ // temporary scope to define loading matrix L
int idx = 0;
for (i in 1:m){
for (j in (i + 1):k){
L[i, j] = 0; // constrain upper tri elements to zero
}
}
for (j in 1:k){
L[j, j] = beta_diag[j];
for (i in (j + 1):m){
idx = idx + 1;
L[i, j] = beta_lower_tri[idx];
}
}
}
Sigma = multiply_lower_tri_self_transpose(L);
for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
L_Sigma = cholesky_decompose(Sigma);
}
model {
// priors
beta_lower_tri ~ normal(0, sigma_L);
sigma_L ~ normal(0, 2);
sigma_y ~ normal(0, 2);
// priors for diagonal entries (Leung and Drton 2016)
for (i in 1:k) {
target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
}
// likelihood
y ~ multi_normal_cholesky(zeros, L_Sigma);
}
Specifically, my issue is here:
Stan code
parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector [n_miss] y_miss; // How can I modulate the upper limit of each variable vector here???
}
transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];
// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss[i];
//y[row_miss[i], col_miss[i]] = y_miss[i];
}
I should also mention that Rick Farouni has a fantastic explanation and code for factor analysis that has also been quite helpful in the process thus far: Bayesian Factor Analysis.
Thank you in advance. Any help/direction here would be amazing!
Alex.