# Imputing non-MCAR missing data in a latent factor model

Hi everyone,

Apologies beforehand if any of this comes across as a bit clumsy, this is my first post. I have a problem that’s quite prevalent yet mostly ignored in the field of biomarkers; data is quite often missing, yet not completely at random - it’s just below the assay levels of quantitation. Hence, we have information such that the missing values for each biomarker have a range of <lower = 0 upper = less than the assay LLOQ>. The LLOQ is typically provided by the company who makes the assay, so we have the values. Essentially I would like to model the missing biomarker data as parameters in a simple factor analysis while using the prior information I have that the value is between zero and either 1) a provided vector of numbers representing the LLOQ for each biomarker, or possibly more simply, 2) lower than the minimum observed value for each biomarker in the data. There’s a great factor analysis discussion here with some of the community members (https://groups.google.com/g/stan-users/c/1jh0pdZgnDs ; thanks to Mike Lawrence, Bob Carpenter and GitHub user mbjoseph!!). From this discussion is the following code below which deals with missing data imputation for factor analysis. My issue is that I just don’t know how to code in the aforementioned missing data condition.

stan code for factor analysis from GitHub user mbjoseph: Factor analysis sandbox · GitHub

stan code

data {
int<lower = 0> m; // dimensions - number of biomarkers
int<lower = 0> k; // number of latent factors
int<lower = 0> n; // number of participants
int<lower = 0> n_obs; // number of observations
int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
vector[n_obs] y_obs;

int<lower=0> n_miss; // number of missing observations
int<lower = 1, upper = n> row_miss[n_miss];
int<lower = 1, upper = m> col_miss[n_miss];
}

transformed data {
int<lower = 1> p; // number of nonzero lower triangular factor loadings
vector[m] zeros;

zeros = rep_vector(0, m);
p = k * (m - k) + k * (k - 1) / 2;
}

parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector [n_miss] y_miss;
}

transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];

// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss[i];
//y[row_miss[i], col_miss[i]] = y_miss[i];
}

{ // temporary scope to define loading matrix L
int idx = 0;
for (i in 1:m){
for (j in (i + 1):k){
L[i, j] = 0; // constrain upper tri elements to zero
}
}
for (j in 1:k){
L[j, j] = beta_diag[j];
for (i in (j + 1):m){
idx = idx + 1;
L[i, j] = beta_lower_tri[idx];
}
}
}

Sigma = multiply_lower_tri_self_transpose(L);
for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
L_Sigma = cholesky_decompose(Sigma);
}

model {
// priors
beta_lower_tri ~ normal(0, sigma_L);
sigma_L ~ normal(0, 2);
sigma_y ~ normal(0, 2);
// priors for diagonal entries (Leung and Drton 2016)
for (i in 1:k) {
target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
}

// likelihood
y ~ multi_normal_cholesky(zeros, L_Sigma);
}



Specifically, my issue is here:

Stan code


parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector [n_miss] y_miss; // How can I modulate the upper limit of each variable vector here???
}

transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];

// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss[i];
//y[row_miss[i], col_miss[i]] = y_miss[i];
}



I should also mention that Rick Farouni has a fantastic explanation and code for factor analysis that has also been quite helpful in the process thus far: Bayesian Factor Analysis.

Thank you in advance. Any help/direction here would be amazing!

Alex.

I believe you can pass a vector as the upper/lower bounds of a parameter. So something like this should work.

data{
vector<lower = 0>[n_miss] upper_bound;
}
parameters{
vector<lower = 0, upper = upper_bound>[n_miss] y_miss;
}

1 Like

Thanks for the response! I’ve tried to pass a vector to the upper bound but got the error “Expression denoting real required; found type = vector.” If I switch the vector to ‘real’ the error just reverses and asks for a vector. This seems to be verified in the Stan users guide: “Stan only allows a single lower and upper bound to be declared in the constraints for a container data type.” However, this did lead me to a section in the guide called “Varying Upper and Lower Bounds”. That also has the following code snippet.

stan code

  int N;
vector[N] L;  // lower bounds
vector[N] U;  // upper bounds
...
parameters {
vector<lower=0, upper=1>[N] alpha_raw;
...
transformed parameters {
vector[N] alpha = L + (U - L) .* alpha_raw;


The explanation that goes with it states:
“The expression U - L is multiplied by alpha_raw element wise to produce a vector of variables in (0, U - L), then adding L results in a variable ranging between (L.U). In this case it is important that L and U are constants, otherwise a Jacobian would be required when multiplying by U-L.”

I’ve tried to implement this in my transformed parameters block, with the caveat that for each of the missing values, there is an upper and lower bound (0 for any particular biomarker, but in z-score space that number can vary) that is only constrained to the column (biomarker) that the missing value comes from. That is, if there are 5 biomarkers, there would be a 5 element vector of lower bounds and a 5 element vector of upper bounds.
The relevant parts of the code I tried are here:

Stan Code

  vector[m] Up; //upper bound for missing variables across m biomarkers
vector[m] Lo; //lower bound for missing variables across m biomarkers
}
transformed parameters {
vector[m] y[n];

// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = Lo[col_miss[i]] + (Up[col_miss[i]] - Lo[col_miss[i]]) .* y_miss[i]; //produces a vector of variables in (L, U)
//y[row_miss[i], col_miss[i]] = y_miss[i];
}



However. When I try and implement this, the estimates for the missing data are way too high: most estimates are between 1-4 when in fact they should be somewhere between -2 and -1. Im not sure what I can try next.

I apologize for the incorrect guidance! It seemed to be working for me, but I’ll double-check. I’m using the most recent version of cmdstanr.

Edit: I do think this was implemented in a later version. See here.

1 Like

Thanks for getting back! apologize for the delay on my end; it was my first post so the forum chatbot withheld any further posts for a couple of days while they verified my account. You are definitely correct, the updated manual suggests that you can pass a vector the lower and upper bounds argument. I have no idea why this isn’t working for me - I uninstalled and reinstalled the latest versions of rstan (2.21) and cmdstanr (2.26.1). However, I think I figured out why ‘Up’ and ‘Lo’ are not working - they should just be a vector of the total number of missing values, which in my case is ‘n_miss’; ‘m’ is just the number of columns. I then clumsily tried to implement that in the transformed parameters block by restricting the ‘Lo’ and ‘Up’ to ‘col_miss’. A simpler version was just to create two vectors for ‘Up’ and ‘Lo’ that had the bounds for each individual missing value n = ‘n_miss’. I’ve done that here:

Stan Code

//
data {
int<lower = 0> m; // dimensions - number of biomarkers
int<lower = 0> k; // number of latent factors
int<lower = 0> n; // number of participants
int<lower = 0> n_obs; // number of observations
int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
vector[n_obs] y_obs;
int<lower=0> n_miss; // number of missing observations
int<lower = 1, upper = n> row_miss[n_miss];
int<lower = 1, upper = m> col_miss[n_miss];
vector[n_miss] Up; //upper bound for missing variables
vector[n_miss] Lo; //lower bound for missing variables
}


Then it becomes easier to transform a new parameter ‘y_miss2’ as is suggested in the manual, which in turn makes it easier to code the transformed parameter ‘y’:

Stan Code

parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector[n_miss] y_miss;
}

transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];
vector[n_miss] y_miss2 = Lo + (Up - Lo) .* y_miss;

// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];
//y[row_obs[i], col_obs[i]] = y_obs[i];
}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss2[i];
//y[row_miss[i], col_miss[i]] = y_miss[i];
}



The problem now, is that it does an absolutely horrible job of implementation (correlation of about 3% between the true and imputed values in the model). I wonder if this is because factor analysis isn’t good at an imputation unless the values are completely random? Once again, thanks a ton for your help!

Alex.

A few observations.

To transform correctly, you still need to constrain these to (0, 1).

If I understand your data, the minimum value is 0 but you’re modelling the means as 0. To put it another way, it looks like you have a model for the covariances but not the means and are in fact estimating the covariances conditional on nearly impossible values. This could be causing some of your problems. Sampling the mean parameters might help.

parameters{
...
vector[m] mu;
}
model{
...
y ~ multi_normal_cholesky(mu, L_Sigma);
}


I am only somewhat familiar with the cholesky decomposition distributions, so double-check that I’m not misunderstanding how the mean works here. (Edit: \mu is indeed the mean; see here)

1 Like

Very good point, and I think you’re correct - I’m just getting into modelling latent variables myself, so the cholesky distribution is also fairly new to me. I’ve looked back at Bishop’s pattern recognition and machine learning, as well as Kevin Murphy’s probabilistic machine learning, and essentially the mu should be zero in order to properly model the covariance structure (it’s also described well by Ben Lambert here: Factor analysis assumptions - YouTube). I was getting slightly confused with modelling individual observations as parameters though, because although the raw biomarker value is between zero and some number below the LLOQ, after a z-score transform my lower bound ‘zero’ becomes a negative number. This is why I didn’t constrain the original y vector to 0,1. However, you’re right, the model is not designed for estimating an individual y observation quite like a standard linear model, its designed to model the covariance between the observed and latent parameters, and will thus estimate values constrained to mu = 0 and a covariance matrix. My initial thought is that it may not be possible to constrain the model parameters based on the limits of their raw scores in this model. I was just initially optimistic because the model by mbjoseph (see above post) was able to do a decent job in recovering raw values in a toy example (albeit random). I’ll try sampling the mu parameters to get a better feel for the model, but not sure how else it can be constrained for my particular use case. Once again, thanks, this has been very informative and I appreciate the time you’ve taken.

1 Like

Stancode for Factor Analysis with varying bounds on imputed parameter estimates.

//
data {
int<lower = 0> m; // dimensions - number of biomarkers
int<lower = 0> k; // number of latent factors
int<lower = 0> n; // number of participants
int<lower = 0> n_obs; // number of observations
int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
vector[n_obs] y_obs;
int<lower=0> n_miss; // number of missing observations
int<lower = 1, upper = n> row_miss[n_miss];
int<lower = 1, upper = m> col_miss[n_miss];
vector[n_miss] Lo; // vector of lower bounds for each biomarker
vector[n_miss] Up; //vector of upper bounds for each biomarker
}

transformed data {
int<lower = 1> p; // number of nonzero lower triangular factor loadings
vector[m] zeros;
zeros = rep_vector(0, m);
p = k * (m - k) + k * (k - 1) / 2;
}

parameters {
vector<lower = 0>[k] beta_diag;
vector[p] beta_lower_tri;
vector<lower = 0>[m] sigma_y; // residual error
real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
vector<lower = 0, upper = 1>[n_miss] y_miss; // Have to do this!
}

transformed parameters {
matrix[m, k] L;
cov_matrix[m] Sigma;
matrix[m, m] L_Sigma;
vector[m] y[n];
vector[n_miss] y_miss2 = Lo + (Up - Lo) .* y_miss;

//Because you can't pass a vector into rstan for upper and lower bounds at the moment, //according to the STAN manual, you can //transform your missing value parameter like this //instead (see line directly above this for creation of 'y_miss2'.

// build n by m matrix y by combining observed and missing observations
for (i in 1:n_obs) {
y[row_obs[i]][col_obs[i]] = y_obs[i];

}
for (i in 1:n_miss) {
y[row_miss[i]][col_miss[i]] = y_miss2[i];

}

{ // temporary scope to define loading matrix L
int idx = 0;
for (i in 1:m){
for (j in (i + 1):k){
L[i, j] = 0; // constrain upper tri elements to zero
}
}
for (j in 1:k){
L[j, j] = beta_diag[j];
for (i in (j + 1):m){
idx = idx + 1;
L[i, j] = beta_lower_tri[idx];
}
}
}

Sigma = multiply_lower_tri_self_transpose(L);
for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
L_Sigma = cholesky_decompose(Sigma);
}

model {
// priors
beta_lower_tri ~ normal(0, sigma_L);
sigma_L ~ normal(0, 2);
sigma_y ~ normal(0, 2);
// priors for diagonal entries (Leung and Drton 2016)
for (i in 1:k) {
target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
}

// likelihood
y ~ multi_normal_cholesky(zeros, L_Sigma);
}