Imputing non-MCAR missing data in a latent factor model

Hi everyone,

Apologies beforehand if any of this comes across as a bit clumsy, this is my first post. I have a problem that’s quite prevalent yet mostly ignored in the field of biomarkers; data is quite often missing, yet not completely at random - it’s just below the assay levels of quantitation. Hence, we have information such that the missing values for each biomarker have a range of <lower = 0 upper = less than the assay LLOQ>. The LLOQ is typically provided by the company who makes the assay, so we have the values. Essentially I would like to model the missing biomarker data as parameters in a simple factor analysis while using the prior information I have that the value is between zero and either 1) a provided vector of numbers representing the LLOQ for each biomarker, or possibly more simply, 2) lower than the minimum observed value for each biomarker in the data. There’s a great factor analysis discussion here with some of the community members (https://groups.google.com/g/stan-users/c/1jh0pdZgnDs ; thanks to Mike Lawrence, Bob Carpenter and GitHub user mbjoseph!!). From this discussion is the following code below which deals with missing data imputation for factor analysis. My issue is that I just don’t know how to code in the aforementioned missing data condition.

stan code for factor analysis from GitHub user mbjoseph: Factor analysis sandbox · GitHub

stan code

data {
  int<lower = 0> m; // dimensions - number of biomarkers
  int<lower = 0> k; // number of latent factors
  int<lower = 0> n; // number of participants
  int<lower = 0> n_obs; // number of observations
  int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
  int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
  vector[n_obs] y_obs;
  
  int<lower=0> n_miss; // number of missing observations
  int<lower = 1, upper = n> row_miss[n_miss];
  int<lower = 1, upper = m> col_miss[n_miss];
}

transformed data {
  int<lower = 1> p; // number of nonzero lower triangular factor loadings
  vector[m] zeros;

  zeros = rep_vector(0, m);
  p = k * (m - k) + k * (k - 1) / 2;
}

parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  vector [n_miss] y_miss;
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  vector[m] y[n];
  
  // build n by m matrix y by combining observed and missing observations
  for (i in 1:n_obs) {
    y[row_obs[i]][col_obs[i]] = y_obs[i];
    //y[row_obs[i], col_obs[i]] = y_obs[i];
  }
  for (i in 1:n_miss) {
    y[row_miss[i]][col_miss[i]] = y_miss[i];
    //y[row_miss[i], col_miss[i]] = y_miss[i];
  }

  
  { // temporary scope to define loading matrix L
    int idx = 0;
    for (i in 1:m){
      for (j in (i + 1):k){
        L[i, j] = 0; // constrain upper tri elements to zero
      }
    }
    for (j in 1:k){
      L[j, j] = beta_diag[j];
      for (i in (j + 1):m){
        idx = idx + 1;
        L[i, j] = beta_lower_tri[idx];
      }
    }
  }
  
  Sigma = multiply_lower_tri_self_transpose(L);
  for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
  L_Sigma = cholesky_decompose(Sigma);
}

model {
  // priors
  beta_lower_tri ~ normal(0, sigma_L);
  sigma_L ~ normal(0, 2);
  sigma_y ~ normal(0, 2);
  // priors for diagonal entries (Leung and Drton 2016)
  for (i in 1:k) {
    target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
  }

  // likelihood
  y ~ multi_normal_cholesky(zeros, L_Sigma); 
}


Specifically, my issue is here:

Stan code


parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  vector [n_miss] y_miss; // How can I modulate the upper limit of each variable vector here??? 
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  vector[m] y[n];
  
  // build n by m matrix y by combining observed and missing observations
  for (i in 1:n_obs) {
    y[row_obs[i]][col_obs[i]] = y_obs[i];
    //y[row_obs[i], col_obs[i]] = y_obs[i];
  }
  for (i in 1:n_miss) {
    y[row_miss[i]][col_miss[i]] = y_miss[i];
    //y[row_miss[i], col_miss[i]] = y_miss[i];
  }

I should also mention that Rick Farouni has a fantastic explanation and code for factor analysis that has also been quite helpful in the process thus far: Bayesian Factor Analysis.

Thank you in advance. Any help/direction here would be amazing!

Alex.

I believe you can pass a vector as the upper/lower bounds of a parameter. So something like this should work.

data{
    vector<lower = 0>[n_miss] upper_bound;
}
parameters{
    vector<lower = 0, upper = upper_bound>[n_miss] y_miss;
}
1 Like

Thanks for the response! I’ve tried to pass a vector to the upper bound but got the error “Expression denoting real required; found type = vector.” If I switch the vector to ‘real’ the error just reverses and asks for a vector. This seems to be verified in the Stan users guide: “Stan only allows a single lower and upper bound to be declared in the constraints for a container data type.” However, this did lead me to a section in the guide called “Varying Upper and Lower Bounds”. That also has the following code snippet.

stan code

  int N;
  vector[N] L;  // lower bounds
  vector[N] U;  // upper bounds
  ...
parameters {
  vector<lower=0, upper=1>[N] alpha_raw;
  ...
transformed parameters {
  vector[N] alpha = L + (U - L) .* alpha_raw;

The explanation that goes with it states:
“The expression U - L is multiplied by alpha_raw element wise to produce a vector of variables in (0, U - L), then adding L results in a variable ranging between (L.U). In this case it is important that L and U are constants, otherwise a Jacobian would be required when multiplying by U-L.”

I’ve tried to implement this in my transformed parameters block, with the caveat that for each of the missing values, there is an upper and lower bound (0 for any particular biomarker, but in z-score space that number can vary) that is only constrained to the column (biomarker) that the missing value comes from. That is, if there are 5 biomarkers, there would be a 5 element vector of lower bounds and a 5 element vector of upper bounds.
The relevant parts of the code I tried are here:

Stan Code

  vector[m] Up; //upper bound for missing variables across m biomarkers
  vector[m] Lo; //lower bound for missing variables across m biomarkers
}
transformed parameters {
  vector[m] y[n];
  
  // build n by m matrix y by combining observed and missing observations
  for (i in 1:n_obs) {
    y[row_obs[i]][col_obs[i]] = y_obs[i];
    //y[row_obs[i], col_obs[i]] = y_obs[i];
  }
  for (i in 1:n_miss) {
    y[row_miss[i]][col_miss[i]] = Lo[col_miss[i]] + (Up[col_miss[i]] - Lo[col_miss[i]]) .* y_miss[i]; //produces a vector of variables in (L, U)
    //y[row_miss[i], col_miss[i]] = y_miss[i];
  }

However. When I try and implement this, the estimates for the missing data are way too high: most estimates are between 1-4 when in fact they should be somewhere between -2 and -1. Im not sure what I can try next.

I apologize for the incorrect guidance! It seemed to be working for me, but I’ll double-check. I’m using the most recent version of cmdstanr.

Edit: I do think this was implemented in a later version. See here.

1 Like

Thanks for getting back! apologize for the delay on my end; it was my first post so the forum chatbot withheld any further posts for a couple of days while they verified my account. You are definitely correct, the updated manual suggests that you can pass a vector the lower and upper bounds argument. I have no idea why this isn’t working for me - I uninstalled and reinstalled the latest versions of rstan (2.21) and cmdstanr (2.26.1). However, I think I figured out why ‘Up’ and ‘Lo’ are not working - they should just be a vector of the total number of missing values, which in my case is ‘n_miss’; ‘m’ is just the number of columns. I then clumsily tried to implement that in the transformed parameters block by restricting the ‘Lo’ and ‘Up’ to ‘col_miss’. A simpler version was just to create two vectors for ‘Up’ and ‘Lo’ that had the bounds for each individual missing value n = ‘n_miss’. I’ve done that here:

Stan Code

//
data {
  int<lower = 0> m; // dimensions - number of biomarkers
  int<lower = 0> k; // number of latent factors
  int<lower = 0> n; // number of participants
  int<lower = 0> n_obs; // number of observations
  int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
  int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
  vector[n_obs] y_obs;
  int<lower=0> n_miss; // number of missing observations
  int<lower = 1, upper = n> row_miss[n_miss];
  int<lower = 1, upper = m> col_miss[n_miss];
  vector[n_miss] Up; //upper bound for missing variables
  vector[n_miss] Lo; //lower bound for missing variables
}

Then it becomes easier to transform a new parameter ‘y_miss2’ as is suggested in the manual, which in turn makes it easier to code the transformed parameter ‘y’:

Stan Code

parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  vector[n_miss] y_miss;
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  vector[m] y[n];
  vector[n_miss] y_miss2 = Lo + (Up - Lo) .* y_miss;
  
  
  // build n by m matrix y by combining observed and missing observations
  for (i in 1:n_obs) {
    y[row_obs[i]][col_obs[i]] = y_obs[i];
    //y[row_obs[i], col_obs[i]] = y_obs[i];
  }
  for (i in 1:n_miss) {
    y[row_miss[i]][col_miss[i]] = y_miss2[i]; 
    //y[row_miss[i], col_miss[i]] = y_miss[i];
  }

The problem now, is that it does an absolutely horrible job of implementation (correlation of about 3% between the true and imputed values in the model). I wonder if this is because factor analysis isn’t good at an imputation unless the values are completely random? Once again, thanks a ton for your help!

Alex.

A few observations.

To transform correctly, you still need to constrain these to (0, 1).

If I understand your data, the minimum value is 0 but you’re modelling the means as 0. To put it another way, it looks like you have a model for the covariances but not the means and are in fact estimating the covariances conditional on nearly impossible values. This could be causing some of your problems. Sampling the mean parameters might help.

parameters{
   ... 
    vector[m] mu;
} 
model{
    ... 
    y ~ multi_normal_cholesky(mu, L_Sigma);  
} 

I am only somewhat familiar with the cholesky decomposition distributions, so double-check that I’m not misunderstanding how the mean works here. (Edit: \mu is indeed the mean; see here)

1 Like

Very good point, and I think you’re correct - I’m just getting into modelling latent variables myself, so the cholesky distribution is also fairly new to me. I’ve looked back at Bishop’s pattern recognition and machine learning, as well as Kevin Murphy’s probabilistic machine learning, and essentially the mu should be zero in order to properly model the covariance structure (it’s also described well by Ben Lambert here: Factor analysis assumptions - YouTube). I was getting slightly confused with modelling individual observations as parameters though, because although the raw biomarker value is between zero and some number below the LLOQ, after a z-score transform my lower bound ‘zero’ becomes a negative number. This is why I didn’t constrain the original y vector to 0,1. However, you’re right, the model is not designed for estimating an individual y observation quite like a standard linear model, its designed to model the covariance between the observed and latent parameters, and will thus estimate values constrained to mu = 0 and a covariance matrix. My initial thought is that it may not be possible to constrain the model parameters based on the limits of their raw scores in this model. I was just initially optimistic because the model by mbjoseph (see above post) was able to do a decent job in recovering raw values in a toy example (albeit random). I’ll try sampling the mu parameters to get a better feel for the model, but not sure how else it can be constrained for my particular use case. Once again, thanks, this has been very informative and I appreciate the time you’ve taken.

1 Like

I thought I might add some additional notes here after working at the problem for a few more weeks. First, turns out you definitely need to constrain the missing data parameter to 0,1 before applying the transformation suggested in the STAN manual, otherwise your imputed parameter estimates won’t properly approximate the proper upper and lower bounds. Also, as is talked about in this brilliant lecture by Richard McElreath (Statistical Rethinking 2023 - 18 - Missing Data - YouTube), when the outcome variable is the cause of the missingness itself, you are in the worst possible situation. With below-quantification level biomarkers, that seems to be precisely what is happening; the biomarker being low causes the missing values. What ends up happening when you run the imputed model in this case - with the bounds for the missing biomarkers constrained between 0 and the lowest observed sample value - is that you essentially get back the prior (more or less). I think this just means that the model hasn’t seen values that low in the data, and with the bounds restricted to such a small parameter space it isn’t able to estimate a value with any real confidence. I’m not sure if this is correct, but it’s my best guess at the moment. I’d be curious to see what ppl may think about this, but I still think using this type of varying bounds imputation is useful and correct to implement for these types of models (below detection variables) for a few reasons: 1) It allows you to use the other observed data for that subject instead of just throwing it out (complete case analysis), especially in situations where most biomarker values are observed for that subject. 2) I think that when the model returns the prior, you are still gaining information on that value, because you do have some information about what the value is, vague as it may be ‘its between 0 and some quantitation limit’, and 3) While the true values cannot be recovered accurately in toy examples, the covariance structure is retained. Again, I’d love to hear what ppl think about any of this. Thanks! (see working STAN model below).

Stancode for Factor Analysis with varying bounds on imputed parameter estimates.

//
data {
  int<lower = 0> m; // dimensions - number of biomarkers
  int<lower = 0> k; // number of latent factors
  int<lower = 0> n; // number of participants
  int<lower = 0> n_obs; // number of observations
  int<lower = 1, upper = n> row_obs[n_obs]; // row indices for observations
  int<lower = 1, upper = m> col_obs[n_obs]; // column indices for observations
  vector[n_obs] y_obs;
  int<lower=0> n_miss; // number of missing observations
  int<lower = 1, upper = n> row_miss[n_miss];
  int<lower = 1, upper = m> col_miss[n_miss];
  vector[n_miss] Lo; // vector of lower bounds for each biomarker
  vector[n_miss] Up; //vector of upper bounds for each biomarker
}

transformed data {
  int<lower = 1> p; // number of nonzero lower triangular factor loadings
  vector[m] zeros;
  zeros = rep_vector(0, m);
  p = k * (m - k) + k * (k - 1) / 2;
}

parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  vector<lower = 0, upper = 1>[n_miss] y_miss; // Have to do this!
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  vector[m] y[n];
  vector[n_miss] y_miss2 = Lo + (Up - Lo) .* y_miss; 

//Because you can't pass a vector into rstan for upper and lower bounds at the moment, //according to the STAN manual, you can //transform your missing value parameter like this //instead (see line directly above this for creation of 'y_miss2'.
  
  
  // build n by m matrix y by combining observed and missing observations
  for (i in 1:n_obs) {
    y[row_obs[i]][col_obs[i]] = y_obs[i];
   
  }
  for (i in 1:n_miss) {
    y[row_miss[i]][col_miss[i]] = y_miss2[i]; 
   
  }

  
  { // temporary scope to define loading matrix L
    int idx = 0;
    for (i in 1:m){
      for (j in (i + 1):k){
        L[i, j] = 0; // constrain upper tri elements to zero
      }
    }
    for (j in 1:k){
      L[j, j] = beta_diag[j];
      for (i in (j + 1):m){
        idx = idx + 1;
        L[i, j] = beta_lower_tri[idx];
      }
    }
  }
  
  Sigma = multiply_lower_tri_self_transpose(L);
  for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
  L_Sigma = cholesky_decompose(Sigma);
}

model {
  // priors
  beta_lower_tri ~ normal(0, sigma_L);
  sigma_L ~ normal(0, 2);
  sigma_y ~ normal(0, 2);
  // priors for diagonal entries (Leung and Drton 2016)
  for (i in 1:k) {
    target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
  }

  // likelihood
  y ~ multi_normal_cholesky(zeros, L_Sigma); 
}