Reverse engineering a STAN model

jstanger · May 4, 2020, 10:47am

So I’ve made some poor life choices and landed myself with multiple impossible tasks at once. I’m not overly familiar with SEM techniques but I’ve had a moderate exposure to Bayesian techniques and a pile of other math and physics and I’ve been asked to help a startup with a ‘machine learning’ model that they had a data scientist whip up a year ago. Little did I know when I would open the hood it was a doozy of a challenge.

Now rather than bumble around here bothering people and trying to get free help, I’d like to post what I can (unfortunately anonymized to protect IP) and what I currently have managed to decypher so people can tell me just how wrong I am. I’ll also mention that if someone wants the opportunity to run me through a focused crash course on SEM I’ll happily see if I can get the CEO to front a budget. I will do my best to demonstrate that I’m putting in effort and I’m asking for help with as much humbleness as I can muster to see if I can get people to help me out of my predicament.

The core goal is to fit a model to a range of manifest observations such that we can produce a new “metric” that evaluates the good-ness of the sample. So we would take the observations from a new sample and using the fit parameters calculate the value of a latent variable in the model that we can use as a relative metric.

The model I have inherited is based on Lee and Song (2003) Statist. Med. DOI 10.1002/sim.1544 Bayesian analysis of structural equation models with dichotomous variables. In this case only some of our data is ordinal, with the others being a mix of nominal and ratio data. Once I have file attach privileges I will attach a sample set of data, which was standardized using R’s scale function.

One immediate flaw I will point out before anyone even gets to the STAN code is that the original model was trained against 9 samples (!?!?) and 30 manifest variables. I know this is wildly bad for a good fit, but the startup’s plan was always to get more data and retrain. One of the reasons I was called in was to check they could re-run the model fit, which they can’t, which is why I’m here, so we can justify collecting more sample data.

The model code I was provided with is as follows (comments are mine):

// declare the data to be used for the model
data{
  int I;              // number of samples 
  int run_estimation; // switch for the estimation, appears to be set to 1 in test R code
  matrix[I, 30] y;    // matrix with all manifest observations
  int matrix_miss;    // length of missing observations vector
}

// define the unknowns to be estimated, including any restrictions on their values
parameters{
  matrix[I, 10] xi;              // latent variables matrix
  vector<lower=0>[10] tau;       // random variance of each latent variable (xi)
  corr_matrix[10] Omega;         // correlation between latent variables (xi)
  positive_ordered[10] gamma;    // coefficients of latent variables (xi) to prediction (eta), fixed order I believe to aid model identification
  vector[I] z;                   // random variance of final prediction (eta)
//  real<lower=0> sigma_delta;   // --appears to have been replaced
  vector[19] lambda;             // values to form coefficient matrix from manifest observed variables (y) to latent variables (xi)
  vector<lower=0>[30] scale_eps; // random variance/errors of the manifest observed variables
  real lambda_out;               // --appears unused
  vector[matrix_miss] missing_values; // vector of missing manifest variable values to fill in y
}

// transformations of the parameters and data declared above
transformed parameters{
  matrix[I, 30] y_hat;    // actual manifest observations matrix used by model
  
  matrix[11, 30] Lambda;  // actual matrix of coefficients between (y) and (xi) formed from (lambda)
  vector[I] eta;          // final prediction for each sample
  
  eta = xi * gamma + z;   // definition for eta
  
  // Populate matrix Lambda from vector lambda
  Lambda = rep_matrix(0.0, 11, 30);
  Lambda[1,1] = 1.0;
  Lambda[2,2] = 1.0;
  Lambda[3,3] = 1.0;
  Lambda[4,4] = 1.0;
  Lambda[5,5] = 1.0; 
  Lambda[5,6:10] = lambda[1:5]';
  Lambda[6,11] = 1.0;
  Lambda[6,12:15] = lambda[6:9]';
  Lambda[7,16] = 1.0; 
  Lambda[7,17:20] = lambda[10:13]';
  Lambda[8,21] = 1.0;
  Lambda[9,22] = 1.0;
  Lambda[10,23] = 1.0;
  Lambda[10,24:29] = lambda[14:19]';
  Lambda[11,30] = 1.0;
  
  // Populate y_hat from y and missing_values
  {
    int count = 0;
    for(i in 1:I){
      for(j in 1:30){
        if (y[i,j] == -9.0){
          count += 1;
          y_hat[i,j] = missing_values[count];
        } else {
          y_hat[i,j] = y[i,j];
        }
      }
    }
  }
}

// where the full probability model is defined
model{
  tau ~ normal(0,1);
  Omega ~ lkj_corr(10);
  z ~ normal(0,1);
  gamma ~ normal(0,1);
//  sigma_delta ~ normal(.5, .3);
  lambda ~ normal(0,1);
  scale_eps ~ normal(1,.1);   // I believe because the input data is normalized around 1
  lambda_out ~ normal(0,1);
  
  // Define latent variables with respect to correlation matrix and variance
  for(i in 1:I){
    xi[i] ~ multi_normal(rep_vector(0.0, 10), quad_form_diag(Omega, tau));
  }
  
  // The apparent only use of the run_estimation flag
  if(run_estimation == 1){
    for(i in 1:I){
      y_hat[i]' ~ normal(Lambda' * append_row(rep_vector(eta[i], 1), xi[i]'), scale_eps);
    }
  }
}

// generates a range of outputs from the model
generated quantities{
  matrix[I, 30] y_sim;  // simulated possible values of the y_hat manifest observations based on the model fit
  for(i in 1:I){
    y_sim[i] = multi_normal_rng(Lambda' * append_row(rep_vector(eta[i], 1), xi[i]'), diag_matrix(scale_eps)*diag_matrix(scale_eps))';
  }
}

What I have gathered from some reading, online tutorials and reading the model code is that the above model is:
\eta^I=\xi^{I}\cdot \gamma^I+\zeta^I
where
\xi^I=\Lambda\cdot \widehat{y}^I assuming that we are excluding the manifest \widehat{y} observations that are being caused by the desired latent variable \eta that we are using as our metric.
So I expect that I can get a new observation \widehat{y}_{new}, set the value to zero for the manifest that is caused by \eta, multiply by the fitted \Lambda extracted using get_posterior_mean(model.fit, pars = "Lambda") where I will only retain the mean.all.chains then multiply by \gamma, also extracted from the fit in the same way, to get my new \eta_{new} for the sample.
Everything in the model is assumed to be normal distributions (questionable but what can you do with 9 samples).
Questions I’m struggling to answer:
0. Where is the best place online to bootstrap core understanding of SEM principles applied by STAN?

Is it even valid to take new data and use the model fit to calculate a new value for \eta?
Is the above assumption about how to extract the relevant model parameters to do said calculation valid?
Are there some basic sanity checks I can do on the fit object to determine if things have gone right off the rails? Like post-fit get the estimated degree of freedom?
Obviously there were some major issues with missing data in observations, but post-fit, if the data is missing is it valid to set to zero? I’m suggesting this as the basic linear algebra would allow me to estimate how much the value of \eta could be missing due to missing data. Alternatively I could see replacing missing data with the mean of the training data for that manifest variable making sense?
A subsequent iteration of the model that I have the values of \Lambda and \gamma from a fit (but no STAN model code anymore) appears to show significantly fewer elements of \Lambda fixed at 1.0. This has had me scratching my head quite a bit and the only thing I can think of is that the original data scientist has removed some of the variance or whoever took the extract from the model did a very bad job and somehow copied the wrong numbers.

Seriously, any help is much appreciated. I find the SEM idea quite interesting but the deeper understanding is somewhat impenetrable to me so far.

Ara_Winter · May 4, 2020, 1:22pm

I like to point folks here https://www.imachordata.com/bayesian-sem-with-brms/ for a simple SEM workflow.

I’d recommend using brms and then dump out the Stan code from brms. Then fiddle with a very simple model using fake data to get something up and running.

There is also a nice intro paper Getting beyond the Null: Statistical Modeling
as an Alternative Framework for Inference in
Developmental Science https://www.tandfonline.com/doi/pdf/10.1080/15427609.2017.1371567

jstanger · May 4, 2020, 1:54pm

Thanks Ara, that paper looks really good, too many people underestimate the value of a solid review paper to get started in a new field.
I can see what you mean by using brms to build a simple model then dump out the code to see what is going on behind the scenes. I can perhaps slowly work up towards something as complex as the OP model, add bits as I get a handle on the process with a simpler approach.

Ara_Winter · May 5, 2020, 12:57pm

Let me see what else I can dig up. I thought I had a few more reviewy type articles here.

Topic		Replies	Views
One-off STAN tutoring for a small research project Jobs	9	674	April 15, 2020
"Bayesian Cognitive Modeling" in cmdstanr CmdStan cognitive-science	5	1209	September 6, 2021
Questions on Merkle et al. preprints on Efficient Bayesian SEM in Stan General specification , blavaan	5	897	August 26, 2020
Model Advice Modeling	5	412	November 15, 2019
Suggestions to justify use of stan/bayesian MLM (to journal editors/reviewers) Publicity	2	726	November 25, 2022

Reverse engineering a STAN model

Related topics