So I’ve made some poor life choices and landed myself with multiple impossible tasks at once. I’m not overly familiar with SEM techniques but I’ve had a moderate exposure to Bayesian techniques and a pile of other math and physics and I’ve been asked to help a startup with a ‘machine learning’ model that they had a data scientist whip up a year ago. Little did I know when I would open the hood it was a doozy of a challenge.
Now rather than bumble around here bothering people and trying to get free help, I’d like to post what I can (unfortunately anonymized to protect IP) and what I currently have managed to decypher so people can tell me just how wrong I am. I’ll also mention that if someone wants the opportunity to run me through a focused crash course on SEM I’ll happily see if I can get the CEO to front a budget. I will do my best to demonstrate that I’m putting in effort and I’m asking for help with as much humbleness as I can muster to see if I can get people to help me out of my predicament.
The core goal is to fit a model to a range of manifest observations such that we can produce a new “metric” that evaluates the good-ness of the sample. So we would take the observations from a new sample and using the fit parameters calculate the value of a latent variable in the model that we can use as a relative metric.
The model I have inherited is based on Lee and Song (2003) Statist. Med. DOI 10.1002/sim.1544 Bayesian analysis of structural equation models with dichotomous variables. In this case only some of our data is ordinal, with the others being a mix of nominal and ratio data. Once I have file attach privileges I will attach a sample set of data, which was standardized using R’s scale
function.
One immediate flaw I will point out before anyone even gets to the STAN code is that the original model was trained against 9 samples (!?!?) and 30 manifest variables. I know this is wildly bad for a good fit, but the startup’s plan was always to get more data and retrain. One of the reasons I was called in was to check they could re-run the model fit, which they can’t, which is why I’m here, so we can justify collecting more sample data.
The model code I was provided with is as follows (comments are mine):
// declare the data to be used for the model
data{
int I; // number of samples
int run_estimation; // switch for the estimation, appears to be set to 1 in test R code
matrix[I, 30] y; // matrix with all manifest observations
int matrix_miss; // length of missing observations vector
}
// define the unknowns to be estimated, including any restrictions on their values
parameters{
matrix[I, 10] xi; // latent variables matrix
vector<lower=0>[10] tau; // random variance of each latent variable (xi)
corr_matrix[10] Omega; // correlation between latent variables (xi)
positive_ordered[10] gamma; // coefficients of latent variables (xi) to prediction (eta), fixed order I believe to aid model identification
vector[I] z; // random variance of final prediction (eta)
// real<lower=0> sigma_delta; // --appears to have been replaced
vector[19] lambda; // values to form coefficient matrix from manifest observed variables (y) to latent variables (xi)
vector<lower=0>[30] scale_eps; // random variance/errors of the manifest observed variables
real lambda_out; // --appears unused
vector[matrix_miss] missing_values; // vector of missing manifest variable values to fill in y
}
// transformations of the parameters and data declared above
transformed parameters{
matrix[I, 30] y_hat; // actual manifest observations matrix used by model
matrix[11, 30] Lambda; // actual matrix of coefficients between (y) and (xi) formed from (lambda)
vector[I] eta; // final prediction for each sample
eta = xi * gamma + z; // definition for eta
// Populate matrix Lambda from vector lambda
Lambda = rep_matrix(0.0, 11, 30);
Lambda[1,1] = 1.0;
Lambda[2,2] = 1.0;
Lambda[3,3] = 1.0;
Lambda[4,4] = 1.0;
Lambda[5,5] = 1.0;
Lambda[5,6:10] = lambda[1:5]';
Lambda[6,11] = 1.0;
Lambda[6,12:15] = lambda[6:9]';
Lambda[7,16] = 1.0;
Lambda[7,17:20] = lambda[10:13]';
Lambda[8,21] = 1.0;
Lambda[9,22] = 1.0;
Lambda[10,23] = 1.0;
Lambda[10,24:29] = lambda[14:19]';
Lambda[11,30] = 1.0;
// Populate y_hat from y and missing_values
{
int count = 0;
for(i in 1:I){
for(j in 1:30){
if (y[i,j] == -9.0){
count += 1;
y_hat[i,j] = missing_values[count];
} else {
y_hat[i,j] = y[i,j];
}
}
}
}
}
// where the full probability model is defined
model{
tau ~ normal(0,1);
Omega ~ lkj_corr(10);
z ~ normal(0,1);
gamma ~ normal(0,1);
// sigma_delta ~ normal(.5, .3);
lambda ~ normal(0,1);
scale_eps ~ normal(1,.1); // I believe because the input data is normalized around 1
lambda_out ~ normal(0,1);
// Define latent variables with respect to correlation matrix and variance
for(i in 1:I){
xi[i] ~ multi_normal(rep_vector(0.0, 10), quad_form_diag(Omega, tau));
}
// The apparent only use of the run_estimation flag
if(run_estimation == 1){
for(i in 1:I){
y_hat[i]' ~ normal(Lambda' * append_row(rep_vector(eta[i], 1), xi[i]'), scale_eps);
}
}
}
// generates a range of outputs from the model
generated quantities{
matrix[I, 30] y_sim; // simulated possible values of the y_hat manifest observations based on the model fit
for(i in 1:I){
y_sim[i] = multi_normal_rng(Lambda' * append_row(rep_vector(eta[i], 1), xi[i]'), diag_matrix(scale_eps)*diag_matrix(scale_eps))';
}
}
What I have gathered from some reading, online tutorials and reading the model code is that the above model is:
\eta^I=\xi^{I}\cdot \gamma^I+\zeta^I
where
\xi^I=\Lambda\cdot \widehat{y}^I assuming that we are excluding the manifest \widehat{y} observations that are being caused by the desired latent variable \eta that we are using as our metric.
So I expect that I can get a new observation \widehat{y}_{new}, set the value to zero for the manifest that is caused by \eta, multiply by the fitted \Lambda extracted using get_posterior_mean(model.fit, pars = "Lambda")
where I will only retain the mean.all.chains then multiply by \gamma, also extracted from the fit
in the same way, to get my new \eta_{new} for the sample.
Everything in the model is assumed to be normal distributions (questionable but what can you do with 9 samples).
Questions I’m struggling to answer:
0. Where is the best place online to bootstrap core understanding of SEM principles applied by STAN?
- Is it even valid to take new data and use the model fit to calculate a new value for \eta?
- Is the above assumption about how to extract the relevant model parameters to do said calculation valid?
- Are there some basic sanity checks I can do on the fit object to determine if things have gone right off the rails? Like post-fit get the estimated degree of freedom?
- Obviously there were some major issues with missing data in observations, but post-fit, if the data is missing is it valid to set to zero? I’m suggesting this as the basic linear algebra would allow me to estimate how much the value of \eta could be missing due to missing data. Alternatively I could see replacing missing data with the mean of the training data for that manifest variable making sense?
- A subsequent iteration of the model that I have the values of \Lambda and \gamma from a fit (but no STAN model code anymore) appears to show significantly fewer elements of \Lambda fixed at 1.0. This has had me scratching my head quite a bit and the only thing I can think of is that the original data scientist has removed some of the variance or whoever took the extract from the model did a very bad job and somehow copied the wrong numbers.
Seriously, any help is much appreciated. I find the SEM idea quite interesting but the deeper understanding is somewhat impenetrable to me so far.