# Identifying variable influence in SEM

Consider the following SEM model (though the following graphic doesn’t follow all conventional SEM DAG rules, more of a dependency-graph):

Wherein a common latent weighs on two derived latents that in turn connect to two outcomes. Weights are constained as `lower=-1, upper=1` (with r1 constrained to 0-1 for identifiability) and are thus interpretable as correlations and their square as quantifying the % of variability in the derived latent attributable to the common factor.

Code for this model being:

``````data{
int n_id ;
int n_obs ;
matrix[n_obs,n_id] y1 ;
matrix[n_obs,n_id] y2 ;
}
parameters{
real<lower=0> noise1 ;
real<lower=0> noise2 ;
real<lower=0,upper=1> r1 ;
unit_vector r2 ;  // for unconstrained SEM correlations, unit_vector makes for less by-hand compute
vector[n_id] common ;
vector[n_id] res1 ;
vector[n_id] res2 ;
}
model{
noise1 ~ weibull(2,1) ;
noise2 ~ weibull(2,1) ;
res1 ~ std_normal() ;
res2 ~ std_normal() ;
common ~ std_normal() ;
vector[n_id] mu1 = (
common .* r1
+ res1 .* sqrt(1-pow(r1,2)) // have to compute this term by-hand for constrained correlations
) ;
vector[n_id] mu2 = (
common .* r2
+ res2 .* r2 // for unconstained SEM correlations, unit-vector gives us this term automatically
) ;
for(i_id in 1:n_id){
y1[,i_id] ~ normal( mu1[i_id] , noise1 ) ;
y2[,i_id] ~ normal( mu2[i_id] , noise2 ) ;
}
}
``````

Now, consider an extension of this model wherein there are two more observed variables, with one being directly modelled by a single latent and the other being modelled as a combination of that new single latent plus the `common` latent of the previous model:

With Stan code:

``````data{
int n_id ;
int n_obs ;
matrix[n_obs,n_id] y1 ;
matrix[n_obs,n_id] y2 ;
matrix[n_obs,n_id] y3 ;
matrix[n_obs,n_id] y4 ;
}
parameters{
real<lower=0> noise1 ;
real<lower=0> noise2 ;
real<lower=0> noise3 ;
real<lower=0> noise4 ;
real<lower=0,upper=1> r1 ;
unit_vector r2 ;
vector[n_id] res1 ;
vector[n_id] res2 ;
vector[n_id] common ;
vector[n_id] mu3 ;
vector[n_id] res4 ;
unit_vector r_mu4 ;
}
model{
noise1 ~ weibull(2,1) ;
noise2 ~ weibull(2,1) ;
noise3 ~ weibull(2,1) ;
noise4 ~ weibull(2,1) ;
res1 ~ std_normal() ;
res2 ~ std_normal() ;
common ~ std_normal() ;
mu3 ~ std_normal() ;
res4 ~ std_normal() ;
vector[n_id] mu1 = (
common .* r1
+ res1 .* sqrt(1-pow(r1,2))
) ;
vector[n_id] mu2 = (
common .* r2
+ res2 .* r2
) ;
vector[n_id] mu4 = (
common .* r_mu4
+ mu3 .* r_mu4
+ res4 .* r_mu4
) ;
for(i_id in 1:n_id){
y1[,i_id] ~ normal( mu1[i_id] , noise1 ) ;
y2[,i_id] ~ normal( mu2[i_id] , noise2 ) ;
y3[,i_id] ~ normal( mu3[i_id] , noise3 ) ;
y4[,i_id] ~ normal( mu4[i_id] , noise4 ) ;
}
}
``````

When the data `y3` are low-noise, the model is well-identified and it’s possible to interpret `r_mu4^2` as the r-square for the prediction of `mu4` from the `common` variable alone and `r_mu4^2 + r_mu4^2` as the r-square for prediction of `mu4` using both `common` and `mu3`.

However, when `y3` is high-noise, I think that an identifiability issue creeps in whereby `mu3` begins to take on a “random noise” character and thereby can start to exchange roles with `res4` in reflecting a random noise residual. For model fitting this doesn’t seem to pose much of an issue, merely inducing a negative correlation between `r_mu4` and `r_mu4` that HMC can handle, BUT it does render the above noted derivations/interpretations of r-square invalid, at least as far as I can discern. That is, if the “added value” of `mu3` to the prediction of `mu4` is of critical importance, when `mu3` itself is not well constrained by the `y3` data, `mu3` will trade roles in the SEM with the residual `res4` and thereby yield inflated (or at least non-meaningful) values for it’s influence on `y4`.

Is this non-identifiability circumventable in any obvious way other than to collect enough `y3` data to properly constrain `mu3`?

What was the motivation to use a `weibull(2, 1)` as sd for the noise parameter? The mass is concentrated at 1? Why not use a `student_t` distribution? Did you do a pairs-plot and have a look at the correlation between the sampled parameters?