Learn parameters of a Gaussian linear model in Pearl's backdoor model

We have a model that consists of three variables U, X, and Y that have the dimension of is D \times L, D \times N, and D \times 1 respectively. Our model is as follows:

U := N_{U} \sim N(\mu, \Sigma_{UU})
X := U \alpha + N_{X}
Y := X \beta + U \eta + N_{Y}

Where N_X and N_y have a Normal distribution with mean zero and identity covariance matrix. Hence,

U \sim N(\mu,\Sigma_{UU})
X \sim N(\mu \alpha , \Sigma_{XX}) = N(\mu \alpha, \alpha^T \Sigma_{UU} \alpha + I_{N_X N_X})
Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY})

Where \Sigma_{YY} is,
\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}.
For simplicity we assume that \Sigma_{UU} is a identity matrix. The dimensions of \mu, \alpha, \beta and \eta are 1 \times L, L \times N, N \times 1, and L \times 1 respectively.

Our goal is to learn the parameters \mu, \alpha, \beta and \eta. I have created the following Stan model. My problem is that I can’t learn \eta correctly but I can learn the rest of the parameters correctly using the optimizing method. Here I post my data generation process in R and my Stan model. Any help would be greatly appreciated.

f_u, f_x, and f_y are functions that generate data for u,x, y, \mu, \alpha,\beta and \eta.

f_u <- function(){
  mu <- rnorm(L, 0, 10)
  Sigma_uu = diag(L)
  u <- matrix(0, nrow=D, ncol=L)
  for(i in 1:D){
    for(j in 1:L){
      u[i, j] <- rnorm(1, mu[j], sqrt(Sigma_uu[j,j]))
    }
  }
  return(list(u = u, mu = mu, Sigma_uu = Sigma_uu))
}
sim <- f_u()
mu<- sim$mu
u  <- sim$u
Sigma_uu<- sim$Sigma_uu

f_x <- function(u, Sigma_uu){
  alpha <- matrix(0, nrow = L, ncol = N)
  for(i in 1:L){
    for(j in 1:N){
      alpha[i, j] <- rnorm(1, 0, 10)
    }
  }
  linear_exp = u %*% alpha
  Sigma_xx = t(alpha) %*% Sigma_uu %*% alpha + diag(N)
  x <- matrix(0, nrow = D, ncol = N)
  for(i in 1:D){
    for(j in 1:N){
      x[i, j] <- rnorm(1, linear_exp[i,j],sqrt(Sigma_xx[j,j]))
    }
  }
  return(list(x = x, alpha = alpha, Sigma_xx = Sigma_xx))
}
sim_x <- f_x(u,Sigma_uu)
alpha <- sim_x$alpha
x <- sim_x$x
Sigma_xx <- sim_x$Sigma_xx

f_y <- function(x,Sigma_xx, u, Sigma_uu){
  beta <- matrix(0, nrow = N, ncol = 1)
  eta <- matrix(0, nrow = L, ncol = 1)
  for (i in 1:N) {
    beta[i,1] <- rnorm(1,0,10)
  }
  for (i in 1:L) {
    eta[i,1] <- rnorm(1,0,10)
  }

  linear_exp = x %*% beta + u %*% eta
  var_yy = t(beta) %*% Sigma_xx %*% beta + t(eta) %*% Sigma_uu %*% eta + 1
  std_y = sqrt(var_yy[1,1])
  y <- matrix(0, nrow = D, ncol = 1)
  for(i in 1:D){
    y[i, 1] <- rnorm(1, linear_exp[i,1],std_y)
  }
  return(list(y = y, beta = beta, eta = eta))
}
sim_y <- f_y(x, Sigma_xx, u, Sigma_uu)
beta<- sim_y$beta
eta <- sim_y$eta
y <- sim_y$y

The Stan model is:

model_str <- "
    data {
        int L;
        int D;
        int N;
        matrix[D, L] u;
        matrix[D, N] x;
        matrix[D, 1] y;
        matrix[L, L] sigma_uu;
    }
    parameters {
       vector[L] mu; 
       matrix[L, N] alpha;
       matrix[N, 1] beta;
       matrix[L, 1] eta;
    }
    transformed parameters {
       matrix[D, N] x_loc;
       matrix[D, 1] y_loc;
       vector[N] x_scale;
       vector[1] y_scale;
       x_scale = sqrt(diagonal(alpha'*sigma_uu*alpha) + rep_vector(1,N));
       y_scale = sqrt(diagonal(beta'*(alpha'*sigma_uu*alpha + diag_matrix(rep_vector(1, N)))*beta + beta'*alpha'*sigma_uu*eta + eta'*sigma_uu*alpha*beta + eta'*sigma_uu*eta + eta'*(mu'*mu)*eta)+ rep_vector(1,1));
       for (i in 1:D){
           x_loc[i, ] = u[i, ] * alpha;
           y_loc[i, ] = x[i, ] * beta + u[i, ] * eta;
       }
    }
    model {
        target += normal_lpdf(mu | 0, 10);
        for (j in 1:L){
            target += normal_lpdf(alpha[j, ] | 0, 10);
            target += normal_lpdf(eta[j, ] | 0, 10);
        }
        for (j in 1:N){
            target += normal_lpdf(beta[j, ] | 0, 10);
        }
        for (i in 1:D){
             target += normal_lpdf(u[i, ] | mu, 1);      // likelihood
             target += normal_lpdf(x[i, ] | x_loc[i, ], x_scale);
             target += normal_lpdf(y[i, ] | y_loc[i, ], y_scale);
        }
    }
"

I will compile the model as follows:

mod <- stan_model(model_code = model_str)
data_list <- list(u=u, L=5, D=1000, x = x, N =8, y = y, sigma_uu = diag(L))
optim_fit <- optimizing(mod, data=data_list)

Let’s compare the estimates of parameters with their actual values:

optim_fit$par[1:5] #estimated mu
mu  #real mu
optim_fit$par[6:45] #estimated alpha
t(alpha) #real alpha
optim_fit$par[46:53] #estimated beta
t(beta) #real gamma
optim_fit$par[54:58]
t(eta)

All the parameters will be learned correctly except for \eta. I can’t figure out why. Can anyone please help me identify the issue?

Welcome srtaheri to the Stan forum,

Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY}) = N(\mu \alpha \beta + \mu \eta, \beta^T \Sigma_{XX}\beta + \eta^T \Sigma_{UU} \eta + 2 \eta^T \Sigma_{UU} \alpha \beta + I_{N_Y N_Y}

If eta < 0 then 2 \eta^T \Sigma_{UU} \alpha \beta can become negative. Short question: Is + 2 \eta^T \Sigma_{UU} \alpha \beta + in RHS of Y correct? (Is there \eta missing?)

Thank you for your reply. I editted my \Sigma_{YY} and also in Stan model. it is now equal to,

\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}

But still I can’t learn the parameter \eta correctly. Even if I put the exact correct std of y in my Stan model, instead of calculating y_scale, I still can’t get the estimates of \eta parameter correctly.

Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY})

\mu \eta expands to \eta^T\mu^T\mu\eta + I_{N_Y N_Y}

but in the expansion of \mu\alpha\beta why is there \eta, this part is refering to X\beta?

Y = X \beta + U \eta + N_y, so,

E(Y) = E(X) \beta + E(U) \eta = \mu \alpha \beta + \mu \eta

Sure and the variance \Sigma_{YY}? See the first post where it’s specified. The \mu\eta part is ok.

\Sigma_{YY} is equal to,

\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}

Am I missing something?

It’s probably me. I don’t understand the variance transformation. Precisely I don’t understand why \eta becomes a factor together with \alpha and \beta?

Y = X \beta + U \eta + N_Y = (U \alpha + N_X) \beta + U \eta + N_Y

so,

Y = U \alpha \beta + N_X \beta + U \eta + N_Y

now,

\Sigma_{YY} = E[(Y-E[Y])^T(Y-E[Y])] = E[(\beta^T \alpha^T U^T + \beta^T N^T_{X} + \eta^T U^T + N^T_{Y}- \beta^T \alpha^T \mu^T)(U\alpha \beta + N_X \beta + U \eta + N_Y - \mu \alpha \beta)]

If you work out the math, probably will get something similar to what I got. But this is not my problem here. Let’s assume that I have the real \Sigma_{YY} from my data generating process and in my Stan model, I directly put the correct \Sigma_{YY}. Even in this case, I can’t get the estimates of \eta correctly.

@martinmodrak maybe you can help here.

First, thanks for I don’t see an obvious problem, but I also can’t follow the math, sorry.

What I do in such circumstances is to work from a simpler model (and corresponding simulated data) - either move most parameters to data (i.e. provide true values) or even completely remove parts of the model until I find the simplest model that still manifests the issue and the most complex model that works as expected. Then it usually much easier to guess what is going wrong.

Also, it is possible that the problem is the use of optimizing - maybe for this model the maximum a posteriori estimate for eta simply isn’t really representative of the posterior? Or maybe the posterior for eta is actually very wide (i.e. the data don’t really inform eta and we are basically left with the prior?)

I’ve tried running the model with sampling for D = 10 and this is actually what I observe, the results for eta are:

               mean   se_mean       sd      2.5%       25%        50%      75%    97.5%    n_eff      Rhat
eta[1,1] -0.4300312 0.1129946 9.775357 -19.08538 -7.136960 -0.3743708 6.346911 18.43398 7484.282 0.9998207
eta[2,1]  0.3988653 0.1310413 9.723525 -18.90360 -5.962849  0.4988552 6.976638 19.36335 5505.933 0.9996996
eta[3,1]  0.1182934 0.1199940 9.613023 -18.74033 -6.245467  0.1024488 6.606756 19.08571 6418.018 1.0003555
eta[4,1]  0.4920918 0.1169741 9.565062 -18.73486 -5.699214  0.5851166 6.727748 19.36741 6686.459 0.9995837
eta[5,1]  0.3872079 0.1216390 9.581643 -18.47652 -6.183869  0.4228076 6.814392 19.18640 6204.894 0.9995415

Which is basically what I would expect from the normal(0, 10) prior on eta. Investigating further, here’s my posterior for y_scale:

              mean  se_mean       sd     2.5%      25%      50%      75%    97.5%   n_eff     Rhat
y_scale[1] 916.504 3.715834 176.0317 586.9176 795.7389 909.1492 1028.136 1284.638 2244.24 1.001895

Which seems to explain the problem - y_scale is computed so big that y just can’t inform y_loc and thus also eta.

Does that make sense?

Thanks for your answer. I reduced the variance for my parameters to get a smaller y_scale. I still have the same problem but ut gets better. Why it only affect \eta and not \beta? I am interested to know how you got those results for eta and y_scale. I am very new to Stan and don’t know it well.

Good question. In your simulations, x contains quite larger values than u, so the effect of \beta on y is larger than the effect of \eta.

I used sampling (which as I said is preferable as this gives you the full Bayesian posterior) and then looked at posterior summaries:

sampling_fit <- sampling(mod, data=data_list)
summary(sampling_fit, pars = "eta")$summary
summary(sampling_fit, pars = "y_scale")$summary

however, sampling is much slower than optimizing (that’s why I ran with D = 10) so it might easily take hours with D = 1000.

No worries, you’ll figure it out :-) Don’t hesitate to ask.

Makes sense. x contains larger values than y and this explains why the effect of \beta on y is larger than the effect of \eta on y. This solved my problem! Thank you so much :)