Learn parameters of a Gaussian linear model in Pearl's backdoor model

srtaheri · June 24, 2020, 10:39pm

We have a model that consists of three variables U, X, and Y that have the dimension of is D \times L, D \times N, and D \times 1 respectively. Our model is as follows:

U := N_{U} \sim N(\mu, \Sigma_{UU})
X := U \alpha + N_{X}
Y := X \beta + U \eta + N_{Y}

Where N_X and N_y have a Normal distribution with mean zero and identity covariance matrix. Hence,

U \sim N(\mu,\Sigma_{UU})
X \sim N(\mu \alpha , \Sigma_{XX}) = N(\mu \alpha, \alpha^T \Sigma_{UU} \alpha + I_{N_X N_X})
Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY})

Where \Sigma_{YY} is,
\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}.
For simplicity we assume that \Sigma_{UU} is a identity matrix. The dimensions of \mu, \alpha, \beta and \eta are 1 \times L, L \times N, N \times 1, and L \times 1 respectively.

Our goal is to learn the parameters \mu, \alpha, \beta and \eta. I have created the following Stan model. My problem is that I can’t learn \eta correctly but I can learn the rest of the parameters correctly using the optimizing method. Here I post my data generation process in R and my Stan model. Any help would be greatly appreciated.

f_u, f_x, and f_y are functions that generate data for u,x, y, \mu, \alpha,\beta and \eta.

f_u <- function(){
  mu <- rnorm(L, 0, 10)
  Sigma_uu = diag(L)
  u <- matrix(0, nrow=D, ncol=L)
  for(i in 1:D){
    for(j in 1:L){
      u[i, j] <- rnorm(1, mu[j], sqrt(Sigma_uu[j,j]))
    }
  }
  return(list(u = u, mu = mu, Sigma_uu = Sigma_uu))
}
sim <- f_u()
mu<- sim$mu
u  <- sim$u
Sigma_uu<- sim$Sigma_uu

f_x <- function(u, Sigma_uu){
  alpha <- matrix(0, nrow = L, ncol = N)
  for(i in 1:L){
    for(j in 1:N){
      alpha[i, j] <- rnorm(1, 0, 10)
    }
  }
  linear_exp = u %*% alpha
  Sigma_xx = t(alpha) %*% Sigma_uu %*% alpha + diag(N)
  x <- matrix(0, nrow = D, ncol = N)
  for(i in 1:D){
    for(j in 1:N){
      x[i, j] <- rnorm(1, linear_exp[i,j],sqrt(Sigma_xx[j,j]))
    }
  }
  return(list(x = x, alpha = alpha, Sigma_xx = Sigma_xx))
}
sim_x <- f_x(u,Sigma_uu)
alpha <- sim_x$alpha
x <- sim_x$x
Sigma_xx <- sim_x$Sigma_xx

f_y <- function(x,Sigma_xx, u, Sigma_uu){
  beta <- matrix(0, nrow = N, ncol = 1)
  eta <- matrix(0, nrow = L, ncol = 1)
  for (i in 1:N) {
    beta[i,1] <- rnorm(1,0,10)
  }
  for (i in 1:L) {
    eta[i,1] <- rnorm(1,0,10)
  }

  linear_exp = x %*% beta + u %*% eta
  var_yy = t(beta) %*% Sigma_xx %*% beta + t(eta) %*% Sigma_uu %*% eta + 1
  std_y = sqrt(var_yy[1,1])
  y <- matrix(0, nrow = D, ncol = 1)
  for(i in 1:D){
    y[i, 1] <- rnorm(1, linear_exp[i,1],std_y)
  }
  return(list(y = y, beta = beta, eta = eta))
}
sim_y <- f_y(x, Sigma_xx, u, Sigma_uu)
beta<- sim_y$beta
eta <- sim_y$eta
y <- sim_y$y

The Stan model is:

model_str <- "
    data {
        int L;
        int D;
        int N;
        matrix[D, L] u;
        matrix[D, N] x;
        matrix[D, 1] y;
        matrix[L, L] sigma_uu;
    }
    parameters {
       vector[L] mu; 
       matrix[L, N] alpha;
       matrix[N, 1] beta;
       matrix[L, 1] eta;
    }
    transformed parameters {
       matrix[D, N] x_loc;
       matrix[D, 1] y_loc;
       vector[N] x_scale;
       vector[1] y_scale;
       x_scale = sqrt(diagonal(alpha'*sigma_uu*alpha) + rep_vector(1,N));
       y_scale = sqrt(diagonal(beta'*(alpha'*sigma_uu*alpha + diag_matrix(rep_vector(1, N)))*beta + beta'*alpha'*sigma_uu*eta + eta'*sigma_uu*alpha*beta + eta'*sigma_uu*eta + eta'*(mu'*mu)*eta)+ rep_vector(1,1));
       for (i in 1:D){
           x_loc[i, ] = u[i, ] * alpha;
           y_loc[i, ] = x[i, ] * beta + u[i, ] * eta;
       }
    }
    model {
        target += normal_lpdf(mu | 0, 10);
        for (j in 1:L){
            target += normal_lpdf(alpha[j, ] | 0, 10);
            target += normal_lpdf(eta[j, ] | 0, 10);
        }
        for (j in 1:N){
            target += normal_lpdf(beta[j, ] | 0, 10);
        }
        for (i in 1:D){
             target += normal_lpdf(u[i, ] | mu, 1);      // likelihood
             target += normal_lpdf(x[i, ] | x_loc[i, ], x_scale);
             target += normal_lpdf(y[i, ] | y_loc[i, ], y_scale);
        }
    }
"

I will compile the model as follows:

mod <- stan_model(model_code = model_str)
data_list <- list(u=u, L=5, D=1000, x = x, N =8, y = y, sigma_uu = diag(L))
optim_fit <- optimizing(mod, data=data_list)

Let’s compare the estimates of parameters with their actual values:

optim_fit$par[1:5] #estimated mu
mu  #real mu
optim_fit$par[6:45] #estimated alpha
t(alpha) #real alpha
optim_fit$par[46:53] #estimated beta
t(beta) #real gamma
optim_fit$par[54:58]
t(eta)

All the parameters will be learned correctly except for \eta. I can’t figure out why. Can anyone please help me identify the issue?

andre.pfeuffer · June 25, 2020, 8:49am

Welcome srtaheri to the Stan forum,

Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY}) = N(\mu \alpha \beta + \mu \eta, \beta^T \Sigma_{XX}\beta + \eta^T \Sigma_{UU} \eta + 2 \eta^T \Sigma_{UU} \alpha \beta + I_{N_Y N_Y}

If eta < 0 then 2 \eta^T \Sigma_{UU} \alpha \beta can become negative. Short question: Is + 2 \eta^T \Sigma_{UU} \alpha \beta + in RHS of Y correct? (Is there \eta missing?)

srtaheri · June 25, 2020, 3:45pm

Thank you for your reply. I editted my \Sigma_{YY} and also in Stan model. it is now equal to,

\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}

But still I can’t learn the parameter \eta correctly. Even if I put the exact correct std of y in my Stan model, instead of calculating y_scale, I still can’t get the estimates of \eta parameter correctly.

andre.pfeuffer · June 25, 2020, 4:59pm

Y \sim N(\mu \alpha \beta + \mu \eta, \Sigma_{YY})

\mu \eta expands to \eta^T\mu^T\mu\eta + I_{N_Y N_Y}

but in the expansion of \mu\alpha\beta why is there \eta, this part is refering to X\beta?

srtaheri · June 25, 2020, 5:45pm

Y = X \beta + U \eta + N_y, so,

E(Y) = E(X) \beta + E(U) \eta = \mu \alpha \beta + \mu \eta

andre.pfeuffer · June 25, 2020, 5:50pm

Sure and the variance \Sigma_{YY}? See the first post where it’s specified. The \mu\eta part is ok.

srtaheri · June 25, 2020, 5:53pm

\Sigma_{YY} is equal to,

\beta^T \Sigma_{XX}\beta + \beta^T \alpha^T \Sigma_{UU} \eta + \eta^T \Sigma_{UU} \alpha \beta + \eta^T \Sigma_{UU} \eta + \eta^T \mu^T \mu \eta + I_{N_Y N_Y}

Am I missing something?

andre.pfeuffer · June 25, 2020, 6:06pm

It’s probably me. I don’t understand the variance transformation. Precisely I don’t understand why \eta becomes a factor together with \alpha and \beta?

srtaheri · June 25, 2020, 6:16pm

Y = X \beta + U \eta + N_Y = (U \alpha + N_X) \beta + U \eta + N_Y

so,

Y = U \alpha \beta + N_X \beta + U \eta + N_Y

now,

\Sigma_{YY} = E[(Y-E[Y])^T(Y-E[Y])] = E[(\beta^T \alpha^T U^T + \beta^T N^T_{X} + \eta^T U^T + N^T_{Y}- \beta^T \alpha^T \mu^T)(U\alpha \beta + N_X \beta + U \eta + N_Y - \mu \alpha \beta)]

If you work out the math, probably will get something similar to what I got. But this is not my problem here. Let’s assume that I have the real \Sigma_{YY} from my data generating process and in my Stan model, I directly put the correct \Sigma_{YY}. Even in this case, I can’t get the estimates of \eta correctly.

andre.pfeuffer · June 25, 2020, 6:22pm

@martinmodrak maybe you can help here.

martinmodrak · June 25, 2020, 8:33pm

First, thanks for I don’t see an obvious problem, but I also can’t follow the math, sorry.

What I do in such circumstances is to work from a simpler model (and corresponding simulated data) - either move most parameters to data (i.e. provide true values) or even completely remove parts of the model until I find the simplest model that still manifests the issue and the most complex model that works as expected. Then it usually much easier to guess what is going wrong.

Also, it is possible that the problem is the use of optimizing - maybe for this model the maximum a posteriori estimate for eta simply isn’t really representative of the posterior? Or maybe the posterior for eta is actually very wide (i.e. the data don’t really inform eta and we are basically left with the prior?)

I’ve tried running the model with sampling for D = 10 and this is actually what I observe, the results for eta are:

               mean   se_mean       sd      2.5%       25%        50%      75%    97.5%    n_eff      Rhat
eta[1,1] -0.4300312 0.1129946 9.775357 -19.08538 -7.136960 -0.3743708 6.346911 18.43398 7484.282 0.9998207
eta[2,1]  0.3988653 0.1310413 9.723525 -18.90360 -5.962849  0.4988552 6.976638 19.36335 5505.933 0.9996996
eta[3,1]  0.1182934 0.1199940 9.613023 -18.74033 -6.245467  0.1024488 6.606756 19.08571 6418.018 1.0003555
eta[4,1]  0.4920918 0.1169741 9.565062 -18.73486 -5.699214  0.5851166 6.727748 19.36741 6686.459 0.9995837
eta[5,1]  0.3872079 0.1216390 9.581643 -18.47652 -6.183869  0.4228076 6.814392 19.18640 6204.894 0.9995415

Which is basically what I would expect from the normal(0, 10) prior on eta. Investigating further, here’s my posterior for y_scale:

              mean  se_mean       sd     2.5%      25%      50%      75%    97.5%   n_eff     Rhat
y_scale[1] 916.504 3.715834 176.0317 586.9176 795.7389 909.1492 1028.136 1284.638 2244.24 1.001895

Which seems to explain the problem - y_scale is computed so big that y just can’t inform y_loc and thus also eta.

Does that make sense?

srtaheri · June 25, 2020, 10:41pm

Thanks for your answer. I reduced the variance for my parameters to get a smaller y_scale. I still have the same problem but ut gets better. Why it only affect \eta and not \beta? I am interested to know how you got those results for eta and y_scale. I am very new to Stan and don’t know it well.

martinmodrak · June 26, 2020, 2:13pm

Good question. In your simulations, x contains quite larger values than u, so the effect of \beta on y is larger than the effect of \eta.

I used sampling (which as I said is preferable as this gives you the full Bayesian posterior) and then looked at posterior summaries:

sampling_fit <- sampling(mod, data=data_list)
summary(sampling_fit, pars = "eta")$summary
summary(sampling_fit, pars = "y_scale")$summary

however, sampling is much slower than optimizing (that’s why I ran with D = 10) so it might easily take hours with D = 1000.

No worries, you’ll figure it out :-) Don’t hesitate to ask.

srtaheri · June 28, 2020, 3:12pm

Makes sense. x contains larger values than y and this explains why the effect of \beta on y is larger than the effect of \eta on y. This solved my problem! Thank you so much :)

Topic		Replies	Views
How to make a stan model as a linear Gaussian system Modeling	1	305	June 9, 2019
Kernel parameter estimation in multivariate Gaussian Process Regression Modeling gaussian-process	4	918	April 27, 2020
Linear multivariate model Modeling fitting-issues	2	736	September 6, 2018
Possibly improving the performance of Mulivariantnormal Distribution Modeling	2	474	February 28, 2020
Priors for binomial trials in generalized linear models Modeling	3	507	July 6, 2018

Learn parameters of a Gaussian linear model in Pearl's backdoor model

Related topics