Laten Instrument variable to control for endogeneity - getting error

Hello All, I am trying to implement a latent variable instrument model based on Ebbes, P 2004.
I am seeing if I can recover an unbiased parameter if there is endogeneity in the regression model.

The problem is I keep gettin this error: Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: normal_lpdf: Location parameter[17] is nan, but must be finite! (in ‘string’, line 32, column 2 to column 28)

Here is my R code which generates the data

Set the seed for reproducibility
set.seed(42)

beta0_true ← 1 # true intercept
beta1_true ← 2 # true coefficient for x
sigma_true ← 1 # true error standard deviation

Simulate endogenous independent variable x
N ← 400 # number of observations
theta_true ← rbeta(N, 2, 5) # true values for theta (latent instrument)
x ← rnorm(N, 5 * theta_true, 1) # endogenous independent variable x

Simulate dependent variable y
v ← rnorm(N, 0, 1) # endogenous part of x
mu ← beta0_true + beta1_true * x # linear predictor
y ← rnorm(N, mu , sigma_true) # dependent variable y

model ~ lm(y ~ x)
summary(model)

data ← list(N = N, x = x, y = y)

And here is the Stan Code


data {
  int<lower=0> N;                // number of observations
  vector[N] y;                   // outcome variable
  vector[N] x;                   // endogenous variable
}

parameters {
  real<lower=0> alpha;            // concentration parameter for Dirichlet process
  real<lower=0> lambda;           // parameter for the baseline prior distribution
  simplex[N] theta;               // latent instrument variable
  real beta0;                    // intercept
  real beta1;                    // coefficient for x
  real<lower=0> sigma;            // error standard deviation
}

transformed parameters {
  vector[N] v;                   // endogenous part of x

  // Compute the endogenous part of x
  vector[N] theta_normalized = softmax(theta);
  v = theta_normalized .* sqrt(x) / sum(theta_normalized);
}

model {
  vector[N] mu;                  // linear predictor

  // Priors
  alpha ~ gamma(1, 1);           // prior for the concentration parameter
  lambda ~ gamma(1, 1);          // prior for the parameter of the baseline prior distribution
  theta ~ dirichlet(rep_vector(lambda, N));  // prior for the latent instrument from Dirichlet process
  beta0 ~ normal(0, 1);          // prior for the intercept
  beta1 ~ normal(0, 1);          // prior for the coefficient
  sigma ~ cauchy(0, 5);          // prior for the error standard deviation

  // Linear predictor
  mu = beta0 + x * beta1;

  // Likelihood
  y ~ normal(mu + v, sigma);
}

generated quantities {
  vector[N] v_generated;
  v_generated = softmax(theta) .* sqrt(x) / sum(softmax(theta));
}
****

The error message says line 32 but it’s actually coming from line 39

  y ~ normal(mu + v, sigma);

and the real error happens on line 21

  v = theta_normalized .* sqrt(x) / sum(theta_normalized);

where you take the square root of x but don’t ensure x is nonnegative. Your data simulation can produce negative xs and negative numbers don’t have square roots so you have to deal with those in some other way. I skimmed Ebbes (2004) thesis but didn’t see where you’re getting the square root from so I don’t know exactly what to do about it.

However, I do think that your code isn’t quite right. softmax(..) always sums to one i.e. sum(theta_normalized)=1 and there’s no point dividing by it. Furthermore, the purpose of softmax() is usually to turn an unrestricted vector into a simplex but in your model theta is declared as a simplex. In other words, theta_normalized is silly because theta is already normalized.
So the code should probably be simply

  v = theta .* sqrt(x);
2 Likes

Thanks, I have updated the code, please find it below and it is working now. But it is still not recovering the true parameter. Any suggestion @nhuurre @spinkney ?

data {
int<lower=0> N; // number of observations
vector[N] y; // outcome variable
vector[N] x; // endogenous variable
}

parameters {
real<lower=0> alpha; // concentration parameter for Dirichlet process
real<lower=0> lambda; // parameter for the baseline prior distribution
simplex[N] theta; // latent instrument variable
real beta0; // intercept
real beta1; // coefficient for x
real<lower=0> sigma; // error standard deviation
}

transformed parameters {

}

model {
vector[N] mu; // linear predictor

// Priors
alpha ~ gamma(1, 1); // prior for the concentration parameter
lambda ~ gamma(1, 1); // prior for the parameter of the baseline prior distribution
theta ~ dirichlet(rep_vector(lambda, N)); // prior for the latent instrument from Dirichlet process
beta0 ~ normal(0, 1); // prior for the intercept
beta1 ~ normal(0, 1); // prior for the coefficient
sigma ~ cauchy(0, 5); // prior for the error standard deviation

// Linear predictor
mu = beta0 + x * beta1;

// Likelihood
y ~ normal(mu + theta, sigma);
}

generated quantities {

}

Whether the model recovers true parameter depends on how closely the model follows the true data generating process.
It’s hard to say what the problem is without knowing how you obtain your true parameter values and the data.

Reading that Ebbes (2004) thesis I don’t think it’s going to work well with Stan. The model in chapter 7 (which I believe is what you’re going for) is built around a Dirichlet process.
Stan does not support discrete parameters but Dirichlet process needs discrete hidden levels. (You could try marginalizing them out but that looks very messy…)
(the code you’ve written doesn’t implement a Dirichlet process—Stan’s dirichlet() means a Dirichlet distribution, a related but very different concept. and you don’t even use the so-called concentration parameter alpha)

1 Like

Thanks for the suggestion.
Dirichlet Process! okay! is it something related to Stan’s Latent Dirichlet allocation? https://mc-stan.org/docs/stan-users-guide/latent-dirichlet-allocation.html

Yes, latent Dirichlet allocation is very much like a Dirichlet process. I believe a marginalized Dirichlet process looks like this:

data {
  int<lower=0> N; // number of observations
  int<lower=0> K; // number of levels
  vector[N] y; // outcome variable
  vector[N] x;
}
parameters {
  real<lower=0> alpha; // concentration parameter for Dirichlet process
  real<lower=0> lambda; // parameter for the baseline prior distribution
  simplex[K] p; // latent level probabilities
  vector[K] theta; // latent instrument variable
  real beta0; // intercept
  real beta1; // coefficient for x
  real<lower=0> sigma; // error standard deviation
}
model {
  alpha ~ gamma(1, 1);
  lambda ~ gamma(1, 1);
  p ~ dirichlet(rep_vector(alpha, N));
  theta ~ normal(0, lambda); // baseline distribution is normal
  beta0 ~ normal(0, 1);
  beta1 ~ normal(0, 1);
  sigma ~ cauchy(0, 5);
  vector[N] mu = beta0 + x * beta1;
  for (n in 1:N) {
    array[K] real gamma;
    for (k in 1:K) {
      gamma[k] = log(p[k]) + normal_lpdf(y[n]| mu[n] + theta[k], sigma);
    }
    target += log_sum_exp(gamma);  // likelihood
  }
}

But that’s not quite the Ebbes model from chapter 7. Ebbes also has some kind of correlated errors. It might look something like this:

  for (n in 1:N) {
    array[K] real gamma;
    for (k in 1:K) {
      real v = x[n] - theta[k];
      gamma[k] = log(p[k]) + multi_normal_lpdf([y[n], v]| [mu[n], 0], Sigma);
      // Sigma is the covariance matrix for v and epsilon (equation 7.3)
    }
    target += log_sum_exp(gamma); // likelihood
  }

This is getting quite convoluted and I expect these types of models to require a lot of further tuning for Stan sample efficiently.

1 Like

Thank you @nhuurre