We would like to estimate a non-linear regression model, e.g.:
\log y = f( x_1 , x_2, \beta ) + u,
where y, x_1, and x_2 are (observed) variables, \beta is a vector of unknown parameters (to be estimated), f( \cdot ) is a non-linear function, and u is a random error term (representing the role of unobserved variables in the regression model) that can be assumed to follow a normal distribution with zero mean. However, our domain knowledge indicates that the error term u is very likely substantially correlated with the right-hand side variables x_1 and x_2. This correlation would result in largely biased and inconsistent estimates in linear regression models estimated by OLS and I guess that this problem also occurs in non-linear regression models estimated by Bayesian methods.
Therefore, we thought about the following specification of the regression model that aligns with what we assume could be the ‘true’ data generating process:
\log y = f( x_1 \cdot e^{u_1}, x_2 \cdot e^{u_2}, \beta ),
where e is the Euler number and u_1 and u_2 are two random error terms that can be assumed to follow a bivariate normal distribution. Two approaches for estimating this model specification came to our mind:
- Modelling u_1 and u_2 as latent variables (that follow a bivariate normal distribution with both of the mean values equal to zero) and adding a third error term u to the regression equation (as in the first equation above) that is assumed follow a normal distribution with zero mean and a tiny standard deviation so that this third error term is always approximately zero.
- Solving the regression equation for u_2, i.e., u_2 = g( y, x_1, x_2, u_1, \beta ), modeling u_1 as a latent variable, calculating u_2 as g(\cdot), and calculating the value of the (log) likelihood function by applying the density function of a bivariate normal distribution to u_1 and u_2.
What do you think about these two approaches? Do you have any concerns? Do you have suggestions for other approaches to estimate a single regression equation with two or more error terms or for other approaches to address a potential correlation between the (unknown) error term and right-hand side variables?
Note: y is always strictly positive, x_1 and x_2 are always non-negative, and x_1 + x_2 is always strictly positive. There is a notable number of observations with x_1 being zero or having very small values and a notable number of observations with x_2 being zero or having very small values so that both the variance of u_1 and the variance of u_2 should be (statistically) identifiable.