Repeat values in input variable of regression?

Hello everyone,

I have a question regarding fitting a non-linear regression model with Stan but with repeating values in the input variables.


I have an observed variable y, which depends on some predictors, let’s call the variables x1, x2, and x3. The regression model is as follows:


data{
real N;
real k1;
real k2;
vector[N] y;
vector[N] x3;
matrix[N,k1] x1;
matrix[N,k2] x2;
}
parameters{
vector[k1] beta_1;
vector[k2] beta_2;
real sigma;
}
model {
  vector[N] mu;
  vector[N] reg_1;
  vector[N] reg_2;
  reg_1 = x1*beta_1;
  reg_2 = x2*beta_2;
  mu = (exp(reg_1) .* exp(reg_2) .* x3) ./ (rep_vector(1,N) + (exp(reg_1) .* x3) );
  //Some priors here//
  y ~ normal(mu, sigma); 
} 

However, there are multiple values of x3 observed for any given value of x1 and x2. The context is that x1 and x2 describe a material, while x3 describes measurement conditions, and the combination of the three can predict an outcome y.

What I am worried about is the error estimation of the parameters. Let’s say I only have 4 distinct values x1 and x2 (i.e., I have 4 materials) but I measured them at 4 different conditions each, so I have 16 values of x3 and y.

My question is, will the repeat values of x1 and x2 cause the posterior draws of the parameter estimates of beta from Stan to be overly confident/narrow? And if so how can I rectify that issue? I was planning on making a hierarchical version of this model after this, so I would like to know if that would help.

I’m sorry for the very naive question. I ask this because my real data has ~3900 observations of y and x3 but only ~70 distinct observations of x1 and x2. It successfully fit the model, but the resulting fit seems to have so little uncertainty in the estimates despite me using only a weakly informative prior. In fact, plotting the posterior draws with mcmc_intervals seems to return a dot. This caused me to be suspicious of the results.

Any help would be very appreciated.

Nope! Repeated observations of predictors should-and-does yield appropriately more certainty in the inference on their effects.

1 Like

One way to think this is that for exponential family the log-likelihood would be an affine function of the sum of the individual observation sufficient statistics. Specifically additional observations directly affect the score function, thus change the sensitivity of log-likelihood w.r.t the parameters.

1 Like

Thank you very much for the answers and explanations! I think I understand it a little bit better now.