How to stop a parameter from absorbing impact from missing covariate?

halo2masterq · September 2, 2023, 10:55am

So i am simulating two nested models and I want to show that when i generate data according to the covariate model, that it performs better in model comparison. However, i notice that when i look at the estimates of my beta from my baseline model, beta is absorbing the missing type effects. So the beta parameters are quite far off from the true beta parameters, but overall they still result in similar counts as the model comparison finds little difference most of the time. How can i stop beta in my baseline model from “stealing” effects from the missing type parameter?

Maybe I need to introduce hierarchical priors or something? So my data is clear:

Theta represents an individual level effect (person ability for example)
Beta represents the difficulties of the items
Type represents the effect of item type (for example multiple choice, open ended, etc.). The type is set up such that each item can only be one type, and it is always that type for every person. For example, i could have 40 items, and 4 types, making 10 items per type.

So maybe there is some nesting structure going on that I am not account for? The results indicate that beta in the baseline model is absorbing essentially 100% of the type parameter in addition to itself.

model_baseline <- stan_model(model_code = "data {
  int<lower=0> I;
  int<lower=0> J;
  int<lower=1> N;
  int<lower=1,upper=I> ii[N];
  int<lower=1,upper=J> jj[N];
  int<lower=0> y[N];
}
parameters {
  vector[J] theta;
  vector[I-1] b_free;
 

}
transformed parameters{
  vector[I] beta = append_row(b_free, -sum(b_free));
}
model {

  theta ~ normal(0, 1);
  target += normal_lpdf(beta | 0, 1);

  y ~ poisson_log(theta[jj] + beta[ii]);
}
generated quantities{
  vector[N] log_lik;
  for(n in 1:N){
    log_lik[n] = poisson_lpmf( y[n] | exp(theta[jj[n]] + beta[ii[n]]));
  
  }
}

")

model_covariate <- stan_model(model_code = "data {
  int<lower=0> I;
  int<lower=0> J;
  int<lower=0> K;
  int<lower=1> N;
  int<lower=1,upper=I> ii[N];
  int<lower=1,upper=J> jj[N];
  int<lower=1,upper=K> kk[N];
  int<lower=0> y[N];
}
parameters {
  vector[J] theta;
  vector[I-1] b_free;
  vector[K-1] k_free;


}
transformed parameters{
  vector[I] beta = append_row(b_free, -sum(b_free));
  vector[K] type = append_row(k_free, -sum(k_free));
}
model {

  theta ~ normal(0,1);
  
  target+= normal_lpdf(beta | 0,1);
  target+= normal_lpdf(type | 0,1);

  y ~ poisson_log(theta[jj] + beta[ii] + type[kk]);

}
generated quantities{
  vector[N] log_lik;
  for(n in 1:N){
    log_lik[n] = poisson_lpmf( y[n] | exp(theta[jj[n]] + beta[ii[n]] + type[kk[n]]));
  
  }
}

")
}

my data generation code if needed:

simulation_conditions <- list(c(100,40), c(200,40))

for(condition in simulation_conditions) {
  J <- condition[1]
  I <- condition[2]
  K <- 4

  for(x in 1:nrep){
        
    seed <- x*36
    set.seed(seed)
    
    loo_results <- data.frame(replication = integer(),
                          model_base_loo = numeric(),
                          model_base_waic = numeric(),
                          model_cov_loo = numeric(),
                          model_cov_waic = numeric())
    
    results_base <- list()
    results_base$rep <- x
    results_cov <- list()
    results_cov$rep <- x

    print(paste("Working on RPCM vs MFRPCM: condition (J,I): (", J, ",", I, ") rep: ",
           x, " started at: ", format(Sys.time(), "%Y-%m-%d %H:%M:%S")))
    start_time <- Sys.time()

    theta <- rnorm(J, 0, 1)
    
    #beta <- runif(I,.5,1.5)
    beta <- rnorm(I,0,1)
    
    type <- c(-3.9, 4.3, -4.6, 4.2)
    type[K] <- -1*sum(type[1:K-1])

    N <- I*J
    ii <- rep(1:I, times = J)
    jj <- rep(1:J, each = I)
    kk <- rep(1:K, times = N/K)
    
    y <- numeric(N)
    for(n in 1:N){
      y[n] <- rpois(1,exp(theta[jj[n]] + beta[ii[n]] + type[kk[n]]))
    }
    
    #generated data according to the covariate model
    
    fit_covariate <- sampling(model_covariate, data = list(I = I,
                                                           J = J,
                                                           K = K,
                                                           N = N,
                                                           ii = ii,
                                                           jj = jj,
                                                           kk = kk,
                                                           y = y), 
                              iter = 4000, control= list(max_treedepth = 12))
    
      
    fit_baseline <- sampling(model_baseline, data = list(I = I,
                                                           J = J,
                                                           N = N,
                                                           ii = ii,
                                                           jj = jj,
                                                           y = y), 
                              iter = 4000, control= list(max_treedepth = 12))
etc.

Topic		Replies	Views
Estimation accuracy is high but predictive accuracy is low? Possibly overfitting? Modeling rstan , techniques , fitting-issues , specification , loo	2	300	August 29, 2023
Recovering parameter Modeling	5	942	September 27, 2019
Beta regression - Complimentary responses Modeling rstan	7	516	May 18, 2023
Stan speed drastically decreases when adding one extra parameter Modeling performance	3	1223	November 27, 2017
Trouble specifying a simple beta model Modeling specification	9	1619	October 29, 2018

How to stop a parameter from absorbing impact from missing covariate?

Related topics