Multilevel Logit with Few Observations per Individual

Hello!

I’ve been simulating data and attempting to recover it using rstan, and have run into some issues I don’t quite understand. In particular, I’ve set up a logit choice problem wherein N individuals face a decision between the same two goods (call them Mug and Pen) D times. To keep things simple, I suppose I know the underlying value of the Mug and the Pen (they are the same for all N individuals), and that the decision between the two goods is a function of an individual parameter [\lambda_i \sim LogNormal(\bar{\lambda}, \sigma_{\lambda})]. My goal is to recover the parameters of the LogNormal distribution from the data.

The synthetic data generation is shown below. It’s a bit more complicated than the above paragraph because the valuations of the Mug and Pen are assumed to follow the KR model of reference dependent preferences, but ultimately \lambda and the logit noise are the only unknowns that affect an individual’s choice between the two goods.

set.seed(19)
# Synthetic Data to explore prior implications
gen_indiv_synth_exchange_data <- function(synth_N, D) {
  synth_exchange <- tibble(id = seq_len(synth_N),
                         synth_mug_u = 5,
                         synth_pen_u = 4.4,
                         endowment = sample(c(1,2), synth_N, 0.5),
                         lambda = rlnorm(synth_N, 0.17, exp(-0.62)),
                         q =  sample(c(0.1, 0.9), synth_N, TRUE),
                         pen_utils = (endowment == 2) * synth_pen_u + 
                           (endowment == 1)*
                           (q*synth_pen_u + (1 -q)*synth_mug_u 
                            + q*(1 - q)*(1-lambda)*(synth_mug_u + synth_pen_u)),
                         mug_utils = (endowment == 1) * synth_mug_u + 
                           (endowment == 2)*
                           (q*synth_mug_u + (1 -q)*synth_pen_u 
                            + q*(1 - q)*(1-lambda)*(synth_mug_u + synth_pen_u)))

id_noise_grid <- expand.grid(id = seq_len(synth_N),
                             decision_num = seq_len(D)) %>%
  mutate(logit_noise = rlogis(n()))

indiv_synth_exchange <- id_noise_grid %>%
  inner_join(synth_exchange, by = "id") %>%
  as_tibble() %>%
  mutate(prefer_mug = mug_utils - pen_utils > logit_noise,
         exchange = prefer_mug * (endowment ==2) + 
           (1-prefer_mug)*(endowment == 1))

return(indiv_synth_exchange)
}

gen_agg_synth_exchange_data <- function(indiv_tib) {
  synth_agg_data_tib <- indiv_tib %>%
  group_by(id) %>%
  summarize(group_size = n(),
            group_exchange = sum(exchange),
            group_id = mean(id), 
            mean_exchange = mean(exchange), 
            endowment = mean(endowment),
            q = mean(q)) %>%
  ungroup()

return(synth_agg_data_tib)
}

format_synth_exchange <- function(data_in) {
  synth_agg_dat <- list(
    G = nrow(data_in),
    utils_mug = 5,
    utils_pen = 4.4,
    exchange = data_in$group_exchange,
    endowment = data_in$endowment,
    q = data_in$q,
    group_size = data_in$group_size
  )
  return(synth_agg_dat)
}
# Number of choices per individual.
d_options <- c(1, 2, 3, 4, 10, 20, 50, 100)
# List of data when N = 100.
synth_indiv_tib_100 <- map(.x = d_options, .f = gen_indiv_synth_exchange_data,
                      synth_N = 100)
synth_agg_tib_100 <- map(.x = synth_indiv_tib_100, .f = gen_agg_synth_exchange_data)
synth_stan_dat_100 <- map(.x = synth_agg_tib_100, .f = format_synth_exchange)

# List of data when N=1000.
synth_indiv_tib_1000 <- map(.x = d_options, .f = gen_indiv_synth_exchange_data,
                      synth_N = 1000)
synth_agg_tib_1000 <- map(.x = synth_indiv_tib_1000, .f = gen_agg_synth_exchange_data)
synth_stan_dat_1000 <- map(.x = synth_agg_tib_1000, .f = format_synth_exchange)

The stan model I use to estimate the distribution of the parameters is below. It explicitly defines the value of the Mug and the Pen as a function of this unknown individual \lambda_i, and takes into account decision noise by modeling the probability of exchange as inv\_logit(kr\_utils\_mug[i] - kr\_utils\_pen[i]). Because my primary interest is in recovering the \bar{\lambda} and \sigma_{\lambda}, I set the model up as multilevel, where each individual has a \lambda_i drawn from this LogNormal population distribution.

data{
  int G;
  int group_size[G];
  int exchange[G];
  int endowment[G];
  real q[G];
  real utils_mug;
  real utils_pen;
}
parameters{
  vector<lower=0, upper= 5>[G] lambda;
  real<lower=-0.15,upper=0.7> lambda_bar;
  real<lower=-2,upper=-0.3> log_sig_lambda;
}

model{
  vector[G] kr_utils_mug;
  vector[G] kr_utils_pen;
  vector[G] p;

  log_sig_lambda ~ normal( -0.55 , 0.125 );
  lambda_bar ~ normal( 0.25 , 0.125 );
  lambda ~ lognormal(lambda_bar , exp(log_sig_lambda));
  for ( i in 1:G ) {
    
    // CPE Version of utility given endowed mug
    if(endowment[i] == 1) {
       // U(Mug | Mug)
       kr_utils_mug[i]= utils_mug;
       // U(Attempt Pen | Attempt Pen)
       kr_utils_pen[i]= (q[i] * utils_pen + (1 - q[i])*utils_mug +
      q[i] * (1 - q[i]) * (1 - lambda[i]) * (utils_mug + utils_pen));
      // Probability of Exchange
      p[i] = inv_logit((kr_utils_pen[i] - kr_utils_mug[i]));
    }
    // CPE Version of utility given endowed Pen
    if(endowment[i] == 2) {
      // U(Attempt Mug | Attempt Mug)
      kr_utils_mug[i]= (q[i] * utils_mug + (1 - q[i]) * utils_pen +
        q[i] * (1 - q[i]) * (1 - lambda[i]) * (utils_mug + utils_pen));
      // U(Pen | Pen)
      kr_utils_pen[i]= utils_pen;
      // Probability of Exchange
      p[i] = inv_logit((kr_utils_mug[i] - kr_utils_pen[i]));
    }
    //target += (exchange[i] * log(p[i])) + ((1 - exchange[i]) * log(1-p[i]));
  }
  exchange ~ binomial(group_size, p);
}

I run the sampler via the r code below (note: I invoke the furrr library with furrr::future_map)

n100_fit <- future_map(.x = synth_stan_dat_100, 
                       .f = ~sampling(object = exchange_agg_hier, data = .x, 
                                      chains = 1, warmup = 2000, iter = 7000))

n1000_fit <- future_map(.x = synth_stan_dat_1000, 
                       .f = ~sampling(object = exchange_agg_hier, data = .x, 
                                      chains = 1, warmup = 5000, iter = 15000,
                                      control = list(stepsize = 0.01, 
                                                     adapt_delta = 0.99, 
                                                     max_treedepth = 18)))

Given this set up, my questions are:

  1. Is it appropriate to use multilevel models to recover the latent \lambda population distribution?
  2. Empirically, the posterior on \bar{\lambda} is quite good even when D=1 – so that I only observe one choice per individual. Is this a lucky coincidence, or can I actually hope to recover the population distribution even with 1 choice per individual (where I can’t hope to recover the individual \lambda_i)?
  3. I am running into a “BFMI Low” warning when I increase the sample size from N=100 to N=1000 (particularly at D = 1, 2, and 3). The E-FMI is around 0.15 according to “check_energy()”. Why would that be?
  4. I’m also not entirely sure how to compute the pairs() plot, which I think are a suggested diagnostic for the BFMI warning, since there are so many \lambda_i in the model.

Any help would be greatly appreciated, thanks!

Sorry for once again missing your question. A few things to note:

  • You seem to require large adapt_delta and max_treedepth - presumably to avoid divergencies. You then - if I understand you correctly - get rid of divergencies but get BFMI warnings. This is likely just a different manifestation of the same problem - for diagnostic purposes, it is usually better to run with default settings as it shows you more of the problems.

I don’t think this is fundamentally problematic, but can definitely be a problem in practice for some specific settings.

Once again - this is not something that is IMHO problematic in principle but might be a problem in practice.

As I said above, I believe the actual problem is not BFMI, so I would not fixate too much on this. (I admit I do not understand BFMI much, so I don’t want to speculate)

I would just look at the pairs plot for a bunch of different subsets of the variables. Note also additional debugging strategies at Divergent transitions - a primer. I would start by making sure you can fit a simpler version (e.g. non-hierarchical) of the model without ramping up adapt_delta or max_treedepath

Additionally:
The hard lower and upper bounds are suspicious - why do you believe such hard bounds are sensible a priori?

Best of luck with the model!