Why does changing reference level of a factor change estimates (no intercept, no interactions)?

We are trying to achieve reproducible results between two programmers in our analysis. The only difference, as far as we can tell, between the two analyses is the reference level of the factor covariate. The two sets of code use the same model, dataset (except for the factor reference level), and the same random seed. While the results are similar, I expect them to be exactly the same.

The data cannot be shared, but here is a reproducible example using mtcars:

library(tidyverse)
library(rstanarm)
#> Loading required package: Rcpp
#> This is rstanarm version 2.32.1
#> - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
#> - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#>   options(mc.cores = parallel::detectCores())

options(mc.cores = parallel::detectCores())

df_3_ref <- mtcars %>%
  mutate(gear = factor(gear) %>% fct_relevel("3", "4", "5"))

df_5_ref <- mtcars %>%
  mutate(gear = factor(gear) %>% fct_relevel("5", "4", "3"))

f_3_ref <- stan_glm(
  mpg ~ -1 + gear,
  family = gaussian(),
  prior = normal(0, 5, autoscale = FALSE),
  seed = 123,
  data = df_3_ref
)

f_5_ref <- stan_glm(
  mpg ~ -1 + gear,
  family = gaussian(),
  prior = normal(0, 5, autoscale = FALSE),
  seed = 123,
  data = df_5_ref
)

summary(f_3_ref)
#> 
#> Model Info:
#>  function:     stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ -1 + gear
#>  algorithm:    sampling
#>  sample:       4000 (posterior sample size)
#>  priors:       see help('prior_summary')
#>  observations: 32
#>  predictors:   3
#> 
#> Estimates:
#>         mean   sd   10%   50%   90%
#> gear3 14.9    1.4 13.1  14.9  16.6 
#> gear4 22.3    1.6 20.2  22.4  24.3 
#> gear5 17.3    2.4 14.2  17.5  20.3 
#> sigma  5.4    0.9  4.4   5.3   6.6 
#> 
#> Fit Diagnostics:
#>            mean   sd   10%   50%   90%
#> mean_PPD 18.0    1.5 16.2  18.1  19.8 
#> 
#> The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
#> 
#> MCMC diagnostics
#>               mcse Rhat n_eff
#> gear3         0.0  1.0  2774 
#> gear4         0.0  1.0  2241 
#> gear5         0.1  1.0  2291 
#> sigma         0.0  1.0  1813 
#> mean_PPD      0.0  1.0  2656 
#> log-posterior 0.0  1.0  1330 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(f_5_ref)
#> 
#> Model Info:
#>  function:     stan_glm
#>  family:       gaussian [identity]
#>  formula:      mpg ~ -1 + gear
#>  algorithm:    sampling
#>  sample:       4000 (posterior sample size)
#>  priors:       see help('prior_summary')
#>  observations: 32
#>  predictors:   3
#> 
#> Estimates:
#>         mean   sd   10%   50%   90%
#> gear5 17.3    2.5 14.1  17.4  20.4 
#> gear4 22.3    1.6 20.2  22.4  24.3 
#> gear3 14.9    1.4 13.1  15.0  16.7 
#> sigma  5.4    0.9  4.4   5.3   6.5 
#> 
#> Fit Diagnostics:
#>            mean   sd   10%   50%   90%
#> mean_PPD 18.1    1.4 16.2  18.2  19.8 
#> 
#> The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
#> 
#> MCMC diagnostics
#>               mcse Rhat n_eff
#> gear5         0.0  1.0  3086 
#> gear4         0.0  1.0  2467 
#> gear3         0.0  1.0  3683 
#> sigma         0.0  1.0  1965 
#> mean_PPD      0.0  1.0  2893 
#> log-posterior 0.0  1.0  1566 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Created on 2024-04-25 with reprex v2.1.0```

As you can see, the posterior results are similar, but not exactly the same.

Why?

  • Operating System: macOS Ventura 13.2
  • rstanarm Version: 2.32.1

One possible source of variability is the initial values for starting each Markov chain. The seed guarantees that the initial values are the same for each model. However, the order of the coefficients is reversed in f_5_ref relative to f_3_ref (as shown in your model summary output).

So I think what may be happening is that the initial values for gear3 and gear5 are swapped between the two model fits, which presumably results in different trajectories for the chains for f_5_ref relative to f_3_ref, and therefore non-identical results.

You can see the initial values with rstan::get_inits(). Below I’ve extracted the initial values for the three beta coefficients (the gear coefficients) in each model and you can see they’re the same. However, due to the reverse ordering of the coefficients, in f_3_ref the first value for each chain is presumably applied to gear3, while in f_5_ref the first value is presumably applied to gear5 (and similarly for the third value).

library(tidyverse)
library(rstan)

lst(f_3_ref, f_5_ref) %>% 
  map(~get_inits(.x$stanfit) %>% map(~.x$beta))

$f_3_ref
$f_3_ref[[1]]
[1]  3.3701735  3.6394912 -0.7546425

$f_3_ref[[2]]
[1] -2.105675 -5.749702 -2.355853

$f_3_ref[[3]]
[1]  4.662232 -4.047375 -5.333816

$f_3_ref[[4]]
[1]  5.215248  5.281696 -5.188157


$f_5_ref
$f_5_ref[[1]]
[1]  3.3701735  3.6394912 -0.7546425

$f_5_ref[[2]]
[1] -2.105675 -5.749702 -2.355853

$f_5_ref[[3]]
[1]  4.662232 -4.047375 -5.333816

$f_5_ref[[4]]
[1]  5.215248  5.281696 -5.188157

I thought I could check my hypothesis by looking at the first warmup draw for each chain in each model. However, it looks like stan_glm does not save warmup draws. Instead, I reran the models with brms, which does save the warmup draws. Results are below.

You can see that the first warmup draws for gear3_ref3 and gear5_ref5 are very close to each other, as are the first draws for gear5_ref3 and gear3_ref5, which suggests that the init-swapping is indeed occurring.

library(tidyverse)
library(brms)

df_3_ref <- mtcars %>%
  mutate(gear = factor(gear) %>% fct_relevel("3", "4", "5"))

df_5_ref <- mtcars %>%
  mutate(gear = factor(gear) %>% fct_relevel("5", "4", "3"))

# brms models
ref3 = brm(mpg ~ -1 + gear, 
         prior=prior(normal(0,5), class="b"),
         data=df_3_ref, 
         seed=123, cores=4)

ref5 = brm(mpg ~ -1 + gear, 
         prior=prior(normal(0,5), class="b"),
         data=df_5_ref, 
         seed=123, cores=4)
# Extract first warmup draw from each chain for each model
lst(ref3, ref5) %>% 
  imap(
    ~as_draws_df(.x, pars=c("b_gear3","b_gear5"), inc_warmup=TRUE) %>% 
      group_by(.chain) %>% slice(1) %>% 
      select(.chain, matches("3|5")) %>% 
      mutate(model=.y) %>% 
      pivot_wider(names_from=model, values_from=matches("gear"))
  ) %>% reduce(full_join) %>% rename_all(~gsub("b_", "", .))
#> Joining with `by = join_by(.chain)`
#> # A tibble: 4 × 5
#> # Groups:   .chain [4]
#>   .chain gear3_ref3 gear5_ref3 gear5_ref5 gear3_ref5
#>    <int>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1      1      0.747     -0.144      0.741     -0.137
#> 2      2     -0.524     -0.368     -0.531     -0.361
#> 3      3      0.953     -1.06       0.946     -1.05 
#> 4      4      1.06      -1.06       1.05      -1.05
# Extract inits
lst(ref3, ref5) %>% 
  map(~rstan::get_inits(.x$fit) %>% map(~.x$b))
#> $ref3
#> $ref3[[1]]
#> [1]  0.6740347  0.7278982 -0.1509285
#> 
#> $ref3[[2]]
#> [1] -0.4211349 -1.1499405 -0.4711706
#> 
#> $ref3[[3]]
#> [1]  0.9324464 -0.8094750 -1.0667632
#> 
#> $ref3[[4]]
#> [1]  1.043050  1.056339 -1.037631
#> 
#> 
#> $ref5
#> $ref5[[1]]
#> [1]  0.6740347  0.7278982 -0.1509285
#> 
#> $ref5[[2]]
#> [1] -0.4211349 -1.1499405 -0.4711706
#> 
#> $ref5[[3]]
#> [1]  0.9324464 -0.8094750 -1.0667632
#> 
#> $ref5[[4]]
#> [1]  1.043050  1.056339 -1.037631
1 Like