Omit transformed parameters from output?

I am reproducing the tutorial rstanarm multilevel example rewriting it in “pure” ®Stan with some twists. I wanted to produce the mean predictions per school, and after about fifteen trials and errors, I ended with the combination of transformed parameters and generated quantities as follows:

//
// This Stan program defines a multilevel model
// expressed in lmer as
// 
// lmer(formula = course ~ 1 + female + (1 | school), data = GCSE)
//

// The input data
data {
  // number of observations
  int<lower=0> N;
  // number of schools
  int<lower=0> M;
  //  outcome variable: course
  vector[N] y;
  // predictor: female
  vector[N] female;
  // cluster variable
  int school[N];
  // school sizes, to be computed in R
  int s[M];
}

parameters {
  // intercept
  real beta0;
  // fixed effects slope
  real beta_f;
  // random effects
  vector[M] u_j;
  // residual error
  real<lower=0> sigma;
  // random effect variance
  real<lower=0> sigma_u;
}

transformed parameters {
  // level-1 unit specific means
  vector[N] mu_i;
  
  mu_i = beta0 + beta_f * female + u_j[school];
}

model {
  y ~ normal(mu_i, sigma);
  u_j ~ normal(0, sigma_u);
}

generated quantities {
  // running counter  
  int pos;

  // school empirical mean
  vector[M] m_j;

  pos = 1;
  for (j in 1:M) {
    m_j[j] = mean( segment(mu_i,pos,s[j]) );
    pos = pos + s[j];
  }
}

The R code that gets the data is

### pure Rstan multilevel example
library(tidyverse)
library(rstan)
library(here)
library(mlmRev)

data(Gcsemv, package = "mlmRev")
summary(Gcsemv)

# massage the data a bit
Gcsemv$female <- relevel(Gcsemv$gender, "M")
Gcsemv %>% 
  select(school, student, female, course) %>% 
  filter( !is.na(course) ) %>% 
  arrange(school, student) %>%               # sort by school and student within school
  group_by(school) %>% 
  mutate(s=n()) %>% ungroup() ->             # compute school sizes
  GCSE_nmy

# convert to lists for Stan
GCSE_stan_data <- list(
  # number of observations
  N = nrow(GCSE_nmy),
  # number of schools
  M = length(unique(GCSE_nmy$school)),
  #  outcome variable: course
  y = GCSE_nmy$course,
  # predictor: female
  female = as.integer(GCSE_nmy$female)-1,
  # cluster variable
  school = as.integer(GCSE_nmy$school),
  # school sizes for segmenting
  s = (GCSE_nmy %>% group_by(school) %>% slice(1) %>% ungroup() %>% select(s) %>% unlist() )
)

So when I run

multilevel_stan_fit <- stan(
  file = here('multilevel-stan.stan'), # program
  data = GCSE_stan_data,             # data
  chains = 4,             # number of Markov chains
  warmup = 2000,          # number of warmup iterations per chain
  iter = 10000,           # total number of iterations per chain
  cores = 4,              # number of cores to use
  refresh = 200,          # show progress every 'refresh' iterations
  seed = 10203
)

I end up with

> print( object.size(multilevel_stan_fit), units="Mb")
576.1 Mb

which takes a fairly long while to create and access.

The blame, of course, is with mu_i, an object of size 1725 x 4000, which I don’t really care that much about. Is there any way to omit this vector of transformed parameters from being written into R? (It has to be there in memory for the generated quantities to be created, but the write step of outputting it is a different one, as far as I understand from https://mc-stan.org/docs/2_21/reference-manual/output.html.)

1 Like

You can use the pars option along with the include option to list the names of parameters you want to keep (if include=TRUE) or remove (if include=FALSE).

4 Likes

I’ll also note that unless you’re trying to get stable inference about some pretty extreme quantiles in the posterior distributions, having 1e4 iterations per chain isn’t really helping you much.

1 Like

OK so @mcol’s suggestion amounted to

multilevel_stan_fit <- stan(
  file = here('multilevel-stan.stan'), # program
  data = GCSE_stan_data,             # data
  chains = 4,             # number of Markov chains
  warmup = 2000,          # number of warmup iterations per chain
  iter = 10000,           # total number of iterations per chain
  cores = 4,              # number of cores to use
  refresh = 200,          # show progress every 'refresh' iterations
  seed = 10203,
  include = FALSE,        # actually, this means 'exclude' the parameters listed below
  pars = "mu_i"
)

which produces

> names(extract(multilevel_stan_fit))
[1] "beta0"   "beta_f"  "u_j"     "sigma"   "sigma_u" "pos"     "m_j"     "lp__"   
> print(object.size(multilevel_stan_fit), units="Mb")
48.7 Mb

Much nicer. Problem solved, case closed.

@mike-lawrence, old BUGS habits die hard :).

3 Likes

Has this behavior changed? What is the process to only include specific parameters in the writing of the files?