Large dataset, Multilevel Model, Time Component - New User issues with specification

marsh_master · May 8, 2024, 5:38pm

Hi there, I’m new to using stan and having a bit of trouble.

I have a fairly large dataset (uploaded), that I would like to use to build a predictive model.
The model should be a Multilevel model that takes into account the expected non-linear response.
For some context: The data is from laying hen parent stock - looking at the female pullet hatchability, fertility, and hatch of fertile eggs.

Response Variables:

fmale_chick_pct (also available-- fmale_chick_count )
fertility
hof
** primarily care about fmale_chick_pct

Predictors/Covariates:

hen_age – (in weeks; start producing eggs around 21 weeks and depending on
performance they can continue out to 80+ weeks)
egg_age – (in days; length of time that eggs are stored prior to incubation)

Multilevel Structure:

group_id – (individual flocks)
farm_id – (location where individual flocks live)
facility_id – (incubation sites, eggs from a given group id may go to one or more
facility_id for incubation)
time components – h_year, h_month, h_day ( 10 year period, 1-12 months, and 1-31
days, respectively)
I am unsure how to include these time components in the model to account for the fact that flocks can span across 2 years

Other variables:

treatment_id – (indicator of whether or not a standard industry practice was
performed to improve fmale_chick_pct, supposedly helps when
egg_age is high)
type_id – (specific type of hen; 1 and 2)
subtype_id – (subtype of hen; only type 2 has subtypes (2 and 3))

So far I started with the following model:

data_brown ← subset(clean_data, breed == 1)

bayes_model_type1 ← brm(
fmale_chick_pct ~ s(hen_age) + s(egg_age) + treatment_id +
(1 | farm_id) + (1 | group_id) + (1 | h_year),
data = data_type2,
family = gaussian(),
prior = c(set_prior(“normal(0,10)”, class = “b”)),
chains = 2,
warmup = 1000,
iter = 2000,
seed = 123
)

This model only uses a subset of the data for type_id = 1 (because I expect the response to be different between type_id’s.
However, the model has taken 6 hours so far to run, and is only a “simplified” version of what I had expected the final model to look like.

Any suggested improvements or pointers would be really appreciated.

data_sample_stanhelp.csv (4.5 MB)

Solomon · May 9, 2024, 6:35pm

I have not downloaded your data, but I’ll take you at your word it’s large. In my experience, sophisticated multilevel models with smooth terms and large data sets usually take several hours to fit. I’m so sorry. Welcome to the club.

However, you can cut down on the run time by changing some of your settings to something like

bayes_model_type1 <- brm(
  <blah>,
  <blah>,
  <blah>,
  chains = 4, warmup = 500, iter = 1000, cores = 4
  )

You’ll have the same number of post-warmup draws, but with less warmup time, fewer post-warmup iterations by chain, but spread over more chains. This is all presuming your model has no problems with warming up with fewer iterations, which is often the case in my experience. If you do need those many warmup iterations, you could also do something like this:

bayes_model_type1 <- brm(
  <blah>,
  <blah>,
  <blah>,
  chains = 4, warmup = 1000, iter = 1500, cores = 4
  )

Also, notice that I’m setting cores = 4 in both examples. Assuming your computer has several cores available (if it’s newish, it should), your model will run much faster if you run multiple chains in parallel, rather than run them one at a time.

Solomon · May 9, 2024, 6:37pm

Anyway, @marsh_master it looks like this is your first time posting on the Stan forums. Welcome!

marsh_master · May 13, 2024, 1:01pm

Hi Solomon,

Thank you for the response, changing the number of chains and reducing the warmups did help to speed things up!

I do think however, that my model is likely inappropriate contributing to the increased time it takes to converge.

I decided to take a step back and start from more simple models and build my way up and the convergence times are drastically different.

Is there a particular forum/tag that I should post under to ask for conceptual/model building help?

Regards,

Kyle

Solomon · May 13, 2024, 2:09pm

I’m of two minds on your last question. On the one hand, when I see a model including (1 | farm_id) + (1 | group_id) + (1 | h_year) applied to a large data set, I would expect a long running time (several to many hours). So I’m not so sure you need to worry.

On the other hand, I think starting simple and adding complexity slowly is almost always the right thing to do with any model, especially Bayesian models. To that end, if I had your data challenge, I’d,

take a random subset of my data to use for my model building steps (maybe 10 or 25%), to cut down my running time;
start with a simple intercept-only model brm(fmale_chick_pct ~ 1);
make sure the output made sense;
if so, add complexity one or two steps at a time; and
iterate until I had the full model running smoothly.

Part of that iteration process would also entail making sure I understood how to set my priors properly, and addressing any complications like divergent transitions. I’d make great use of functions like pp_check() and get_prior(). I also often practice writing out my models in formal statistical notation with LaTeX to make sure I have a good understanding of what I’m doing.

mhollanders · May 13, 2024, 8:41pm

I agree with everything Solomon has said but you can also specify the threads argument if you have more cores on your computer to spread the work of the different chains across more than n_chains cores on your machine.

Topic		Replies	Views
Creating a hierarchical model (multilevel) Modeling	4	675	July 27, 2020
Multilevel hierarchical linear regression model with nested predictors Modeling	1	1552	January 9, 2018
Help with simple multilevel model code [solved] Modeling specification	5	1207	February 10, 2018
Trouble with Mathematical Model brms	2	728	July 29, 2020
Slow running multilevel longitudinal model containing random effects at two levels Modeling techniques , specification , performance	1	398	January 21, 2021

Large dataset, Multilevel Model, Time Component - New User issues with specification

Related topics