Hi there, I’m new to using stan and having a bit of trouble.
I have a fairly large dataset (uploaded), that I would like to use to build a predictive model.
The model should be a Multilevel model that takes into account the expected non-linear response.
For some context: The data is from laying hen parent stock - looking at the female pullet hatchability, fertility, and hatch of fertile eggs.
Response Variables:
- fmale_chick_pct (also available-- fmale_chick_count )
- fertility
- hof
** primarily care about fmale_chick_pct
Predictors/Covariates:
- hen_age – (in weeks; start producing eggs around 21 weeks and depending on
performance they can continue out to 80+ weeks)
- egg_age – (in days; length of time that eggs are stored prior to incubation)
Multilevel Structure:
- group_id – (individual flocks)
- farm_id – (location where individual flocks live)
- facility_id – (incubation sites, eggs from a given group id may go to one or more
facility_id for incubation)
- time components – h_year, h_month, h_day ( 10 year period, 1-12 months, and 1-31
days, respectively)
I am unsure how to include these time components in the model to account for the fact that flocks can span across 2 years
Other variables:
- treatment_id – (indicator of whether or not a standard industry practice was
performed to improve fmale_chick_pct, supposedly helps when
egg_age is high)
- type_id – (specific type of hen; 1 and 2)
- subtype_id – (subtype of hen; only type 2 has subtypes (2 and 3))
So far I started with the following model:
data_brown ← subset(clean_data, breed == 1)
bayes_model_type1 ← brm(
fmale_chick_pct ~ s(hen_age) + s(egg_age) + treatment_id +
(1 | farm_id) + (1 | group_id) + (1 | h_year),
data = data_type2,
family = gaussian(),
prior = c(set_prior(“normal(0,10)”, class = “b”)),
chains = 2,
warmup = 1000,
iter = 2000,
seed = 123
)
This model only uses a subset of the data for type_id = 1 (because I expect the response to be different between type_id’s.
However, the model has taken 6 hours so far to run, and is only a “simplified” version of what I had expected the final model to look like.
Any suggested improvements or pointers would be really appreciated.
data_sample_stanhelp.csv (4.5 MB)
I have not downloaded your data, but I’ll take you at your word it’s large. In my experience, sophisticated multilevel models with smooth terms and large data sets usually take several hours to fit. I’m so sorry. Welcome to the club.
However, you can cut down on the run time by changing some of your settings to something like
bayes_model_type1 <- brm(
<blah>,
<blah>,
<blah>,
chains = 4, warmup = 500, iter = 1000, cores = 4
)
You’ll have the same number of post-warmup draws, but with less warmup
time, fewer post-warmup iterations by chain, but spread over more chains. This is all presuming your model has no problems with warming up with fewer iterations, which is often the case in my experience. If you do need those many warmup iterations, you could also do something like this:
bayes_model_type1 <- brm(
<blah>,
<blah>,
<blah>,
chains = 4, warmup = 1000, iter = 1500, cores = 4
)
Also, notice that I’m setting cores = 4
in both examples. Assuming your computer has several cores available (if it’s newish, it should), your model will run much faster if you run multiple chains in parallel, rather than run them one at a time.
1 Like
Anyway, @marsh_master it looks like this is your first time posting on the Stan forums. Welcome!
1 Like
Hi Solomon,
Thank you for the response, changing the number of chains and reducing the warmups did help to speed things up!
I do think however, that my model is likely inappropriate contributing to the increased time it takes to converge.
I decided to take a step back and start from more simple models and build my way up and the convergence times are drastically different.
Is there a particular forum/tag that I should post under to ask for conceptual/model building help?
Regards,
Kyle
I’m of two minds on your last question. On the one hand, when I see a model including (1 | farm_id) + (1 | group_id) + (1 | h_year)
applied to a large data set, I would expect a long running time (several to many hours). So I’m not so sure you need to worry.
On the other hand, I think starting simple and adding complexity slowly is almost always the right thing to do with any model, especially Bayesian models. To that end, if I had your data challenge, I’d,
- take a random subset of my data to use for my model building steps (maybe 10 or 25%), to cut down my running time;
- start with a simple intercept-only model
brm(fmale_chick_pct ~ 1)
;
- make sure the output made sense;
- if so, add complexity one or two steps at a time; and
- iterate until I had the full model running smoothly.
Part of that iteration process would also entail making sure I understood how to set my priors properly, and addressing any complications like divergent transitions. I’d make great use of functions like pp_check()
and get_prior()
. I also often practice writing out my models in formal statistical notation with LaTeX to make sure I have a good understanding of what I’m doing.
I agree with everything Solomon has said but you can also specify the threads
argument if you have more cores on your computer to spread the work of the different chains across more than n_chains
cores on your machine.
1 Like