Dear Paul Buerkner,
I read your article about IRT modelling with the package brms with great interest. Until now I applied lme4 to model multilevel Rasch models, but the lme4 fell short when it came to polytomous types of Rasch models. As I understand from your paper the package brms is able to do this. Thus, I currently exploring this possibility.
My aim is to specify a multilevel multivariate PCM to assess DIF of the student survey administred in six different countries. Furthermore, possibly I want to use IRT linking (concurrent calibration or separate calibration) to correct for non-invariance. For these purposes I need highly flexible modelling software. The data consists of student perception surveys gathered in six countries. I specify two random effects (students and classes) and one fixed effect for country. Items are clustered in six factors. I basically specified the following model:
formula = score.f ~ 1 + (0 + itemtype | item) + (1 | studentid) + (1 | teacherid) + country
model = brm(formula = formula, data = data[c(1:2100, 58801:60900), ], family = brmsfamily("acat", "logit"), cores = 5)
I selected five student ratings per class. The statement data[c(1:2100, 58801:60900), ] selects 60 teachers spread over 2 countries (to speed up for testing the model). The complete dataset counts six countries and over 1500 teachers.
I have no experience with brm and until now have analyzed dichotomous Rasch models using lme4. In what I have done so far, I experience brm to be considerably slower than lme4, but then again lme4 cannot specify PCM models (thus perhaps this is an unfair comparison). Nonetheless, any help with speeding up the process is very much appreciated.
At this moment I have three questions:
Is the model specification correct or do I misunderstand the syntax. I understand that I provide you with limited information to answer this question. If additional information is required, please ask.
What priors would you advise for this type of model? You must know that I am unaccustomed to setting priors. In my prior experience with Baysian statistics I specified uninformative priors. I guess that the current model (which does not specify any priors) also uses uninformative priors. In you paper https://arxiv.org/pdf/1905.09501.pdf you specify weakly informative priors. However, unfortunately, I do not completely understand from the given example (at page 40) what priors you exactly specified for the PCM. I think that you applied the same priors as in a GRM, but I am not completely sure about this.
Is it possible to obtain the random person or class effects? In lme4 I am used to use the “ranef” function to call for the random effects of the model. I guess I need to call for the posteriors?
Many thanks for your help!
Rikkert van der Lans
Morning and welcome @RikkertvanderLans! I know @paul.buerkner is around and several other brms users. Hopefully they can speak to the paper. I can talk a bit about 2 and 3.
Welcome to the Stan forum.
I’ll try a few quick answers:
- You can specify random effects as you do in lme4
- brms is slower, because sampling from a posterior takes time (lme4 will only return point estimates, brms gives you the full posterior distribution)
- you can use the ranef function to get random effects
- with get_prior you can print the priors brms specifies for the model.
So for group level estimates it’s like this
Full group level estimates
ranef(k_fit_brms, groups="site", probs = 0.5)
You can use
get_priors to see what the default priors are.
If they aren’t quite what you want then you will have to look through the literature to see what other folks say the impact of your variables are on score.f. For example if country can be positive or negative influence on score.f then normal(0,1) or maybe a student_t distribution would be a good.
For each prior you should document why you picked that distribution. Usually that means listing the journal articles being referenced, copies of field notes, or which experts you polled.
Dear Guido_Biele and Ara Winter,
Thank you for your clear answers. At this moment I cannot really test the ranef function, since my computer is running the model for over an hour stating no more than “start sampling”. I guess, the sampling should not take so long must it?
1h without anything beyond “start sampling” is long, but need not be unreasonable if you have lots of data and parameters (e.g. from student random effects).
You could try to start with less data and a simpler model (e.g. fewer random effects) to speed things up.
You can also reduce the number of iteration. The default of 2000 is in my experience on the high side, i.e. ones does often not need 1000 warm-up samples with HMC as implemented in Stan. And if you have 4 chains and 250 per chain, this should also be enough for most purposes.
So for testing purposes I you could try it with 500 warm-up and 250 post warm up samples if you have lots of data. It is, however, a good idea to run the final model with more iterations.
All that said, with the type of model and the amount of data you are describing, I would rather do the computation on a dedicated machine, possibly a high performance computing cluster.
My appologies Guido, I feel like a newby. I used the multi cores for the first time and now I understand that when you do that the sampling process is no longer visualized in r studio - if you use one core it does show a percentage complete. I understand now how to visualize the process and I can follow that it does work appropriately.
Now let’s see how it does with the complete data.
On Windows you can see the progress of brms (rstan) models fit with multiple cores in the Viewer tab on the right hand side.
Dear Guido, Ara and Paul,
I am still exploring brms. Guido mentions the necessity of a dedicated machine. Getting access to a high performance computing cluster might be a possibility, but my experience is that I will need to wait for months before I get access (there is a waiting cue). I could upgrade my own machine giving it for example 8 or 16 times the RAM memory compared to what it currently has. How much would that benefit me in time? I am now looking at time estimations in the range of 10,000 - 20,000 seconds per 1000 transitions. I frankly have no clue how to interpret this. All I know now is that 500 seconds results in approximately 2 hours estimation time (though I didn’t clock it to be honest).
Thank you, Rikkert
@RikkertvanderLans I usually recommend folks to look at the amazon cloud services unless you want to build and maintain (almost always a losing proposition) a local cluster. Amazon is pretty cheap, way cheaper then buying ram.
Hi, more RAM could help, but I am really not sure. I once picked up that lots of cache memory is crucial, but for that one would need a new CPU.
Having said that, before thinking about hardware, I’d check that the model is specified as good as possible.
For this I’d first check that the model estimation works well, i.e. divergent iterations, low Rhat, low BFMI. You can do this with a subset of the data.
Next, you can go through the brms generated Stan code and check if you can call the likelihood function (…_lpdf) with vectors instead of looping through all observations. The Stan documentation has information about if a vectorised version for the likelihood function exists. (Even if it exists, the speed up is not always big).
One can also use map_rect to speed up things, but this will again be only useful if you have more cores than chains.
As you can see, running somewhat complicated models with large data sets quickly gets complicated in Stan.
So, even if I prefer to use Bayesian estimation when possible, you might be in a situation where I would not use Stan, if other methods can fit the models you are after.
Having said that, I’ve myself fit models that took half a week or more to fit (on a HPC platform). Therefore AWS could be a solution.
Sorry for not having an easier proposal, and let me reiterate that I would make sure that model is well specified (which includes good priors) before investing time in anything else.
Dear Guido, Ara and Paul,
Thank you for pointing me back to the formulas Guido. I made several updates on the model. The two formulas I apply now look like this:
score ~ 1 + (0 + country | itemtype) + (cs(0 + country) | item) + (0 + country | studentid) + (0 + Country | teacherid)
another second competing model I am testing is:
score ~ 1 + (1 | itemtype:country) + (cs(1) | item:country) + (1 | studentid:country) + (1 | teacherid:country)
The output seems to make more sense than the output provided by the formula I posted earlier.
Rhats are all between 1.00 and 1.02 with one exception of a parameter having Rhat = 1.05. With the limited sample I selected I get no further warnings concerning fit (the model I posted priorly did give such warnings), but I do get warnings concerning the low sample size. When I add an interaction term teacherid:country:item to these models, I get a warning about 30-40 divergent transitions after warmup, which is why I left it out for now. I will decide later on whether to include the term or not.
What I do not completely understand is my system performance. It takes really long to estimate these models (with small samples (selecting 30 teachers per country = 180 teachers rated by 5*180 students) it takes approximately 1h to 2h). However, when I look up the CPU and memory processes in the task manager, then hardly ever is Rstudio using more than 50% of my CPU and memory usage is somewhere between 50-80% of my RAM memory.
Just for the sake of illustration, I am now running the model two with the complete data 1500 teachers rated by over 7000 students. This was meant to get an impression of the time. It reports that 1000 transitions will take around 7000 seconds. I started model estimation yesterday evening at 22h. Currently, it is noon one day later and brms has not achieved 10% of the iterations. When I look in the task manager I see CPU 40% and memory 78%. Disk is not used. I will keep waiting till this evening to see the status then.
However, given that part of my CPU and memory capacity is not exploited makes me curious how to make better use of my computer capacity to speed up estimation.
Thank you guys for all the help and tips.
About your first model: You seem to be using splines over country (
cs(0 + country)), which I assume is a non-ordered factor. In a similar vain, I am not sure what
It is also not clear to me what
(0 + country | studentid) or (
0 + Country | teacherid) shall do, as I would expected that students and teachers are clustered within countries.
I no too little about hardware to have a clear opinion about how adequate your machine is for the model you are trying to fit.
Maybe I can clarify a few things:
The time per 1000 transition is only part of the picture because the number of transitions per MCMC sample varies over time. In particular, finding the right stepsize is part of the warmup process. If the stepsize is small (which it can be especially at the beginning of warmup) the algorithm needs more transitions per MCMC sample. All that to say that one can’t simply use the time needed for the first 10% of the iterations to calculate the total time needed.
The warning about the low sample size was probably about low number of effective samples. This would indicate that you would need to run the chains longer.
It is not clear to me if you run the chains in parallel (by setting e.g.
options(mc.cores = 4) before running the model. This will of course help, but only if you have enough RAM to rund multiple chains in parallel. This is usually not a problem, but I am not sure for your data.)
You can use "refresh (integer) to control how often the progress of the sampling is reported ", details are here.
I’ll be in a work in a few. I would go for a simpler model running 4 chains on 6 cores (my default on a 5 year old laptop) if you can. Again depends on your machine.
From the paper you mentioned
formula = y ~ 1 + (1 | item) + (1 | person)
That’s what I would start with.
Also country really should be (1|country) unless there is something I am missing about how country is coded?
One of my favorite papers on hierarchical modeling using Stan is:
Koster J, McElreath R, Hill K, Yu D, Shepard G, Vliet NV, Gurven M, Kaplan H, Trumble B, Bird RB, Bird D. The life history of human foraging: Cross-cultural and individual variation.
Figure 2 and Figure 3 would be relevant to this discussion but the overall paper is worth reading and using as a template for Bayesian analysis with Stan.