Hello, is there any knowledge on how Stan’s computation time scale with sample size?
Say I run a multi-level model. At first, I randomly sample N=500 to test the code using vectorization in the model block. I achieve convergence and the code finishes running in 5 minutes. Now I run the same model with the same parameters everywhere but for a much large N (say, 50K). Is it possible to make a statement about the corresponding computation time? Thanks.
Not as you described. There are multiple factors that determine sampling time, data volume being one that in turn can have nuance itself. See the
generate_and_fit.r code here and at the top you’ll see 4 data-generating parameter variables that show some of the ways there can be “more data”. You could also use that project to explore the impact of different data volume configurations on sampling time to maybe get an estimate for your specific scenario. The model code there is a highly-optimized version of the SUG 1.13 hierarchical model.
One can discuss scaling of the gradient evaluation time in the context of a given Stan program, but without that Stan program there’s not much one can say. The entirely of Stan’s computation time, however, is not determined by the gradient evaluation time but rather the gradient evaluation time and the number of gradient evaluations needed. The scaling of the number of gradient evaluations needed will depend on the particular data, the assumed model/Stan program, and the provenance of the data.
For some more discussion see Addressing Stan speed claims in general - #45 by emiruz and Chains stuck when use larger dataset, but not smaller.