Wish list for Stan interfaces

@sakrejda posted this on the logging topic, but I’m moving it to its own topic to avoid hijacking the logging topic.

My [Krzysztof’s] wish list for an interface is something like:

  • these loggers

  • sane input format, binary and text

  • typed output for the stuff like mass matrix / warmup / config messages

  • streamed binary output that can be read during the run

  • utilities to read/validate config

  • config you can read from a file

  • threading

  • job scheduling based on config files

I could do most of this in an interface by going off on my own but I think
it’s better to push core Stan in a direction that makes it possible

Do you have more concrete proposals?

  • these loggers: the current interface is low level. What do you want the high-level interfaces to be in the interfaces? Config by file? Config by static/global variable? Writing options to files, to the screen, to sockets, to databases?

  • sane input format: you’ll have to be more specific. I think everyone agrees that the R dump format has to go, but there’s not a concrete proposal on the table for a replacement as far as I know. If we want to go with somehting like JSON, someone will need to propose a schema for representing all of our data types. And when you say binary and text, I take it you mean two different formats? If so, should they be convertible?

  • typed output: not sure what you mean here. What do you consider to be output? If you look at something like R and have a matrix variable a in Stan and you have a fit object, then extract(fit)$a gives you the structured output, but it’s first indexed by draw, then by the two matrix arguments. Or do you mean file-based output?

  • streamed binary output: is that also typed in the sense of the previous version or do you want multiple formats? I think everyone agrees having streaming output is desirable if only so as not to blow out memory in R or Python and to make it easy to move output among interfaces (because the R tools have much richer visualizations than exist in Python at the moment)

  • config you can read from a file: config for calling sampling and optimization? Woudl that file then point ot another file with inputs or do you imagine data being in the same file? What about inits for mass matrices or parameters? Would this also config MPI, GPUs, etc (make-like config in addition to calling a sampler once everything’s made)?

  • threading: this can’t be done without tweaking the underlying memory handling in the autodiff library to make the global stacks thread local. There’s a performance hit per thread, but it allows you to run without copying data.

  • job scheduling: I’m not sure what you’re thinking here. Do you mean multiple fits scheduled over a cluster? Scheduled over multiple cores?

For loggers, there’s no interface to control logging level and everything just dumps to one level; so you’d have to first fix the underlying C++ code.

For binary inputs, you’d need to implement a var_context. @betanalpha wants to change how those work, so you need to coordinate with him.

For threading, you’d need to also modify the underlying C++ code and provide a way to control from the compilation whehter to use thread-local or global memory.