GSOC 2017: NumFOCUS will be an umbrella organization

Hi

Organizations can start submitting applications for Google Summer of Code
2017 on January 19 (and the deadline is February 9)

https://developers.google.com/open-source/gsoc/timeline?hl=en

NumFOCUS will be applying again this year. If you want to work with us
please let me know and if you apply as an organization yourself or under a
different umbrella organization please tell me as well. If you participate
with us it would be great if you start to add possible projects to the
ideas page on github soon. We some general information for mentors on
github.

We also have a template for ideas that might help. It lists the things
Google likes to see.

In case you participated in earlier years with NumFOCUS there are some
small changes this year. Raniere won’t be the admin this year. Instead I’m
going to be the admin. We are also planning to include two explicit rules
when a student should be failed, they have to communicate regularly and
commit code into your development branch at the end of the summer.

best, Max

Although Stan has never participated in GSOC, we at least had a wiki page for it in 2014 (and some of these things got implemented anyway in the meantime):

Guidelines

Stan is primarily a C++ project, but the ideas below vary in the degree to which skill with C++ is required from “none at all” to “familiarity with Boost or Eigen”. Stan is hosted on GitHub, so some experience with the git version-control system is a plus but git can be learned in a relatively short period of time.

Stan consists of a few different parts. The modeling language allows users to specify a Bayesian model for data. The parser inputs a file in the modeling language and outputs a C++ source file, which can then be compiled into an executable file that links to the library. Data are passed to this executable, which outputs results to the hard disk. These results can then be summarized.

Ideas

Interfaces to Stan

There are three interfaces to Stan currently. The first is the command-line of a shell. The second is via the R package RStan. The third is via the Python module PyStan. Additional interfaces are welcome. The easiest approach is to use the calling program to write the data to the hard disk, use a shell call to the command-line interface to Stan, and read the results into the calling program. The harder approach (which is undertaken by RStan and PyStan) is use the calling program to pass the data-in-memory to Stan and accept the results-in-memory from Stan. For a GSOC project, we favor starting with the first approach and implementing the second approach if there is time.

HTML5 Interface using PNaCL (mentor: Ben Goodrich)

The Portable Native Client (PNaCL) compiler takes C++ source code and compiles it into an intermediate representation that can be executed in a sandbox on a variety of platforms via the Chrome web browser. This mechanism makes it possible for users to distribute the intermediate representation of their Bayesian models for others to execute without additional compilation. We have already used the PNaCL compiler to create the intermediate representation but would greatly benefit from a GSOC project to write an HTML5 interface that would pass a data file and various options to the executable and pass results to the executable that prints a summary. This project would require no C++ but would require intermediate web design skill, including javascript and HTML5. See the Native Client website for more details on PNaCL in particular the last few steps of the tutorial.

HTML5 Interface using emscripten (mentor: Ben Goodrich)

The emscripten compiler takes C++ source code and translates it into javascript that can be run on a variety of platforms via many web browsers. The development version of emscripten uses a fork of the PNaCL compiler to create the intermediate representation and generates the resulting javascript. We would benefit from a GSOC project to write an HTML5 interface that would pass a data file and various options to the executable and pass results to the executable that prints a summary. This project would require no C++ but would require intermediate web design skill, including javascript and HTML5. See the emscripten website for more details on this process.

Custom Stan extensions (mentor: Marcus Brubaker)

Stan currently has a large number of functions available but users often have their own functions that they’d like to be able to plug in to Stan. Currently, short of explicitly modifying and recompiling Stan, users can’t do this. This project would require developing a mechanism by which a user could write a function (with a suitable interface) in C++ and then make use of that function in their Stan model. Additionally, some users might want to implement custom gradients for these functions instead of using Stan’s autodiff implementation. Ideally the extension mechanism should be able to support this.

Java Interface (mentor: TBD)

Bob said in an email:

Having just said this, it occurs to me that having a Java
and Scala wrapper for Stan along the lines of PyStan and
RStan (and soon to be MStan for MATLAB) would make sense as
a project.

The performance hit from translating data structures shouldn’t
be an issue for sampling or optimization.

With a wrapper, it’d be possible to experiment with other algorithms
by calling the log probability function compiled in C++ for
a model. I’ve never used JNI, but it may be a bottleneck for
a tighter integration like this as the parameters would need to
be passed around for each call and the result passed back:

http://stackoverflow.com/questions/7699020/what-makes-jni-calls-slow

Julia Interface (mentor: TBD)

Stata Interface (mentor: TBD)

Matlab Interface (mentor: Marcus Brubaker)

A simple version is already in-progress. It may be worthwhile for a GSOC project to implement a more complicated in-memory interface. Doing so would require at least intermediate skill with Matlab and C++ programming.

Input / Output

The command-line interface to Stan currently only reads data from the disk in a format that is heavily influenced by BUGS and R. Also, the command-line interface to Stan only writes results to the disk in CSV format, although CSV files can be read by a wide variety of statistical programs. There have been many requests for additional input / output options, any of which would make a good GSOC project. In each case, some skill with C++ is necessary but the main requirement is experience with the input format.

JSON Input (mentor: TBD)

Stan-users discussion

CSV Input

Issue 421

File-backed Input

Stan-dev discussion

Protobuf Input

Link

Statistical Computing

These projects enable users to estimate new models with Stan or to estimate existing models in more computationally efficient ways. As such, they all require at least intermediate C++ skill and some knowledge (or capacity to learn) the theory behind the algorithm.

Initialization block (mentor: TBD)

A Stan model requires starting values for all the unknown parameters. These starting values can be input from a file on the hard disk, but many users find doing so inconvenient. A better alternative would be to add an initialization block to the Stan modeling language that could be parsed into C++ code, which would allow users to specify particular starting values for (a subset of) the parameters or to specify a statistical distribution to randomly draw the corresponding starting values. Stan already has the capability to draw randomly from many statistical distributions but lacks the capability to utilize them for starting values.

Limited memory BFGS optimization (mentor: Marcus Brubaker)

Stan already has two optimization algorithms, an Newton-based approach and a quasi-Newton approach. Both require that a (approximation to the) Hessian matrix of the unknown parameters be allocated, which limits the applicability to optimization problems with many parameters. An alternative is to use some “limited memory” variant of the BFGS algorithm. These limited memory alternatives are already available in a variety of software, but Stan is somewhat unique in that it does not require the user to implement a function to calculate the gradient since the gradient can be calculated via auto-differentiation. Thus, having a limited memory optimization algorithm would be quite useful for Stan. See the mentions of “L-BFGS” in Nocedal and Wright’s book Numerical Optimization.

Conditionally independent contributions to the log-posterior (mentor: TBD)

The Stan modeling language currently supports statements such as vector ~ distribution(parameters); which (when appropriate) indicates the the elements of the vector are conditionally independent given the parameters (which are often scalars) to the distribution and thus contribute additively to the log-posterior that is essential to the Markov Chain Monte Carlo (MCMC) schemes used by Stan. However, many users find this syntax is still too limiting and Stan would benefit from a GSOC project to support array ~ distribution(parameters); where array is a multidimensional object such as a matrix-like two-dimensional object. This syntax would indicate the user’s intention that all elements of the array are conditionally independent given the parameters to the distribution. Permitting this syntax is relatively easy, but the harder part is implementing the desired behavior for as many distributions as possible that are already supported by Stan.

Improvement of multivariate distributions (mentors: Ben Goodrich or Marcus Brubaker)

Additional statistical distributions (mentor: Marcus Brubaker)

Stan already makes it possible for users to use dozens of statistical distributions in their models, but perhaps the most common request is for additional ones. We would be especially interested in a GSOC project that implemented more general statistical distributions that include many well-known distributions as special cases. Or if the student would be more comfortable implementing a few well-known statistical distributions, that would be great as well.

Explicit parallel sampling framework (mentor: Marcus Brubaker)

Stan users should always run multiple Markov chains in order to accurately determine convergence. Given the proliferation of multi-core systems and clusters these should be run simultaneous. Unfortunately, Stan requires users to explicitly launch multiple processes and these processes are unable to interact. An excellent GSOC project would be one which implemented a framework to allow parallelization of sampling. Initially the chains would interact only minimally if at all but the framework should be targeted to allow more complex interactions (for algorithms like parallel tempering, DE, etc). The ideal student should have a strong C/C++ background and experience with or strong desire to learn task-level parallel programming APIs like MPI.

Thanks for reaching out on our forums. What’s the arrangement
with NumFOCUS? Would we still apply for someone to work full
time on Stan? Is NumFOCUS “the mentor” here or is it the Stan
project or an actual person?

  • Bob

NumFOCUS is the organization that Google will have contact with. We try
to handle all the organization details of GSoC and remind you when
important dates come up.

Stan is then a project under the umbrella of NumFOCUS and you offer
projects to different students. You will also have to find mentors
who supervise the students on specific projects.

Your first step if you want to participate is to write a list of
possible projects for students on the NumFOCUS repository (create a PR)
using the skeleton provided above.

Does that make things clearer?

best Max

Perfectly clear. And great. Thanks.

  • Bob

Great. If you want to participate let me know. The time commitment per
week and student for a mentor is about 5-10 hours, maybe more it depends
on the student and the project. There should also be some short daily
communication with the student (We tell the students that no
communicating often is a reason to fail them).

Are there any Stan developers willing to put in 5–10 hours/week
mentoring a Google Summer of Code student? If so, Max Linke has
offered to handle the bureaucratic details through NumFOCUS for all the
NumFOCUS projects. See the first post in the following topic for
details on what Max needs (a pull request with topics):

GSOC 2017: NumFOCUS will be an umbrella organization

I’m not going to try to coordinate this centrally—please communicate
directly with Max if you’re interested. I’d be happy to put a blog
post on Andrew’s blog or a note to our user’s list if people have
projects they want to advertise.

I see the final deadline is February 9 — Max, when do you need
those pull requests by?

  • Bob

Yes the application deadline is on February 9. It would be nice to have
the PR at least 1-2 days before so that I can look over them and they
are ready when Google reviews the application.

But you can still submit a PR for projects after that. If we are
accepted the last deadline is March 20 when student applications start.

I’d be interested in getting somebody to work on the protobuf stuff, or in
getting something solid for building models via the clang API. I’ll see if
I get to a pr before the deadline. K

Krzysztof: I’m happy to do the PR for protocol buffer
interfaces if you want to supervise the student. It’s a nice
project with a clear deliverable that doesn’t require a deep
dive into Stan math or C++ templating.

Now that the services callback refactor is all done, it should also
be more manageable from the interface side. I’m happy to help with
overall design, too, and we know Allen was also interested.

  • Bob

It would be great to have the protocol buffer writer. I agree it’s a
good, focused project.

Let’s do it. It’s a really good match for GSOC and I can supervise the
student. Yes if you can do the PR it would be great, I don’t have time to
figure that out right now. The three of us can coordinate over design.
Krzysztof

Will do. I’m going to be a well honed grant machine
in a couple of years.

  • Bob

Thanks!

Last call for anyone interested in GSOC through NumFOCUS.

Krzysztof: You need to add yourself as a mentor for GSOC through Google.
Max pointed to this page, but I don’t see how to use it to add you
as a mentor:

https://www.google-melange.com/archive/gsoc/2015

  • Bob

The NumFOCUS project skeleton is way too much detail for me
to fill in for an open-ended project I’m not supervising,
so I’m withdrawing my offer to make the pull request. Given that
Krzysztof said he didn’t have time and nobody else on the
project expressed interest, I think we’re out. In the past,
we’ve concluded that GSOC was more trouble than it’s worth

In case you change your mind, Krzysztof, the relevant
repo is here:

https://github.com/numfocus/gsoc

and you want to copy 2017/ideas/skeleton.md into a new file and
create a pull request. I’m appending the file so you can see what
you’d be getting yourself into.

  • Bob

{{ Title }}

Abstract

{{ Abstract }}

Technical Details

{{
Long description of the project.
Must include all technical details of the projects like libraries involved.
}}

Schedule of Deliverables

May 1th - May 28th, Community Bonding Period

{{ Delieverables }}

May 29th - June 3rd

{{ Delieverables }}

June 5th - June 9th

{{ Delieverables }}

June 12th - June 16th

{{ Delieverables }}

June 19th - June 23th, End of Phase 1

{{ Delieverables }}

June 26 - June 30th, Begin of Phase 2

{{ Delieverables }}

July 3rd - July 7th

{{ Delieverables }}

July 10th - July 14th

{{ Delieverables }}

July 17th - July 21th, End of Phase 2

{{ Delieverables }}

July 24th - July 28th, Begin of Phase 3

{{ Delieverables }}

July 31st - August 4th

{{ Delieverables }}

August 7th - August 11th

{{ Delieverables }}

August 14th - August 18th

{{ Delieverables }}

August 21st - August 25th, Final Week

{{ Delieverables }}

August 28th - August 29th, Submit final work

Future works

{{ Future works }}

Development Experience

{{ Development Experience }}

Other Experiences

{{ Experience }}

Why this project?

{{ Why you want to do this project? }}

Appendix

{{ Extra content }}

As far as I can tell (I just spent a few minutes Googling for this) this is
not a thing. The organization registers and then is supposed to manage
mentors. (?)
K

Ok, yeah, it’s pretty detailed. I’ll see if I have time but it’s not a
priority right now. K

Just for clarification this is the correct idea skeleton. The previous post contains the student application skeleton.

In short give the project a name. List beneficial experience for a student and write one paragraph that explains the project in more detail and outlines a minimum of functionality that you want. Best with some links to useful information for the project if you have some. The last part in the skeleton can be skipped if you like. Otherwise direct them to some easy/beginner bugs or page to create a dev environment. They should do that anyway know what they get into for the summer.

About mentors signing up. This will only be possible after we got accepted. I will contact you again if that is the case.

I hope that clarifies things.

Thanks, Max. I’ll send you a pull request with that
template filled in.

I only saw the other one in the subdirectory.

  • Bob