Text templates for code generation (?)

Mostly just wanted to throw this idea out there: I ran across a simple (two files) C++ lib for text templates and it made me think that something like it might be useful for simplifying the code generation. From the few changes I made to the code generation on branches it seems like so much of what makes that process unwieldy is the generation of boilerplate that could be separated out into text templates.

Here’s the library:
https://github.com/catnapgames/NLTemplate/tree/master/NLTemplate

Here’s an example they have:

Template file 1:

{% include header.txt %}

<p>Items:</p>
{% block items %}<p>
  Title: {{ title }}<br/>
  Text: {{ text }}<br/>
  {% block details %}Detail: {{ detail }}{% endblock %}
</p>{% endblock %}

Template file 2:

<html><body>
<h1>{{ text }}</h1>

C++:

#include <iostream>
#include "NLTemplate.h"

using namespace std;
using namespace NL::Template;

int main(int, char *[] ) {
    const char *titles[ 3 ] = { "Chico", "Harpo", "Groucho" };
    const char *details[ 3 ] = { "Red", "Green", "Blue" };

    LoaderFile loader; // Let's use the default loader that loads files from disk.

    Template t( loader );

    t.load( "test.txt" );               // Load & parse the main template and its dependencies.
    t.set( "text", "Hello, world" );    // Set a top-level variable
    t.block( "items" ).repeat( 3 );     // We need to know in advance that the "items" block will repeat 3 times.

    // Let's fill in the data for the repeated block.
    for ( int i=0; i < 3; i++ ) {
        // Set title and text by accessing the variable directly
        t.block( "items" )[ i ].set( "title", titles[ i ] );
        t.block( "items" )[ i ].set( "text", "Lorem Ipsum" );

        // We can get a shortcut reference to a nested block
        Block & block = t.block( "items" )[ i ].block( "detailblock" );
        block.set( "detail", details[ i ] );

        // Disable this block for the first item in the list. Can be useful for opening/closing HTML tables etc.
        if ( i==0 ) {
            block.disable();
        }
    }

    t.render( cout ); // Render the template with the variables we've set above

    return 0;
}

Sorry, I don’t see how this is relevant. Conceptually, we’re doing the same thing – we have a C++ concept. We’re code generating the concept.

If you’re wondering why I’m using the term “concept” instead of “class” is because there is a difference. If you were to hand-write a C++ header that’s compatible with Stan, it doesn’t need to be extend a class. It just needs to have a set of methods defined.

This is a little out of date, but this describes the concept:

I agree that conceptually we are doing the same thing, but our code-generator needs C++ code to track every last little bit of indentation, every last paren and semicolon. That’s all boilerplate that could go into a text file. When I did a branch to look at making doubles versions of functions callable from R it got old fast to have to get the formatting/matching parens/semicolons/namespace brackets right. I think it would be easier to work with if that stuff got broken out into templates. That said, I think it would be a lot of work to change things over so I just wanted to throw the idea out there. Not arguing that we should do it. There’s a lot of other refactoring that would make the code generator easier to work with.

Any concrete suggestions?

This suggestion is just too vague for me to think of anything actionable out of it. If you can think of things that make the codebase better and have well defined starts and ends and improvements, that helps a lot.

I didn’t make issues because I didn’t have time to work on them yet but here’s a sampling based on a quick run-through. It’s a 5k line file so at least we should have mercy on anyone else who wants to understand the code generator and break it up.

  1. the generic visitor stuff is currently mixed in the same file as the code that uses streams to generate very specific things like semicolons. The generic code could just get moved to another file to make the core of the code generation more readable.

  2. namespaces get generated separately from their end-brackets, the functions for adding namespaces should wrap a chunk of generated code in a namespace instead. Same with class declarations (and maybe more?)

  3. code for calculating valid sizes should be separate from code generation of statements to check for valid sizes.

  4. generate_validate_positive code-generates a call to a function called “validate_non_negative_index” so the name you call doesn’t match the name you generate (literally or conceptually). I think there’s more of this.

  5. few of the functions have any documentation even when they make non-trivial assumptions (which are not clear from the name).

  6. There are a bunch of comments like “// see member_var_decl_visgen cut & paste” and at least some of them refer to actual cut-and-paste that should just go into a function.

  7. code-generation for type inits could be its own file, ditto for validation functions.

  8. ‘generate_type’ actually generates array types (pretty sure…) so it’ generates code for array of array of … of type. A clearer name for this specific function would help but so would looking over the rest of them w.r.t. naming.

  9. generate_located_statement and generate located statement(s) are mostly cut-and-paste, one of those should call the other.

  10. In ‘generate_function_template_parameters’ there’s a bunch of switching on booleans that indicate function type but you have to read the code closely to figure out how many function types there are, this could just be a switch on a well-named enum. Similar in ‘generate_function_arguments’ and ‘generate_functor_arguments’.

  11. generate_cpp is actually generating the ‘class concept’, that could be more explicit, especially since you might want to code-generate other cpp (like, e.g., just the functions block or just the generated quantities block).

1 Like

Great! We’re now getting somewhere. We can start adding things as issues and knocking them off. We can comment on how to implement them on GitHub once issues are there. No need to add them all now. I’m glad you wrote out these improvements.

Your optimism is contagious! :)

Do you have an example of what this would look like in
our code? That looks complicated. And generally, I’m
against adding new dependencies unless we absolutely need
them (the lib’s MIT licensed, so it’s not ruled out on
those grounds).

  • Bob

What I’m missing is what’s the boilerplate? Is it generating
a statement that gets terminated with a semicolon? Generating
an indented block of code? Because if the semicolons need to
be inserted by hand, just in an offset piece of notation, it’s
just moving the problem around.

The other thing we’ve really wanted to avoid is runtime errors
from failed template substitutions. We went down the path
originall of using something like printf from boost, but it
was a huge pain when runtime errors occurred (bad testing because
most of it was under our control).

  • Bob

For the least boilerplate pieces of code we could take the current approach. Here’s a sketch of what I’m thinking using the write_array method as an example:

Current code:

   void generate_write_array_method(const program& prog,
                                     const std::string& model_name,
                                     std::ostream& o) {
      o << INDENT << "template <typename RNG>" << EOL;
      o << INDENT << "void write_array(RNG& base_rng__," << EOL;
      o << INDENT << "                 std::vector<double>& params_r__," << EOL;
      o << INDENT << "                 std::vector<int>& params_i__," << EOL;
      o << INDENT << "                 std::vector<double>& vars__," << EOL;
      o << INDENT << "                 bool include_tparams__ = true," << EOL;
      o << INDENT << "                 bool include_gqs__ = true," << EOL;
      o << INDENT
        << "                 std::ostream* pstream__ = 0) const {" << EOL;
      o << INDENT2 << "vars__.resize(0);" << EOL;
      o << INDENT2
        << "stan::io::reader<double> in__(params_r__,params_i__);"<< EOL;
      o << INDENT2 << "static const char* function__ = \""
        << model_name << "_namespace::write_array\";" << EOL;
      suppress_warning(INDENT2, "function__", o);

      // declares, reads, and sets parameters
      generate_comment("read-transform, write parameters", 2, o);
      write_array_visgen vis(o);
      for (size_t i = 0; i < prog.parameter_decl_.size(); ++i)
        boost::apply_visitor(vis, prog.parameter_decl_[i].decl_);

      // this is for all other values
      write_array_vars_visgen vis_writer(o);

     ...

A sketch of part of a write_array template is below. I just did this by taking the version from bernoulli.hpp and editing it so there’ll be some mistakes but hopefully it gets the message across. I’ll do it in chunks so I can comment. Boilerplate:

 template <typename RNG>
    void write_array(RNG& base_rng__,
                     std::vector<double>& params_r__,
                     std::vector<int>& params_i__,
                     std::vector<double>& vars__,
                     bool include_tparams__ = true,
                     bool include_gqs__ = true,
                     std::ostream* pstream__ = 0) const {
        vars__.resize(0);
        stan::io::reader<double> in__(params_r__,params_i__);

One bit of template to be filled in for the model name:

        static const char* function__ = "{{model_name}}_model_namespace::write_array";

Back to boilerplate:

        (void) function__; // dummy call to supress warning
        // read-transform, write parameters

Here’s the second place we get to something to fill in. The write_array_visgen gets used to generate declarations for parameters, instead it would generate just the type/name/contraint (is there anything else?) and we would use the result to fill in this template piece:

  {% block  write_visgen %}
        {{type}} {{name}} = in__.scalar_{{constraint}}_constrain(0,1);
        vars__.push_back({{name}});
  {% endblock %}

Maybe there would need to be some variations for the different container types. Then something similar for write_vars_array_visgen, then back to boilerplace:

        if (!include_tparams__) return;
        // declare and define transformed parameters

        double lp__ = 0.0;
        (void) lp__; // dummy call to supress warning
        stan::math::accumulator<double> lp_accum__;


        try {
        } catch (const std::exception& e) {
            stan::lang::rethrow_located(e,current_statement_begin__);
            // Next line prevents compiler griping about no return
            throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
        }

    ...

That sort of thing. I’m probably missing some stuff that needs to be provided by template arguments.

Oh yeah, the “block” syntax is for repeating elements so how many times it appears in the final result depends on what teh C++ side does.

I see — you just mean using something like printf.
I see what the intent of this is and how it can be more
readable than what we’re doing now with << (which I agree
is dreadful):

{% block write_visgen %}
{{type}} {{name}} = in__.scalar_{{constraint}}constrain(0,1);
vars
_.push_back({{name}});
{% endblock %}

Presumably you’d need to indent that by hand to match the
context in which it’s used or is there a mechanism to indent
blocks?

What I didn’t see is where this pattern goes in our code and how
the free variables type, name, and constraint get instantiated.

Can you just show me how to rewrite this function using the
template lib you’re suggesting?

string generate(string type, string name, string constraint) {
std::stringstream s;
s << type << " " << name << " = in__.scalar_" << constriant << “constrain(0, 1);"
<< EOL
<< INDENT
<< "vars
_.push_back(” << name << “);”
<< EOL;
return s.str();
}

Or if it’s easeier, just write to the stream rather than having
to construct a string.

Thanks.

  • Bob

I’m happy to do a whole example for that function when I get a chance, but no free lunch, so if you generate an independent code block and later want to place it at variable indents you need to keep it as a string w/newlines and then write it via a stream that adds indentation (modifies newline to be newline followed by x spaces) or have a function to indent it prior to placing it in a bigger block.

I just want to see what the code looks like for a simple
example. I still don’t understand where the template goes
or how it’ll get called.

I don’t know how many of those templates are actually boilerplate
in the sense of being reused. What I’d like to accomplish is
higher-level code consolidation. So there’s some notion of
implementing a block of code, or a way to generate statement plus semicolon.
Basically something that’d look like generating from an AST
for C++ :-)

  • Bob

In their examples they load the template from a separate text file just like any of the web templating frameworks (e.g.-python’s mako or jinja2). I’m also not sure exactly how much of our code is boilerplate… so I guess I don’t know if there’s really justification for pursuing this further yet.

I think the refactoring you’re talking about are really the place to start if we want the biggest bang for our buck. I got as far as developing a series of complaints (outlined for Daniel above) but I’m not sure if there are design issues to tackle first.

What I really don’t get is how is the re-design you’re thinking of different from straight up generating an AST for C++ and letting something like clang generate the C++ code. Is the big dependency the thing you’re trying to avoid? We’re already dependent on having a compiler installed…

What about that dependency? Should we keep in the back of our minds the hope that we could one day get rid of the dependency on an external compiler and use some sort of embedded JIT internal compilation step? (See this stackoverflow question for background.)

(I’m justifying this off-topic comment with the thought that all C++ code is potentially code generated by one or more intermediate languages.)

(edited for clarity)

YOu mean potentially replaced(?)