Logging in Stan services

Tl;dr: my recommendation is to have our own interface with logging levels that the services pass complete messages.

I spent the evening reading up on:

  • Java logging design
  • C++ logging systems: log4cplusplus, log4cxx, glog, boost log, pantheons, and a few others
  • how Stan’s services would use logging

I think all the packages are aligned on logging levels. I think we stick to those levels for now.

Where the C++ libraries differed was down to implementation. Some had function calls to log, others allowed you to stream into it. Some had pattern substitution, most didn’t.

I’m also thinking about @bgoodri’s use case of filtering based on messages. This seems to be calling out for having a secondary way of filtering the messages. In the Java logging utility, there is a name option that has a hierarchy. I think having a name is a decent way to handle that case.

I think if we ever want to move towards multitthreading, we want to keep the messages whole.

So my suggestion is to have our custom API for logging which includes a log level, name, and message. Thoughts?

(Sorry, won’t be at the meeting tomorrow.)

That was a little too terse for me to understand. Did you want to use one of the existing C++ packages or write something ourselves? If ourselves, then what will the API look like from the client code perspective?

Would there be some kind of singleton logger? Would it be set up with filters, like tee-ing and filtering by name? Is the message API something that uses << or takes a string argument?

I agree that we’d need to synchronize messages from multiple threads. The question is whether this is done on the client side or not. If we’re streaming in dribs and drabs, then it’d have to be done in the logger itself, otherwise it could be done in whatever handles the log messages.

Rather than starting a new thread, I’ll pile on here about the merge that just happened.

I think it’s going to be very problematic efficiency-wise to have a logging framework that takes strings rather than streams. The fundamental problem is that you have to do something like:

logger.info(a + b + c);

and because C++ is pass-by-value, that expression a + b + c gets evaluated. So that leads everyone with a logging framework like this to clutter up all their code with efficiency preserving guards,

if (logger.is_info())
  logger.info(a + b + c);

it doubles the testing (for the log level), but it prevents the expensive string concatenations that would otherwise be done eagerly.

What we’d really want is something that that lazily did this or had a way to ignore. If we had

std::ostream& logger.info();

and the usage pattern

logger.info() << a << b << c;

we couldn’t solve the problem of evaluating the a, b, and c expressions, but at least we wouldn’t need to concatenate. If the logger wasn’t up to the info level, it’d return a dummy stream. But that still doesn’t solve the bigger problem.

The problem it does solve, though is this one:

double x;
logger.info("value of x is " + x);  // FAIL, can't add arithmetic and string types!

which we’d have to code up as

std::stringstream ss;
ss << "value of x is " + x;
logger.info(ss.str());

which of course needs to be wrapped for efficiency, making the entire idiom this

if (logger.is_info()) {
  std::stringstream ss;
  ss << "value of x is " + x;
  logger.info(ss.str());
}

I don’t know of a good solution to this problem of not doing redundant work if the logger isn’t reporting the relevant logging level.

Also, I didn’t understand Daniel’s concern about breaking messages up for RStan.

RStan, there's usually a string that prefixes each line that indicates what chain the message comes from. So far, the message unit is a line, not a chunk. So if we wanted to use something like a stream operator with std::endl as a marker for the end of the line, RStan's logger would need to inspect the stream as it comes in and decide how to break apart lines.

I don’t quite get the efficiency thing here: Daniel’s code doesn’t include anything like logger.is_info. If the codes wants to log info it does, if it doesn’t it never bothers with the logger. That also means you know when you’re constructing the string. The one part I get is that free-form strings manipulated in efficiency-relevant pieces of code seem like a bad idea… but where do we do that?

So this use case kills me because it’s using a string for a key seems really heavy inside C++, outside of cpp it seems fine. This is the use case we have all over the place:

std::stringstream ss;
ss << "stepsize is " + x;
logger.info(ss.str());

How about taking Daniel’s current approach but for key-value stuff doing this (runs in cpp.sh):

// Example program
#include <iostream>
#include <string>

namespace stan {
    namespace out {
        struct stepsize {   
          static const std::string key;
        };
        const std::string stepsize::key = "stepsize";
    }
}

template <typename T>
void logger_A(double value) {
    std::cout << T::key << " " << value << std::endl;
}

template <typename T>
void logger_B(T key, double value)  {
    std::cout << T::key << " " << value << std::endl;
}

int main()
{
  logger_A<stan::out::stepsize>(3.3);
  logger_B(stan::out::stepsize(), 3.3);
}

I’m partial to logger_A. I’m pretty sure this is only as expensive as passing the double.

Sorry, I was being sloppy with what I was saying. When I mentioned breaking up messages for RStan and talking about efficiency, I meant the same thing. I didn’t put every write inside an if (and I didn’t include something like is_info() in the logging interface) because we currently call the logger so infrequently. Right now, the logger gets called (a few times at a time) once per iteration if there’s some sort of rejection and how ever often we print out iterations. The way we currently use the logger, I don’t think it’s a practical concern. As we go forward and can put out more information, this might differ.

More details:

In RStan, it used to spit out messages per chain, so something like:

chain 1, message 1
chain 1, message 2

If we had

std::ostream& logger.info();

and the usage pattern

logger.info() << a << b << c;

Then here’s a few examples of how that string could have been created:

logger.info() << “message 1” << std::endl;
logger.info() << “message 2” << std::endl;


2. 

logger.info() << “message 1” << std::endl << “message 2” << std::endl;


3. 

logger.info() << “message 1\nmessage2” << std::endl;


Under the current behavior, we'd want the output to be the same across the three different ways to create that string. I'd say for the first two, it's fairly straightforward: the logger would need to have state. When it's the start of a message, it would insert the prefix string, here "`chain 1,`" and then continue to append whatever gets passed in until it receives "`std::endl`."

With the third, it would also need state, but in addition, it would need to split the message at "`\n`" and insert `chain 1,` after the newline.

I didn't know how to do that part efficiently. With the current design, where a complete message / line is passed in as the argument, it's easy to add a prefix string before each message.

@sakrejda, I don’t think we need to worry about this for the logger. (I think we need to worry about it for the other types of output.) For the logging, there really aren’t key-value pairs. If you had a concrete example of where to use key-value pairs for messages in the interfaces, please let me know. (The example you pulled up is just for configuration and it’s supposed to be an informative error message for the user to figure out what setting is incorrect.)

Now that I’ve familiarized myself with different logging frameworks, two possible next steps are:

  1. adding is_info() type methods so we can avoid evaluating the string (@Bob_Carpenter’s point)
  2. adding topics that can be subscribed to (@sakrejda, this addresses your key-value points)

I’m not sure we’d gain much by adding both to the logger, unless we want to change the logger behavior.

When I reviewed the logger PR I don’t think it came up in the code. It does come up with the writers (e.g.-output of the mass matrix diagonal, as a string (!))

1 Like

YUP! Logger is now done! We can move towards removing strings from the other writers.

It’s going to need to include some kind of boolean methods to tell what the logger level is if we want to avoid string manipulation when not necessary. That’s how you know whether to construct the string.

Generally logging messages get restricted to not having newlines in them so that line-based processors can deal with the output and boilerplate can go at the front of lines.

So I don’t think this is a concern.

Does the current API for logger.info("foo") automatically append a newline? I think it should, because it means the user doesn’t have to repeat the newline and write things like logger.info("foo" + "\n") or logger.info("foo\n").

If we did expose a stream, I’d like to have the usage auto-include a newline, so that

logger.info() << "msg1";
logger.info() << "msg2";

is synonymous with

logger.info("msg1");
logger.info("msg2");

By the way, I’m not saying we need to put checks in the code to make sure there aren’t newlines. This is at the programmer interface level, so we should have unit tests to make sure we’re not doing it.

If we restrict log messages to those without newlines then it seems like we could expose streams and still allow R to construct a logger callback that prepends "chain 1: " or whatever to everything passed into the stream, right?

What about this solution: c++ - A function with variable number of arguments with known types, the c++11 way - Stack Overflow
That lets you delay string mangling till after the call:

void foobar(std::initializer_list<std::string> values)
{
    for (auto& value : values) {
        // do something with value
        std::cout << value << '\n';
    }
}

foobar({ "Hello", "World", "...", });

The linked Stackoverflow answer also has a template-based answer.

Awesome. We can do that.

That’s not going to work. We only need to build strings when we serialize things like doubles and matrices. The initializer list requires the template argument to be the same.

Good point, the template version will work though—you can have a parameter pack with varying types.

Funny you should mention this, as it’s exactly the solution I built into the Stan language itself—take variable number of arguments of any type to the print() and reject() statements—I figured that’d be easier than introducing a new function call syntax.

I still haven’t fully understood parameter packs. Is there a way to pass by reference? If not, we’d have to copy just to call it, but if so, this should work pretty easily with a recursive template function to do the actual streaming.

It looks to be a c++ level of straightforward: https://stackoverflow.com/questions/9103076/how-to-pass-a-reference-through-a-parameter-pack

As a consequence of splitting apart the logger from the writers, we really shouldn’t have the multi-line problem any more.

I think we can still generate some through the check functions, but I think most of them don’t print Eigen::Matrix any more.

Regarding efficiency, yes, we should be responsible. If you (@sakrejda and @Bob_Carpenter) think we should add an is_info() function or use template magic to be efficient, we can and should – it’s not too difficult. I just want to mention again that this isn’t a real bottleneck, I think. On a typical run of 2000 iterations with refresh at 100, there are about 5 calls for a message about timing, 22 calls about the iteration number, and 5 calls for the elapsed time. Per iteration, we get another half dozen calls, but I’ve always thought the info was useful + exception handling probably takes some time. (I’m just talking logging, not the rest of the writers.)