Logging and error reporting

Having a good system for logging error reporting is essential to making a large project like Zulip successful. Without reliable error reporting, one has to rely solely on bug reports from users in order to produce a working product.

Our goal as a project is to have zero known 500 errors on the backend and zero known JavaScript exceptions on the frontend. While there will always be new bugs being introduced, that goal is impossible without an efficient and effective error reporting framework.

We provide integration with Sentry to make it easier for very large installations like zulip.com to manage their exceptions and ensure they are all tracked down, but our default email-based system is great for small installations.

Backend error reporting

The Django framework provides much of the infrastructure needed by our error reporting system:

Since 500 errors in any Zulip server are usually a problem the server administrator should investigate and/or report upstream, we have this email reporting system configured to report errors by default.

Zulip's optional Sentry integration will aggregate errors to show which users and realms are affected, any logging which happened prior to the exception, local variables in each frame of the exception, and the full request headers which triggered it.

Backend logging

Django's logging system uses the standard Python logging infrastructure. We have configured them so that logging.exception and logging.error get emailed to the server maintainer, while logging.warning will just appear in /var/log/zulip/errors.log. Lower log levels just appear in the main server log (as well as in the log for corresponding process, be it django.log for the main Django processes or the appropriate events_* log file for a queue worker).

Backend logging format

The main Zulip server log contains a line for each backend request. It also contains warnings, errors, and the full tracebacks for any Python exceptions. In production, it goes to /var/log/zulip/server.log; in development, it goes to the terminal where you run run-dev.py.

In development, it's good to keep an eye on the run-dev.py console as you work on backend changes, since it's a great way to notice bugs you just introduced.

In production, one usually wants to look at errors.log for errors since the main server log can be very verbose, but the main server log can be extremely valuable for investigating performance problems.

2016-05-20 14:50:22.056 INFO [zr] 127.0.0.1       GET     302 528ms (db: 1ms/1q) (+start: 123ms) / (unauth@zulip via ?)
[20/May/2016 14:50:22]"GET / HTTP/1.0" 302 0
2016-05-20 14:50:22.272 INFO [zr] 127.0.0.1       GET     200 124ms (db: 3ms/2q) /login/ (unauth@zulip via ?)
2016-05-20 14:50:26.333 INFO [zr] 127.0.0.1       POST    302  37ms (db: 6ms/7q) /accounts/login/local/ (unauth@zulip via ?)
[20/May/2016 14:50:26]"POST /accounts/login/local/ HTTP/1.0" 302 0
2016-05-20 14:50:26.538 INFO [zr] 127.0.0.1       POST    200  12ms (db: 1ms/2q) (+start: 53ms) /api/v1/events/internal [1463769771:0/0] (8@zulip via internal)
2016-05-20 14:50:26.657 INFO [zr] 127.0.0.1       POST    200  10ms (+start: 8ms) /api/v1/events/internal [1463769771:0/0] (8@zulip via internal)
2016-05-20 14:50:26.959 INFO [zr] 127.0.0.1       GET     200 588ms (db: 26ms/21q) / [1463769771:0] (8@zulip via website)

The format of this output is: * Timestamp * Log level * Logger name, abbreviated as "zr" for these Zulip request logs * IP address * HTTP method * HTTP status code * Time to process * (Optional perf data details, e.g. database time/queries, memcached time/queries, Django process startup time, Markdown processing time, etc.) * Endpoint/URL from zproject/urls.py * "email via client" showing user account involved (if logged in) and the type of client they used ("web", "Android", etc.).

The performance data details are particularly useful for investigating performance problems, since one can see at a glance whether a slow request was caused by delays in the database, in the Markdown processor, in memcached, or in other Python code.

One useful thing to note, however, is that the database time is only the time spent connecting to and receiving a response from the database. Especially when response are large, there can often be a great deal of Python processing overhead to marshall the data from the database into Django objects that is not accounted for in these numbers.

Blueslip frontend error reporting

We have a custom library, called blueslip (named after the form used at MIT to report problems with the facilities), that takes care of reporting JavaScript errors. In production, this means emailing the server administrators (though the setting controlling this, BROWSER_ERROR_REPORTING, is disabled by default, since most problems are unlikely to be addressable by a system administrator, and it's very hard to make JavaScript errors not at least somewhat spammy due to the variety of browser versions and sets of extensions that someone might use). In development, this means displaying a highly visible overlay over the message view area, to make exceptions in testing a new feature hard to miss.

Blueslip supports several error levels: * throw new Error(…): For fatal errors that cannot be easily recovered from. We try to avoid using it, since it kills the current JS thread, rather than returning execution to the caller. * blueslip.error: For logging of events that are definitely caused by a bug and thus sufficiently important to be reported, but where we can handle the error without creating major user-facing problems (e.g. an exception when handling a presence update). * blueslip.warn: For logging of events that are a problem but not important enough to send an email about in production. They are, however, highlighted in the JS console in development. * blueslip.log (and blueslip.info): Logged to the JS console in development and also in the blueslip log in production. Useful for data that might help discern what state the browser was in during an error (e.g. whether the user was in a narrow). * blueslip.debug: Similar to blueslip.log, but are not printed to the JS console in development.

Frontend performance reporting

In order to make it easier to debug potential performance problems in the critically latency-sensitive message sending code pathway, we log and report to the server the following whenever a message is sent:

The code is all in zerver/lib/report.py and static/js/sent_messages.js.

We have similar reporting for the time it takes to narrow / switch to a new view: