Analytics

Zulip has a cool analytics system for tracking various useful statistics that currently power the /stats page, and over time will power other features, like showing usage statistics for the various streams. It is designed around the following goals:

There are a few important things you need to understand in order to effectively modify the system.

Analytics backend overview

There are three main components:

The next several sections will dive into the details of these components.

The *Count database tables

The Zulip analytics system is built around collecting time series data in a set of database tables. Each of these tables has the following fields:

There are four tables: UserCount, StreamCount, RealmCount, and InstallationCount. Every CountStat is initially collected into UserCount, StreamCount, or RealmCount. Every stat in UserCount and StreamCount is aggregated into RealmCount, and then all stats are aggregated from RealmCount into InstallationCount. So for example, "messages_sent:client:day" has rows in UserCount corresponding to (user, end_time, client) triples. These are summed to rows in RealmCount corresponding to triples of (realm, end_time, client). And then these are summed to rows in InstallationCount with totals for pairs of (end_time, client).

Note: In most cases, we do not store rows with value 0. See Performance strategy below.

CountStats

CountStats declare what analytics data should be generated and stored. The CountStat class definition and instances live in analytics/lib/counts.py. These declarations specify at a high level which tables should be populated by the system and with what data.

The FillState table

The default Zulip production configuration runs a cron job once an hour that updates the *Count tables for each of the CountStats in the COUNT_STATS dictionary. The FillState table simply keeps track of the last end_time that we successfully updated each stat. It also enables the analytics system to recover from errors (by retrying) and to monitor that the cron job is running and running to completion.

Performance strategy

An important consideration with any analytics system is performance, since it's easy to end up processing a huge amount of data inefficiently and needing a system like Hadoop to manage it. For the built-in analytics in Zulip, we've designed something lightweight and fast that can be available on any Zulip server without any extra dependencies through the carefully designed set of tables in PostgreSQL.

This requires some care to avoid making the analytics tables larger than the rest of the Zulip database or adding a ton of computational load, but with careful design, we can make the analytics system very low cost to operate. Also, note that a Zulip application database has 2 huge tables: Message and UserMessage, and everything else is small and thus not performance or space-sensitive, so it's important to optimize how many expensive queries we do against those 2 tables.

There are a few important principles that we use to make the system efficient:

Backend testing

There are a few types of automated tests that are important for this sort of system:

For manual backend testing, it sometimes can be valuable to use ./manage.py dbshell to inspect the tables manually to check that things look right; but usually anything you feel the need to check manually, you should add some sort of assertion for to the backend analytics tests, to make sure it stays that way as we refactor.

LoggingCountStats

The system discussed above is designed primarily around the technical problem of showing useful analytics about things where the raw data is already stored in the database (e.g. Message, UserMessage). This is great because we can always backfill that data to the beginning of time, but of course sometimes one wants to do analytics on things that aren't worth storing every data point for (e.g. activity data, request performance statistics, etc.). There is currently a reference implementation of a "LoggingCountStat" that shows how to handle such a situation.

Analytics UI development and testing

Setup and testing

The main testing approach for the /stats page UI is manual testing. For most UI testing, you can visit /stats/realm/analytics while logged in as Iago (this is the server administrator view of stats for a given realm). The only piece that you can't test here is the "Me" buttons, which won't have any data. For those, you can instead log in as the shylock@analytics.ds in the analytics realm and visit /stats there (which is only a bit more work). Note that the analytics realm is a shell with no streams, so you'll only want to use it for testing the graphs.

If you're adding a new stat/table, you'll want to edit analytics/management/commands/populate_analytics_db.py and add code to generate fake data of the form needed for your new stat/table; you'll then run ./manage.py populate_analytics_db before looking at the updated graphs.

Adding or editing /stats graphs

The relevant files are:

Most of the code is self-explanatory, and for adding say a new graph, the answer to most questions is to copy what the other graphs do. It is easy when writing this sort of code to have a lot of semi-repeated code blocks (especially in stats.js); it's good to do what you can to reduce this.

Tips and tricks:

/activity page