Troubleshooting and monitoring

Zulip uses Supervisor to monitor and control its many Python services. Read the next section, Using supervisorctl, to learn how to use the Supervisor client to monitor and manage services.

If you haven't already, now might be a good time to read Zulip's architectural overview, particularly the Components section. This will help you understand the many services Zulip uses.

If you encounter issues while running Zulip, take a look at Zulip's logs, which are located in /var/log/zulip/. That directory contains one log file for each service, plus errors.log (has all errors), server.log (has logs from the Django and Tornado servers), and workers.log (has combined logs from the queue workers).

The section troubleshooting services on this page includes details about how to fix common issues with Zulip services.

If you run into additional problems, please report them so that we can update this page! The Zulip installation scripts logs its full output to /var/log/zulip/install.log, so please include the context for any tracebacks from that log.

Using supervisorctl

To see what Zulip-related services are configured to use Supervisor, look at /etc/supervisor/conf.d/zulip.conf and /etc/supervisor/conf.d/zulip-db.conf.

Use the supervisor client supervisorctl to list the status of, stop, start, and restart various services.

Checking status with supervisorctl status

You can check if the Zulip application is running using:

supervisorctl status

When everything is running as expected, you will see something like this:

process-fts-updates                                             RUNNING   pid 11392, uptime 19:40:06
smokescreen                                                     RUNNING   pid 3113, uptime 29 days, 21:58:32
teleport_node                                                   RUNNING   pid 15683, uptime 3 days, 13:01:58
zulip-django                                                    RUNNING   pid 11441, uptime 19:39:57
zulip-tornado                                                   RUNNING   pid 11397, uptime 19:40:03
zulip_deliver_scheduled_emails                                  RUNNING   pid 10289, uptime 19:41:04
zulip_deliver_scheduled_messages                                RUNNING   pid 10294, uptime 19:41:02
zulip-workers:zulip_events_deferred_work                        RUNNING   pid 10314, uptime 19:41:00
zulip-workers:zulip_events_digest_emails                        RUNNING   pid 10339, uptime 19:40:57
zulip-workers:zulip_events_email_mirror                         RUNNING   pid 10751, uptime 19:40:52
zulip-workers:zulip_events_email_senders                        RUNNING   pid 10769, uptime 19:40:49
zulip-workers:zulip_events_embed_links                          RUNNING   pid 11035, uptime 19:40:46
zulip-workers:zulip_events_embedded_bots                        RUNNING   pid 11139, uptime 19:40:43
zulip-workers:zulip_events_error_reports                        RUNNING   pid 11154, uptime 19:40:40
zulip-workers:zulip_events_invites                              RUNNING   pid 11261, uptime 19:40:36
zulip-workers:zulip_events_missedmessage_emails                 RUNNING   pid 11346, uptime 19:40:21
zulip-workers:zulip_events_missedmessage_mobile_notifications   RUNNING   pid 11351, uptime 19:40:19
zulip-workers:zulip_events_outgoing_webhooks                    RUNNING   pid 11358, uptime 19:40:17
zulip-workers:zulip_events_user_activity                        RUNNING   pid 11365, uptime 19:40:14
zulip-workers:zulip_events_user_activity_interval               RUNNING   pid 11376, uptime 19:40:11
zulip-workers:zulip_events_user_presence                        RUNNING   pid 11384, uptime 19:40:08

If you see any services showing a status other than RUNNING, or you see an uptime under 5 seconds (which indicates it's crashing immediately after startup and repeatedly restarting), that service isn't running. If you don't see relevant logs in /var/log/zulip/errors.log, check the log file declared via stdout_logfile for that service's entry in /etc/supervisor/conf.d/zulip.conf for details. Logs only make it to /var/log/zulip/errors.log once a service has started fully.

Restarting services with supervisorctl restart all

After you change configuration in /etc/zulip/settings.py or fix a misconfiguration, you will often want to restart the Zulip application. You can restart Zulip using:

supervisorctl restart all

Stopping services with supervisorctl stop all

Similarly, you can stop Zulip using:

supervisorctl stop all

If you're looking to shut down the server, it is often better to run ./scripts/stop-server.

Troubleshooting services

The Zulip application uses several major open source services to store and cache data, queue messages, and otherwise support the Zulip application:

If one of these services is not installed or functioning correctly, Zulip will not work. Below we detail some common configuration problems and how to resolve them:

Restrict unattended upgrades

:::{important} We recommend that you disable or limit Ubuntu's unattended-upgrades to skip some server packages. With unattended upgrades enabled but not limited, the moment a new PostgreSQL release is published, your Zulip server will have its PostgreSQL server upgraded (and thus restarted). If you do disable unattended-upgrades, do not forget to regularly install apt upgrades manually! :::

Restarting one of the system services that Zulip uses (PostgreSQL, memcached, Redis, or Rabbitmq) will drop the connections that Zulip processes have to the service, resulting in future operations on those connections throwing errors.

Zulip is designed to recover from system service downtime by creating new connections once the system service is back up, so the Zulip outage will end once the system service finishes restarting. But you'll get a bunch of error emails during the system service outage whenever one of the Zulip server's ~20 workers attempts to access the system service.

An unplanned outage will also result in an annoying (and potentially confusing) trickle of error emails over the following hours or days. These emails happen because a worker only learns its connection was dropped when it next tries to access the connection (at which point it'll send an error email and make a new connection), and several workers are commonly idle for periods of hours or days at a time.

You can prevent this trickle when doing a planned upgrade by restarting the Zulip server with /home/zulip/deployments/current/scripts/restart-server after installing system package updates to PostgreSQL, memcached, RabbitMQ, or Redis.

You can ensure that the unattended-upgrades package never upgrades PostgreSQL, memcached, Redis, or RabbitMQ, by configuring in /etc/apt/apt.conf.d/50unattended-upgrades:

// Python regular expressions, matching packages to exclude from upgrading
Unattended-Upgrade::Package-Blacklist {
    "libc\d+";
    "memcached$";
    "nginx-full$";
    "postgresql-\d+$";
    "rabbitmq-server$";
    "redis-server$";
    "supervisor$";
};

Monitoring

Chat is mission-critical to many organizations. This section contains advice on monitoring your Zulip server to minimize downtime.

First, we should highlight that Zulip sends Django error emails to ZULIP_ADMINISTRATOR for any backend exceptions. A properly functioning Zulip server shouldn't send any such emails, so it's worth reporting/investigating any that you do see.

Beyond that, the most important monitoring for a Zulip server is standard stuff:

Beyond that, Zulip ships a few application-specific end-to-end health checks. The Nagios plugins check_send_receive_time, check_rabbitmq_queues, and check_rabbitmq_consumers are generally sufficient to point to the cause of any Zulip production issue. See the next section for details.

Nagios configuration

The complete Nagios configuration (sans secret keys) used to monitor zulip.com is available under puppet/zulip_ops in the Zulip Git repository (those files are not installed in the release tarballs).

The Nagios plugins used by that configuration are installed automatically by the Zulip installation process in subdirectories under /usr/lib/nagios/plugins/. The following is a summary of the useful Nagios plugins included with Zulip and what they check:

Application server and queue worker monitoring:

Database monitoring:

Standard server monitoring:

If you're using these plugins, bug reports and pull requests to make it easier to monitor Zulip and maintain it in production are encouraged!

Memory leak mitigation

As a measure to mitigate the potential impact of any future memory leak bugs in one of the Zulip daemons, Zulip service automatically restarts itself every Sunday early morning. See /etc/cron.d/restart-zulip for the precise configuration.