Check the status of Tor services with status.torproject.org
The Tor Project now has a status page which shows the state of our major services.
You can check status.torproject for news about major outages in Tor services, including v3 and v2 onion services, directory authorities, our website (torproject.org), and the check.torproject.org tool. The status page also displays outages related to Tor internal services, like our GitLab instance.
This post documents why we launched status.torproject.org, how the service was built, and how it works.
Why a status page
The first step in setting up a service page was to realize we needed one in the first place. I surveyed internal users at the end of 2020 to see what could be improved, and one of the suggestions that came up was to "document downtimes of one hour or longer" and generally improve communications around monitoring. The latter is still on the sysadmin roadmap, but a status page seemed like a good solution for the former.
We already have two monitoring tools in the sysadmin team: Icinga (a fork of Nagios) and Prometheus, with Grafana dashboards. But those are hard to understand for users. Worse, they also tend to generate false positives, and don't clearly show users which issues are critical.
In the end, a manually curated dashboard provides important usability benefits over an automated system, and all major organisations have one.
Picking the right tool
It wasn't my first foray in status page design. In another life, I had setup a status page using a tool called Cachet. That was already a great improvement over the previous solutions, which were to use first a wiki and then a blog to post updates. But Cachet is a complex Laravel app, which also requires a web browser to update. It generally requires more maintenance than what we'd like, needing silly things like a SQL database and PHP web server.
At first, I wanted to deploy the site through GitLab CI, but at that time we didn't have GitLab pages set up. Even though we do have GitLab pages set up now, it's not (yet) integrated with our mirroring infrastructure. So, for now, the source is hosted and built in our legacy git and Jenkins services.
It is nice to have the content hosted in a git repository: sysadmins can just edit Markdown in the git repository and push to deploy changes, no web browser required. And it's trivial to setup a local environment to preview changes:
hugo serve --baseUrl=http://localhost/ firefox https://localhost:1313/
Only the sysadmin team and gitolite administrators have access to the repository, at this stage, but that could be improved if necessary. Merge requests can also be issued on the GitLab repository and then pushed by authorized personnel later on, naturally.
One of the concerns I have is that the site is hosted inside our normal mirror infrastructure. Naturally, if an outage occurs there, the site goes down. But I figured it's a bridge we'll cross when we get there. Because it's so easy to build the site from scratch, it's actually trivial to host a copy of the site on any GitLab server, thanks to the
.gitlab-ci.yml file shipped (but not currently used) in the repository. If push comes to shove, we can just publish the site elsewhere and point DNS there.
And, of course, if DNS fails us, then we're in trouble, but that's the situation anyway: we can always register a new domain name for the status page when we need to. It doesn't seem like a priority at the moment.
Comments and feedback are welcome!