Check the status of Tor services with status.torproject.org

by anarcat | May 5, 2021

The Tor Project now has a status page which shows the state of our major services.

You can check status.torproject for news about major outages in Tor services, including v3 and v2 onion services, directory authorities, our website (torproject.org), and the check.torproject.org tool. The status page also displays outages related to Tor internal services, like our GitLab instance.

This post documents why we launched status.torproject.org, how the service was built, and how it works.

Why a status page

The first step in setting up a service page was to realize we needed one in the first place. I surveyed internal users at the end of 2020 to see what could be improved, and one of the suggestions that came up was to "document downtimes of one hour or longer" and generally improve communications around monitoring. The latter is still on the sysadmin roadmap, but a status page seemed like a good solution for the former.

We already have two monitoring tools in the sysadmin team: Icinga (a fork of Nagios) and Prometheus, with Grafana dashboards. But those are hard to understand for users. Worse, they also tend to generate false positives, and don't clearly show users which issues are critical.

In the end, a manually curated dashboard provides important usability benefits over an automated system, and all major organisations have one.

Picking the right tool

It wasn't my first foray in status page design. In another life, I had setup a status page using a tool called Cachet. That was already a great improvement over the previous solutions, which were to use first a wiki and then a blog to post updates. But Cachet is a complex Laravel app, which also requires a web browser to update. It generally requires more maintenance than what we'd like, needing silly things like a SQL database and PHP web server.

So when I found cstate, I was pretty excited. It's basically a theme for the Hugo static site generator, which means that it's a set of HTML, CSS, and a sprinkle of Javascript. And being based on Hugo means that the site is generated from a set of Markdown files and the result is just plain HTML that can be hosted on any web server on the planet.

Deployment

At first, I wanted to deploy the site through GitLab CI, but at that time we didn't have GitLab pages set up. Even though we do have GitLab pages set up now, it's not (yet) integrated with our mirroring infrastructure. So, for now, the source is hosted and built in our legacy git and Jenkins services.

It is nice to have the content hosted in a git repository: sysadmins can just edit Markdown in the git repository and push to deploy changes, no web browser required. And it's trivial to setup a local environment to preview changes:

hugo serve --baseUrl=http://localhost/
firefox https://localhost:1313/

Only the sysadmin team and gitolite administrators have access to the repository, at this stage, but that could be improved if necessary. Merge requests can also be issued on the GitLab repository and then pushed by authorized personnel later on, naturally.

Availability

One of the concerns I have is that the site is hosted inside our normal mirror infrastructure. Naturally, if an outage occurs there, the site goes down. But I figured it's a bridge we'll cross when we get there. Because it's so easy to build the site from scratch, it's actually trivial to host a copy of the site on any GitLab server, thanks to the .gitlab-ci.yml file shipped (but not currently used) in the repository. If push comes to shove, we can just publish the site elsewhere and point DNS there.

And, of course, if DNS fails us, then we're in trouble, but that's the situation anyway: we can always register a new domain name for the status page when we need to. It doesn't seem like a priority at the moment.

Comments and feedback are welcome!

Comments

Please note that the comment area below has been archived.

That's right: components which don't have documented outages have this bizarre behavior that they give a 404 instead of a more helpful message like, say, "there were no reported incidents on this component".

It seems like it's actually been fixed upstream somewhat but for some reason it still doesn't work on our end. The "affected/404" page doesn't get triggered by the ErrorDocument but anyways, it doesn't get built in the current configuration. I filed upstream issue 183 to discuss that over there.

Thanks for the report!

May 11, 2021

In reply to anarcat

Permalink

Not the same person. But have you also thought about implementing a yellow then red depreciation warning on the onion "lock logo" itself similar to HTTP security warnings? This would be the best way in my opinion to solve the fact that some people are just never going to find out themselves and suddenly lose their v2 address without redirecting users to their new v3 onion first.

May 12, 2021

In reply to anarcat

Permalink

Thanks for promptly implementing a good suggestion from a user!

May 07, 2021

Permalink

I surveyed internal users at the end of 2020 to see what could be improved

"I outline the blog first because it's one of the most frequently used service, yet it's one of the "saddest", so it should probably be made a priority in 2021."

Thank you.