Collecting, Aggregating, and Presenting Data from The Tor Network
As the makers of software dedicated to privacy and freedom online, Tor must take special precautions when collecting data to ensure the privacy of its users while ensuring the integrity of its services.
Tor Metrics is the central mechanism that the Tor Project uses to evaluate the functionality and ongoing relevance of its technologies. Tor Metrics consists of several services that work together to collect, aggregate, and present data from the Tor network and related services. We're always looking for ways to improve, and we recently completed a project, the main points of which are included in this post, to document our pipeline and identify areas that could benefit from modernization.
As of August 2017, all user-facing Tor Metrics content has moved (back) to the Tor Metrics website. The main reason for gathering everything related to Tor Metrics on a single website is usability. In the background, however, there are several services distributed over a dozen hosts that together collect, aggregate, and present the data on the Tor Metrics website. Almost all Tor Metrics codebases are written using Java, although there is also R, SQL and Python. In the future we expect to see more Python code, although Java is still popular with academics and data scientists and we would like to continue supporting easy access to our data for those that want to use Java even if we are using it less ourselves.
Tor relays and bridges collect aggregated statistics about their usage including bandwidth and connecting clients per country. Source aggregation is used to protect the privacy of connecting users—discarding IP addresses and only reporting country information from a local database mapping IP address ranges to countries. These statistics are sent periodically to the directory authorities. CollecTor downloads the latest server descriptors, extra info descriptors containing the aggregated statistics, and consensus documents from the directory authorities and archives them. This archive is public and the metrics-lib Java library can be used to parse the contents of the archive to perform analysis of the data.
In order to provide easy access to visualizations of the historical data archived, the Tor Metrics website contains a number of customizable plots to show user, traffic, relay, bridge, and application download statistics over a requested time period and filtered to a particular country. In order to provide easy access to current information about the public Tor network, Onionoo implements a protocol to serve JSON documents over HTTP that can be consumed by applications that would like to display information about relays along with historical bandwidth, uptime, and consensus weight information.
An example of one such application is Relay Search which is used by relay operators, those monitoring the health of the network, and developers of software using the Tor network. Another example of such an application is metrics-bot which posts regular snapshots to Twitter and Mastodon including country statistics and a world map plotting known relays. The majority of our services use metrics-lib to parse the descriptors that have been collected by CollecTor as their source of raw data about the public Tor network.
As the volume of the data we are collecting scales over time, with both the growth of the Tor network and an increase in the number of services providing telemetry and ecosystem health metrics, some of our services become slow or begin to fail as they are no longer fit for purpose. The most critical part of the Tor Metrics pipeline, and the part that has to handle the most data, is CollecTor. In charge of collecting and archiving raw data from the Tor ecosystem, an outage or failure may mean lost data that we can never collect. For this reason we chose CollecTor as the service that we would examine in detail and explore options for modernization using technologies that had not been available previously.
The CollecTor service provides network data collected since 2004, and has existed in its current form as a Java application since 2010. Over time new modules have been added to collect new data and other modules have been retired as the services they downloaded data from no longer exist. As the CollecTor codebase has grown, technical debt has emerged as we have added new features without refactoring existing code. This results in it becoming increasingly difficult to add new data sources to CollecTor as the complexity of the application increases. Some of the requirements of CollecTor, such as concurrency or scheduling, are common to many applications and frameworks exist implementing best practices for these components that could be used in place of the current bespoke implementations.
During the process of examining CollecTor's operation, we discovered some bugs in the Tor directory protocol specification and created fixes for them:
• The extra-info-digest's sha256-digest isn't actually over the same data as the sha1-digest (#28415)
• The ns consensus isn't explicitly ns in the network-status-version line of flavored consensuses (#28416)
During the process we also requested some additional functionality from stem that would allow for a Python implementation, which was implemented to support our prototype:
• Parsing descriptors from a byte-array (#28450)
• Parsing of detached signatures (#28495)
• Generating digests for extra-info descriptors, votes, consensuses and microdescriptors (#28398)
A prototype of a "modern CollecTor" was implemented as part of this project. This prototype is known as bushel and the source code and documentation can be found on GitHub at https://github.com/irl/bushel. In the future, we hope to build on this work to produce an improved and modernized data pipeline that can better both help our engineering efforts and assist rapid response to attacks or censorship.
For the full details from this project, you can refer to the two technical reports:
Iain R. Learmonth and Karsten Loesing. Towards modernising data collection and archive for the Tor network. Technical Report 2018-12-001, The Tor Project, December 2018.
Iain R. Learmonth and Karsten Loesing. Tor metrics data collection, aggregation, and presentation. Technical Report 2019-03-001, The Tor Project, March 2019.