National Insecurity Agency
This is the second post in a two-part series on the build security improvements in the Tor Browser Bundle 3.0 release cycle.
The first post described why such security is necessary. This post is meant to describe the technical details with respect to how such builds are produced.
We achieve our build security through a reproducible build process that enables anyone to produce byte-for-byte identical binaries to the ones we release. Elsewhere on the Internet, this process is varyingly called "deterministic builds", "reproducible builds", "idempotent builds", and probably a few other terms, too.
To produce byte-for-byte identical packages, we use Gitian to build Tor Browser Bundle 3.0 and above, but that isn't the only option for achieving reproducible builds. We will first describe how we use Gitian, and then go on to enumerate the individual issues that Gitian solves for us, and that we had to solve ourselves through either wrapper scripts, hacks, build process patches, and (in one esoteric case for Windows) direct binary patching.
Gitian: What is it?
Gitian is a thin wrapper around the Ubuntu virtualization tools written in a combination of Ruby and bash. It was originally developed by Bitcoin developers to ensure the build security and integrity of the Bitcoin software.
Gitian uses Ubuntu's python-vmbuilder to create a qcow2 base image for an Ubuntu version and architecture combination and a set of git and tarball inputs that you specify in a 'descriptor', and then proceeds to run a shell script that you provide to build a component inside that controlled environment. This build process produces an output set that includes the compiled result and another "output descriptor" that captures the versions and hashes of all packages present on the machine during compilation.
Gitian requires either Intel VT support (for qemu-kvm), or LXC support, and currently only supports launching Ubuntu build environments from Ubuntu itself.
Gitian: How Tor Uses It
Tor's use of Gitian is slightly more automated and also slightly different than how Bitcoin uses it.
First of all, because Gitian supports only Ubuntu hosts and targets, we must cross compile everything for Windows and MacOS. Luckily, Mozilla provides support for MinGW-w64 as a "third tier" compiler, and does endeavor to work with the MinGW team to fix issues as they arise.
To our further good fortune, we were able to use a MacOS cross compiler created by Ray Donnelly based on a fork of "Toolchain4". We owe Ray a great deal for providing his compilers to the public, and he has been most excellent in terms of helping us through any issues we encountered with them. Ray is also working to merge his patches into the crosstools-ng project, to provide a more seamless build process to create rebuilds of his compilers. As of this writing, we are still using his binaries in combination with the flosoft MacOS10.X SDK.
For each platform, we build the components of Tor Browser Bundle in 3 stages, with one descriptor per stage. The first descriptor builds Tor and its core dependency libraries (OpenSSL, libevent, and zlib) and produces one output zip file. The second descriptor builds Firefox. The third descriptor combines the previous two outputs along with our included Firefox addons and localization files to produce the actual localized bundle files.
We provide a Makefile and shellscript-based wrapper around Gitian to automate the download and authentication of our source inputs prior to build, and to perform a final step that creates a sha256sums.txt file that lists all of the bundle hashes, and can be signed by any number of detached signatures, one for each builder.
It is important to distribute multiple cryptographic signatures to prevent targeted attacks against stealing a single build signing key (because the signing key itself is of course another single point of failure). Unfortunately, GPG currently lacks support for verifying multiple signatures of the same document. Users must manually copy each detached signature they wish to verify into its proper .asc filename suffix. Eventually, we hope to create a stub installer or wrapper script of some kind to simplify this step, as well as add multi-signature support to the Firefox update process. We are also investigating adding a URL and a hash of the package list the Tor Consensus, so that the Tor Consensus document itself authenticates our binary packages.
We do not use the Gitian output descriptors or the Gitian signing tools, because the tools are used to sign Gitian's output descriptors. We found that Ubuntu's individual packages (which are listed in the output descriptors) varied too frequently to allow this mechanism to be reproducible for very long. However, we do include a list of input versions and hashes used to produce each bundle in the bundle itself. The format of this versions file is the same that we use as input to download the sources. This means it should remain possible to re-build an arbitrary bundle for verification at a later date, assuming that any later updates to Ubuntu's toolchain packages do not change the output.
Gitian: Pain Points
Gitian is not perfect. In fact, many who have tried our build system have remarked that it is not even close to deterministic (and that for this and other reasons 'Reproducible Builds' is a better term). In fact, it seems to experience build failures for quite unpredictible reasons related to bugs in one or more of qemu-kvm/LXC, make, qcow copy-on-write image support. These bugs are often intermittent, and simply restarting the build process often causes things to proceed smoothly. This has made the bugs exceedingly tricky to pinpoint and diagnose.
Gitian's use of tags (especially signed tags) has some bugs and flaws. For this reason, we verify signatures ourselves after input fetching, and provide gitian only with explicit commit hashes for the input source repositories.
We maintain a list of the most common issues in the build instructions.
Remaining Build Reproducibility Issues
By default, the Gitian VM environment controls the following aspects of the build platform that normally vary and often leak into compiled software: hostname, build path, uname output, toolchain version, and time.
However, Gitian is not enough by itself to magically produce reproducible builds. Beyond what Gitian provides, we had to patch a number of reproducibility issues in Firefox and some of the supporting tools on each platform. These include:
- Reordering due to inode ordering differences (exposed via Python's os.walk())
Several places in the Firefox build process use python scripts to repackage both compiled library archives and zip files. In particular, they tend to obtain directory listings using os.walk(), which is dependent upon the inode ordering of the filesystem. We fix this by sorting those file lists in the applicable places.
- LC_ALL localizations alter sorting order
Sorting only gets you so far, though, if someone from a different locale is trying to reproduce your build. Differences in your character sets will cause these sort orders to differ. For this reason, we set the LC_ALL environment variable to 'C' at the top of our Gitian descriptors.
- Hostname and other OS info leaks in LXC mode
For these cases, we simply patch the pieces of Firefox that include the hostname (primarily for about:buildconfig).
- Millisecond and below timestamps are not fixed by libfaketime
Gitian relies on libfaketime to set the clock to a fixed value to deal with embedded timestamps in archives and in the build process. However, in some places, Firefox inserts millisecond timestamps into its supporting libraries as part of an informational structure. We simply zero these fields.
- FIPS-140 mode generates throwaway signing keys
A rather insane subsection of the FIPS-140 certification standard requires that you distribute signatures for all of your cryptographic libraries. The Firefox build process meets this requirement by generating a temporary key, using it to sign the libraries, and discarding the private portion of that key. Because there are many other ways to intercept the crypto outside of modifying the actual DLL images, we opted to simply remove these signature files from distribution. There simply is no way to verify code integrity on a running system without both OS and coprocessor assistance. Download package signatures make sense of course, but we handle those another way (as mentioned above).
- On Windows builds, something mysterious causes 3 bytes to randomly vary
in the binary.
Unable to determine the source of this, we just bitstomp the binary and regenerate the PE header checksums using strip... Seems fine so far! ;)
- umask leaks into LXC mode in some cases
We fix this by manually setting umask at the top of our Gitian descriptors. Additionally, we found that we had to reset the permissions inside of tar and zip files, as the umask didn't affect them on some builds (but not others...)
- Zip and Tar reordering and attribute issues
- Timezone leaks
To deal with these, we set TZ=UTC at the top of our descriptors.
The most common question we've been asked about this build process is: What can be done to prevent the adversary from compromising the (substantially weaker) Ubuntu build and packaging processes, and further, what about the Trusting Trust attack?
In terms of eliminating the remaining single points of compromise, the first order of business is to build all of our compilers and toolchain directly from sources via their own Gitian descriptors.
Once this is accomplished, we can begin the process of building identical binaries from multiple different Linux distributions. This would require the adversary to compromise multiple Linux distributions in order to compromise the Tor software distribution.
If we can support reproducible builds through cross compiling from multiple architectures (Intel, ARM, MIPS, PowerPC, etc), this also reduces the likelihood of a Trusting Trust attack surviving unnoticed in the toolchain (because the machine code that injects the payload would have to be pre-compiled and present for all copies of the cross-compiled executable code in a way that is still not visible in the sources).
If those Linux distributions also support reproducible builds of the full build and toolchain environment (both Debian and Fedora have started this), we can eliminate Trusting Trust attacks entirely by using Diverse Double Compilation between multiple independent distribution toolchains, and/or assembly audited compilers. In other words, we could use the distributions' deterministic build processes to verify that identical build environments are produced through Diverse Double Compilation.
As can be seen, much work remains before the system is fully resistant against all forms of malware injection, but even in their current state, reproducible builds are a huge step forward in software build security. We hope this information helps other software distributors to follow the example set by Bitcoin and Tor.
I've spent the past few months developing a new build system for the 3.0 series of the Tor Browser Bundle that produces what are called "deterministic builds" -- packages which are byte-for-byte identical no matter who actually builds them, or what hardware they use. This effort was extraordinarily involved, consuming all of my development time for over two months (including several nights and weekends), babysitting builds and fixing differences and issues that arose.
When describing my recent efforts to others, by far the two most common questions I've heard are "Why did you do that?" and "How did you do that?". I've decided to answer each question at length in a separate blog post. This blog post attempts to answer the first question: "Why would anyone want a deterministic build process?"
The short answer is: to protect against targeted attacks. Current popular software development practices simply cannot survive targeted attacks of the scale and scope that we are seeing today. In fact, I believe we're just about to witness the first examples of large scale "watering hole" attacks. This would be malware that attacks the software development and build processes themselves to distribute copies of itself to tens or even hundreds of millions of machines in a single, officially signed, instantaneous update. Deterministic, distributed builds are perhaps the only way we can reliably prevent these types of targeted attacks in the face of the endless stockpiling of weaponized exploits and other "cyberweapons".
The Dangerous Pursuit of "Cyberweapons" and "Cyberwar"
For the past several years, we've been seeing a steady increase in the weaponization, stockpiling, and the use of software exploits by multiple governments, and by multiple agencies of multiple governments. It would seem that no networked computer is safe from a discovered but undisclosed and already weaponized vulnerability against one or more of its software components -- with each vulnerability being resold an unknown number of times.
Worse still, with Stuxnet and Flame, this stockpile has grown to include weaponized exploits specifically designed to "bridge the air gap" against even non-networked computers. Examples include exploits against software/hardware USB stacks, filesystem drivers, hard drive firmware, and even disconnected Bluetooth and Wifi interfaces. Even if these exploits themselves don't leak, the fact that they are known to exist (and are known to be deliberately kept secret from manufacturers and developers) means that other parties can begin looking for (or simply re-purchasing) the underlying vulnerabilities themselves, without fear of their disclosure or mitigation.
Unfortunately, the use of such exploits isn't limited to attacks against questionable nuclear energy programs by hostile states. The clock is certainly ticking on how long it will be before multiple other intelligence agencies, along with elements of organized crime and "terrorist" groups, have replicated these weapons.
We are essentially risking all of computing (or at least major sectors of the world economy that are dependent on specific software systems) by stockpiling these weapons, as if there would be any possibility of retaliation after a serious cyberattack. Wakeup call: There is not. In fact, the more exploits exist, the higher the risk of the wrong one leaking -- and it really only takes a chain of just a few of the right exploits for this to happen.
Software Engineering Complexity: The Doomsday Scenario
The core problem is this: With the number of dependencies present in large software projects, there is no way any amount of global surveillance, network censorship, machine isolation, or firewalling can sufficiently protect the software development process of widely deployed software projects in order to prevent scenarios where malware sneaks into a development dependency through an exploit in combination with code injection, and makes its way into the build process of software that is critical to the function of the world economy.
Such malware could be quite simple: One day, a timer goes off, and any computer running the infected software turns into a brick. In fact, it's not that hard to destroy a computer via software. Linux distributions have been accidentally tripping on bugs that do it for two decades now. If the right software vector is chosen (for example, a popular piece of software with a rapid release cycle and an auto-updater), a logic bomb that infects the build systems could continuously update the timestamps in the distributed versions of itself to ensure that the infected computers are only destroyed in the event that the attacker actually loses control of the software build infrastructure. If the right systems are chosen, this destruction could mean the disruption of all industrial control or supply chain systems simultaneously, disabling the ability to provide food, water, power, and aid to hundreds of millions of people in a very short amount of time.
The malware could also be more elaborate, especially if the motives are financial as opposed to purely destructive. The ability to universally deploy a backdoor that would allow modification of various aspects of financial transaction processing, stock markets, insurance records, and the supply chain records of various industries would prove tremendously profitable in the right circumstances. Just about all aspects of business are computerized now, and if the computer systems say an event did or didn't happen, that is the reality. Even short of modification, early access to information about certain events is also valuable -- unreleased earnings data from publicly traded companies being the immediate example.
In this brave new world, without the benefit of anonymity and decentralization to protect single points of failure in the software engineering process from such targeted attacks, I don't believe it is possible to keep software signing keys secure any more, nor do I believe it is possible to keep even an offline build machine secure from malware injection any more, especially against the types of adversaries that Tor has to contend with.
As someone who regularly discusses software engineering practices with the best and the brightest minds in the computer industry, I can tell you with certainty that even companies that exercise current best practices -- such as keeping their software build machines offline (and even these companies are few and far between) can still end up being infected, due to the existence and proliferation of the air gap bridging exploits mentioned above.
A true air gap is also difficult to achieve even if it could be used to ensure build machine integrity. For example, all of the major Windows web browser vendors employ a Microsoft run-time optimization technique called "Profile Guided Optimization". This technique requires running an initial compiled binary on a machine to produce a profile output that represents which code paths were executed and which were most expensive. This output is used to transform its code and optimize it further. In the case of browsers, this means that an untrusted, proprietary, and opaque input is derived from non-deterministic network sources (such as the Alexa Top 1000) and transferred to the build machines, to produce executable code that is manipulated and rewritten based on this network-derived, untrusted input, and upon the performance and other characteristics of the specific machine that was used to generate this profile output.
This means that software development has no choice but to evolve beyond the simple models of "Trust our RSA-signed update feed produced from our trusted build machines", or even companies like Google, Mozilla, Apple, and Microsoft are going to end up distributing state-sponsored malware in short order.
Deterministic Builds: Integrity through Decentralization
This is where the "why" of deterministic builds finally comes in: in our case, any individual can use our anonymity network to privately download our source code, verify it against public signed, audited, and mirrored git repositories, and reproduce our builds exactly, without being subject to such targeted attacks. If they notice any differences, they can alert the public builders/signers, hopefully using a pseudonym or our anonymous trac account.
This also will eventually allow us to create a number of auxiliary authentication mechanisms for our packages, beyond just trusting a single offline build machine and a single cryptographic key's integrity. Interesting examples include providing multiple independent cryptographic signatures for packages, listing the package hashes in the Tor consensus, and encoding the package hashes in the Bitcoin blockchain.
I believe it is important for Tor to set an example on this point, and I hope that the Linux distributions will follow in making deterministic packaging the norm. Thankfully, due to our close relationship with Debian, after we whispered in a few of the right ears they have started work on this effort. Don't despair guys: it won't take two months for each Linux package. In our case, we had to cross-compile Firefox deterministically for four different target operating system and architecture combinations and fix a number of Firefox-specific build issues, all of which I will describe in the second post on the technical details.
By now, just about everybody has heard about the PRISM surveillance program, and many are beginning to speculate on its impact on Tor.
Unfortunately, there still are a lot of gaps to fill in terms of understanding what is really going on, especially in the face of conflicting information between the primary source material and Google, Facebook, and Apple's claims of non-involvement.
This apparent conflict means that it is still hard to pin down exactly how the program impacts Tor, and is leading many to assume worst-case scenarios.
For example, some of the worst-case scenarios include the NSA using weaponized exploits to compromise datacenter equipment at these firms. Less severe, but still extremely worrying possibilities include issuing gag orders to mid or low-level datacenter staff to install backdoors or monitoring equipment without any interaction what-so-ever with the legal and executive staff of the firms themselves.
We're going to save analysis of those speculative and invasive scenarios for when more information becomes available (though we may independently write a future blog post on the dangers of the government use of weaponized exploits).
For now, let's review what Tor can do, what tools go well with Tor to give you defense-in-depth for your communications, and what work needs to be done so we can make it easier to protect communications from instances where the existing centralized communications infrastructure is compromised by the NSA, China, Iran, or by anyone else who manages to get ahold of the keys to the kingdom.
The core Tor software's job is to conceal your identity from your recipient, and to conceal your recipient and your content from observers on your end. By itself, Tor does not protect the actual communications content once it leaves the Tor network. This can make it useful against some forms of metadata analysis, but this also means Tor is best used in combination with other tools.
Through the use of HTTPS-Everywhere in Tor Browser, in many cases we can protect your communications content where parts of the Tor network and/or your recipients' infrastructure are compromised or under surveillance. The EFF has created an excellent interactive graphic to help illustrate and clarify these combined properties.
Through the use of combinations of additional software like TorBirdy and Enigmail, OTR, and Diaspora, Tor can also protect your communications content in cases where the communications infrastructure (Google/Facebook) is compromised.
However, the real interesting use cases for Tor in the face of dragnet surveillance like this is not that Tor can protect your gmail/facebook accounts from analysis (in fact, Tor could never really protect account usage metadata), but that Tor and hidden services are actually a key building block to build systems where it is no longer possible to go to a single party and obtain the full metadata, communications frequency, *or* contents.
Tor hidden services are arbitrary communications endpoints that are resistant to both metadata analysis and surveillance.
A simple (to deploy) example of a hidden service based mechanism to significantly hinder exactly this type of surveillance is an XMPP client that also ships with an XMPP server and a Tor hidden service. Such a P2P communication system (where the clients are themselves the servers) is both end-to-end secure, and does *not* have a single central server where metadata is available. This communication is private, pseudonymous, and does not have involve any single central party or intermediary.
Despite these compelling use cases and powerful tool combination possibilities, the Tor Project is under no illusion that these more sophisticated configurations are easy, usable, or accessible by the general public.
We recognize that a lot of work needs to be done even for the basic tools like Tor Browser, TorBirdy, EnigMail, and OTR to work seamlessly and securely for most users, let alone complex combinations like XMPP or Diaspora with Hidden Services.
Additionally, hidden services themselves are in need of quite a bit of development assistance just to maintain their originally designed level of security, let alone scaling to support large numbers of endpoints.
Being an Open Source project with limited resources, we welcome contributions from the community to make any of this software work better with Tor, or to help improve the Tor software itself.
If you're not a developer, but you would still like to help us succeed in our mission of securing the world's communications, please donate! It is a rather big job, after all.
We will keep you updated as we learn more about the exact capabilities of this program.