During the month of December, we're highlighting other organizations and projects that rely on Tor, build on Tor, or are accomplishing their missions better because Tor exists. Check out our blog each day to learn about our fellow travelers. And please support the Tor Project! We're at the heart of Internet freedom. Donate today!
So far in this blog series we've highlighted mainly software and advocacy projects. Today is a little different: I'm going to explain more about Tor's role in the academic world of privacy and security research.
Part one: Tor matters to the research community
Just about every major security conference these days has a paper analyzing, attacking, or improving Tor. While ten years ago the field of anonymous communications was mostly theoretical, with researchers speculating that a given design should or shouldn't work, Tor now provides an actual deployed testbed. Tor has become the gold standard for anonymous communications research for three main reasons:
First, Tor's source code and specifications are open. Beyond its original design document, Tor provides a clear and published set of RFC-style specifications describing exactly how it is built, why we made each design decision, and what security properties it aims to offer. The Tor developers conduct design discussion in the open, on public development mailing lists, and the public development proposal process provides a clear path by which other researchers can participate.
Second, Tor provides open APIs and maintains a set of tools to help researchers and developers interact with the Tor software. The Tor software's "control port" lets controller programs view and change configuration and status information, as well as influence path selection. We provide easy instructions for setting up separate private Tor networks for testing. This modularity makes Tor more accessible to researchers because they can run their own experiments using Tor without needing to modify the Tor program itself.
Third, real users rely on Tor. Every day hundreds of thousands of people connect to the Tor network and depend on it for a broad variety of security goals. In addition to its emphasis on research and design, The Tor Project has developed a reputation as a non-profit that fosters this community and puts its users first. This real-world relevance motivates researchers to help make sure Tor provides provably good security properties.
I wrote the above paragraphs in 2009 for our first National Science Foundation proposal, and they've become even more true over time. A fourth reason has also emerged: Tor attracts researchers precisely because it brings in so many problems that are at the intersection of "hard to solve" and "matter deeply to the world". How to protect communications metadata is one of the key open research questions of the century, and nobody has all the answers. Our best chance at solving it is for researchers and developers all around the world to team up and all work in the open to build on each other's progress.
Since starting Tor, I've done probably 100 Tor talks to university research groups all around the world, teaching grad students about these open research problems in the areas of censorship circumvention (which led to the explosion of pluggable transport ideas), privacy-preserving measurement, traffic analysis resistance, scalability and performance, and more.
The result of that effort, and of Tor's success in general, is a flood of research papers, plus a dozen research labs who regularly have students who write their thesis on Tor. The original Tor design paper from 2004 now has over 3200 citations, and in 2014 Usenix picked that paper out of all the security papers in 2004 to win their Test of Time award.
Part two: University collaborations
This advocacy and education work has also led to a variety of ongoing collaborations funded by the National Science Foundation, including with Nick Feamster's group at Princeton on measuring censorship, with Nick Hopper's group at University of Minnesota on privacy-preserving measurement, with Micah Sherr's group at Georgetown University on scalability and security against denial of service attacks, and an upcoming one with Matt Wright's group at RIT on defense against website fingerprinting attacks.
All of these collaborations are great, but there are precious few people on the Tor side who are keeping up with them, and those people need to balance their research time with development, advocacy, management, etc. I'm really looking forward to the time where Tor can have an actual research department.
And lastly, I would be remiss in describing our academic collaborations without also including a shout-out to the many universities that are running exit relays to help the network grow. As professor Leo Reyzin from Boston University once explained for why it is appropriate for his research lab to support the Tor network, "If biologists want to study elephants, they get an elephant. I want my elephant." So, special thanks to Boston University, University of Michigan, University of Waterloo, MIT, CMU (their computer science department that is), University of North Carolina, University of Pennsylvania, Universidad Galileo, and Clarkson University. And if you run an exit relay at a university but you're not on this list, please reach out!
Part three: The Privacy Enhancing Technologies Symposium
Another critical part of the privacy research world is the Privacy Enhancing Technologies Symposium (PETS), which is the premiere venue for technical privacy and anonymity research. This yearly gathering started as a workshop in 2000, graduated to being called a symposium in 2008, and in 2015 it became an open-access journal named Proceedings on Privacy Enhancing Technologies.
The editorial board and chairs for PETS over the years overlap greatly with the Tor community, with a lot of names you'll see at both PETS and the Tor twice-yearly meetings, including Nikita Borisov, George Danezis, Claudia Diaz, Roger Dingledine (me), Ian Goldberg, Rachel Greenstadt, Kat Hanna, Nick Hopper, Steven Murdoch, Paul Syverson, and Matt Wright.
But beyond community overlap, The Tor Project is actually the structure underneath PETS. The group of academics who run the PETS gatherings intentionally did not set up corporate governance and all those pieces of bureaucracy that drag things down — so they can focus on having a useful research meeting each year — and Tor stepped in to effectively be the fiscal sponsor, by keeping the bank accounts across years, and by being the "owner" for the journal since De Gruyter's paperwork assumes that some actual organization has to own it. We're proud that we can help provide stability and longevity for PETS.
Speaking of all these papers: we have tracked the most interesting privacy and anonymity papers over the years on the anonymity bibliography (anonbib). But at this point, anonbib is still mostly a two-man show where Nick Mathewson and I update it when we find some spare time, and it's starting to show its age since its launch in 2003, especially with the huge growth in the field, and with other tools like Google Scholar. Probably the best answer is that we need to trim it down so it's more of a "recommended reading list" than a resource of all relevant papers. If you want to help, let us know!
Part four: The Tor Research Safety Board
This post is running long, so I will close by pointing to the Tor Research Safety Board, a group of researchers who study Tor and who want to minimize privacy risks while fostering a better understanding of the Tor network and its users. That page lists a set of guidelines on what to consider when you're thinking about doing research on Tor users or the Tor network, and a process for getting feedback and suggestions on your plan. We did a soft launch of the safety board this past year in the rump session at PETS, and we've fielded four requests for advice so far. We've been taking it slow in turns of publicity, but if you're a researcher and you can help us refine our process, please take a look!
During the month of December, we're highlighting other organizations and projects that rely on Tor, build on Tor, or are accomplishing their missions better because Tor exists. Check out our blog each day to learn about our fellow travelers. And please support the Tor Project! We're at the heart of Internet freedom. Donate today!
Research by Andrea Forte, Nazanin Andalibi and Rachel Greenstadt
Wikipedia blocks edits from Tor — how does this affect the quality and coverage of the "encyclopedia that anyone can edit?" How do captchas and blocking of anonymity services affect the experiences of Tor users when they are trying to contribute content? What can projects do to better support contributions from people who value their privacy?
We are a group of researchers from Drexel University studying these questions. Our initial study of privacy in open collaboration projects, entitled Privacy, Anonymity, and Perceived Risk in Open Collaboration: A Study of Tor Users and Wikipedians, was recently published in advance of its presentation at the ACM conference on Computer-Supported Cooperative Work and Social Computing (CSCW) in February. Our findings offer a rare look at why people turn to privacy tools like Tor and how they experience the Internet as a result. This work was inspired by a previous Tor blog post, A call to arms: Helping Internet services accept anonymous users.
We interviewed 23 people from seven countries ranging in age from 18-41; 12 Tor users who participate in online projects and 11 Wikipedia editors who use a variety of privacy tactics. The Tor Project and Wikimedia Foundation are organizations committed to similar ideals — a free global exchange of information in which everyone is able to participate. The study's central finding is that perceived threats from other individuals, groups of people and governments are substantial enough to force users below the radar and curtail their participation in order to protect their reputation, themselves, and their families.
In nearly all interviews, participants described being wary about how aspects of their participation in open collaboration projects would compromise their privacy or safety. Many participants described crisis experiences of their own or of someone they knew as antecedent to their model of threat in online projects.
Their reasons for guarding their privacy online ranged from concerns about providers obtaining and using their browsing history for targeted advertising to actual verbal abuse, harassment and threats of violence. The most common concern voiced by participants was a fear that their online communication or activities may be accessed or logged by parties without their knowledge or consent.
This threat, which became very real for many Americans after Snowden revealed the extent of the National Security Agency's surveillance and monitoring practices, has been ever-present for users in other countries for some time. According to one non-U.S. respondent "in my country there's basically unknown surveillance going on ... and I don't know what providers to use so at some point I decided to use Tor for everything."
For a political activist, dissident, or just someone who has expressed strong political opinions the threat is multiplied. One such participant who uses Tor said "they busted [my friend's] door down and they beat the ever living crap out of him...and told him, "If you and your family want to live, then you're going to stop causing trouble." This person's privacy strategies were quickly transformed after that experience.
Eleven of the study's participants were recruited from the ranks of Wikipedia editors who expressed concerns about maintaining their privacy. In comparison to political dissent, helping to add information to Wikipedia might seem innocuous, but especially editors who work on controversial topics are also being threatened and harassed. Wikipedia allows anonymous posting, but it does not permit users to mask their IP addresses and blocks Tor users — except in special cases. So wading into the controversial territory, even to present a fact-backed, neutral point of view, puts editors at risk. Some Wikipedians described threats of rape, physical assault, and death as reprisals for their contributions to the project.
Administrators of the site, who often spend their time on managerial tasks and enforcing policies, also reported being harassed or threatened with violence. "It's a lot of emotional work," said one study participant. "I remember being like 13 and getting a lot of rape threats and death threats and that was when I was doing administrative work."
Our analysis suggests that Wikipedia and other collaborative projects are losing valuable contributions to privacy concerns. If certain voices are systematically dampened by the threat of harassment, intimidation, violence, or opportunity and reputation loss, projects like Wikipedia cannot hope to attract the diversity of contributors required to produce "the sum of all human knowledge."
In response to this problem, our research agenda aims to support communities like Wikipedia in developing tools and norms that value and welcome anonymous contributions.
Watch the video from the 32c2 talk: What is the value of anonymous communication?
After ten years of volunteer maintenance of Tonga, Tor's bridge Authority—a piece of critical infrastructure within the Tor network—our colleague and friend, Lucky Green, a long time cypherpunk, and free speech and privacy advocate, has decided to step down from this role. Tonga's cryptographic keys will be destroyed this week. We are incredibly thankful to Lucky for all his support and selfless labour in maintaining a key component of our censorship circumvention efforts, grateful for the years we have spent working with him, and very sorry to see him go.
The Bridge Authority is a simple but essential piece of the Tor Network. Unlike the other directory authorities, the Bridge Authority does not get a vote in Tor's consensus protocol. Instead, it serves to aggregate relay descriptors which Tor Bridges send to it, checking their cryptographic validity and testing that the Bridges' ORPorts within these descriptors are reachable. It then sends these descriptors to BridgeDB, which does all the deduplication, cryptographic signature verification (again), stability calculations, pluggable transport argument validation, assignment into the hashring of each Bridge distribution mechanism, and finally distributing the Bridges to Tor clients.
This transition does not affect Tor users, regardless of whether or not Bridges are used to connect to the Tor network. However, it is extremely important that relay operators running Bridges upgrade to tor-0.2.8.7 or tor-0.2.9.2.-alpha, which contains a patch to replace Tonga with the new Bridge Authority. Bridges which do not upgrade will cease to be distributed to new clients; however, clients which have connected to your Bridge previously will still be able to connect (at least until your Bridge's IP address, port, or fingerprint changes).
"The same thing, but made of rainbows and on fire."
As a replacement for Tonga, I am happy to announce that Greenhost has donated hardware and hosting for the new Bridge Authority, Bifröst. Bifröst is a Norse mythological bridge that connects Midgard, the mortal realm, and Asgard, the realm of the gods, and is described in the poem Grímnismál within the Poetic Edda as a burning bridge, constructed out of a rainbow whose end lies upon Himinbjorg, or "Heaven's cliffs." The name was suggested by both our colleagues Alison Macrina of the Library Freedom Project and Moritz Bartl of Torservers.net. Despite the personal temptation to follow Nick Mathewson's suggestion to christen it after that iconic symbol of my home, I could not help but name it Bifröst, because why go with some boring normal thing, when you could have the same thing, but made of rainbows and on fire. RAINBOWS. FIRE. Clear choice.
The Tor Project is incredibly thankful to Greenhost for their generous donation of hardware, hosting, and bandwidth. In particular, I am thankful to my colleagues at Greenhost, Sacha von Geffen and Jurre van Bergen, for all the work they put into the organisation, collaboration, and technical efforts in setting the server up quickly. Working with Greenhost, as always, is a pleasure, and I would give my highest recommendations for Greenhost to those seeking an ethical, friendly, and experienced hosting provider.
Future Research and Hacking
Moving forward, there are several improvements to these systems which could be made, some requiring further research.
- We currently don't have any mechanism for testing the bandwidth capacity of bridge relays. Additional design complications may arise when Bridges have their own Guard relays (#7144), e.g. causing fast Bridges which select slower Guards to not utilize their full capacity. This might be navigated by adding support for bridges to do a self-bandwidth test before selecting a guard node.
- We also don't currently have anything that tests the reachability of the address/port for any of a Bridge's pluggable transports. Our previous attempts at a distributed/automated Bridge reachability testing system lead me to believe that there is no way to both reliably and securely, i.e., without literally burning the Bridge by attracting a censor's attention to it, test reachability in a distributed manner. Add on top a game of Russian roulette by mixing in N different pluggable transports with varying indistinguishability, authentication, and security properties merely compounds the issue, adding to the likelihood that the secrecy of the best transport a Bridge provides is reduced to that of its worst. That said, thorough analysis of the risks of a centralised system should be made, and there are likely other alternatives. For example, one might attempt to build a system which heuristically crowdsources this information from clients.
- There's no legitimate reason to have the Bridge Authority and BridgeDB be separate systems. It would make more sense to break apart the components into those which
- receive descriptors
- conduct reachability tests
- archive all descriptors
- access archived descriptors for which Bridges may currently be distributed to clients
- distribute Bridges to clients in some manner.
- Decentralise the Bridge Authority/BridgeDB systems without simply turning a single point-of-failure into multiple points-of-failure.
Researchers and hackers interested in these problems are welcome and encouraged to contribute. If these problems interest you (or your sufficiently bright, self-directed, and motivated students!), please feel encouraged to contact me and/or our Research Director, Roger Dingledine to discuss ideas and projects moving forward.
1. Goals of this document.
- In general, to describe how to conduct responsible research on Tor and similar privacy tools.
- To develop guidelines for research activity that researchers can use to evaluate their proposed plan.
- Produce a (non-exhaustive) list of specific types of unacceptable activity.
- Develop a “due diligence” process for research that falls in the scope of “potentially dangerous” activities. This process can require some notification and feedback from the Tor network or other third parties.
2. General principles
Experimentation does not justify endangering people. Just as in medicine, there are experiments in privacy that can only be performed by creating an unacceptable degree of human harm. These experiments are not justified, any more than the gains to human knowledge would justify unethical medical research on human subjects.
Research on humans' data is human research. Over the last century, we have made enormous strides in what research we consider ethical to perform on people in other domains. For example, we have generally decided that it's ethically dubious to experiment on human subjects without their informed consent. We should make sure that privacy research is at least as ethical as research in other fields.
We should use our domain knowledge concerning privacy when assessing risks. Privacy researchers know that information which other fields consider non-invasive can be used to identify people, and we should take this knowledge into account when designing our research.
Finally, users and implementors must remember that "should not" does not imply "can not." Guidelines like these can serve to guide researchers who are genuinely concerned with doing the right thing and behaving ethically; they cannot restrain the unscrupulous or unethical. Against invasions like these, other mechanisms (like improved privacy software) are necessary.
3. Guidelines for research
- Only collect data that is acceptable to publish. If it would be inappropriate to share it with the world, it is invasive to collect it. In the case of encrypted or secret-shared data, it can be acceptable to assume that the keys or some shares are not published.
- Only collect as much data as is needed: practice data minimization.
- Whenever possible, use analysis techniques that do not require sensitive data, but which work on anonymized aggregates.
- Limit the granularity of the data. For example, "noise" (added data inaccuracies) should almost certainly be added. This will require a working statistical background, but helps to avoid harm to users.
- Make an explicit description of benefits and risks, and argue that the benefits outweigh the risks.
- In order to be sure that risks have been correctly identified, seek external review from domain experts. Frequently there are non-obvious risks.
- Consider auxiliary data when assessing the risk of your research. Data which is not damaging on its own can become dangerous when other data is also available. For example, data from exit traffic can be combined with entry traffic to deanonymize users.
- Respect people's own judgments concerning their privacy interests in their own data.
- It's a warning sign if you can't disclose details of your data collection in advance. If knowing about your study would cause your subjects to object to it, that's a good sign that you're doing something dubious.
- Use a test network when at all possible.
- If you can experiment either on a test network without real users, or on a live network, use the test network.
- If you can experiment either on your own traffic or on the traffic of strangers, use your own traffic.
- "It was easier that way" is not justification for using live user traffic over test network traffic.
4. Examples of unacceptable research activity
- It is not acceptable to run an HSDir, harvest onion addresses, and publish or connect to those onion addresses.
- Don't set up exit relays to sniff, or tamper with exit traffic. Some broad measurements (relative frequency of ports; large-grained volume) may be acceptable depending on risk/benefit tradeoffs; fine-grained measures are not.
- Don't set up relays that are deliberately dysfunctional (e.g., terminate connections to specific sites).
Roya, David, Nick, nweaver, Vern, and I just finished a research project in which we revisited the Great Firewall of China's (GFW) active probing system. This system was brought to life several years ago to reactively probe and block circumvention proxies, including Tor. You might remember an earlier blog post that gave us some first insight into how the active probing system works. Several questions, however, remained. For example, we were left wondering what the system's physical infrastructure looked like. Is the GFW using dedicated machines behind their thousands of probing IP addresses? Does the GFW even "own" all these IP addresses? Rumour had it that the GFW was hijacking IP addresses for a short period of time, but there was no conclusive proof. As a result, we teamed up and set out to answer these, and other questions.
Because this was a network measurement project, we started by compiling datasets. We created three datasets, comprising hours (a Sybil-like experiment to attract many probes), months (an experiment to measure reachability for clients in China), and even years (log files of a long-established server) worth of active probing data. Together, these datasets allow us to look at the GFW's active probing system from different angles, illuminating aspects we wouldn't be able to observe with just a single dataset. We are able to share two of our datasets, so you are very welcome to reproduce our work, or do your own analysis.
We now want to give you an overview of our most interesting findings.
- Generally, once a bridge is detected and blocked by the GFW, it remains blocked. But does this mean that the bridge is entirely unreachable? We measured the blocking effectiveness by continuously making a set of virtual private systems in China connect to a set of bridges under our control. We found that every 25 hours, for a short period of time, our Tor clients in China were able to connect to our bridges. This is illustrated in the diagram shown below. Every point represents one connection attempt, meaning that our client in China was trying to connect to our bridge outside of China. Note the curious periodic availability pattern for both Unicom and CERNET (the two ISPs in China we measured from). Sometimes, network security equipment goes into "fail open" mode while it updates its rule set, but it is not clear if this is happening here.
- We were able to find patterns in the TCP headers of active probes that suggest that all these thousands of IP addresses are, in fact, controlled by a single source. Check out the initial sequence number (ISN) pattern in the diagram below. It shows the value of ISNs (y-axis) over time (x-axis). Every point in the graph represents the SYN segment of one active probing connection. If all probing connections would have come from independent computers, we would have expected a random distribution of points. That's because ISNs are typically chosen randomly to protect against off-path attackers. Instead, we see a clear linear pattern across IP addresses. We believe that active probes derive their ISN from the current time.
- We discovered that Tor is not the only victim of active probing attacks; the GFW is targeting other circumvention systems, namely SoftEther and GoAgent. This highlights the modular nature of the active probing system. It appears to be easy for GFW engineers to add new probing modules to react to emerging, proxy-based circumvention tools.
- Back in 2012, the system worked in 15-minute-queues. These days, it seems to be able to scan bridges in real-time. On average, it takes only half a second after a bridge connection for an active probe to show up.
- Using a number of traceroute experiments, we could show that the GFW's sensor is stateful and seems unable to reassemble TCP streams.
Luckily, we now have several pluggable transports that can defend against active probing. ScrambleSuit and its successor, obfs4, defend against probing attacks by relying on a shared secret that is distributed out of band. Meek tunnels traffic over cloud infrastructure, which does not prevent active probing, but greatly increases collateral damage when blocked. While we keep developing and maintaining circumvention tools, we need to focus more on usability. A powerful and carefully-engineered circumvention tool is of little use if folks find it too hard to use. That's why projects like the UX sprint are so important.
Albert Kwon, Mashael AlSabah, and others have a paper entitled Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden Services at the upcoming Usenix Security symposium in a few weeks. Articles describing the paper are making the rounds currently, so I'm posting a technical summary here, along with explanations of the next research questions that would be good to answer. (I originally wrote this summary for Dan Goodin for his article at Ars Technica.) Also for context, remember that this is another research paper in the great set of literature around anonymous communication systems—you can read many more at http://freehaven.net/anonbib/.
"This is a well-written paper. I enjoyed reading it, and I'm glad the researchers are continuing to work in this space.
First, for background, run (don't walk) to Mike Perry's blog post explaining why website fingerprinting papers have historically overestimated the risks for users:
and then check out Marc Juarez et al's followup paper from last year's ACM CCS that backs up many of Mike's concerns:
To recap, this new paper describes three phases. In the first phase, they hope to get lucky and end up operating the entry guard for the Tor user they're trying to target. In the second phase, the target user loads some web page using Tor, and they use a classifier to guess whether the web page was in onion-space or not. Lastly, if the first classifier said "yes it was", they use a separate classifier to guess which onion site it was.
The first big question comes in phase three: is their website fingerprinting classifier actually accurate in practice? They consider a world of 1000 front pages, but ahmia.fi and other onion-space crawlers have found millions of pages by looking beyond front pages. Their 2.9% false positive rate becomes enormous in the face of this many pages—and the result is that the vast majority of the classification guesses will be mistakes.
For example, if the user loads ten pages, and the classifier outputs a guess for each web page she loads, will it output a stream of "She went to Facebook!" "She went to Riseup!" "She went to Wildleaks!" while actually she was just reading posts in a Bitcoin forum the whole time? Maybe they can design a classifier that works well when faced with many more web pages, but the paper doesn't show one, and Marc Juarez's paper argues convincingly that it's hard to do.
The second big question is whether adding a few padding cells would fool their "is this a connection to an onion service" classifier. We haven't tried to hide that in the current Tor protocol, and the paper presents what looks like a great classifier. It's not surprising that their classifier basically stops working in the face of more padding though: classifiers are notoriously brittle when you change the situation on them. So the next research step is to find out if it's easy or hard to design a classifier that isn't fooled by padding.
I look forward to continued attention by the research community to work toward answers to these two questions. I think it would be especially fruitful to look also at true positive rates and false positives of both classifiers together, which might show more clearly (or not) that a small change in the first classifier has a big impact on foiling the second classifier. That is, if we can make it even a little bit more likely that the "is it an onion site" classifier guesses wrong, we could make the job of the website fingerprinting classifier much harder because it has to consider the billions of pages on the rest of the web too."
Drexel University researchers in Philadelphia, Pennsylvania are recruiting Tor users for an interview study to see how they use Tor while creating things online—how they write blog posts, edit Wikipedia articles, contribute to open source projects on GitHub, post on discussion forums, comment on news articles, Tweet, write reviews, and many other things.
The researchers want to investigate the ways in which various limits, like CAPTCHAs, or even blocking access to sites entirely, inhibit or don’t inhibit Tor users’ ability to create things online. They hope to identify times when people are forced to modify their behavior to achieve the privacy they want. They want to measure the value of anonymous participation and then begin to talk to service providers and others to optimize the participation of Tor users.
“By understanding the contributions that Tor users make, we can help make a case for the value of anonymity online,” said Associate Professor Rachel Greenstadt, an investigator on the study.
The researchers are also interested in hearing from Tor users about other impediments to their anonymous participation that they have encountered while online.
“It’s critical for online projects to support contributions from anyone eager to participate,” said Assistant Professor Andrea Forte, principal investigator.
For more information about joining the study, see: The Tor Study (http://andreaforte.net/tor.html)
People are starting to ask us about a recent tech report from Sambuddho's group about how an attacker with access to many routers around the Internet could gather the netflow logs from these routers and match up Tor flows. It's great to see more research on traffic correlation attacks, especially on attacks that don't need to see the whole flow on each side. But it's also important to realize that traffic correlation attacks are not a new area.
This blog post aims to give you some background to get you up to speed on the topic.
First, you should read the first few paragraphs of the One cell is enough to break Tor's anonymity analysis:
First, remember the basics of how Tor provides anonymity. Tor clients route their traffic over several (usually three) relays, with the goal that no single relay gets to learn both where the user is (call her Alice) and what site she's reaching (call it Bob).
The Tor design doesn't try to protect against an attacker who can see or measure both traffic going into the Tor network and also traffic coming out of the Tor network. That's because if you can see both flows, some simple statistics let you decide whether they match up.
Because we aim to let people browse the web, we can't afford the extra overhead and hours of additional delay that are used in high-latency mix networks like Mixmaster or Mixminion to slow this attack. That's why Tor's security is all about trying to decrease the chances that an adversary will end up in the right positions to see the traffic flows.
The way we generally explain it is that Tor tries to protect against traffic analysis, where an attacker tries to learn whom to investigate, but Tor can't protect against traffic confirmation (also known as end-to-end correlation), where an attacker tries to confirm a hypothesis by monitoring the right locations in the network and then doing the math.
And the math is really effective. There are simple packet counting attacks (Passive Attack Analysis for Connection-Based Anonymity Systems) and moving window averages (Timing Attacks in Low-Latency Mix-Based Systems), but the more recent stuff is downright scary, like Steven Murdoch's PET 2007 paper about achieving high confidence in a correlation attack despite seeing only 1 in 2000 packets on each side (Sampled Traffic Analysis by Internet-Exchange-Level Adversaries).
Second, there's some further discussion about the efficacy of traffic correlation attacks at scale in the Improving Tor's anonymity by changing guard parameters analysis:
Tariq's paper makes two simplifying assumptions when calling an attack successful [...] 2) He assumes that the end-to-end correlation attack (matching up the incoming flow to the outgoing flow) is instantaneous and perfect. [...] The second one ("how successful is the correlation attack at scale?" or maybe better, "how do the false positives in the correlation attack compare to the false negatives?") remains an open research question.
Researchers generally agree that given a handful of traffic flows, it's easy to match them up. But what about the millions of traffic flows we have now? What levels of false positives (algorithm says "match!" when it's wrong) are acceptable to this attacker? Are there some simple, not too burdensome, tricks we can do to drive up the false positives rates, even if we all agree that those tricks wouldn't work in the "just looking at a handful of flows" case?
More precisely, it's possible that correlation attacks don't scale well because as the number of Tor clients grows, the chance that the exit stream actually came from a different Tor client (not the one you're watching) grows. So the confidence in your match needs to grow along with that or your false positive rate will explode. The people who say that correlation attacks don't scale use phrases like "say your correlation attack is 99.9% accurate" when arguing it. The folks who think it does scale use phrases like "I can easily make my correlation attack arbitrarily accurate." My hope is that the reality is somewhere in between — correlation attacks in the current Tor network can probably be made plenty accurate, but perhaps with some simple design changes we can improve the situation.
The discussion of false positives is key to this new paper too: Sambuddho's paper mentions a false positive rate of 6%. That sounds like it means if you see a traffic flow at one side of the Tor network, and you have a set of 100000 flows on the other side and you're trying to find the match, then 6000 of those flows will look like a match. It's easy to see how at scale, this "base rate fallacy" problem could make the attack effectively useless.
And that high false positive rate is not at all surprising, since he is trying to capture only a summary of the flows at each side and then do the correlation using only those summaries. It would be neat (in a theoretical sense) to learn that it works, but it seems to me that there's a lot of work left here in showing that it would work in practice. It also seems likely that his definition of false positive rate and my use of it above don't line up completely: it would be great if somebody here could work on reconciling them.
For a possibly related case where a series of academic research papers misunderstood the base rate fallacy and came to bad conclusions, see Mike's critique of website fingerprinting attacks plus the follow-up paper from CCS this year confirming that he's right.
I should also emphasize that whether this attack can be performed at all has to do with how much of the Internet the adversary is able to measure or control. This diversity question is a large and important one, with lots of attention already. See more discussion here.
In summary, it's great to see more research on traffic confirmation attacks, but a) traffic confirmation attacks are not a new area so don't freak out without actually reading the papers, and b) this particular one, while kind of neat, doesn't supercede all the previous papers.
(I should put in an addendum here for the people who are wondering if everything they read on the Internet in a given week is surely all tied together: we don't have any reason to think that this attack, or one like it, is related to the recent arrests of a few dozen people around the world. So far, all indications are that those arrests are best explained by bad opsec for a few of them, and then those few pointed to the others when they were questioned.)
[Edit: be sure to read Sambuddho's comment below, too. -RD]
Today Facebook unveiled its hidden service that lets users access their website more safely. Users and journalists have been asking for our response; here are some points to help you understand our thinking.
Part one: yes, visiting Facebook over Tor is not a contradiction
I didn't even realize I should include this section, until I heard from a journalist today who hoped to get a quote from me about why Tor users wouldn't ever use Facebook. Putting aside the (still very important) questions of Facebook's privacy habits, their harmful real-name policies, and whether you should or shouldn't tell them anything about you, the key point here is that anonymity isn't just about hiding from your destination.
There's no reason to let your ISP know when or whether you're visiting Facebook. There's no reason for Facebook's upstream ISP, or some agency that surveils the Internet, to learn when and whether you use Facebook. And if you do choose to tell Facebook something about you, there's still no reason to let them automatically discover what city you're in today while you do it.
Also, we should remember that there are some places in the world that can't reach Facebook. Long ago I talked to a Facebook security person who told me a fun story. When he first learned about Tor, he hated and feared it because it "clearly" intended to undermine their business model of learning everything about all their users. Then suddenly Iran blocked Facebook, a good chunk of the Persian Facebook population switched over to reaching Facebook via Tor, and he became a huge Tor fan because otherwise those users would have been cut off. Other countries like China followed a similar pattern after that. This switch in his mind between "Tor as a privacy tool to let users control their own data" to "Tor as a communications tool to give users freedom to choose what sites they visit" is a great example of the diversity of uses for Tor: whatever it is you think Tor is for, I guarantee there's a person out there who uses it for something you haven't considered.
Part two: we're happy to see broader adoption of hidden services
I think it is great for Tor that Facebook has added a .onion address. There are some compelling use cases for hidden services: see for example the ones described at using Tor hidden services for good, as well as upcoming decentralized chat tools like Ricochet where every user is a hidden service, so there's no central point to tap or lean on to retain data. But we haven't really publicized these examples much, especially compared to the publicity that the "I have a website that the man wants to shut down" examples have gotten in recent years.
Hidden services provide a variety of useful security properties. First — and the one that most people think of — because the design uses Tor circuits, it's hard to discover where the service is located in the world. But second, because the address of the service is the hash of its key, they are self-authenticating: if you type in a given .onion address, your Tor client guarantees that it really is talking to the service that knows the private key that corresponds to the address. A third nice feature is that the rendezvous process provides end-to-end encryption, even when the application-level traffic is unencrypted.
So I am excited that this move by Facebook will help to continue opening people's minds about why they might want to offer a hidden service, and help other people think of further novel uses for hidden services.
Another really nice implication here is that Facebook is committing to taking its Tor users seriously. Hundreds of thousands of people have been successfully using Facebook over Tor for years, but in today's era of services like Wikipedia choosing not to accept contributions from users who care about privacy, it is refreshing and heartening to see a large website decide that it's ok for their users to want more safety.
As an addendum to that optimism, I would be really sad if Facebook added a hidden service, had a few problems with trolls, and decided that they should prevent Tor users from using their old https://www.facebook.com/ address. So we should be vigilant in helping Facebook continue to allow Tor users to reach them through either address.
Part three: their vanity address doesn't mean the world has ended
Their hidden service name is "facebookcorewwwi.onion". For a hash of a public key, that sure doesn't look random. Many people have been wondering how they brute forced the entire name.
The short answer is that for the first half of it ("facebook"), which is only 40 bits, they generated keys over and over until they got some keys whose first 40 bits of the hash matched the string they wanted.
Then they had some keys whose name started with "facebook", and they looked at the second half of each of them to pick out the ones with pronouncable and thus memorable syllables. The "corewwwi" one looked best to them — meaning they could come up with a story about why that's a reasonable name for Facebook to use — so they went with it.
So to be clear, they would not be able to produce exactly this name again if they wanted to. They could produce other hashes that start with "facebook" and end with pronouncable syllables, but that's not brute forcing all of the hidden service name (all 80 bits).
For those who want to explore the math more, read about the "birthday attack". And for those who want to learn more (please help!) about the improvements we'd like to make for hidden services, including stronger keys and stronger names, see hidden services need some love and Tor proposal 224.
Part four: what do we think about an https cert for a .onion address?
Facebook didn't just set up a hidden service. They also got an https certificate for their hidden service, and it's signed by Digicert so your browser will accept it. This choice has produced some feisty discussions in the CA/Browser community, which decides what kinds of names can get official certificates. That discussion is still ongoing, but here are my early thoughts on it.
In favor: we, the Internet security community, have taught people that https is necessary and http is scary. So it makes sense that users want to see the string "https" in front of them.
Against: Tor's .onion handshake basically gives you all of that for free, so by encouraging people to pay Digicert we're reinforcing the CA business model when maybe we should be continuing to demonstrate an alternative.
In favor: Actually https does give you a little bit more, in the case where the service (Facebook's webserver farm) isn't in the same location as the Tor program. Remember that there's no requirement for the webserver and the Tor process to be on the same machine, and in a complicated set-up like Facebook's they probably shouldn't be. One could argue that this last mile is inside their corporate network, so who cares if it's unencrypted, but I think the simple phrase "ssl added and removed here" will kill that argument.
Against: if one site gets a cert, it will further reinforce to users that it's "needed", and then the users will start asking other sites why they don't have one. I worry about starting a trend where you need to pay Digicert money to have a hidden service or your users think it's sketchy — especially since hidden services that value their anonymity could have a hard time getting a certificate.
One alternative would be to teach Tor Browser that https .onion addresses don't deserve a scary pop-up warning. A more thorough approach in that direction is to have a way for a hidden service to generate its own signed https cert using its onion private key, and teach Tor Browser how to verify them — basically a decentralized CA for .onion addresses, since they are self-authenticating anyway. Then you don't have to go through the nonsense of pretending to see if they could read email at the domain, and generally furthering the current CA model.
We could also imagine a pet name model where the user can tell her Tor Browser that this .onion address "is" Facebook. Or the more direct approach would be to ship a bookmark list of "known" hidden services in Tor Browser — like being our own CA, using the old-fashioned /etc/hosts model. That approach would raise the political question though of which sites we should endorse in this way.
So I haven't made up my mind yet about which direction I think this discussion should go. I'm sympathetic to "we've taught the users to check for https, so let's not confuse them", but I also worry about the slippery slope where getting a cert becomes a required step to having a reputable service. Let us know if you have other compelling arguments for or against.
Part five: what remains to be done?
In terms of both design and security, hidden services still need some love. We have plans for improved designs (see Tor proposal 224) but we don't have enough funding and developers to make it happen. We've been talking to some Facebook engineers this week about hidden service reliability and scalability, and we're excited that Facebook is thinking of putting development effort into helping improve hidden services.
And finally, speaking of teaching people about the security features of .onion sites, I wonder if "hidden services" is no longer the best phrase here. Originally we called them "location-hidden services", which was quickly shortened in practice to just "hidden services". But protecting the location of the service is just one of the security features you get. Maybe we should hold a contest to come up with a new name for these protected services? Even something like "onion services" might be better if it forces people to learn what it is.
Looking for a way to help the Internet stay open and free? This topic needs some dedicated people to give it more attention — it could easily grow to as large a project as Tor itself. In the short term, OTF's Information Controls Fellowship Program has expressed interest in funding somebody to get this project going, and EFF's Eva Galperin has said she'd be happy to manage the person as an OTF fellow at EFF, with mentorship from Tor people. The first round of those proposals has a deadline in a few days, but if that timeframe doesn't work for you, this problem isn't going away: let us know and we can work with you to help you coordinate other funding.
We used to think there are two main ways that the Tor network can fail. First, legal or policy pressure can make it so nobody is willing to run a relay. Second, pressure on or from Internet Service Providers can reduce the number of places willing to host exit relays, which in turn squeezes down the anonymity that the network can provide. Both of these threats are hard to solve, but they are challenges that we've known about for a decade, and due in large part to strong ongoing collaborations we have a pretty good handle on them.
We missed a third threat to Tor's success: a growing number of websites treat users from anonymity services differently. Slashdot doesn't let you post comments over Tor, Wikipedia won't let you edit over Tor, and Google sometimes gives you a captcha when you try to search (depending on what other activity they've seen from that exit relay lately). Some sites like Yelp go further and refuse to even serve pages to Tor users.
The result is that the Internet as we know it is siloing. Each website operator works by itself to figure out how to handle anonymous users, and generally neither side is happy with the solution. The problem isn't limited to just Tor users, since these websites face basically the same issue with users from open proxies, users from AOL, users from Africa, etc.
Two recent trends make the problem more urgent. First, sites like Cloudflare, Akamai, and Disqus create bottlenecks where their components are used by many websites. This centralization impacts many websites at once when e.g. Cloudflare changes its strategy for how to handle Tor users. Second, services increasingly outsource their blacklisting, such that e.g. Skype refuses connections from IP addresses that run Tor exit relays, not because they worry about abuse via Tor (it's hard to use Skype over Tor), but because their blacklist provider has an incentive to be overbroad in its blocking. (Blacklist providers compete in part by having "the most complete" list, and in many cases it's hard for services to notice that they're losing contributions from now-missing users.)
Technical mechanisms do exist to let anonymous users interact with websites in ways that control abuse better. Simple technical approaches include "you can read but you can't post" or "you have to log in to post". More complex approaches track reputation of users and give them access to site features based on past behavior of the user rather than on past behavior of their network address. Several research designs suggest using anonymous credentials, where users anonymously receive a cryptographic credential and then prove to the website that they possess a credential that hasn't been blacklisted — without revealing their credential, so the website can't link them to their past behavior.
Social mechanisms have also proven effective in some cases, ranging from community moderation (I hear Wikipedia Germany manually approves edits from users who don't have sufficiently reputable accounts), to flagging behavior from Tor users (even though you don't know *which* Tor user it is) so other participants can choose how to interact with them.
But applying these approaches to real-world websites has not gone well overall. Part of the challenge is that the success stories are not well-publicized, so each website feels like it's dealing with the question in isolation. Some sites do in fact face quite different problems, which require different solutions: Wikipedia doesn't want jerks to change the content of pages, whereas Yelp doesn't want competitors to scrape all its pages. We can also imagine that some companies, like ones that get their revenue from targeted advertising, are fundamentally uninterested in allowing anonymous users at all.
A way forward
The solution I envision is to get a person who is both technical and good at activism to focus on this topic. Step one is to enumerate the set of websites and other Internet services that handle Tor connections differently from normal connections, and look for patterns that help us identify the common (centralized) services that impact many sites. At the same time, we should make a list of solutions — technical and social — that are in use today. There are a few community-led starts on the Tor wiki already, like the DontBlockMe page and a List of Services Blocking Tor.
Step two is to sort the problem websites based on how amenable they would be to our help. Armed with the toolkit of options we found in step one, we should go to the first (most promising) site on the list and work with them to understand their problem. Ideally we can adapt one of the ideas from the toolkit; otherwise we'll need to invent and develop a new approach tailored to their situation and needs. Then we should go to the second site on the list with our (now bigger) toolkit, and so on down the list. Once we have some success stories, we can consider how to scale better, such as holding a conference where we invite the five best success cases plus the next five unsolved sites on our list.
A lot of the work will be building and maintaining social connections with engineers at the services, to help them understand what's possible and to track regressions (for example, every year or so Google gets a new engineer in charge of deciding when to give out Captchas, and they seem to have no institutional memory of how the previous one decided to handle Tor users). It might be that the centralization of Cloudflare et al can be turned around into an advantage, where making sure they have a good practices will scale to help many websites at once.
EFF is the perfect organization to lead this charge, given its community connections, its campaigns like Who has your back?, and its more (at least more than Tor ;) neutral perspective on the topic. And now, when everybody is sympathetic about the topic of surveillance, is a great time to try to take back some ground. We have a wide variety of people who want to help, from scientists and research groups who would help with technical solutions if only they understood the real problems these sites face, to users and activists who can help publicize both the successful cases and the not-yet-successful cases.
Looking ahead to the future, I'm also part of an upcoming research collaboration with Dan Boneh, Andrea Forte, Rachel Greenstadt, Ryan Henry, Benjamin Mako Hill, and Dan Wallach who will look both at the technical side of the problem (building more useful ideas for the toolkit) and also the social side of the problem: how can we quantify the loss to Wikipedia, and to society at large, from turning away anonymous contributors? Wikipedians say "we have to blacklist all these IP addresses because of trolls" and "Wikipedia is rotting because nobody wants to edit it anymore" in the same breath, and we believe these points are related. The group is at the "applying for an NSF grant" stage currently, so it will be a year or more before funding appears, but I mention it because we should get somebody to get the ball rolling now, and hopefully we can expect reinforcements to appear as momentum builds.
In summary, if this call to arms catches your eye, your next steps are to think about what you most want to work on to get started, and how you would go about doing it. You can apply for an OTF fellowship, or we can probably help you find other funding sources as needed too.