We are starting a project to study and quantify hidden services traffic. As part of this project, we are collecting data from just a few volunteer relays which only allow us to see a small portion of hidden service activity (between 2% and 5%). Extrapolating from such a small sample is difficult, and our data are preliminary.
We've been working on methods to improve our calculations, but with our current methodology, we estimate that about 30,000 hidden services announce themselves to the Tor network every day, using about 5 terabytes of data daily. We also found that hidden service traffic is about 3.4% of total Tor traffic, which means that, at least according to our early calculations, 96.6% of Tor traffic is *not* hidden services. We invite people to join us in working to research methodologies and develop systems for better understanding Tor hidden services.
Over the past months we've been working on hidden service statistics. Our goal has been to answer the following questions:
- "Approximately how many hidden services are there?"
- "Approximately how much traffic of the Tor network is going to hidden services?"
We chose the above two questions because even though we want to understand hidden services, we really don't want to harm the privacy of Tor users. From a privacy perspective, the above two questions are relatively easy questions to answer because we don't need data from clients or the hidden services themselves; we just need data from hidden service directories and rendezvous points. Furthermore, the measurements reported by each relay cannot be linked back to specific hidden services or their clients.
Our first move was to research various ways we could collect these statistics in a privacy-preserving manner. After days of discussions on obfuscating statistics, we began writing a Tor proposal with our design, as well as code that implements the proposal. The code has since been reviewed and merged to Tor! The statistics are currently disabled by default so we asked volunteer relay operators to explicitly turn them on. Currently there are about 70 relays publishing measurements to us every 24 hours:
So as of now we've been receiving these measurements for over a month, and we have thought a lot about how to best use the reported measurements to derive interesting results. We finally have some preliminary results we would like to share with you:
How many hidden services are there?
All in all, it seems that every day about 30000 hidden services announce themselves to the hidden service directories. Graphically:
By counting the number of unique hidden service addresses seen by HSDirs, we can get the approximate number of hidden services. Keep in mind that we can only see between 2% and 5% of the total HSDir space, so the extrapolation is, naturally, messy.
How much traffic do hidden services cause?
Our preliminary results show that hidden services cause somewhere between 400 to 600 Mbit of traffic per second, or equivalently about 4.9 terabytes a day. Here is a graph:
We learned this by getting rendezvous points to publish the total number of cells transferred over rendezvous circuits, which allows us to learn the approximate volume of hidden service traffic. Notice that our coverage here is not very good either, with a probability of about 5% that a hidden service circuit will use a relay that reports these statistics as a rendezvous point.
A related statistic here is "How much of the Tor network is actually hidden service usage?". There are two different ways to answer this question, depending on whether we want to understand what clients are doing or what the network is doing. The fraction of hidden-service traffic at Tor clients differs from the fraction at Tor relays because connections to hidden services use 6-hop circuits while connections to the regular Internet use 3-hop circuits. As a result, the fraction of hidden-service traffic entering or leaving Tor is about half of the fraction of hidden-service traffic inside of Tor. Our conclusion is that about 3.4% of client traffic is hidden-service traffic, and 6.1% of traffic seen at a relay is hidden-service traffic.
Conclusion and future work
In this blog post we presented some preliminary results that could be extracted from these new hidden service statistics. We hope that this data can help us better gauge the future development and maturity of the onion space as well as detect potential incidents and bugs on the network. To better present our results and methods, we wrote a short technical report that outlines the exact process we followed. We invite you to read it if you are curious about the methodology or the results.
Finally, this project is only a few months old, and there are various plans for the future. For example:
There are more interesting questions that we could examine in this area. For example: "How many people are using hidden services every day?" and "How many times does someone try to visit a hidden service that does not exist anymore?."
Unfortunately, some of these questions are not easy to answer with the current statistics reporting infrastructure, mainly because collecting them in this way could reveal information about specific hidden services but also because the results of the current system contain too much obfuscating data (each reporting relay randomizes its numbers a little bit before publishing them, so we can learn about totals but not about specific events).
For this reason, we've been analyzing various statistics aggregation protocols that could be used in place of the current system, allowing us to safely collect other kinds of statistics.
- We need to incorporate these statistics in our Metrics portal so that they are updated regularly and so that everyone can follow them.
Currently, these hidden service statistics are not collected in relays by default. Unfortunately, that gives us very small coverage of the network, which in turn makes our extrapolations very noisy. The main reason that these statistics are disabled by default is that similar statistics are also disabled (e.g. CellStatistics). Also, this allows us more time to consider privacy consequences. As we analyze more of these statistics and think more about statistics privacy, we should decide whether to turn these statistics on by default.
It's worth repeating that the current results are preliminary and should be digested with a grain of salt. We invite statistically-inclined people to review our code, methods, and results. If you are a researcher interested in digging into the measurements themselves, you can find them in the extra-info descriptors of Tor relays.
Over the next months, we will also be thinking more about these problems to figure out proper ways to analyze and safely measure private ecosystems like the onion space.
Till then, take care, and enjoy Tor!
Hi! Nick here.
I ought to post my own responses to that Andy Greenberg article, too. (Especially since most everybody else around here is at 31c3 right now, or sick with the flu, or both.)
When I saw the coverage of the hidden services study that was presented at CCC today, I was reminded of the media fallout from that old study from the 1990s that "proved" that a ridiculously high fraction of the internet was pornography...by looking at Usenet*, and by counting newsgroups and bytes. (You might remember it; it was the basis of the delightful TIME Magazine "Cyberporn" cover.)
The 1990s researcher wasn't lying outright, but he and the press *were* conflating one question: "What fraction of Usenet groups are 'alt.sex' or 'alt.binaries' (file posting) groups" with two others: "What fraction of internet traffic is porn?" and "What fraction of internet-user hours are spent on porn?"
These are quite different things.
The presentation today focused on data about hidden service types and usage. Predictably, given the results from Biryukov, Pustogarov, Thill, and Weinmann, the researcher found that hidden services related to child abuse are only a small fraction of the total number of hidden service addresses on the network. And because of the way that hidden services work, traffic does not go through hidden service directories, but instead through rendezvous points (randomly chosen Tor nodes): so no relay that knows the hidden service's address will learn the actual amount of traffic transmitted. But, as previously documented, abusive services represent a disproportionate fraction of usage... if you're measuring usage with hidden service directory requests.
Why might that be?
First, some background. Basically, a Tor client makes a hidden service directory request the first time it visits a hidden service that it has not been to in a while. (If you spend hours at one hidden service, you make about 1 hidden service directory request. But if you spend 1 second each at 100 hidden services, you make about 100 requests.) Therefore, obsessive users who visit many sites in a session account for many more of the requests that this study measures than users who visit a smaller number of sites with equal frequency.
There are other confounding factors as well. Due to bugs in older Tor implementations, a hidden service that is unreliable (or completely unavailable) will get many, many more hidden service directory requests than a reliable one. So if any abuse sites are unusually unreliable, we'd expect their users to create a disproportionately large number of hidden service directory requests.
Also, a very large number of hidden service directory requests are probably not made by humans! See bug 13287: We don't know what's up with that. Could this be caused by some kind of anti-abuse organization running an automated scanning tool?
In any case, a methodology that looks primarily at hidden service directory requests will over-rate services that are frequently accessed from a Tor client that hasn't been there recently, and under-rate services that are used via tor2web, and so on. It also depends a lot on how hidden services are configured, how frequently Tor hidden service directories go up and down, and what times of day they change introduction points in comparison to what time of day their users tend to be awake.
The greater the number of distinct hidden services a person visits, and the less reliable those sites are, the more hidden service directory requests they will trigger.
Suppose 10 people use hidden services to look at conspiracy theories, 100 people use hidden services to buy Cuban cigars, and 1000 people use it for online chat.
But suppose that the average cigar purchaser visits only one or two sites to make purchases, and the average chat user joins one or two networks, whereas the average conspiracy theorist needs to visit several dozen forums and wikis.
Suppose also that the average Cuban cigar purchaser makes about two purchases a month, the average chat user logs in once a day, and the average conspiracy theorist spends 3 hours a day crawling the hidden web.
And suppose that conspiracy theory websites come and go frequently, whereas cigar sites and chat networks are more stable.
In this analysis, even though there are far more people buying cigars, users who use it for obsessive behavior that spans multiple unreliable hidden services will be far overrepresented in the count of hidden service directory requests than users who use it for activities done less frequently and across fewer services. So any comparison of hidden service directory request counts will say more about the behavioral differences of different types of users than about their relative numbers, or the amount of traffic they generated.
In conclusion, let's spend a minute talking about freedom and philosophy. Any system that provides security on the Internet will inevitably see some use by bad people that we'd rather not help at all. After all, cars are used for getaways, and window shades conceal all kinds of criminality. The only way to make a privacy tool that nobody abuses is to make it so weak that people aren't willing to touch it, or so unusable that nobody can figure it out.
Up till now, many of the early adopters for Tor hidden services have been folks for whom the risk/effort calculations have been quite extreme, since--as I'd certainly acknowledge--the system isn't terribly usable for the average person as it stands. Roger noted earlier that hidden services amount to less than 2% of our total traffic today. Given their privacy potential, I think that's not even close to enough. We've got to work over the next year or more to develop hidden services to the point where their positive impact is felt by the average netizen, whether they're publishing a personal blog for their friends, using a novel communications protocol more secure than email, or reading a news article based on information that a journalist received through an anonymous submission system. Otherwise, they'll remain a target for every kind of speculation, and every misunderstanding about them will lead people to conclude the worst about privacy online. Come lend a hand?
(Also, no offense to Andy on this: he is a fine tech reporter and apparently a fine person. And no offense to Dr. Owen, who explained his results a lot more carefully than they have been re-explained elsewhere. Now please forgive me, I'm off to write some more software and get some sleep. Please direct all media inquiries to the email of "press at torproject dot org".)
* Usenet was sort of like Twitter, only you could write paragraphs on it. ;)
Recently it was announced that a coalition of government agencies took control of many Tor hidden services. We were as surprised as most of you. Unfortunately, we have very little information about how this was accomplished, but we do have some thoughts which we want to share.
Over the last few days, we received and read reports saying that several Tor relays were seized by government officials. We do not know why the systems were seized, nor do we know anything about the methods of investigation which were used. Specifically, there are reports that three systems of Torservers.net disappeared and there is another report by an independent relay operator. If anyone has more details, please get in contact with us. If your relay was seized, please also tell us its identity so that we can request that the directory authorities reject it from the network.
But, more to the point, the recent publications call the targeted hidden services seizures "Operation Onymous" and they say it was coordinated by Europol and other government entities. Early reports say 17 people were arrested, and 400 hidden services were seized. Later reports have clarified that it was hundreds of URLs hosted on roughly 27 web sites offering hidden services. We have not been contacted directly or indirectly by Europol nor any other agency involved.
Tor is most interested in understanding how these services were located, and if this indicates a security weakness in Tor hidden services that could be exploited by criminals or secret police repressing dissents. We are also interested in learning why the authorities seized Tor relays even though their operation was targetting hidden services. Were these two events related?
How did they locate the hidden services?
So we are left asking "How did they locate the hidden services?". We don't know. In liberal democracies, we should expect that when the time comes to prosecute some of the seventeen people who have been arrested, the police would have to explain to the judge how the suspects came to be suspects, and that as a side benefit of the operation of justice, Tor could learn if there are security flaws in hidden services or other critical internet-facing services. We know through recent leaks that the US DEA and others have constructed a system of organized and sanctioned perjury which they refer to as "parallel construction."
Unfortunately, the authorities did not specify how they managed to locate the hidden services. Here are some plausible scenarios:
The first and most obvious explanation is that the operators of these hidden services failed to use adequate operational security. For example, there are reports of one of the websites being infiltrated by undercover agents and the affidavit states various operational security errors.
Another explanation is exploitation of common web bugs like SQL injections or RFIs (remote file inclusions). Many of those websites were likely quickly-coded e-shops with a big attack surface. Exploitable bugs in web applications are a common problem.
Apparently, there are ways to link transactions and deanonymize Bitcoin clients even if they use Tor. Maybe the seized hidden services were running Bitcoin clients themselves and were victims of similar attacks.
Attacks on the Tor network
The number of takedowns and the fact that Tor relays were seized could also mean that the Tor network was attacked to reveal the location of those hidden services. We received some interesting information from an operator of a now-seized hidden service which may indicate this, as well. Over the past few years, researchers have discovered various attacks on the Tor network. We've implemented some defenses against these attacks, but these defenses do not solve all known issues and there may even be attacks unknown to us.
For example, some months ago, someone was launching non-targetted deanonymization attacks on the live Tor network. People suspect that those attacks were carried out by CERT researchers. While the bug was fixed and the fix quickly deployed in the network, it's possible that as part of their attack, they managed to deanonymize some of those hidden services.
Another possible Tor attack vector could be the Guard Discovery attack. This attack doesn't reveal the identity of the hidden service, but allows an attacker to discover the guard node of a specific hidden service. The guard node is the only node in the whole network that knows the actual IP address of the hidden service. Hence, if the attacker then manages to compromise the guard node or somehow obtain access to it, she can launch a traffic confirmation attack to learn the identity of the hidden service. We've been
discussing various solutions to the guard discovery attack for the past many months but it's not an easy problem to fix properly. Help and feedback on the proposed designs is appreciated.
*Similarly, there exists the attack where the hidden service selects the attacker's relay as its guard node. This may happen randomly or this could occur if the hidden service selects another relay as its guard and the attacker renders that node unusable, by a denial of service attack or similar. The hidden service will then be forced to select a new guard. Eventually, the hidden service will select the attacker.
Furthermore, denial of service attacks on relays or clients in the Tor network can often be leveraged into full de-anonymization attacks. These techniques go back many years, in research such as "From a Trickle to a Flood", "Denial of Service or Denial of Security?", "Why I'm not an Entropist", and even the more recent Bitcoin attacks above. In the Hidden Service protocol there are more vectors for DoS attacks, such as the set of HSDirs and the Introduction Points of a Hidden Service.
Finally, remote code execution exploits against Tor software are also always a possibility, but we have zero evidence that such exploits exist. Although the Tor source code gets continuously reviewed by our security-minded developers and community members, we would like more focused auditing by experienced bug hunters. Public-interest initiatives like Project Zero could help out a lot here. Funding to launch a bug bounty program of our own could also bring real benefit to our codebase. If you can help, please get in touch.
Advice to concerned hidden service operators
As you can see, we still don't know what happened, and it's hard to give concrete suggestions blindly.
If you are a concerned hidden service operator, we suggest you read the cited resources to get a better understanding of the security that hidden services can offer and of the limitations of the current system. When it comes to anonymity, it's clear that the tighter your threat model is, the more informed you need to be about the technologies you use.
If your hidden service lacks sufficient processor, memory, or network resources the DoS based de-anonymization attacks may be easy to leverage against your service. Be sure to review the Tor performance tuning guide to optimize your relay or client.
*Another possible suggestion we can provide is manually selecting the guard node of a hidden service. By configuring the EntryNodes option in Tor's configuration file you can select a relay in the Tor network you trust. Keep in mind, however, that a determined attacker will still be able to determine this relay is your guard and all other attacks still apply.
The task of hiding the location of low-latency web services is a very hard problem and we still don't know how to do it correctly. It seems that there are various issues that none of the current anonymous publishing designs have really solved.
In a way, it's even surprising that hidden services have survived so far. The attention they have received is minimal compared to their social value and compared to the size and determination of their adversaries.
It would be great if there were more people reviewing our designs and code. For example, we would really appreciate feedback on the upcoming hidden service revamp or help with the research on guard discovery attacks (see links above).
Also, it's important to note that Tor currently doesn't have funding for improving the security of hidden services. If you are interested in funding hidden services research and development, please get in touch with us. We hope to find time to organize a crowdfunding campaign to acquire independent and focused hidden service funding.
Thanks to Griffin, Matt, Adam, Roger, David, George, Karen, and Jake for contributions to this post.
* Added information about guard node DoS and EntryNodes option - 2014/11/09 18:16 UTC
Today Facebook unveiled its hidden service that lets users access their website more safely. Users and journalists have been asking for our response; here are some points to help you understand our thinking.
Part one: yes, visiting Facebook over Tor is not a contradiction
I didn't even realize I should include this section, until I heard from a journalist today who hoped to get a quote from me about why Tor users wouldn't ever use Facebook. Putting aside the (still very important) questions of Facebook's privacy habits, their harmful real-name policies, and whether you should or shouldn't tell them anything about you, the key point here is that anonymity isn't just about hiding from your destination.
There's no reason to let your ISP know when or whether you're visiting Facebook. There's no reason for Facebook's upstream ISP, or some agency that surveils the Internet, to learn when and whether you use Facebook. And if you do choose to tell Facebook something about you, there's still no reason to let them automatically discover what city you're in today while you do it.
Also, we should remember that there are some places in the world that can't reach Facebook. Long ago I talked to a Facebook security person who told me a fun story. When he first learned about Tor, he hated and feared it because it "clearly" intended to undermine their business model of learning everything about all their users. Then suddenly Iran blocked Facebook, a good chunk of the Persian Facebook population switched over to reaching Facebook via Tor, and he became a huge Tor fan because otherwise those users would have been cut off. Other countries like China followed a similar pattern after that. This switch in his mind between "Tor as a privacy tool to let users control their own data" to "Tor as a communications tool to give users freedom to choose what sites they visit" is a great example of the diversity of uses for Tor: whatever it is you think Tor is for, I guarantee there's a person out there who uses it for something you haven't considered.
Part two: we're happy to see broader adoption of hidden services
I think it is great for Tor that Facebook has added a .onion address. There are some compelling use cases for hidden services: see for example the ones described at using Tor hidden services for good, as well as upcoming decentralized chat tools like Ricochet where every user is a hidden service, so there's no central point to tap or lean on to retain data. But we haven't really publicized these examples much, especially compared to the publicity that the "I have a website that the man wants to shut down" examples have gotten in recent years.
Hidden services provide a variety of useful security properties. First — and the one that most people think of — because the design uses Tor circuits, it's hard to discover where the service is located in the world. But second, because the address of the service is the hash of its key, they are self-authenticating: if you type in a given .onion address, your Tor client guarantees that it really is talking to the service that knows the private key that corresponds to the address. A third nice feature is that the rendezvous process provides end-to-end encryption, even when the application-level traffic is unencrypted.
So I am excited that this move by Facebook will help to continue opening people's minds about why they might want to offer a hidden service, and help other people think of further novel uses for hidden services.
Another really nice implication here is that Facebook is committing to taking its Tor users seriously. Hundreds of thousands of people have been successfully using Facebook over Tor for years, but in today's era of services like Wikipedia choosing not to accept contributions from users who care about privacy, it is refreshing and heartening to see a large website decide that it's ok for their users to want more safety.
As an addendum to that optimism, I would be really sad if Facebook added a hidden service, had a few problems with trolls, and decided that they should prevent Tor users from using their old https://www.facebook.com/ address. So we should be vigilant in helping Facebook continue to allow Tor users to reach them through either address.
Part three: their vanity address doesn't mean the world has ended
Their hidden service name is "facebookcorewwwi.onion". For a hash of a public key, that sure doesn't look random. Many people have been wondering how they brute forced the entire name.
The short answer is that for the first half of it ("facebook"), which is only 40 bits, they generated keys over and over until they got some keys whose first 40 bits of the hash matched the string they wanted.
Then they had some keys whose name started with "facebook", and they looked at the second half of each of them to pick out the ones with pronouncable and thus memorable syllables. The "corewwwi" one looked best to them — meaning they could come up with a story about why that's a reasonable name for Facebook to use — so they went with it.
So to be clear, they would not be able to produce exactly this name again if they wanted to. They could produce other hashes that start with "facebook" and end with pronouncable syllables, but that's not brute forcing all of the hidden service name (all 80 bits).
For those who want to explore the math more, read about the "birthday attack". And for those who want to learn more (please help!) about the improvements we'd like to make for hidden services, including stronger keys and stronger names, see hidden services need some love and Tor proposal 224.
Part four: what do we think about an https cert for a .onion address?
Facebook didn't just set up a hidden service. They also got an https certificate for their hidden service, and it's signed by Digicert so your browser will accept it. This choice has produced some feisty discussions in the CA/Browser community, which decides what kinds of names can get official certificates. That discussion is still ongoing, but here are my early thoughts on it.
In favor: we, the Internet security community, have taught people that https is necessary and http is scary. So it makes sense that users want to see the string "https" in front of them.
Against: Tor's .onion handshake basically gives you all of that for free, so by encouraging people to pay Digicert we're reinforcing the CA business model when maybe we should be continuing to demonstrate an alternative.
In favor: Actually https does give you a little bit more, in the case where the service (Facebook's webserver farm) isn't in the same location as the Tor program. Remember that there's no requirement for the webserver and the Tor process to be on the same machine, and in a complicated set-up like Facebook's they probably shouldn't be. One could argue that this last mile is inside their corporate network, so who cares if it's unencrypted, but I think the simple phrase "ssl added and removed here" will kill that argument.
Against: if one site gets a cert, it will further reinforce to users that it's "needed", and then the users will start asking other sites why they don't have one. I worry about starting a trend where you need to pay Digicert money to have a hidden service or your users think it's sketchy — especially since hidden services that value their anonymity could have a hard time getting a certificate.
One alternative would be to teach Tor Browser that https .onion addresses don't deserve a scary pop-up warning. A more thorough approach in that direction is to have a way for a hidden service to generate its own signed https cert using its onion private key, and teach Tor Browser how to verify them — basically a decentralized CA for .onion addresses, since they are self-authenticating anyway. Then you don't have to go through the nonsense of pretending to see if they could read email at the domain, and generally furthering the current CA model.
We could also imagine a pet name model where the user can tell her Tor Browser that this .onion address "is" Facebook. Or the more direct approach would be to ship a bookmark list of "known" hidden services in Tor Browser — like being our own CA, using the old-fashioned /etc/hosts model. That approach would raise the political question though of which sites we should endorse in this way.
So I haven't made up my mind yet about which direction I think this discussion should go. I'm sympathetic to "we've taught the users to check for https, so let's not confuse them", but I also worry about the slippery slope where getting a cert becomes a required step to having a reputable service. Let us know if you have other compelling arguments for or against.
Part five: what remains to be done?
In terms of both design and security, hidden services still need some love. We have plans for improved designs (see Tor proposal 224) but we don't have enough funding and developers to make it happen. We've been talking to some Facebook engineers this week about hidden service reliability and scalability, and we're excited that Facebook is thinking of putting development effort into helping improve hidden services.
And finally, speaking of teaching people about the security features of .onion sites, I wonder if "hidden services" is no longer the best phrase here. Originally we called them "location-hidden services", which was quickly shortened in practice to just "hidden services". But protecting the location of the service is just one of the security features you get. Maybe we should hold a contest to come up with a new name for these protected services? Even something like "onion services" might be better if it forces people to learn what it is.
The Google Summer of Code (GSoC) was an excellent opportunity to improve on the Ahmia search engine. With Google's stipend and friendly mentoring from The Tor Project, I was able to concentrate on development of my search engine project. Thank you all!
GSoC 2014 is over, but I am sticking around to continue developing and maintaining Ahmia.
Here is the current status of ahmia after GSoC development:
Building a search engine for anonymous web sites running inside the Tor network is an interesting problem. Tor enables web servers to hide their location and Tor users can connect to these authenticated hidden services while the server and the user both stay anonymous. However, finding web content is hard without a good search engine and therefore a search engine is needed for the Tor network.
Web search engines are needed to navigate and search the web. There were no search engines for searching hidden service web content, so I decided to build a search engine specially for Tor. I registered ahmia.fi and started development on it as a side project in 2010.
This development involved programming and testing web crawlers, thinking of ways to find hidden service addresses (since the protocol does not allow enumeration), learning about the Tor community, and implementing a filtering policy. Moreover, I implemented an API that empowers other Tor services that publish content to integrate with Ahmia.
As a result, Ahmia is a working search engine that indexes, searches and catalogs content published on Tor Hidden Services. Furthermore, it is an environment to share meaningful statistics, insights and news about the Tor network itself.
Interesting Summer of Code
One of my best memories from the summer is the Tor Project's Summer 2014 Developers meeting that was hosted by Mozilla in Paris, France. I have always admired the people who are working on the Tor Project.
I also loved the coding itself. Finally I had time to improve the Ahmia search engine and its many features. I did a lot of work and liked it.
Some journalist were very interested in my work: Carola Frediani asked if I could analyze the content of hidden services. I coded a script that fetches every front page's HTML, I gathered all the keywords, headers and description texts and made a simple word cloud visualization.
It is a simple way to glance what is published on the hidden websites.
Carola found this data useful and used it in her presentation at www.sotn.it on June 11th.
Technical design of ahmia
The components of Ahmia are:
- Django front-end site
- PostgreSQL database for the site
- Custom scripts to download data about hidden services
- Django-Haystack connection to Solr database
- Apache Solr for the crawled data
- OnionBot crawler that gathers data to Solr database
The full-text search is implemented using Django-Haystack. The search is using crawled website data that is saved to Apache Solr.
OnionDir is a list of known online hidden service addresses. A separate script gathers this list and fetches information fields from the HTML (title, keywords, description etc.). Furthermore, users can freely edit these fields.
We've also started a convention where hidden service admins can add a file to their website, called description.json, to offer an official description of their site in Ahmia.
As a result, this information is shown in the OnionDir page and over 80 domains are already using this method.
We are gathering three types of popularity data:
- Tor2web nodes share their visiting statistics to Ahmia
- Number of public WWW backlinks to hidden services
- Number of clicks in the search results
The click counter tells the total number of clicks on a search result in ahmia.fi
We have decided to filter any sites related to child porn from our search results. Ahmia is removing everything related to these websites. These websites may not be actual child porn sites. They are rather sites where users can post content (forums, file and image uploads etc.) and as the result there have been, momentarily at least, some suspicious content that has not been moderated in a reasonable period of time. Ahmia.fi does not have the time to monitor these sites carefully and we are banning sites from our public index if we see any evidence of child abuse. Of course, the ban is removed if the site itself contacts us and we review the website to be OK.
In practice, Ahmia calculates the MD5 sums of the banned domains for use as a filtering policy. Moreover, we are sharing this list and Tor2web nodes can use the list to filter out pages.
At the moment, there seems to be 1228 hidden website domains online and 7 of them has been filtered because they are possibly sharing child porn content.
OnionBot is a crawler for hidden service websites based on the Scrapy framework. It crawls the Tor network and passes data to the search database. OnionBot requires the Tor software (using Tor2web mode) and Polipo. The results are saved to Apache Solr.
Apache Solr is a popular, open source enterprise search platform. Its major features include powerful full-text search, hit highlighting, faceted search, and near real-time indexing.
The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
Security measures for privacy
In the software
- We do not log any IP addresses, see Apache configuration
- We are gathering real-time clicks, however, this data is not shown accurately
In the host ahmia.fi
- Backend servers are run separately and they do not have any knowledge about the end-users
- All servers are hosted in countries with strong privacy laws. For example, Finland and the Netherlands
- Communication between servers is encrypted
- Only a few trustworthy people know the locations of the back-end servers and are able to access them
GSoC 2014 was fun and productive!
There is a lot more to do. However, I do not have time to do everything myself. Of course, I am coding when I have time and maintaining the search engine.
In addition, I am going to write a scientific article about the implementation.
Is there anyone who would be interested in developing Ahmia.fi?
Is anyone familiar with Solr and would know how to tweak it for full text search?
Furthermore, any kind of help would be most welcome. There are always Linux admin duties, HTML/CSS design, bug fixing, Django development, etc...
For further information, please don't hesitate to contact me by e-mail: email@example.com
This advisory was posted on the tor-announce mailing list.
On July 4 2014 we found a group of relays that we assume were trying to deanonymize users. They appear to have been targeting people who operate or access Tor hidden services. The attack involved modifying Tor protocol headers to do traffic confirmation attacks.
The attacking relays joined the network on January 30 2014, and we removed them from the network on July 4. While we don't know when they started doing the attack, users who operated or accessed hidden services from early February through July 4 should assume they were affected.
Unfortunately, it's still unclear what "affected" includes. We know the attack looked for users who fetched hidden service descriptors, but the attackers likely were not able to see any application-level traffic (e.g. what pages were loaded or even whether users visited the hidden service they looked up). The attack probably also tried to learn who published hidden service descriptors, which would allow the attackers to learn the location of that hidden service. In theory the attack could also be used to link users to their destinations on normal Tor circuits too, but we found no evidence that the attackers operated any exit relays, making this attack less likely. And finally, we don't know how much data the attackers kept, and due to the way the attack was deployed (more details below), their protocol header modifications might have aided other attackers in deanonymizing users too.
Relays should upgrade to a recent Tor release (0.2.4.23 or 0.2.5.6-alpha), to close the particular protocol vulnerability the attackers used — but remember that preventing traffic confirmation in general remains an open research problem. Clients that upgrade (once new Tor Browser releases are ready) will take another step towards limiting the number of entry guards that are in a position to see their traffic, thus reducing the damage from future attacks like this one. Hidden service operators should consider changing the location of their hidden service.
THE TECHNICAL DETAILS:
We believe they used a combination of two classes of attacks: a traffic confirmation attack and a Sybil attack.
A traffic confirmation attack is possible when the attacker controls or observes the relays on both ends of a Tor circuit and then compares traffic timing, volume, or other characteristics to conclude that the two relays are indeed on the same circuit. If the first relay in the circuit (called the "entry guard") knows the IP address of the user, and the last relay in the circuit knows the resource or destination she is accessing, then together they can deanonymize her. You can read more about traffic confirmation attacks, including pointers to many research papers, at this blog post from 2009:
The particular confirmation attack they used was an active attack where the relay on one end injects a signal into the Tor protocol headers, and then the relay on the other end reads the signal. These attacking relays were stable enough to get the HSDir ("suitable for hidden service directory") and Guard ("suitable for being an entry guard") consensus flags. Then they injected the signal whenever they were used as a hidden service directory, and looked for an injected signal whenever they were used as an entry guard.
The way they injected the signal was by sending sequences of "relay" vs "relay early" commands down the circuit, to encode the message they want to send. For background, Tor has two types of cells: link cells, which are intended for the adjacent relay in the circuit, and relay cells, which are passed to the other end of the circuit. In 2008 we added a new kind of relay cell, called a "relay early" cell, which is used to prevent people from building very long paths in the Tor network. (Very long paths can be used to induce congestion and aid in breaking anonymity). But the fix for infinite-length paths introduced a problem with accessing hidden services, and one of the side effects of our fix for bug 1038 was that while we limit the number of outbound (away from the client) "relay early" cells on a circuit, we don't limit the number of inbound (towards the client) relay early cells.
So in summary, when Tor clients contacted an attacking relay in its role as a Hidden Service Directory to publish or retrieve a hidden service descriptor (steps 2 and 3 on the hidden service protocol diagrams), that relay would send the hidden service name (encoded as a pattern of relay and relay-early cells) back down the circuit. Other attacking relays, when they get chosen for the first hop of a circuit, would look for inbound relay-early cells (since nobody else sends them) and would thus learn which clients requested information about a hidden service.
There are three important points about this attack:
A) The attacker encoded the name of the hidden service in the injected signal (as opposed to, say, sending a random number and keeping a local list mapping random number to hidden service name). The encoded signal is encrypted as it is sent over the TLS channel between relays. However, this signal would be easy to read and interpret by anybody who runs a relay and receives the encoded traffic. And we might also worry about a global adversary (e.g. a large intelligence agency) that records Internet traffic at the entry guards and then tries to break Tor's link encryption. The way this attack was performed weakens Tor's anonymity against these other potential attackers too — either while it was happening or after the fact if they have traffic logs. So if the attack was a research project (i.e. not intentionally malicious), it was deployed in an irresponsible way because it puts users at risk indefinitely into the future.
(This concern is in addition to the general issue that it's probably unwise from a legal perspective for researchers to attack real users by modifying their traffic on one end and wiretapping it on the other. Tools like Shadow are great for testing Tor research ideas out in the lab.)
B) This protocol header signal injection attack is actually pretty neat from a research perspective, in that it's a bit different from previous tagging attacks which targeted the application-level payload. Previous tagging attacks modified the payload at the entry guard, and then looked for a modified payload at the exit relay (which can see the decrypted payload). Those attacks don't work in the other direction (from the exit relay back towards the client), because the payload is still encrypted at the entry guard. But because this new approach modifies ("tags") the cell headers rather than the payload, every relay in the path can see the tag.
C) We should remind readers that while this particular variant of the traffic confirmation attack allows high-confidence and efficient correlation, the general class of passive (statistical) traffic confirmation attacks remains unsolved and would likely have worked just fine here. So the good news is traffic confirmation attacks aren't new or surprising, but the bad news is that they still work. See https://blog.torproject.org/blog/one-cell-enough for more discussion.
Then the second class of attack they used, in conjunction with their traffic confirmation attack, was a standard Sybil attack — they signed up around 115 fast non-exit relays, all running on 184.108.40.206/16 or 220.127.116.11/16. Together these relays summed to about 6.4% of the Guard capacity in the network. Then, in part because of our current guard rotation parameters, these relays became entry guards for a significant chunk of users over their five months of operation.
We actually noticed these relays when they joined the network, since the DocTor scanner reported them. We considered the set of new relays at the time, and made a decision that it wasn't that large a fraction of the network. It's clear there's room for improvement in terms of how to let the Tor network grow while also ensuring we maintain social connections with the operators of all large groups of relays. (In general having a widely diverse set of relay locations and relay operators, yet not allowing any bad relays in, seems like a hard problem; on the other hand our detection scripts did notice them in this case, so there's hope for a better solution here.)
In response, we've taken the following short-term steps:
1) Removed the attacking relays from the network.
2) Put out a software update for relays to prevent "relay early" cells from being used this way.
3) Put out a software update that will (once enough clients have upgraded) let us tell clients to move to using one entry guard rather than three, to reduce exposure to relays over time.
4) Clients can tell whether they've received a relay or relay-cell. For expert users, the new Tor version warns you in your logs if a relay on your path injects any relay-early cells: look for the phrase "Received an inbound RELAY_EARLY cell".
The following longer-term research areas remain:
5) Further growing the Tor network and diversity of relay operators, which will reduce the impact from an adversary of a given size.
6) Exploring better mechanisms, e.g. social connections, to limit the impact from a malicious set of relays. We've also formed a group to pay more attention to suspicious relays in the network:
7) Further reducing exposure to guards over time, perhaps by extending the guard rotation lifetime:
8) Better understanding statistical traffic correlation attacks and whether padding or other approaches can mitigate them.
9) Improving the hidden service design, including making it harder for relays serving as hidden service directory points to learn what hidden service address they're handling:
Q1) Was this the Black Hat 2014 talk that got canceled recently?
Q2) Did we find all the malicious relays?
Q3) Did the malicious relays inject the signal at any points besides the HSDir position?
Q4) What data did the attackers keep, and are they going to destroy it? How have they protected the data (if any) while storing it?
Great questions. We spent several months trying to extract information from the researchers who were going to give the Black Hat talk, and eventually we did get some hints from them about how "relay early" cells could be used for traffic confirmation attacks, which is how we started looking for the attacks in the wild. They haven't answered our emails lately, so we don't know for sure, but it seems likely that the answer to Q1 is "yes". In fact, we hope they *were* the ones doing the attacks, since otherwise it means somebody else was. We don't yet know the answers to Q2, Q3, or Q4.
New work on denial of service in Tor will be presented at NDSS '14 on Tuesday, February 25th, 2014:
The Sniper Attack: Anonymously Deanonymizing and Disabling the Tor Network
by Rob Jansen, Florian Tschorsch, Aaron Johnson, and Björn Scheuermann
To appear at the 21st Symposium on Network and Distributed System Security
We found a new vulnerability in the design of Tor's flow control algorithm that can be exploited to remotely crash Tor relays. The attack is an extremely low resource attack in which an adversary's bandwidth may be traded for a target relay's memory (RAM) at an amplification rate of one to two orders of magnitude. Ironically, the adversary can use Tor to protect it's identity while attacking Tor without significantly reducing the effectiveness of the attack.
We studied relay availability under the attack using Shadow, a discrete-event network simulator that runs the real Tor software in a safe, private testing environment, and found that we could disable each of the fastest guard and the fastest exit relay in a range of 1-18 minutes (depending on relay RAM capacity). We also found that the entire group of the top 20 exit relays, representing roughly 35% of Tor bandwidth capacity at the time of the analysis, could be disabled in a range of 29 minutes to 3 hours and 50 minutes. We also analyzed how the attack could potentially be used to deanonymize hidden services, and found that it would take between 4 and 278 hours before the attack would succeed (again depending on relay RAM capacity, as well as the bandwidth resources used to launch the attack).
Due to our devastating findings, we also designed three defenses that mitigate our attacks, one of which provably renders the attack ineffective. Defenses have been implemented and deployed into the Tor software to ensure that the Tor network is no longer vulnerable as of Tor version 0.2.4.18-rc and later. Some of that work can be found in Trac tickets #9063, #9072, #9093, and #10169.
In the remainder of this post I will detail the attacks and defenses we analyzed, noting again that this information is presented more completely (and more elegantly) in our paper.
The Tor Network Infrastructure
The Tor network is a distributed system made up of thousands of computers running the Tor software that contribute their bandwidth, memory, and computational resources for the greater good. These machines are called Tor relays, because their main task is to forward or relay network traffic to another entity after performing some cryptographic operations. When a Tor user wants to download some data using Tor, the user's Tor client software will choose three relays from those available (an entry, middle, and exit), form a path or circuit between these relays, and then instruct the third relay (the exit) to fetch the data and send it back through the circuit. The data will get transferred from its source to the exit, from the exit to the middle, and from the middle to the entry before finally making its way to the client.
The client may request the exit to fetch large amounts of data, and so Tor uses a window-based flow control scheme in order to limit the amount of data each relay needs to buffer in memory at once. When a circuit is created, the exit will initialize its circuit package counter to 1000 cells, indicating that it is willing to send 1000 cells into the circuit. The exit decrements the package counter by one for every data cell it sends into the circuit (to the middle relay), and stops sending data when the package counter reaches 0. The client at the other end of the circuit keeps a delivery counter, and initializes it to 0 upon circuit creation. The client increments the delivery counter by 1 for every data cell it receives on that circuit. When the client's delivery counter reaches 100, it sends a special Tor control cell, called a SENDME cell, to the exit to signal that it received 100 cells. Upon receiving the SENDME, the exit adds 100 to its package counter and continues sending data into the circuit.
This flow control scheme limits the amount of outstanding data that may be in flight at any time (between the exit and the client) to 1000 cells, or about 500 KiB, per circuit. The same mechanism is used when data is flowing in the opposite direction (up from the client, through the entry and middle, and to the exit).
The Sniper Attack
The new Denial of Service (DoS) attack, which we call "The Sniper Attack", exploits the flow control algorithm to remotely crash a victim Tor relay by depleting its memory resources. The paper presents three attacks that rely on the following two techniques:
- the attacker stops reading from the TCP connection containing the attack circuit, which causes the TCP window on the victim's outgoing connection to close and the victim to buffer up to 1000 cells; and
- the attacker causes cells to be continuously sent to the victim (exceeding the 1000 cell limit and consuming the victim's memory resources) either by ignoring the package window at packaging end of the circuit, or by continuously sending SENDMEs from the delivery end to the packaging end even though no cells have been read by the delivery end.
Basic Version 1 (attacking an entry relay)
In basic version 1, the adversary controls the client and the exit relay, and chooses a victim for the entry relay position. The adversary builds a circuit through the victim to her own exit, and then the exit continuously generates and sends arbitrary data through the circuit toward the client while ignoring the package window limit. The client stops reading from the TCP connection to the entry relay, and the entry relay buffers all data being sent by the exit relay until it is killed by its OS out-of-memory killer.
Basic Version 2 (attacking an exit relay)
In basic version 2, the adversary controls the client and an Internet destination server (e.g. website), and chooses a victim for the exit relay position. The adversary builds a circuit through the victim exit relay, and then the client continuously generates and sends arbitrary data through the circuit toward the exit relay while ignoring the package window limit. The destination server stops reading from the TCP connection to the exit relay, and the exit relay buffers all data being sent by the client until it is killed by its OS out-of-memory killer.
Both of the basic versions of the attack above require the adversary to generate and send data, consuming roughly the same amount of upstream bandwidth as the victim's available memory. The efficient version reduces this cost by one to two orders of magnitude.
In the efficient version, the adversary controls only a client. She creates a circuit, choosing the victim for the entry position, and then instructs the exit relay to download a large file from some external Internet server. The client stops reading on the TCP connection to the entry relay, causing it to buffer 1000 cells.
At this point, the adversary may "trick" the exit relay into sending more cells by sending it a SENDME cell, even though the client has not actually received any cells from the entry. As long as this SENDME does not increase the exit relay's package counter to greater than 1000 cells, the exit relay will continue to package data from the server and send it into the circuit toward the victim. If the SENDME does cause the exit relay's package window to exceed the 1000 cell limit, it will stop responding on that circuit. However, the entry and middle node will hold the circuit open until the client issues another command, meaning its resources will not be freed.
The bandwidth cost of the attack after circuit creation is simply the bandwidth cost of occasionally sending a SENDME to the exit. The memory consumption speed depends on the bandwidth and congestion of non-victim circuit relays. We describe how to parallelize the attack using multiple circuits and multiple paths with diverse relays in order to draw upon Tor's inherent resources. We found that with roughly 50 KiB/s of upstream bandwidth, an attacker could consume the victim's memory at roughly 1 MiB/s. This is highly dependent on the victim's bandwidth capabilities: relays that use token buckets to restrict bandwidth usage will of course bound the attack's consumption rate.
Rather than connecting directly to the victim, the adversary may instead launch the attack through a separate Tor circuit using a second client instance and the "Socks4Proxy" or "Socks5Proxy" option. In this case, she may benefit from the anonymity that Tor itself provides in order to evade detection. We found that there is not a significant increase in bandwidth usage when anonymizing the attack in this way.
A simple but naive defense against the Sniper Attack is to have the guard node watch its queue length, and if it ever fills to over 1000 cells, kill the circuit. This defense does not prevent the adversary from parallelizing the attack by using multiple circuits (and then consuming 1000 cells on each), which we have shown to be extremely effective.
Another defense, called "authenticated SENDMEs", tries to protect against receiving a SENDME from a node that didn't actually receive 100 cells. In this approach, a 1 byte nonce is placed in every 100th cell by the packaging end, and that nonce must be included by the delivery end in the SENDME (otherwise the packaging end rejects the SENDME as inauthentic). As above, this does not protect against the parallel attack. It also doesn't defend against either of the basic attacks where the adversary controls the packaging end and ignores the SENDMEs anyway.
The best defense, as we suggested to the Tor developers, is to implement a custom, adaptive out-of-memory circuit killer in application space (i.e. inside Tor). The circuit killer is only activated when memory becomes scarce, and then it chooses the circuit with the oldest front-most cell in its circuit queue. This will prevent the Sniper Attack by killing off all of the attack circuits.
With this new defense in place, the next game is for the adversary to try to cause Tor to kill an honest circuit. In order for an adversary to cause an honest circuit to get killed, it must ensure that the front-most cell on its malicious circuit queue is at least slightly "younger" than the oldest cell on any honest queue. We show that the Sniper Attack is impractical with this defense: due to fairness mechanisms in Tor, the adversary must spend an extraordinary amount of bandwidth keeping its cells young — bandwidth that would likely be better served in a more traditional brute-force DoS attack.
Tor has implemented a version of the out-of-memory killer for circuits, and is currently working on expanding this to channel and connection buffers as well.
Hidden Service Attack and Countermeasures
The paper also shows how the Sniper Attack can be used to deanonymize hidden services:
- run a malicious entry guard relay;
- run the attack from Oakland 2013 to learn the current guard relay of the target hidden service;
- run the Sniper Attack on the guard from step 2, knocking it offline and causing the hidden service to choose a new guard;
- repeat, until the hidden service chooses the relay from step 1 as its new entry guard.
The technique to verify that the hidden service is using a malicious guard in step 4 is the same technique used in step 2.
In the paper, we compute the expected time to succeed in this attack while running malicious relays of various capacities. It takes longer to succeed against relays that have more RAM, since it relies on the Sniper Attack to consume enough RAM to kill the relay (which itself depends on the bandwidth capacity of the victim relay). For the malicious relay bandwidth capacities and honest relay RAM amounts used in their estimate, we found that deanonymization would involve between 18 and 132 Sniper Attacks and take between ~4 and ~278 hours.
This attack becomes much more difficult if the relay is rebooted soon after it crashes, and the attack is ineffective when Tor relays are properly defending against the Sniper Attack (see the "Defenses" section above).
Strategies to defend hidden services in particular go beyond those suggested here to include entry guard rate-limiting, where you stop building circuits if you notice that your new guards keep going down (failing closed), and middle guards, guard nodes for your guard nodes. Both of these strategies attempt to make it harder to coerce the hidden service into building new circuits or exposing itself to new relays, since that is precisely what is needed for deanonymization.
The main defense implemented in Tor will start killing circuits when memory gets low. Currently, Tor uses a configuration option (MaxMemInCellQueues) that allows a relay operator to configure when the circuit-killer should be activated. There is likely not one single value that makes sense here: if it is too high, then relays with lower memory will not be protected; if it is too low, then there may be more false positives resulting in honest circuits being killed. Can Tor determine this setting in an OS-independent way that allows relays to automatically find the right value for MaxMemInCellQueues?
The defenses against the Sniper Attack prevent the adversary from crashing the victim relay, but the adversary may still consume a relay's bandwidth (and memory resources, to a critical level) at relatively low cost. This means that even though the Sniper Attack can no longer kill a relay, it can still consume a large amount of its bandwidth at a relatively low cost (similar to more traditional bandwidth amplification attacks). More analysis of general bandwidth consumption attacks and defenses remains a useful research problem.
Finally, hidden services also need some love. More work is needed to redesign them in a way that does not allow a client to cause the hidden service to choose new relays on demand.
There are tensions in the Tor protocol design between the anonymity provided by entry guards and the performance improvements from better load balancing. This blog post walks through the research questions I raised in 2011, then summarizes answers from three recent papers written by researchers in the Tor community, and finishes by explaining what Tor design changes we need to make to provide better anonymity, and what we'll be trading off.
Part one: The research questions
In Tor, each client selects a few relays at random, and chooses only from those relays when making the first hop of each circuit. This entry guard design helps in three ways:
First, entry guards protect against the "predecessor attack": if Alice (the user) instead chose new relays for each circuit, eventually an attacker who runs a few relays would be her first and last hop. With entry guards, the risk of end-to-end correlation for any given circuit is the same, but the cumulative risk for all her circuits over time is capped.
Second, they help to protect against the "denial of service as denial of anonymity" attack, where an attacker who runs quite a few relays fails any circuit that he's a part of and that he can't win against, forcing Alice to generate more circuits and thus increasing the overall chance that the attacker wins. Entry guards greatly reduce the risk, since Alice will never choose outside of a few nodes for her first hop.
Third, entry guards raise the startup cost to an adversary who runs relays in order to trace users. Without entry guards, the attacker can sign up some relays and immediately start having chances to observe Alice's circuits. With them, new adversarial relays won't have the Guard flag so won't be chosen as the first hop of any circuit; and even once they earn the Guard flag, users who have already chosen guards won't switch away from their current guards for quite a while.
In August 2011, I posted these four open research questions around guard rotation parameters:
- Natural churn: For an adversary that controls a given number of relays, if the user only replaces her guards when the current ones become unavailable, how long will it take until she's picked an adversary's guard?
- Artificial churn: How much more risk does she introduce by intentionally switching to new guards before she has to, to load balance better?
- Number of guards: What are the tradeoffs in performance and anonymity from picking three guards vs two or one? By default Tor picks three guards, since if we picked only one then some clients would pick a slow one and be sad forever. On the other hand, picking only one makes users safer.
- Better Guard flag assignment: If we give the Guard flag to more or different relays, how much does it change all these answers?
For reference, Tor 0.2.3's entry guard behavior is "choose three guards, adding another one if two of those three go down but going back to the original ones if they come back up, and also throw out (aka rotate) a guard 4-8 weeks after you chose it." I'll discuss in "Part three" of this post what changes we should make to improve this policy.
Part two: Recent research papers
Tariq Elahi, a grad student in Ian Goldberg's group in Waterloo, began to answer the above research questions in his paper Changing of the Guards: A Framework for Understanding and Improving Entry Guard Selection in Tor (published at WPES 2012). His paper used eight months of real-world historical Tor network data (from April 2011 to December 2011) and simulated various guard rotation policies to see which approaches protect users better.
Tariq's paper considered a quite small adversary: he let all the clients pick honest guards, and then added one new small guard to the 800 or so existing guards. The question is then what fraction of clients use this new guard over time. Here's a graph from the paper, showing (assuming all users pick three guards) the vulnerability due to natural churn ("without guard rotation") vs natural churn plus also intentional guard rotation:
In this graph their tiny guard node, in the "without guard rotation" scenario, ends up getting used by about 3% of the clients in the first few months, and gets up to 10% by the eight-month mark. The more risky scenario — which Tor uses today — sees the risk shoot up to 14% in the first few months. (Note that the y-axis in the graph only goes up to 16%, mostly because the attacking guard is so small.)
The second paper to raise the issue is from Alex Biryukov, Ivan Pustogarov, and Ralf-Philipp Weinmann in Luxembourg. Their paper Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization (published at Oakland 2013) mostly focuses on other attacks (like how to censor or track popularity of hidden services), but their Section VI.C. talks about the "run a relay and wait until the client picks you as her guard" attack. In this case they run the numbers for a much larger adversary: if they run 13.8% of the Tor network for eight months there's more than a 90% chance of a given hidden service using their guard sometime during that period. That's a huge fraction of the network, but it's also a huge chance of success. And since hidden services in this case are basically the same as Tor clients (they choose guards and build circuits the same way), it's reasonable to conclude that their attack works against normal clients too so long as the clients use Tor often enough during that time.
I should clarify three points here.
First clarifying point: Tariq's paper makes two simplifying assumptions when calling an attack successful if the adversary's relay *ever* gets into the user's guard set. 1) He assumes that the adversary is also either watching the user's destination (e.g. the website she's going to), or he's running enough exit relays that he'll for sure be able to see the correponding flow out of the Tor network. 2) He assumes that the end-to-end correlation attack (matching up the incoming flow to the outgoing flow) is instantaneous and perfect. Alex's paper argues pretty convincingly that these two assumptions are easier to make in the case of attacking a hidden service (since the adversary can dictate how often the hidden service makes a new circuit, as well as what the traffic pattern looks like), and the paper I describe next addresses the first assumption, but the second one ("how successful is the correlation attack at scale?" or maybe better, "how do the false positives in the correlation attack compare to the false negatives?") remains an open research question.
Researchers generally agree that given a handful of traffic flows, it's easy to match them up. But what about the millions of traffic flows we have now? What levels of false positives (algorithm says "match!" when it's wrong) are acceptable to this attacker? Are there some simple, not too burdensome, tricks we can do to drive up the false positives rates, even if we all agree that those tricks wouldn't work in the "just looking at a handful of flows" case?
More precisely, it's possible that correlation attacks don't scale well because as the number of Tor clients grows, the chance that the exit stream actually came from a different Tor client (not the one you're watching) grows. So the confidence in your match needs to grow along with that or your false positive rate will explode. The people who say that correlation attacks don't scale use phrases like "say your correlation attack is 99.9% accurate" when arguing it. The folks who think it does scale use phrases like "I can easily make my correlation attack arbitrarily accurate." My hope is that the reality is somewhere in between — correlation attacks in the current Tor network can probably be made plenty accurate, but perhaps with some simple design changes we can improve the situation. In any case, I'm not going to try to tackle that research question here, except to point out that 1) it's actually unclear in practice whether you're done with the attack if you get your relay into the user's guard set, or if you are now faced with a challenging flow correlation problem that could produce false positives, and 2) the goal of the entry guard design is to make this issue moot: it sure would be nice to have a design where it's hard for adversaries to get into a position to see both sides, since it would make it irrelevant how good they are at traffic correlation.
Second clarifying point: it's about the probabilities, and that's intentional. Some people might be scared by phrases like "there's an x% chance over y months to be able to get an attacker's relay into the user's guard set." After all, they reason, shouldn't Tor provide absolute anonymity rather than probabilistic anonymity? This point is even trickier in the face of centralized anonymity services that promise "100% guaranteed" anonymity, when what they really mean is "we could watch everything you do, and we might sell or give up your data in some cases, and even if we don't there's still just one point on the network where an eavesdropper can learn everything." Tor's path selection strategy distributes trust over multiple relays to avoid this centralization. The trouble here isn't that there's a chance for the adversary to win — the trouble is that our current parameters make that chance bigger than it needs to be.
To make it even clearer: the entry guard design is doing its job here, just not well enough. Specifically, *without* using the entry guard design, an adversary who runs some relays would very quickly find himself as the first hop of one of the user's circuits.
Third clarifying point: we're considering an attacker who wants to learn if the user *ever* goes to a given destination. There are plenty of reasonable other things an attacker might be trying to learn, like building a profile of many or all of the user's destinations, but in this case Tariq's paper counts a successful attack as one that confirms (subject to the above assumptions) that the user visited a given destination once.
And that brings us to the third paper, by Aaron Johnson et al: Users Get Routed: Traffic Correlation on Tor by Realistic Adversaries (upcoming at CCS 2013). This paper ties together two previous series of research papers: the first is "what if the attacker runs a relay?" which is what the above two papers talked about, and the second is "what if the attacker can watch part of the Internet?"
The first part of the paper should sound pretty familiar by now: they simulated running a few entry guards that together make up 10% of the guard capacity in the Tor network, and they showed that (again using historical Tor network data, but this time from October 2012 to March 2013) the chance that the user has made a circuit using the adversary's relays is more than 80% by the six month mark.
In this case their simulation includes the adversary running a fast exit relay too, and the user performs a set of sessions over time. They observe that the user's traffic passes over pretty much all the exit relays (which makes sense since Tor doesn't use an "exit guard" design). Or summarizing at an even higher level, the conclusion is that so long as the user uses Tor enough, this paper confirms the findings in the earlier two papers.
Where it gets interesting is when they explain that "the adversary could run a relay" is not the only risk to worry about. They build on the series of papers started by "Location Diversity in Anonymity Networks" (WPES 2004), "AS-awareness in Tor path selection" (CCS 2009), and most recently "An Empirical Evaluation of Relay Selection in Tor" (NDSS 2013). These papers look at the chance that traffic from a given Tor circuit will traverse a given set of Internet links.
Their point, which like all good ideas is obvious in retrospect, is that rather than running a guard relay and waiting for the user to switch to it, the attacker should instead monitor as many Internet links as he can, and wait for the user to use a guard such that traffic between the user and the guard passes over one of the links the adversary is watching.
This part of the paper raises as many questions as it answers. In particular, all the users they considered are in or near Germany. There are also quite a few Tor relays in Germany. How much of their results here can be explained by pecularities of Internet connectivity in Germany? Are their results predictive in any way about how users on other continents would fare? Or said another way, how can we learn whether their conclusion shouldn't instead be "German Tor users are screwed, because look how Germany's Internet topology is set up"? Secondly, their scenario has the adversary control the Autonomous System (AS) or Internet Exchange Point (IXP) that maximally deanonymizes the user (they exclude the AS that contains the user and the AS that contains her destinations). This "best possible point to attack" assumption a) doesn't consider how hard it is to compromise that particular part of the Internet, and b) seems like it will often be part of the Internet topology near the user (and thus vary greatly depending on which user you're looking at). And third, like the previous papers, they think of an AS as a single Internet location that the adversary is either monitoring or not monitoring. Some ASes, like large telecoms, are quite big and spread out.
That said, I think it's clear from this paper that there *do* exist realistic scenarios where Tor users are at high risk from an adversary watching the nearby Internet infrastructure and/or parts of the Internet backbone. Changing the guard rotation parameters as I describe in "Part three" below will help in some of these cases but probably won't help in all of them. The canonical example that I've given in talks about "a person in Syria using Tor to visit a website in Syria" remains a very serious worry.
The paper also makes me think about exit traffic patterns, and how to better protect people who use Tor for only a short period of time: many websites pull in resources from all over, especially resources from centralized ad sites. This risk (that it greatly speeds the rate at which an adversary watching a few exit points — or heck, a few ad sites — will be able to observe a given user's exit traffic) provides the most compelling reason I've heard so far to ship Tor Browser Bundle with an ad blocker — or maybe better, with something like Request Policy that doesn't even touch the sites in the first place. On the other hand, Mike Perry still doesn't want to ship an ad blocker in TBB, since he doesn't want to pick a fight with Google and give them even more of a reason to block/drop all Tor traffic. I can see that perspective too.
Part three: How to fix it
Here are five steps we should take, in rough order of how much impact I think each of them would have on the above attacks.
If you like metaphors, think of each time you pick a new guard as a coin flip (heads you get the adversary's guard, tails you're safe this time), and the ideas here aim to reduce both the number and frequency of coin flips.
Fix 1: Tor clients should use fewer guards.
The primary benefit to moving to fewer guards is that there are fewer coin flips every time you pick your guards.
But there's a second benefit as well: right now your choice of guards acts as a kind of fingerprint for you, since very few other users will have picked the same three guards you did. (This fingerprint is only usable by an attacker who can discover your guard list, but in some scenarios that's a realistic attack.) To be more concrete: if the adversary learns that you have a particular three guards, and later sees an anonymous user with exactly the same guards, how likely is it to be you? Moving to two guards helps the math a lot here, since you'll overlap with many more users when everybody is only picking two.
On the other hand, the main downside is increased variation in performance. Here's Figure 10 from Tariq's paper:
"Farther to the right" is better in this graph. When you pick three guards (the red line), the average speed of your guards is pretty good (and pretty predictable), since most guards are pretty fast and it's unlikely you'll pick slow ones for all three. However, when you only pick only one guard (the purple line), the odds go up a lot that you get unlucky and pick a slow one. In more concrete numbers, half of the Tor users will see up to 60% worse performance.
The fix of course is to raise the bar for becoming a guard, so every possible guard will be acceptably fast. But then we have fewer guards total, increasing the vulnerability from other attacks! Finding the right balance (as many guards as possible, but all of them fast) is going to be an ongoing challenge. See Brainstorm tradeoffs from moving to 2 (or even 1) guards (ticket 9273) for more discussion.
Switching to just one guard will also preclude deploying Conflux, a recent proposal to improve Tor performance by routing traffic over multiple paths in parallel. The Conflux design is appealing because it not only lets us make better use of lower-bandwidth relays (which we'll need to do if we want to greatly grow the size of the Tor network), but it also lets us dynamically adapt to congestion by shifting traffic to less congested routes. Maybe some sort of "guard family" idea can work, where a single coin flip chooses a pair of guards and then we split our traffic over them. But if we want to avoid doubling the exposure to a network-level adversary, we might want to make sure that these two guards are near each other on the network — I think the analysis of the network-level adversary in Aaron's paper is the strongest argument for restricting the variety of Internet paths that traffic takes between the Tor client and the Tor network.
This discussion about reducing the number of guards also relates to bridges: right now if you configure ten bridges, you round-robin over all of them. It seems wise for us to instead use only the first bridge in our bridge list, to cut down on the set of Internet-level adversaries that get to see the traffic flows going into the Tor network.
Fix 2: Tor clients should keep their guards for longer.
In addition to choosing fewer guards, we should also avoid switching guards so often. I originally picked "one or two months" for guard rotation since it seemed like a very long time. In Tor 0.2.4, we've changed it to "two or three months". But I think changing the guard rotation period to a year or more is probably much wiser, since it will slow down the curves on all the graphs in the above research papers.
I asked Aaron to make a graph comparing the success of an attacker who runs 10% of the guard capacity, in the "choose 3 guards and rotate them every 1-2 months" case and the "choose 1 guard and never rotate" case:
In the "3 guard" case (the blue line), the attacker's success rate rapidly grows to about 25%, and then it steadily grows to over 80% by the six month mark. The "1 guard" case (green line), on the other hand, grows to 10% (which makes sense since the adversary runs 10% of the guards), but then it levels off and grows only slowly as a function of network churn. By the six month mark, even this very large adversary's success rate is still under 25%.
So the good news is that by choosing better guard rotation parameters, we can almost entirely resolve the vulnerabilities described in these three papers. Great!
Or to phrase it more as a research question, once we get rid of this known issue, I'm curious how the new graphs over time will look, especially when we have a more sophisticated analysis of the "network observer" adversary. I bet there are some neat other attacks that we'll need to explore and resolve, but that are being masked by the poor guard parameter issue.
However, fixing the guard rotation period issue is alas not as simple as we might hope. The fundamental problem has to do with "load balancing": allocating traffic onto the Tor network so each relay is used the right amount. If Tor clients choose a guard and stick with it for a year or more, then old guards (relays that have been around and stable for a long time) will see a lot of use, and new guards will see very little use.
I wrote a separate blog post to provide background for this issue: "The lifecycle of a new relay". Imagine if the ramp-up period in the graph from that blog post were a year long! People would set up fast relays, they would get the Guard flag, and suddenly they'd see little to no traffic for months. We'd be throwing away easily half of the capacity volunteered by relays.
One approach to resolving the conflict would be for the directory authorities to track how much of the past n months each relay has had the Guard flag, and publish a fraction in the networkstatus consensus. Then we'd teach clients to rebalance their path selection choices so a relay that's been a Guard for only half of the past year only counts 50% as a guard in terms of using that relay in other positions in circuits. See Load balance right when we have higher guard rotation periods (ticket 9321) for more discussion, and see Raise our guard rotation period (ticket 8240) for earlier discussions.
Yet another challenge here is that sticking to the same guard for a year gives plenty of time for an attacker to identify the guard and attack it somehow. It's particularly easy to identify the guard(s) for hidden services currently (since as mentioned above, the adversary can control the rate at which hidden services make new circuits, simply by visiting the hidden service), but similar attacks can probably be made to work against normal Tor clients — see e.g. the http-level refresh tricks in How Much Anonymity does Network Latency Leak? This attack would effectively turn Tor into a network of one-hop proxies, to an attacker who can efficiently enumerate guards. That's not a complete attack, but it sure does make me nervous.
One possible direction for a fix is to a) isolate streams by browser tab, so all the requests from a given browser tab go to the same circuit, but different browser tabs get different circuits, and then b) stick to the same three-hop circuit (i.e. same guard, middle, and exit) for the lifetime of that session (browser tab). How to slow down guard enumeration attacks is a tough and complex topic, and it's too broad for this blog post, but I raise the issue here as a reminder of how interconnected anonymity attacks and defenses are. See Slow Guard Discovery of Hidden Services and Clients (ticket 9001) for more discussion.
Fix 3: The Tor code should better handle edge cases where you can't reach your guard briefly.
If a temporary network hiccup makes your guard unreachable, you switch to another one. But how long is it until you switch back? If the adversary's goal is to learn whether you ever go to a target website, then even a brief switch to a guard that the adversary can control or observe could be enough to mess up your anonymity.
Tor clients fetch a new networkstatus consensus every 2-4 hours, and they are willing to retry non-running guards if the new consensus says they're up again.
But I think there are a series of little bugs and edge cases where the Tor client abandons a guard more quickly than it should. For example, we mark a guard as failed if any of our circuit requests time out before finishing the handshake with the first hop. We should audit both the design and the source code with an eye towards identifying and resolving these issues.
We should also consider whether an adversary can *induce* congestion or resource exhaustion to cause a target user to switch away from her guard. Such an attack could work very nicely coupled with the guard enumeration attacks discussed above.
Most of these problems exist because in the early days we emphasized reachability ("make sure Tor works") over anonymity ("be very sure that your guard is gone before you try another one"). How should we handle this tradeoff between availability and anonymity: should you simply stop working if you've switched guards too many times recently? I imagine different users would choose different answers to that tradeoff, depending on their priorities. It sounds like we should make it easier for users to select "preserve my anonymity even if it means lower availability". But at the same time, we should remember the lessons from Anonymity Loves Company: Usability and the Network Effect about how letting users choose different settings can make them more distinguishable.
We've been working hard in recent years to get more relay capacity. The result is a more than four-fold increase in network capacity since 2011:
As the network grows, an attacker with a given set of resources will have less success at the attacks described in this blog post. To put some numbers on it, while the relay adversary in Aaron's paper (who carries 660mbit/s of Tor traffic) represented 10% of the guard capacity in October 2012, that very same attacker would have been 20% of the guard capacity in October 2011. Today that attacker is about 5% of the guard capacity. Growing the size of the network translates directly into better defense against these attacks.
However, the analysis is more complex when it comes to a network adversary. Just adding more relays (and more relay capacity) doesn't always help. For example, adding more relay capacity in a part of the network that the adversary is already observing can actually *decrease* anonymity, because it increases the fraction the adversary can watch. We discussed many of these issues in the thread about turning funding into more exit relays. For more details about the relay distribution in the current Tor network, check out Compass, our tool to explore what fraction of relay capacity is run in each country or AS. Also check out Lunar's relay bubble graphs.
Yet another open research question in the field of anonymous communications is how the success rate of a network adversary changes as the Tor network changes. If we were to plot the success rate of the *relay* adversary using historical Tor network data over time, it's pretty clear that the success rate would be going down over time as the network grows. But what's the trend for the success rate of the network adversary over the past few years? Nobody knows. It could be going up or down. And even if it is going down, it could be going down quickly or slowly.
(Read more in Research problem: measuring the safety of the Tor network where I describe some of these issues in more detail.)
Recent papers have gone through enormous effort to get one, very approximate, snapshot of the Internet's topology. Doing that effort retroactively and over long and dynamic time periods seems even more difficult and more likely to introduce errors.
It may be that the realities of Internet topology centralization make it so that there are fundamental limits on how much safety Tor users can have in a given network location. On the other hand, researchers like Aaron Johnson are optimistic that "network topology aware" path selection can improve Tor's protection against this style of attack. Much work remains.
Fix 5: We should assign the guard flag more intelligently.
In point 1 above I talked about why we need to raise the bar for becoming a guard, so all guards can provide adequate bandwidth. On the other hand, having fewer guards is directly at odds with point 4 above.
My original guard rotation parameters blog post ends with this question: what algorithm should we use to assign Guard flags such that a) we assign the flag to as many relays as possible, yet b) we minimize the chance that Alice will use the adversary's node as a guard?
We should use historical Tor network data to pick good values for the parameters that decide which relays become guards. This remains a great thesis topic if somebody wants to pick it up.
Part four: Other thoughts
What does all of this discussion mean for the rest of Tor? I'll close by trying to tie this blog post to the broader Tor world.
First, all three of these papers come from the Tor research community, and it's great that Tor gets such attention. We get this attention because we put so much effort into making it easy for researchers to analyze Tor: we've worked closely with these authors to help them understand Tor and focus on the most pressing research problems.
In addition, don't be fooled into thinking that these attacks only apply to Tor: using Tor is still better than using any other tool, at least in quite a few of these scenarios. That said, some other attacks in the research literature might be even easier than the attacks discussed here. These are fast-moving times for anonymity research. "Maybe you shouldn't use the Internet then" is still the best advice for some people.
Third, the network-level adversaries rely on being able to recognize Tor flows. Does that argue that using pluggable transports, with bridges, might change the equation if it stops the attacker from recognizing Tor users?
Fourth, I should clarify that I don't think any of these large relay-level adversaries actually exist, except as a succession of researchers showing that it can be done. (GCHQ apparently ran a small number of relays a while ago, but not in a volume or duration that would have enabled this attack.) Whereas I *do* think that the network-level attackers exist, since they already invested in being able to surveil the Internet for other reasons. So I think it's great that Aaron's paper presents the dual risks of relay adversaries and link adversaries, since most of the time when people are worrying about one of them they're forgetting the other one.
Fifth, there are still some ways to game the bandwidth authority measurements (here's the spec spec) into giving you more than your fair share of traffic. Ideally we'd adapt a design like EigenSpeed so it can measure fast relays both robustly and accurately. This question also remains a great thesis topic.
And finally, as everybody wants to know: was this attack how "they" busted recent hidden services (Freedom Hosting, Silk Road, the attacks described in the latest Guardian article)? The answer is apparently no in each case, which means the techniques they *did* use were even *lower* hanging fruit. The lesson? Security is hard, and you have to get it right at many different levels.
We've had several requests by the press and others to talk about the Silk Road situation today. We only know what's going on by reading the same news sources everyone else is reading.
In this case we've been watching carefully to try to learn if there are any flaws with Tor that we need to correct. So far, nothing about this case makes us think that there are new ways to compromise Tor (the software or the network). The FBI says that their suspect made mistakes in operational security, and was found through actual detective work. Remember: Tor does not anonymize individuals when they use their legal name on a public forum, use a VPN with logs that are subject to a subpoena, or provide personal information to other services. See also the list of warnings linked from the Tor download page.
Also, while we've seen no evidence that this case involved breaking into the webserver behind the hidden service, we should take this opportunity to emphasize that Tor's hidden service feature (a way to publish and access content anonymously) won't keep someone anonymous when paired with unsafe software or unsafe behavior. It is up to the publisher to choose and configure server software that is resistant to attacks. Mistakes in configuring or maintaining a hidden service website can compromise the publisher's anonymity independent of Tor.
And finally, Tor's design goals include preventing even The Tor Project from tracking users; hidden services are no different. We don't have any special access to or information about this hidden service or any other. Because Tor is open-source and it comes with detailed design documents and research papers, independent researchers can verify its security.
Here are some helpful links to more information on these subjects:
Technical details of hidden services:
Our abuse FAQ:
For those curious about our interactions with law enforcement:
Using Tor hidden services for good:
Regarding the Freedom Hosting incident in August 2013, which is unrelated
as far as we can tell:
Some general hints on staying anonymous:
The Tor Project is a nonprofit 501(c)(3) organization dedicated to providing tools to help people manage their privacy on the Internet. Our focus continues to be in helping ordinary citizens, victims of abuse, individuals in dangerous parts of the world, and others stay aware and educated about how to keep themselves secure online.
The global Tor team remains committed to building technology solutions to help keep the doors to freedom of expression open. We will continue to watch as the details of this situation unfold and respond when it is appropriate and useful.
For further press related questions please contact us at firstname.lastname@example.org.
A Hidden service is a server – often delivering web pages – that is reachable only through the Tor network. While most people know that the Tor network with its thousands of volunteer-run nodes provides anonymity for users who don´t want to be tracked and identified on the internet, the lesser-known hidden service feature of Tor provides anonymity also for the server operator.
Anyone can run hidden services, and many do. We use them internally at The Tor Project to offer our developers anonymous access to services such as SSH, IRC, HTTP, and our bug tracker. Other organizations run hidden services to protect dissidents, activists, and protect the anonymity of users trying to find help for suicide prevention, domestic violence, and abuse-recovery. Whistleblowers and journalists use hidden services to exchange information in a secure and anonymous way and publish critical information in a way that is not easily traced back to them. The New Yorker's Strongbox is one public example.
Hidden service addresses, aka the dot onion domain, are cryptographically and automatically generated by the tor software. They look like this http://idnxcnkne4qt76tg.onion/, which is our torproject.org website as a hidden service.
There is no central repository nor registry of addresses. The dot onion address is both the name and routing address for the services hosted at the dot onion. The Tor network uses the .onion-address to direct requests to the hidden server and route back the data from the hidden server to the anonymous user. The design of the Tor network ensures that the user can not know where the server is located and the server can not find out the IP-address of the user, except by intentional malicious means like hidden tracking code embedded in the web pages delivered by the server. Additionally, the design of the Tor network, which is run by thousands of volunteers, ensures that it is impossible to censor or block certain .onion-addresses.
them if we can.
As for now, one of multiple hidden service hosting companies appears to be down. There are lots of rumors and speculation as to what's happened. We're reading the same news and threads you are and don't have any insider information. We'll keep you updated as details become available.
EDIT: See our next blog post for more details about the attack.