Traffic correlation using netflows

People are starting to ask us about a recent tech report from Sambuddho's group about how an attacker with access to many routers around the Internet could gather the netflow logs from these routers and match up Tor flows. It's great to see more research on traffic correlation attacks, especially on attacks that don't need to see the whole flow on each side. But it's also important to realize that traffic correlation attacks are not a new area.

This blog post aims to give you some background to get you up to speed on the topic.

First, you should read the first few paragraphs of the One cell is enough to break Tor's anonymity analysis:

First, remember the basics of how Tor provides anonymity. Tor clients route their traffic over several (usually three) relays, with the goal that no single relay gets to learn both where the user is (call her Alice) and what site she's reaching (call it Bob).

The Tor design doesn't try to protect against an attacker who can see or measure both traffic going into the Tor network and also traffic coming out of the Tor network. That's because if you can see both flows, some simple statistics let you decide whether they match up.

Because we aim to let people browse the web, we can't afford the extra overhead and hours of additional delay that are used in high-latency mix networks like Mixmaster or Mixminion to slow this attack. That's why Tor's security is all about trying to decrease the chances that an adversary will end up in the right positions to see the traffic flows.

The way we generally explain it is that Tor tries to protect against traffic analysis, where an attacker tries to learn whom to investigate, but Tor can't protect against traffic confirmation (also known as end-to-end correlation), where an attacker tries to confirm a hypothesis by monitoring the right locations in the network and then doing the math.

And the math is really effective. There are simple packet counting attacks (Passive Attack Analysis for Connection-Based Anonymity Systems) and moving window averages (Timing Attacks in Low-Latency Mix-Based Systems), but the more recent stuff is downright scary, like Steven Murdoch's PET 2007 paper about achieving high confidence in a correlation attack despite seeing only 1 in 2000 packets on each side (Sampled Traffic Analysis by Internet-Exchange-Level Adversaries).

Second, there's some further discussion about the efficacy of traffic correlation attacks at scale in the Improving Tor's anonymity by changing guard parameters analysis:

Tariq's paper makes two simplifying assumptions when calling an attack successful [...] 2) He assumes that the end-to-end correlation attack (matching up the incoming flow to the outgoing flow) is instantaneous and perfect. [...] The second one ("how successful is the correlation attack at scale?" or maybe better, "how do the false positives in the correlation attack compare to the false negatives?") remains an open research question.

Researchers generally agree that given a handful of traffic flows, it's easy to match them up. But what about the millions of traffic flows we have now? What levels of false positives (algorithm says "match!" when it's wrong) are acceptable to this attacker? Are there some simple, not too burdensome, tricks we can do to drive up the false positives rates, even if we all agree that those tricks wouldn't work in the "just looking at a handful of flows" case?

More precisely, it's possible that correlation attacks don't scale well because as the number of Tor clients grows, the chance that the exit stream actually came from a different Tor client (not the one you're watching) grows. So the confidence in your match needs to grow along with that or your false positive rate will explode. The people who say that correlation attacks don't scale use phrases like "say your correlation attack is 99.9% accurate" when arguing it. The folks who think it does scale use phrases like "I can easily make my correlation attack arbitrarily accurate." My hope is that the reality is somewhere in between — correlation attacks in the current Tor network can probably be made plenty accurate, but perhaps with some simple design changes we can improve the situation.

The discussion of false positives is key to this new paper too: Sambuddho's paper mentions a false positive rate of 6%. That sounds like it means if you see a traffic flow at one side of the Tor network, and you have a set of 100000 flows on the other side and you're trying to find the match, then 6000 of those flows will look like a match. It's easy to see how at scale, this "base rate fallacy" problem could make the attack effectively useless.

And that high false positive rate is not at all surprising, since he is trying to capture only a summary of the flows at each side and then do the correlation using only those summaries. It would be neat (in a theoretical sense) to learn that it works, but it seems to me that there's a lot of work left here in showing that it would work in practice. It also seems likely that his definition of false positive rate and my use of it above don't line up completely: it would be great if somebody here could work on reconciling them.

For a possibly related case where a series of academic research papers misunderstood the base rate fallacy and came to bad conclusions, see Mike's critique of website fingerprinting attacks plus the follow-up paper from CCS this year confirming that he's right.

I should also emphasize that whether this attack can be performed at all has to do with how much of the Internet the adversary is able to measure or control. This diversity question is a large and important one, with lots of attention already. See more discussion here.

In summary, it's great to see more research on traffic confirmation attacks, but a) traffic confirmation attacks are not a new area so don't freak out without actually reading the papers, and b) this particular one, while kind of neat, doesn't supercede all the previous papers.

(I should put in an addendum here for the people who are wondering if everything they read on the Internet in a given week is surely all tied together: we don't have any reason to think that this attack, or one like it, is related to the recent arrests of a few dozen people around the world. So far, all indications are that those arrests are best explained by bad opsec for a few of them, and then those few pointed to the others when they were questioned.)

[Edit: be sure to read Sambuddho's comment below, too. -RD]

Anonymous

November 16, 2014

Permalink

Does this attack depend on injecting javascript on HTML pages? If so why not just block all javascript like you should do when using tor?

Anonymous

November 16, 2014

Permalink

would not be posible to artificially generate some sort of random artificial microlatency between tor entry and exit nodes, something that is imperceptible to users but increases the overall network noise and decreases chances of traffic analisys matching?

Anonymous

November 16, 2014

Permalink

hi .. unrelated to this blog... but WHY DOES TOR NEED CORRECT TIME ON MY COMPUTER to RUN????

Calm down. Tor needs correct time so it can see how old the consensus (list of tor relay IPs and info related to those relays) is. This allows it to keep up to date with the current number of Tor relays, the current Tor relays that have been marked as "bad" so clients don't connect to them, etc. It also allows Tor to check whether a relay's "certificate" has expired yet. It wouldn't be good if you connected to some relay pretending to be the same relay as one from 2005 who's certificate it might have stolen!

Anonymous

November 16, 2014

Permalink

It's time to implement patterned/random packet timing. The impact on latency would be minimal and it would make Tor vastly more secure.

Anonymous

November 17, 2014

Permalink

by far more articles have overstated security of Tor rather than understated.
If they hadn't so much, by now there would have been a demand for better onion routing solutions and tons of funds that are needed to make them.

I don't follow this one -- there *is* a demand for better anonymity designs, but nobody has one (or more precisely, nobody has one that is convincingly better), and actually the 'tons of funds' are not easy to come by no matter your design.

It's also certainly the case that assessing the security of a design by reading mainstream newspaper articles about it is never going to get you where you want to be.

And finally, check out
http://freehaven.net/anonbib/
for many useful papers on anonymity designs.

Anonymous

November 17, 2014

Permalink

https://en.wikipedia.org/wiki/Traffic_analysis

Traffic flow security:

# causing the circuit to appear busy at all times or much of the time by sending dummy traffic

# sending a continuous encrypted signal, whether or not traffic is being transmitted. This is also called masking or link encryption.

# It is difficult to defeat traffic analysis without both encrypting messages and masking the channel. When no actual messages are being sent, the channel can be masked [9] by sending dummy traffic, similar to the encrypted traffic, thereby keeping bandwidth usage constant .[10] "It is very hard to hide information about the size or timing of messages. The known solutions require Alice to send a continuous stream of messages at the maximum bandwidth she will ever use...This might be acceptable for military applications, but it is not for most civilian applications."

Don' forget the word __random__ -> not max but random not constant but random. This can be quite acceptable for any applications. And don't buy 'military' - it means standardized and approved in this context. BTW who told you should use standards in inter-relay network!?

Anonymous

November 18, 2014

Permalink

So the problem is that full obfuscation would require sending 100% of maximum traffic per user/connection all the time. That is very inefficient and expensive. But, with bandwidth cost continually decreasing, won't there come a point in the future when this is practicable? Isn't that the future of TOR?
https://gigaom2.files.wordpress.com/2012/08/news20120802-1.gif

One problem is that you, as a user, would have to let your TOR traffic run at this maximum speed 24/7; if you didn't, then large adversaries could still record when your connection begins and match it to when a certain exit node starts a certain connection to a certain website. There will be a certain delay, and the exit node will have many connections running at the same time, but it is still information; at least it may allow your adversary to increase its chances to guess right. Or am I mistaken?

I will see your graph and raise you one:
https://metrics.torproject.org/bandwidth.html?graph=bandwidth&start=201…

So yes, bandwidth is getting cheaper, but also the number of users and load on the network will grow to fill whatever the Tor network has to offer.

Also, you should do the math on millions of users all using their full bandwidth all the time -- it adds up to look very grim indeed. :( Perhaps what we need are approximations (sometimes known by the phrase 'traffic shaping') that get some of the benefits with only some of the costs?

See
https://blog.torproject.org/blog/traffic-correlation-using-netflows#com…
for more discussion.

Anonymous

November 18, 2014

Permalink

We have to be careful when discussing traffic correlation attacks as a family of attacks as opposed to a specific attack implementation. Someone implements an attack and generalizes the success or failure of that particular one to the general attack family. But implementations will get better and eventually reach their theoretical limit.

It's not a given that an attack will have false positives. There are algorithms with a zero false positive rate -- not near zero, exactly zero. And the obvious implementation is real-time and distributed. The dilemma is that you can't discuss these things without giving attackers something to go after your friends with, but without a proof of concept researchers aren't convinced...

It seems to me that as the set of relays and flows scales up, the concept of an algorithm with an exactly zero false positive rate gets harder and harder.

I mean, feel free to not discuss it, but I think our adversaries do a lot of their discussions in secret already, so our best bet is transparency and openness. Submit your analysis to the PETS symposium:
https://petsymposium.org/2015/
and get some experts to review it. The more we know the more prepared we are!

Anonymous

November 19, 2014

Permalink

how I can controle my country's Tor navigation, for example i want the other see me that i navigate from country ..x .. how i can do that ?

Anonymous

November 19, 2014

Permalink

Does this affect Tor users who don't visit hidden sites? I don't have a VPN right now and use Tor to hide my IP. I check emails and log into the everyday sites I visit. Nothing illegal. Well known websites.

you should consult you state agencies. you never know legal or illegal are sites you visit till you ask them. preferably you should have write down list of web-pages you are going to visit in next month approved by some agency.

Anonymous

November 20, 2014

Permalink

I think it is a bit disconcerting that people behind the Tor Project is not willing to accept that traffic correlation right now is a very important issue. The fact that the majority of users come from democratic countries and that the majority of tor nodes are running on those democratic countries as well but with government that implement Orwellian surveillance programs to control traffic at all levels (rogue exit nodes, ISP logs, Internet routers surveillance, cable tapping, etc), ironically makes a person connecting from China to the Tor network more secure than someone connecting from the UK, because China can not become a global adversary to a Tor user. In fact the Tor network could be, particularly for people living in a country belonging to the Five Eyes alliance for instance, more compromised than government are willing to recognize or reveal through detentions and police stint operations. I agree that on Tor you need to launch selective attacks to reveal someone's identity but it seems quite easy to adquire targets if you can profile network flows; in other words, if you know where to look for.

I'm sorry if it looks like we didn't think traffic correlation is a big issue. It is.

But this paper that people are talking about doesn't move the field forward much.

And talking in generalities is not helpful either.

What we need are more good research papers to actually answer questions for us. In particular, I'd love to see a graph where the x axis is how much extra overhead (inefficiency, delay, padding) we add, and the y axis is how much protection we can get against traffic correlation attacks with that overhead. Currently we have zero (!) data points for that graph.

Much research remains before we know how to build a system that is safe against such attacks.

Anonymous

November 22, 2014

Permalink

The correlation can be minimized in future. If the TOR establishes two three-node (instead of one) connections the correlation will be minimized.

Additionally the TOR shall send some dummy background communication which will be lost inside the TOR network (without reaching destination).

Additionally the client of the TOR (TOR browser) could send random requests via the same three-node connection to other servers with random messages in background of the main communication. It should hinder attackers.

Anonymous

November 23, 2014

Permalink

My question: as I understood relays changes each 10 minutes or so. How can you correlate communication after thav 10 minutes if you had not the last hop node?

Anonymous

November 23, 2014

Permalink

I'm from Australia and my ISP has blocked connections to the public TOR network.
It is legal to use TOR here in Australia, so why has my ISP blocked connections to the public TOR network?
I am just a normal average law abiding citizen who wants to have privacy on the internet.
Also can someone here give me a list of TOR bridges?

Which ISP is it? I don't know of any ISPs in Australia that are censoring Tor relays either by address or by DPI (and it would be good to learn if there are some).

It's more likely that you have some other problem going on, like your clock is wrong or you have some firewall or antivirus thing that's preventing Tor from reaching the network.

Anonymous

December 09, 2014

Permalink

Why doesn't tor send random data all the time to random nodes to make the connections be less correlative? They see a 200kbps burst on your tor node, and five seconds later see a 200kbps burst coming from the exit node. So why not send a baseline amount of data all of the time, and when real data needs to exit, then back off of sending the fake data by the same amount?

Anonymous

December 13, 2014

Permalink

How do i test orbot fore liks . And ned a new SAF link to download orbot . The orbot i get. Is fake frorm tor project .