Traffic correlation using netflows

People are starting to ask us about a recent tech report from Sambuddho's group about how an attacker with access to many routers around the Internet could gather the netflow logs from these routers and match up Tor flows. It's great to see more research on traffic correlation attacks, especially on attacks that don't need to see the whole flow on each side. But it's also important to realize that traffic correlation attacks are not a new area.

This blog post aims to give you some background to get you up to speed on the topic.

First, you should read the first few paragraphs of the One cell is enough to break Tor's anonymity analysis:

First, remember the basics of how Tor provides anonymity. Tor clients route their traffic over several (usually three) relays, with the goal that no single relay gets to learn both where the user is (call her Alice) and what site she's reaching (call it Bob).

The Tor design doesn't try to protect against an attacker who can see or measure both traffic going into the Tor network and also traffic coming out of the Tor network. That's because if you can see both flows, some simple statistics let you decide whether they match up.

Because we aim to let people browse the web, we can't afford the extra overhead and hours of additional delay that are used in high-latency mix networks like Mixmaster or Mixminion to slow this attack. That's why Tor's security is all about trying to decrease the chances that an adversary will end up in the right positions to see the traffic flows.

The way we generally explain it is that Tor tries to protect against traffic analysis, where an attacker tries to learn whom to investigate, but Tor can't protect against traffic confirmation (also known as end-to-end correlation), where an attacker tries to confirm a hypothesis by monitoring the right locations in the network and then doing the math.

And the math is really effective. There are simple packet counting attacks (Passive Attack Analysis for Connection-Based Anonymity Systems) and moving window averages (Timing Attacks in Low-Latency Mix-Based Systems), but the more recent stuff is downright scary, like Steven Murdoch's PET 2007 paper about achieving high confidence in a correlation attack despite seeing only 1 in 2000 packets on each side (Sampled Traffic Analysis by Internet-Exchange-Level Adversaries).

Second, there's some further discussion about the efficacy of traffic correlation attacks at scale in the Improving Tor's anonymity by changing guard parameters analysis:

Tariq's paper makes two simplifying assumptions when calling an attack successful [...] 2) He assumes that the end-to-end correlation attack (matching up the incoming flow to the outgoing flow) is instantaneous and perfect. [...] The second one ("how successful is the correlation attack at scale?" or maybe better, "how do the false positives in the correlation attack compare to the false negatives?") remains an open research question.

Researchers generally agree that given a handful of traffic flows, it's easy to match them up. But what about the millions of traffic flows we have now? What levels of false positives (algorithm says "match!" when it's wrong) are acceptable to this attacker? Are there some simple, not too burdensome, tricks we can do to drive up the false positives rates, even if we all agree that those tricks wouldn't work in the "just looking at a handful of flows" case?

More precisely, it's possible that correlation attacks don't scale well because as the number of Tor clients grows, the chance that the exit stream actually came from a different Tor client (not the one you're watching) grows. So the confidence in your match needs to grow along with that or your false positive rate will explode. The people who say that correlation attacks don't scale use phrases like "say your correlation attack is 99.9% accurate" when arguing it. The folks who think it does scale use phrases like "I can easily make my correlation attack arbitrarily accurate." My hope is that the reality is somewhere in between — correlation attacks in the current Tor network can probably be made plenty accurate, but perhaps with some simple design changes we can improve the situation.

The discussion of false positives is key to this new paper too: Sambuddho's paper mentions a false positive rate of 6%. That sounds like it means if you see a traffic flow at one side of the Tor network, and you have a set of 100000 flows on the other side and you're trying to find the match, then 6000 of those flows will look like a match. It's easy to see how at scale, this "base rate fallacy" problem could make the attack effectively useless.

And that high false positive rate is not at all surprising, since he is trying to capture only a summary of the flows at each side and then do the correlation using only those summaries. It would be neat (in a theoretical sense) to learn that it works, but it seems to me that there's a lot of work left here in showing that it would work in practice. It also seems likely that his definition of false positive rate and my use of it above don't line up completely: it would be great if somebody here could work on reconciling them.

For a possibly related case where a series of academic research papers misunderstood the base rate fallacy and came to bad conclusions, see Mike's critique of website fingerprinting attacks plus the follow-up paper from CCS this year confirming that he's right.

I should also emphasize that whether this attack can be performed at all has to do with how much of the Internet the adversary is able to measure or control. This diversity question is a large and important one, with lots of attention already. See more discussion here.

In summary, it's great to see more research on traffic confirmation attacks, but a) traffic confirmation attacks are not a new area so don't freak out without actually reading the papers, and b) this particular one, while kind of neat, doesn't supercede all the previous papers.

(I should put in an addendum here for the people who are wondering if everything they read on the Internet in a given week is surely all tied together: we don't have any reason to think that this attack, or one like it, is related to the recent arrests of a few dozen people around the world. So far, all indications are that those arrests are best explained by bad opsec for a few of them, and then those few pointed to the others when they were questioned.)

[Edit: be sure to read Sambuddho's comment below, too. -RD]


November 16, 2014

In reply to by Anonymous (not verified)


Oh.....surprised.......Chinese tor user? C.C.P.??? China Gongan???? Are you investigating the Tor users?????


November 14, 2014


Glad this article has been put up now. There's been a lot of sh*t-stirring and exaggerating done by the media/journalists regarding Tor these past few days.
There's also been a bit of worry over the Tor Projects silence on social media and blogs.

Hopefully everyone will calm down, and stop disrupting things.


November 14, 2014


I am here to myself clarify all misconceptions. Firslty, they have blow it a bit out of proportion by saying that "81% of Tor traffic", which is not true. It was only 81.4% of our experiments, and we have spoken about this upfront in our paper. Secondly, its only a case of experimental validation and the challenges involved in it that is the highlight of the paper. In my thesis I have also tried to address how to solve this particular attack, which might work for other attacks as well...


TOR allows you to safely surf the web? This is total bullshit, and others still believe in him. This project has already collapsed, which can be heard once more as it is broken by governments, and other common Internet robbers who do not have more knowledge. He was a good 10 years ago, not now, its mechanisms do not give advice, often detected only after the publication of defects are patched. And what do the developers? They sleep? They wait until others discover something, because they do not want to, and advertise it so that it is not wonderful - crap. Good advertising for the naive who think that the firing of the TOR and will be anonymous. There is no longer in the dictionary concept of anonymity, yes she was 30 years ago and then ended. Those who propagate the word anonymity - have to move in time, as something they messed up, and that's good.

There are certainly many people who agree with you, especially recently (but mostly because more people know about Tor now than the did in the recent-past). Tor's goals and mission have not changed, so if you, or anyone you know, can help improve the security and trust in Tor, please do. We're an open community, please join and propose improvement (and help implement them, if you can!).

Spreading discontent and fear hurts our users and the people who need this technology the most.

Hi Sambuddho,

Thanks for the note. Indeed it is unfortunate when some journalist grabs at a new paper, misinterprets it, and uses it to produce more ad revenue or whatever it is journalists prioritize these days.

At the same time, it also shows how far we have to go in terms of teaching the general public about how anonymity works -- or heck, how science works.

For any (other) authors of anonymity research papers who are reading this: please consider doing a guest blog post with us to explain your work. We've had good experiences doing that in the past, e.g.………
but there are a lot more interesting research papers out there!

Thanks Roger, for being supportive. Please do understand, that:
a) we have already published this work in PAM 2014 (earlier this year) and its nothing really new. Its an extension or experimental validation of Zelinski and Murdoch's 2007 paper an
b) I do fully support Tor, both ethically as well as through my research. Most of my present research activities look at ways to improve anonymity and privacy.

If at all you see an attack paper, its intention is like several other attacks papers: exploring possible vulerabilities merely as through academic exercise. Truly people misrinterpret the results by giving a cursory glance to the paper and start to spread fear which is absolutely unnaceptable.


November 14, 2014


' So far, all indications are that those arrests are best explained by bad opsec for a few of them, and then those few pointed to the others when they were questioned.'

I was arrested for 'Downloading indecent images of children'. I obviously haven't committed such crimes. But as of now the only evidence I got to see was 'A tor user has downloaded 30 images of children from the tor network'.

My OPSEC was relatively good, never revealed information about myself.

But just to note, unless they had grounds to believe I was using the Tor network then how would they have known I used it for such activity, Even though I haven't.

Obviously I won't be charged because I'm innocent but this somewhat causes privacy issues just because the 'tor devs' don't have any idea at all what has happened.

Sounds like the police in your area really misunderstand Tor.

It might be wise to get somebody to go talk to them about what Tor is and what it's for, who uses it and why, etc. Otherwise they will end up harassing somebody else next.

See e.g.…

Maybe you could contact us in a more private way, to let us know which jurisdiction and make some introductions?


November 14, 2014


Oh, one more thing to add on-top of my previous comment. The dates match the time of the one-cell relay confirmation attack.

They said from 'Jan 30th'.

Wont reveal anymore details because even typing this is bad OPSec

Jan 30th 2014. Also the only thing which partly makes sense is the fact I was using Tor chat.

So in theory its possible to get my IP from the guard of the one-cell relay correlation attack. But then it gets complicated...

Would like to know how they could prove I was 'downloading 30 images between Feb 2014- June 2014.

They can't, certainly not from that particular attack which was HS dirs. Did you ever use p2p? Did you ever use tor on clearnet sites or hidden services? Tor2web?

It's possible another version of the attack or another agency was using similar techniques for some time before the flaw was discovered.

Or perhaps they just picked random tor users in the hope they have something incriminating.

also buggy and unmaintained since 2012.

Perhaps he had a similar user ID to someone else / someone added him to a buddy list / tried to send him files or maybe the feds were simply 'fishing' and exploited torchat in some way.

If that's really all he used tor for I'm at a loss.

Torchat’s security is unknown. It creates a hidden service on your computer leaving you vulnerable to deanonymization attacks that apply to all hidden services. It also seems to be a very basic protocol that looks like netcat over Tor. There is no way to decline a file transfer. It automatically starts the transfer, writing the file to /tmp which is a RAM-mounted tmpfs on Linux. Then you are supposed to save the file somewhere. Theoretically an attacker could transfer /dev/urandom while you are away from your computer until it fills up your RAM and crashes your computer. This would be great for inducing intersection attacks.

Another thing is that once someone learns your Torchat ID there is no way to prevent them from knowing you are online, even if you remove them from your buddy list. The reason is because your Torchat instance is a hidden service that publishes a normal hidden service descriptor which anyone can download. There’s no way to stop that. So you should be very conservative about handing out your Torchat ID and only give it to extremely trusted associates.

Relay early attack? That particular attack only proved you looked up a particular hidden service, not what you did on it or if you visited it.

From my understanding it also was able to find the location of hidden services. Since torchat uses hidden services it might have been possible.

But that doesn't prove that the adversaries would know what exactly is being sent over torchat.

Torchat runs it's own hidden service on your machine. Normally a hidden service keeps it's guards for some months, but you don't run torchat all the time so it will be changing guards very quickly. This is bad.


November 15, 2014


I observe that in the experiments they performed using multiple relays, the true positive rate dropped from 71/90 (=79%, not entirely sure of the origin of 81.4%) to 14/24 (=58%), while false positives grew from 6/90 (=6.7%) to 3/24 (=12.5%), so how well this attack actually scales is not clear.

Yes, I agree.

And to translate for the non-scientists here, "is not clear" means "sure doesn't look like it'll work the way people seem to think it will".

I had a guy running Tor unbeknownst To me. He was using a vpn also. He was a renter and a hacker. My little e2500 linksys log picked him up when i enabled it. I think you are underestimating how truly identifiable you can be.


November 15, 2014


Since ScrambleSuit is capable of changing its network fingerprint (packet length distribution, inter-arrival times, etc.) why not make every relay and client ScrambleSuit by default?

The performance impact of doing such at thing would be extreme for several reasons.

The first is the extra CPU overhead that would be incurred by the protocol itself (HMAC-SHA256 and CTR-AES128 aren't exactly cheap).

The second is that even assuming CPU was unlimited, that would gut network throughput for the busy relays. The ScrambleSuit inter-arrival time obfuscation (which is disabled for precisely this reason), puts a hard limit on the per-TCP connection throughput of ~278 KiB/s (1427 bytes of payload per frame, 200 frames/sec with a 5 ms average delay). Since each TCP connection can carry any number of different circuits worth of traffic, that's not enough bandwidth. Additionally HOL blocking problems in the presence of congestion would be made considerably worse with the bandwidth required for the extra padding traffic.

Both of the network related performance problems could potentially be addressed by having one TCP connection per circuit, but that has both performance and anonymity implications as well so it isn't a no-brainer.

The ScrambleSuit (and also obfs4) stream length and delay obfuscation is designed vs a different set of adversaries as well, so I'm not sure how well it would hold up vs certain active attacks.

max input rate == max output rate, adding internal buffers you get max time delay as buffer size/max rate. in short with right programming and 1Gbps links for max 1 sec delay 128MB memory is enough. it is a common technic for obfuscating physical placement.

So, it's not a memory thing or a buffer size thing at all, and I'm not sure if I just did a bad job of explaining it, so I will make one attempt to further clarify. Examples will use ScrambleSuit but they apply to obfs4 as well.

The way the inter-arrival time obfuscation works is that after each write is done, a variable delay is added between 0 to 10 ms (5 ms on average over a infinite sample size, though this is a gross oversimplification since the way the delay is sampled is a bit more complicated).

Each write is 1 (variable length) ScrambleSuit frame, which can carry up to 1427 bytes of payload. So assuming bulk data is being transmitted, your data rate is limited to 1427 bytes per 5 ms. In some cases there is more delay, in some cases there is less, but the long term average will converge to around 278 KiB/s (1427 * (1000 / 5) / 1024).

This is not a matter of "the right programming", how much bandwidth the link actually has (as long as it's "sufficiently large"), or of bufffer sizes, because the artificial delay added between writes to obfuscate the flow signature limits the maximum theoretical output rate, since you can only drain 1427 bytes from the buffer every 5 ms.

So choose the distribution according to the relay's configured bandwidth limit, e.g. a mean inter-frame delay of 50 microseconds would give a throughput of 20,000 frames per second.

Can't sleep for less than a millisecond on some platform or other? Then sleep for 0 ms with probability 0.95 or 1 ms with probability 0.05.

The lower you make the inter-frame delay the less obfuscation is provided since it's easier to filter out the changes made to the profile of the flow.

Anyway, making every relay PT aware isn't something I'd consider minimal effort by any stretch of the imagination, and trying to retrofit in defenses for this sort of thing in that manner seems questionable, especially using protocols that were designed for other things in mind ("obfuscation good enough to get past DPI boxes").

A better approach to this would be to figure out a clever way to use the PADDING/VPADDING cell types and do this at the Tor wire protocol level, but figuring out what "clever" things to do without massively gutting performance is a open research question.


November 15, 2014


it is years the german police is exploiting people's routers, yet in another way via javascript, to deanonymize Tor users. When the exploit takes place the deanonymization has a 100% success rate.

Freenet isnt a suitable replacement for Tor's darknet functions, the attacks outlined here, basically sybil attacks on the outlier of the network, are still applicable if an adversary gets two nodes connecting to yours where it can observe traffic going in and out and compare the difference.

When it comes to a global adversary who can monitor the traffic of all nodes in a network theres no network currently that can withstand this sort of traffic analysis, hopefully someone will develop something that will solve this problem and we can have true anonymity.


November 15, 2014


There was a project that was started recently, i dont recall the name, but it was a high latency network that moved data once every hour in a uniform manner so that not even a global adversary could tell who requested or input what into the network. I dont think it has been launched, it was really more of a nuts and bolts github project. Anyone know the name of it?

(Its not freenet, sybil attacks are still possible since freenet does not work in a uniform manner, even if it is high latency. )

Anyways this is what we really need, at least for a true anonymous darknet to exist. Tor will always have its place as a pseudo-anonymous whack-a-mole style way to access the clearnet, but when it comes to communications where you 100% have to know no one else is listening in, all traffic has to move uniformly.

Nope, it was an actual anonymity network not just a specific function, it was called something weird like coineffine, but nothing shows up in the search engine.


November 15, 2014


What if Tor did padding and sent something continiously? If you set the bandwidth limit to 100 KB/s then Tor could send random encrypted traffic at 100 KB/s continiously. That way it would not be possible to see when something is actually in the stream.

I would also like Tor to see Tor wrap it's traffic in something like encrypted BitTorrent so that anyone who just looks at the stream can not tell it apart form encrypted BitTorrent (which is a much more common thing on the Internet than https traffic).