A technical summary of the Usenix fingerprinting paper

by arma | July 31, 2015

Albert Kwon, Mashael AlSabah, and others have a paper entitled Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden Services at the upcoming Usenix Security symposium in a few weeks. Articles describing the paper are making the rounds currently, so I'm posting a technical summary here, along with explanations of the next research questions that would be good to answer. (I originally wrote this summary for Dan Goodin for his article at Ars Technica.) Also for context, remember that this is another research paper in the great set of literature around anonymous communication systems—you can read many more at http://freehaven.net/anonbib/.

"This is a well-written paper. I enjoyed reading it, and I'm glad the researchers are continuing to work in this space.

First, for background, run (don't walk) to Mike Perry's blog post explaining why website fingerprinting papers have historically overestimated the risks for users:
https://blog.torproject.org/blog/critique-website-traffic-fingerprintin…
and then check out Marc Juarez et al's followup paper from last year's ACM CCS that backs up many of Mike's concerns:
http://freehaven.net/anonbib/#ccs2014-critical

To recap, this new paper describes three phases. In the first phase, they hope to get lucky and end up operating the entry guard for the Tor user they're trying to target. In the second phase, the target user loads some web page using Tor, and they use a classifier to guess whether the web page was in onion-space or not. Lastly, if the first classifier said "yes it was", they use a separate classifier to guess which onion site it was.

The first big question comes in phase three: is their website fingerprinting classifier actually accurate in practice? They consider a world of 1000 front pages, but ahmia.fi and other onion-space crawlers have found millions of pages by looking beyond front pages. Their 2.9% false positive rate becomes enormous in the face of this many pages—and the result is that the vast majority of the classification guesses will be mistakes.

For example, if the user loads ten pages, and the classifier outputs a guess for each web page she loads, will it output a stream of "She went to Facebook!" "She went to Riseup!" "She went to Wildleaks!" while actually she was just reading posts in a Bitcoin forum the whole time? Maybe they can design a classifier that works well when faced with many more web pages, but the paper doesn't show one, and Marc Juarez's paper argues convincingly that it's hard to do.

The second big question is whether adding a few padding cells would fool their "is this a connection to an onion service" classifier. We haven't tried to hide that in the current Tor protocol, and the paper presents what looks like a great classifier. It's not surprising that their classifier basically stops working in the face of more padding though: classifiers are notoriously brittle when you change the situation on them. So the next research step is to find out if it's easy or hard to design a classifier that isn't fooled by padding.

I look forward to continued attention by the research community to work toward answers to these two questions. I think it would be especially fruitful to look also at true positive rates and false positives of both classifiers together, which might show more clearly (or not) that a small change in the first classifier has a big impact on foiling the second classifier. That is, if we can make it even a little bit more likely that the "is it an onion site" classifier guesses wrong, we could make the job of the website fingerprinting classifier much harder because it has to consider the billions of pages on the rest of the web too."

research

Comments

Please note that the comment area below has been archived.

The Tor development team has

The Tor development team has been against adding any sort of padding to the Tor stream despite massive gains in anonymity against global surveillance states, why would it start now?

For instance it was originally a problem that packet sizes created an easy to follow fingerprint so they standardized the packet sizes. Then it became a problem that the numbers and timings of packets were also an easy to follow fingerprint, but when they addressed the problem they gave the cop out excuse that adding junk packets that relays could drop and making relays spit out packets at a uniform tempo would slow down the network, so now we have a lightning fast network that is utterly useless against 5-eyes and all of its cronys it likes to feed information to.

The real enemies of personal freedom are exactly those the Tor developers refuse to design their system against, and why can't we have a fully resilient network? It would take a few extra seconds to load a page. Right.

One of the main reasons that

One of the main reasons that Tor does not move fake traffic through its network is that no one has yet figured out how much fake traffic is needed to solve the problem, if it can actually by solved by this technique. Tor uses fixed-sized cells with low-latency onion routing because that approach achieves its security goals with methodology that is simple and well understood. Adding fake traffic to the network is not, however.

Please see the FAQ entry that covers this concern: https://www.torproject.org/docs/faq#SendPadding

is there anybody who does

is there anybody who does not uderstand what the signal/noise ratio is? get old clever books and try to guess.
if you add 1% noise will snr be lower?
how expensive is it to extract anything about the signal from the 100% loaded link with the noise like transmition?
to begin with why not to add the (experimental) parameter 'Noise' tuneable by user?
this noise can be used on the first link to the entry tor router. Make it now and then talk about "how much fake traffic is needed" between tor routers!

> how expensive is it to

> how expensive is it to extract anything about the signal from the 100% loaded link with the noise like transmition?

Trivial. That sort of thing leaks load information via clock skew, which is one of the reasons why the BuFLO website fingerprinting defense exists merely as a theoretical construct and isn't practical to deploy.

Additionally an accurate estimate of link capacity is required to get good network utilization when using this sort of static pad-to-max-capacity scheme, which is a fairly difficult problem to solve.

> this noise can be used on the first link to the entry tor router

This already exists in the form of ScrambleSuit/obfs4's burst obfuscation (Warning: Not designed for this sort of use, but they do add padding), and obfsproxy-wfpadtools/basket's website fingerprinting defenses.

The most effective defense I am aware of is CS-BuFLO, and that incurs a ~3x bandwidth increase if you use application level hinting, and ~6-10x if you do not.

This also won't do anything vs the attack present in the USENIX paper because the paper assumes the guard is malicious (so the padding must extend a minimum of 2 hops).

The reason link layer padding is not done isn't because we don't like the general idea. It is because no one has come up with an algorithm that won't cause the Tor network to implode from the load, while providing good security/anonymity properties. If we didn't like the idea, we wouldn't be working with researchers on coming up with a suitable algorithm (eg: We did a GSOC project, I wrote basket, we work with researchers doing website fingerprinting/defenses).

Why not instead of bloating

Why not instead of bloating the network with fake traffic you make it so that all packets move syncroniously? Its a timing issue right, that clients spit out a stream of packets that cluster together to form a signature, so then regulate the flow of packets. When the 'number-of-packets' element of the signature gets to the first relay it will be combined with other streams coming into that relay at that moment in time forming a mux of all streams that get spit out into the next relay. Even if the client is the only one using that particular entry relay at that moment it will eventually mux with the 2nd and 3rd relays which chances are handling more than one stream. Is there more to the problem than timing? This seems like a trivial fix.

What you are describing is a

What you are describing is a mixnet or an early-generation onion router. Mixnets have their places, but when you start delaying packets in order to mix them and removing timing information, users start complaining and leave, which means that there's less users in the mix and thus less indistinguishability for everyone involved. It's commonly said that there are more researchers writing about mixnets than there are mixnet users.

Not that simple. a) Constant

Not that simple.

a) Constant rate is bad because it leaks information regarding CPU load. But this is a solvable problem (and the "correct" thing to do is sort of known, is easy to implement, and I've done so in the past).

b) How do you determine the rate? You need to maintain an accurate estimate of the bandwidth and latency, or you end up either causing your connection to collapse (too aggressive), or you under-utilize the link capacity. There are a lot of Tor users that are limited to ~128 kbit (Max residential line speed in Iran).

c) What behavior should each side take when the time has come to send traffic, but there is no traffic to send? If you do nothing, the classifier presented in the paper can be trivially adapted to handle this sort of defense. If you opt to send cover, how much cover do you send, for how long?

Basically, to defend against an evil client guard, some sort of active defense must be run to at least the middle relay. The best known algorithm (CS-BuFLO, see http://arxiv.org/abs/1401.6022) is prohibitively expensive, even with application side support, so I see this as an open research area (that people in academia are actively working on, one being my former GSOC student).

The moment anyone comes up with an answer to this problem that realistically defends against this sort of attack, that won't kill the Tor network, won't make things worse, and won't totally hose people on really bad internet links (who arguably really need Tor), I will be ecstatic, because this is something that needs to be fixed.

Note: Defending against an evil guard as a HS is significantly harder, since the HS Guard can trivially mount a confirmation attack (This is the equivalent of sitting outside someone's house, with wire cutters on the telephone line, and seeing if they drop offline when you cut the cord.).

Not the most eloquently said

Not the most eloquently said but the point is valid.

This is why Tor shouldn't

This is why Tor shouldn't rely on central servers. If they relied on their users the same way I2P do besides it would be much easier to add padding.

But this would only work basically the same way as I2P(becoming a node) - (an option if you want to become the last node in a circuit too - which states the risks). And since tor uses TCP only maybe it will become very complex.

People can then help the network with padding and still use central server(s) and or peers to connect to clearnet sites. This might help with plausible deny ability for people who run nodes too.

I for one don't ever mind becoming an entry node for any user at all but an exit node is an issue due to the legal issues.

I think it's a good concept but if it's possible with Tor and the language Tor uses. I always thought I2P and Tor should talk together and think about the issues and advantages of both software. If they both came together we certainly would have a better anonymity world.

Forcing every user to become

Forcing every user to become a node would be an significant burden on those with limited bandwidth and would make the bridges implementation even more complex. If you can afford to let your government/ISP/etc know you're using tor that last point isn't an issue, but not everyone lives in such an accommodating place.
Also, some websites block ALL tor nodes, not just exit nodes. And if the government really wants to harass the network, it wouldn't be hard to find a way to cause legal difficulties to even non-exit nodes.

"Forcing every user to

"Forcing every user to become a node would be an significant burden on those with limited bandwidth ... "

That's why not everyone should automatically contribute, but if someone has decent bandwidth, they should automatically contribute to the Tor Network. Also, it should not be using all your bandwidth but some of it.

" ... and would make the bridges implementation even more complex."

Why is that? You don't have to be a bridge in order to contribute, a middle relay is enough.

" If you can afford to let your government/ISP/etc know you're using tor that last point isn't an issue, but not everyone lives in such an accommodating place. "

Okay so, maybe automatically isn't the best solution, but making it easier to contribute is what is needed, i mean, there's no option at all in Tor Browser to easily contribute. With the removal of Vidalia, the option to contribute is now more or less gone for most people.

The current centralized Tor

The current centralized Tor implementation doesn't allow for holding back the full list of entry and middle relays from the public. However, if we move to I2P-like decentralized topology I'm not sure that limitation stands. Furthermore, the quantity of the users IPs unlike the nodes is a few orders of scale larger, so I'm not sure large providers would be so easily inclined toward banning a significant portion of internet users.

> The current centralized

> The current centralized Tor implementation doesn't allow for holding back the full list of entry and middle relays from the public.

Every single client should have a full view of the network so that it is confident that path selection is done correctly. Otherwise malicious actors can feed it a partial view of the network (or a different set of nodes entirely). This is more applicable to Guard and Exit selection than middle, naturally since those are the two logical points to mount certain attacks from.

On another note, NAT implementations in consumer grade routers are absolutely garbage (with a few especially horrific examples), and running relays behind certain devices is not recommended (eg: Routers that crash when more than 100 or so simultaneous TCP connections are established).

Is it safe to modify the

Is it safe to modify the "layout.css.devPixelsPerPx" in about:config or would this hurt anonymity of TBB

No, it's not safe, it

No, it's not safe, it changes your screen resolution.

Though in Tor Browser 5.x

Though in Tor Browser 5.x and later, screen resolution granularity is reduced, so you can't be tracked that much more with "layout.css.devPixelsPerPx" enabled.

If you need to set "layout.css.devPixelsPerPx" for better accessibility, download the Tor Browser alpha version from here https://www.torproject.org/projects/torbrowser.html.en#downloads-alpha which already has the less granular screen resolution.

> It's not surprising that

> It's not surprising that their classifier basically stops working in the face of more padding though: classifiers are notoriously brittle when you change the situation on them. So the next research step is to find out if it's easy or hard to design a classifier that isn't fooled by padding.

Tor always sound like apologists for their own project and this is a perfect example.
They know something will break this classifier (and other exploits), but instead of doing anything about it, they sit back and say "Well we have to research this, and research that, and research some other thing, and then we have to go find funding and do some more research... you know, because we have to be sure it will work in 100% of all use cases and attacks before we implement it. And on and on and on..."

Honestly, I think Tor is simply too far gone down its own singular path and groupthink that as a design model it's fossilized in its own stone age and
the only work being done is on making it slicker... obfs, control, stats, crypto
tweaks, some HS love, etc blah blah. Thats good and there's lots of great
work to do there. And there's actually nothing wrong with that, it's like any other project really.

You just have to realize that if you want the [inherent] weaknesses in Tor fixed,
you're going to have to create, or move to, another anonymous overlay network project to do it. And stand firm against their monolith of research if you think
you have research that shows your way will work just as well or better than Tor
for some particular purposes.

There are so many good ideas out there about how to do new anonymous networks, it's time people break out of the 15 plus years of Tor's fat gravity well and start implementing new things.

So just sink the whole ship

So just sink the whole ship because there are a few leaks? Sounds reasonable. You got another far more superior network your hiding under your mattress?

How about you point to one-single-anonymity-system that doesnt completely hinge on the user not picking a hostile entry node. Oh thats right, they all do, which is what almost every attack against Tor completely hinges on.

If there is anything the Tor project needs to be focusing on is a better way of validating entry nodes other than "if you stick around long enough you are trusted", that would eliminate 90% of its problems.

Hostile entry nodes (nodes

Hostile entry nodes (nodes in general, sybil, etc) and network interference are active attacks. They are difficult to defeat unless you are splitting your packets across nodes such that one node (or m:n) can't see to add and correlate your traffic (but you can still be timed as you act, see below). Tor doesn't do this.
Passive attacks are correlation based on observations. Do not underestimate this. With the NSA's fangs on literally every fiber optic cable and in every exchange they can weasel into, GPA. They are difficult to defeat unless you are filling the network with fill traffic such that they can't tell wheat from chaff. Tor doesn't do this.

Okay, so why don't you go

Okay, so why don't you go off and create a new anonymous network. Some people have; it's not like tor is the only one around. Of course most of the ones I know about (I2P, gnunet, freenet) haven't really focused on connecting to the clearnet...
It's easy to complain about something or how it needs feature X, but implementing something is much harder. Implementing something that works is even harder than that. The reason why tor is still around let alone the most popular onion routing network is because no one else has managed to create software that handles the same (or even most of) use cases as tor and does it well.
Everyone always complains about how tor doesn't do padding or tor shouldn't use DAs or tor should use UDP, but no one ever tries to do anything about it. Tor's source is freely available. Download it and try to do it yourself. Sure, it's not easy, but if you manage to do that and then test your patches, maybe you can come back and give hard evidence for your arguments.

I agree with you here.

Also, the Tor Project really needs a stronger community, I mean, just like this commenting system on this blog, it really sucks. You have to wait hours (days sometimes) before your message gets approved on here.

We need a forum or something, If some third-party is going to do it, fine, but the Tor Project needs to link to it then, otherwise it will just die out.

I really care for, and want to help this project, but the Tor Project really doesn't seem to care much at all about Tor community in my opinion.

I heard about a guard

I heard about a guard gateway attack any news about this? and could this be stopped?
Because if this is the case is there a way y'all can make it to where we pick our own guard and exit? for safer reasons

Hi, arma, you wrote: "The

Hi, arma, you wrote:

"The first big question comes in phase three: is their website fingerprinting classifier actually accurate in practice? They consider a world of 1000 front pages, but ahmia.fi and other onion-space crawlers have found millions of pages by looking beyond front pages. Their 2.9% false positive rate becomes enormous in the face of this many pages—and the result is that the vast majority of the classification guesses will be mistakes."

I think I can guess what you are thinking of here (the simple argument you sometimes call "the base rate paradox" explicated by a tor-talk user years ago, which compares Bayes's formula in the case of flagging common versus very rare events), but can you confirm that guess?

I worry too about lack of traffic padding, reliance on central servers, and other issues raised above, but I think I understand your explanation of why Tor Project hasn't made major changes yet to address these concerns, and I think I agree. Another argument for going cautiously: Tor is one of a tiny handful of critical tools relied upon by whistleblowers, journalists, human rights researchers, and citizen bloggers around the world, but making changes introduces the risk of inadvertently breaking something else, so it makes sense to be conservative. Especially given strong evidence from the Snowden leaks and more recently the Hacking Team leaks that our enemies may encounter serious problems trying to break Tor in the real world.

I worry about the current paucity of practical/tested tools too, but the answer certainly isn't breaking Tor.

Slightly OT:

1. One of the most desired "goodies" added to Tor Browser or Tails would be a modification to gedit similar to the existing spell-checker, which suggests more likely synonyms to rare words. For example I would be prompted to consider replacing "paucity" with a less common synonym. Stylistic quirks other than vocabulary might be harder to address, but even a "danger bar" which displays the sum of the logarithms of the frequency of each words (assuming "en" language, as per Norvig's data) could help users try to reword their posts to better resist stylometry attacks. Rachel can help, perhaps?

2. Is it bad that in the current version of Tails (and Debian stable), pulseaudio (by Lennart Poettering, author of systemd) seems to keep files in /dev/shm/ which appears to act as a sink for the microphone?

3. Do you know why Brendan Poettering (brother of Lennart) wrote his ECC cryptography utility to be perversely hard to use? Maybe there is a security consideration I don't recognize, but if so the documentation does not explain.

4. Things I'd like to see in Tails include utilities such as steghide, plus TTS utilities, plus ECC, twofish, and some other encryption utilities as a back-up if AES is broken tomorrow.

"4. Things I'd like to see

"4. Things I'd like to see in Tails include utilities such as steghide, ..., twofish, and some other encryption utilities as a back-up if AES is broken tomorrow."

XKEYSCORE has steganography detection and it would be very interesting to read how .Maybe like the silly special dschihad crypto tools.... .
Twofish is integrated in gpg. A modern, more sophisticated crypto algoritm like THREEFISH or better would be.... very nice. If Virgina would allowing that.

For any purpose, and any

For any purpose, and any attack, other than a clearnet or HS service (eg: webserver, mail) trying to discern, in and of itself as the service entity, who is contacting it...
...Tor must be considered not useful. Too many academic reports of passive and active attacks.

Tor still does a great job at keeping the end service from knowing your IP address, and has a great network of volunteer relays. That will always be the case.

Just don't expect it to keep you safe from NSA types, or from the LEA who partner with them, or the Corp types who partner with both.

Also, do you have something

Also, do you have something better than Tor? I really can't find anything that has the same or better anonymity, and privacy protections like Tor.
But against the NSA, or other global adversaries, it's "not enough", but still, always better than using the clearnet :)

As explained many times in

As explained many times in this blog, the bad guys (intelligence services, cyberespionage-as-a-service companies, corrupt government officials, corporate nasties) attack Tor users by attacking the browser. For Tor users, that means TBB or Tails with TBB, so hardening the browser is in scope.

The Snowden leaks, and more recently the Gamma and Hacking Team leaks, show that Tor Browser Bundle and Tails (especially) have been more successful than project leaders dared to hope (in public) at making things hard for NSA and even defeating (in many cases, it appears) cyberespionage-as-a-service companies and their clients (including LEAs).

These revelations show that the goal of making it difficult for even NSA to deanonymize Tor users "en mass" is not out of reach. This goal is fully consistent with the goals of the Project. Now that we know it is not unrealistic, the Project should certainly attempt to continue to stay ahead of the bad guys. Both by continuing to improve Tor itself, and by continuing to develop and improve TBB.

Thanks to the developers, and please keep up the good work!

> Too many academic reports

> Too many academic reports of passive and active attacks.

I think you are too pessimistic here. POCs (Proof of Concept) are not necessarily practical in the real world. This point has been discussed many times in this blog.