A technical summary of the Usenix fingerprinting paper
Albert Kwon, Mashael AlSabah, and others have a paper entitled Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden Services at the upcoming Usenix Security symposium in a few weeks. Articles describing the paper are making the rounds currently, so I'm posting a technical summary here, along with explanations of the next research questions that would be good to answer. (I originally wrote this summary for Dan Goodin for his article at Ars Technica.) Also for context, remember that this is another research paper in the great set of literature around anonymous communication systems—you can read many more at http://freehaven.net/anonbib/.
"This is a well-written paper. I enjoyed reading it, and I'm glad the researchers are continuing to work in this space.
First, for background, run (don't walk) to Mike Perry's blog post explaining why website fingerprinting papers have historically overestimated the risks for users:
and then check out Marc Juarez et al's followup paper from last year's ACM CCS that backs up many of Mike's concerns:
To recap, this new paper describes three phases. In the first phase, they hope to get lucky and end up operating the entry guard for the Tor user they're trying to target. In the second phase, the target user loads some web page using Tor, and they use a classifier to guess whether the web page was in onion-space or not. Lastly, if the first classifier said "yes it was", they use a separate classifier to guess which onion site it was.
The first big question comes in phase three: is their website fingerprinting classifier actually accurate in practice? They consider a world of 1000 front pages, but ahmia.fi and other onion-space crawlers have found millions of pages by looking beyond front pages. Their 2.9% false positive rate becomes enormous in the face of this many pages—and the result is that the vast majority of the classification guesses will be mistakes.
For example, if the user loads ten pages, and the classifier outputs a guess for each web page she loads, will it output a stream of "She went to Facebook!" "She went to Riseup!" "She went to Wildleaks!" while actually she was just reading posts in a Bitcoin forum the whole time? Maybe they can design a classifier that works well when faced with many more web pages, but the paper doesn't show one, and Marc Juarez's paper argues convincingly that it's hard to do.
The second big question is whether adding a few padding cells would fool their "is this a connection to an onion service" classifier. We haven't tried to hide that in the current Tor protocol, and the paper presents what looks like a great classifier. It's not surprising that their classifier basically stops working in the face of more padding though: classifiers are notoriously brittle when you change the situation on them. So the next research step is to find out if it's easy or hard to design a classifier that isn't fooled by padding.
I look forward to continued attention by the research community to work toward answers to these two questions. I think it would be especially fruitful to look also at true positive rates and false positives of both classifiers together, which might show more clearly (or not) that a small change in the first classifier has a big impact on foiling the second classifier. That is, if we can make it even a little bit more likely that the "is it an onion site" classifier guesses wrong, we could make the job of the website fingerprinting classifier much harder because it has to consider the billions of pages on the rest of the web too."
One of the main reasons that Tor does not move fake traffic through its network is that no one has yet figured out how much fake traffic is needed to solve the problem, if it can actually by solved by this technique. Tor uses fixed-sized cells with low-latency onion routing because that approach achieves its security goals with methodology that is simple and well understood. Adding fake traffic to the network is not, however.
Please see the FAQ entry that covers this concern: https://www.torproject.org/docs/faq#SendPadding
is there anybody who does not uderstand what the signal/noise ratio is? get old clever books and try to guess.
if you add 1% noise will snr be lower?
how expensive is it to extract anything about the signal from the 100% loaded link with the noise like transmition?
to begin with why not to add the (experimental) parameter 'Noise' tuneable by user?
this noise can be used on the first link to the entry tor router. Make it now and then talk about "how much fake traffic is needed" between tor routers!
> how expensive is it to extract anything about the signal from the 100% loaded link with the noise like transmition?
Trivial. That sort of thing leaks load information via clock skew, which is one of the reasons why the BuFLO website fingerprinting defense exists merely as a theoretical construct and isn't practical to deploy.
Additionally an accurate estimate of link capacity is required to get good network utilization when using this sort of static pad-to-max-capacity scheme, which is a fairly difficult problem to solve.
> this noise can be used on the first link to the entry tor router
This already exists in the form of ScrambleSuit/obfs4's burst obfuscation (Warning: Not designed for this sort of use, but they do add padding), and obfsproxy-wfpadtools/basket's website fingerprinting defenses.
The most effective defense I am aware of is CS-BuFLO, and that incurs a ~3x bandwidth increase if you use application level hinting, and ~6-10x if you do not.
This also won't do anything vs the attack present in the USENIX paper because the paper assumes the guard is malicious (so the padding must extend a minimum of 2 hops).
The reason link layer padding is not done isn't because we don't like the general idea. It is because no one has come up with an algorithm that won't cause the Tor network to implode from the load, while providing good security/anonymity properties. If we didn't like the idea, we wouldn't be working with researchers on coming up with a suitable algorithm (eg: We did a GSOC project, I wrote basket, we work with researchers doing website fingerprinting/defenses).
Why not instead of bloating the network with fake traffic you make it so that all packets move syncroniously? Its a timing issue right, that clients spit out a stream of packets that cluster together to form a signature, so then regulate the flow of packets. When the 'number-of-packets' element of the signature gets to the first relay it will be combined with other streams coming into that relay at that moment in time forming a mux of all streams that get spit out into the next relay. Even if the client is the only one using that particular entry relay at that moment it will eventually mux with the 2nd and 3rd relays which chances are handling more than one stream. Is there more to the problem than timing? This seems like a trivial fix.
What you are describing is a mixnet or an early-generation onion router. Mixnets have their places, but when you start delaying packets in order to mix them and removing timing information, users start complaining and leave, which means that there's less users in the mix and thus less indistinguishability for everyone involved. It's commonly said that there are more researchers writing about mixnets than there are mixnet users.
Not that simple.
a) Constant rate is bad because it leaks information regarding CPU load. But this is a solvable problem (and the "correct" thing to do is sort of known, is easy to implement, and I've done so in the past).
b) How do you determine the rate? You need to maintain an accurate estimate of the bandwidth and latency, or you end up either causing your connection to collapse (too aggressive), or you under-utilize the link capacity. There are a lot of Tor users that are limited to ~128 kbit (Max residential line speed in Iran).
c) What behavior should each side take when the time has come to send traffic, but there is no traffic to send? If you do nothing, the classifier presented in the paper can be trivially adapted to handle this sort of defense. If you opt to send cover, how much cover do you send, for how long?
Basically, to defend against an evil client guard, some sort of active defense must be run to at least the middle relay. The best known algorithm (CS-BuFLO, see http://arxiv.org/abs/1401.6022) is prohibitively expensive, even with application side support, so I see this as an open research area (that people in academia are actively working on, one being my former GSOC student).
The moment anyone comes up with an answer to this problem that realistically defends against this sort of attack, that won't kill the Tor network, won't make things worse, and won't totally hose people on really bad internet links (who arguably really need Tor), I will be ecstatic, because this is something that needs to be fixed.
Note: Defending against an evil guard as a HS is significantly harder, since the HS Guard can trivially mount a confirmation attack (This is the equivalent of sitting outside someone's house, with wire cutters on the telephone line, and seeing if they drop offline when you cut the cord.).