KAIST freshmen working on bridge distribution strategies
Thanks to a friend who's a professor at KAIST in Korea, several teams of students there are working for their "freshman design class" on designing new bridge (aka bridge relay) distribution strategies. Here's some early brainstorming on what the actual problems are and what needs doing.
Background: Tor uses directory servers to give out information about what relays are in the network. But blocking connections between users and the main Tor network is actually not that hard — after all, there's a public list of relays, which needs to stay public so clients can know where to connect. That's where bridge relays come in. Now that we've deployed bridge relays, we changed the arms race from "how do we prevent the adversary from learning about 2000 IP address that we're openly publishing?" (which is an impossible problem) to "how do we take these several thousand secret IP addresses and give them out one at a time to the good guys, without letting the bad guys learn all of them?" Hopefully that's a more manageable arms race. You can read more background in the blocking-resistance design doc and video.
Here's the problem in a nutshell: what are some creative ways to distribute the IP addresses for bridge relays, such that ordinary users can generally learn a few bridge addresses, but an attacker can't learn all of them?
The key point is that we need diversity of distribution strategies. It's hard to predict the interests and capabilities of the adversary, so it will be hard to predict which strategies are going to be successful.
How do bridges work behind the scenes? Bridge relays anonymously upload their server descriptors to the bridge authority, which aggregates the list and checks which bridges are reachable and working. We've written a set of tools called bridgedb that looks at this list and gives out addresses to users according to various strategies. Each bridge address is only available via one of the strategies, so if one strategy turns out to be weak, the attacker only gets to learn the bridges associated with that strategy.
So far we've deployed three simple strategies: IP-based, email autoresponder, and manual. For the IP-based strategy, we look at which IP address you're coming from and list a few bridges, but we always answer a given IP address with the same list. That way an attacker needs to have a wide variety of IP addresses before he can learn many of the bridges allocated to this strategy. For the email autoresponder, we answer a given email address with a few bridges, but we answer the same way if that email address asks again. And for the manual strategy, we keep some bridge addresses in reserve, a) in case the other two strategies get broken at the same time, and b) to be able to give them out to people who ask on IRC, IM, etc.
Of course, it gets much more complex when we consider all the practical issues that come up. For example, for the IP-based strategy, we need to count nearby addresses (e.g. the whole /24 netblock) as the same address, or an attacker with a single class C could be lots of different addresses. We also need to treat open proxies — and Tor exit relays! — separately, or it's too easy to search for open proxies and beat it. Tor provides an easy-to-use list of exit addresses, but how do we easily collect an up-to-date list of open proxy addresses?
And for the email autoresponder, how do we prevent an attacker who controls a given domain from just creating thousands of email addresses at that domain, and learning lots of bridges? Our current answer is to only answer requests from a few domains — in particular, Gmail — that we know have reasonable ways to slow down creation of lots of accounts. In this way we leverage Gmail's "is a human" checks without having to design our own. Gmail also provides another feature: it signs all its outgoing mail using dkim. If it didn't, an attacker could forge mail from a variety of Gmail addresses and make it look like we're spamming them. After a while, Gmail would mark us as a spam source, and suddenly none of our mails would get through anymore.
How do we handle churn? After all, if we are truly answering the same address with the same set of bridges, then if those bridges go away, we'll be giving out useless addresses. Our current answer is based on consistent hashing. We hash each bridge's identity key, and then when a request comes in, we hash that request's address. Whichever hash(bridge key) is nearest to that hash(address) is the one we give out, along with the two bridges after that for robustness. Now we're always giving out bridges from the same location in the hash table, even if one of those bridges disappears for a while (or a new one shows up).
We could imagine a lot more strategies. For example, the user SMS's us from a given phone number, and we send a few bridge addresses back. Or we wrap the bridge addresses using time-release crypto, such that it takes the user (or the attacker) a few hours of CPU-time to decrypt it. Or we get all the users to subscribe to a mailing list, and we send out a new bridge address every 6 hours, in hopes that the users can use them a while before the attacker gets around to blocking them. Or heck, another strategy would be just collecting $10 from the user for each request for bridge addresses.
Another option is a social networking approach: rather than relying on technical tricks, rely on trust between humans to limit the risk that bridge addresses will fall into the wrong hands, and maybe construct a reputation system to identify participants whose bridge addresses get blocked more often than they "should". This idea gets messy quickly: see "strategy six" in the blocking-resistance design doc for more thoughts.
There are four components to tackling this general design question.
1) Identify some resources such that it's straightforward to get a few of them, but harder to get many of them. The ones I listed above are just a few examples, and I'm sure there are dozens more out there.
2) Speculate about the capabilities of various attackers (ISPs, countries, intelligence agencies, companies, etc), and analyze how hard it would actually be for them to collect lots of the resource. The ideal resource here would be something that's asymmetric, such that the right individuals can get it cheaply but the wrong parties will find it expensive. For example, the strategies based on IP address, email account, or even phone number are brittle: they would not stand up well to an attacker who has access to lots of different IP networks, can solve thousands of captchas, or owns a phone company. The social network based approach is an example of a resource that's more asymmetric: infiltrating a human-based network takes more resources than solving a bunch of captchas. Figure out which attackers your strategy is robust against, and which attackers it isn't.
3) Figure out how you'd actually build it in practice, and how you could make it more robust against attack. Where do we get a list of open proxies for the IP address strategy, and how effective would that actually be at slowing down various attackers? How do you receive SMS's on the Internet, and how do you send them in a sustainable (e.g. free) way? What is the true cost of getting a new phone number? (Gmail sometimes demands a new phone number for creating an account — this approach created an unexpected secondary market, where spammers would buy a new SIM card, use it just for its phone number, then resell it for almost full price since all the minutes were intact.) How hard is it in practice (measured in time, cash, skills, etc) to get 500 Gmail accounts? How do you pick the "seeds" of your social network, and are there unexpected security problems that come up, e.g. keeping sensitive information about users on a central site that the adversary could attack? These questions are only a few examples, and you'll have to extrapolate from them to figure out good questions to ask for your own designs.
4) The best way to figure out if you've thought about all the issues is to actually try to build it. You could build it as a module in the bridgedb tools, or as a separate standalone program. Try to have your friends (or other design teams) attack it, and see if they come up with new approaches or vulnerabilities that you hadn't considered.
While you're at it, there are a variety of other design questions that come up around bridges, and understanding them may help you come up with good approaches.
1) We only have about 500 bridge addresses right now total. Does your distribution strategy work well when it's trying to protect only a few hundred bridge addresses, or does it only start to be effective when there are tens of thousands?
2) Some of these strategies, like the SMS approach, might be able to be monitored by the attacker. Thus a) he can passively enumerate bridge addresses just by monitoring his own users, and b) he can passively enumerate people who ask for bridge addresses. First of all that vulnerability makes the strategy less robust (it's easier than we thought for that class of attacker to gather addresses). But second, it may put the users at higher risk. Strategies like the Gmail auto-responder might be safer, since the users have SSL link encryption when sending and receiving mail. That is, assuming they trust Google.
3) It's a shame that we need to use a centralized bridge authority here. That means an attacker who can break into the bridge authority can learn all of the bridge addresses at once. Further, the bridge authority needs to learn which bridges are reachable (to know which ones to give out), which provides another avenue for an attacker. Are there ways to distribute bridge addresses over more than one bridge authority, in a way that can handle churn and still preserves our rate-limited release properties?
4) How do we test whether bridges are reachable from inside some of these countries? The bridge authority can either test directly (which has problems), or test via Tor (which has different problems). See sections 7.6 and 7.7 of the blocking-resistance document for more discussion.
Overall, a lot of the challenge here comes from not having a good handle on how powerful the attackers might be. In the cryptography world, the standard approach is to assume an extremely powerful attacker, and design a system that can protect against even that. But here, it would seem that convenience and usability are at odds with this traditional approach to strong security — not to mention that we don't even know how to provide strong (crypto-grade) security in this context.
Last, it's worth noting that so far no countries have blocked the public Tor network. Tor played a big role in Iran's organizations and demonstrations in June, and it has even more users in China, Russia, and other repressive countries. So the arms race has already begun, and none of the adversaries have proved to be powerful enough and/or concerned enough to take a step. Building a broad range of strategies — from convenient but breakable ones to very secure ones — will be the best preparation for however the arms race plays out.