Ahmia search after GSoC development

by juha | September 8, 2014

The Google Summer of Code (GSoC) was an excellent opportunity to improve on the Ahmia search engine. With Google's stipend and friendly mentoring from The Tor Project, I was able to concentrate on development of my search engine project. Thank you all!

GSoC 2014 is over, but I am sticking around to continue developing and maintaining Ahmia.

Here is the current status of ahmia after GSoC development:

Introduction

Ahmia is open-source search engine software for Tor hidden service websites. You can test the running search engine at ahmia.fi.

Building a search engine for anonymous web sites running inside the Tor network is an interesting problem. Tor enables web servers to hide their location and Tor users can connect to these authenticated hidden services while the server and the user both stay anonymous. However, finding web content is hard without a good search engine and therefore a search engine is needed for the Tor network.

Web search engines are needed to navigate and search the web. There were no search engines for searching hidden service web content, so I decided to build a search engine specially for Tor. I registered ahmia.fi and started development on it as a side project in 2010.

This development involved programming and testing web crawlers, thinking of ways to find hidden service addresses (since the protocol does not allow enumeration), learning about the Tor community, and implementing a filtering policy. Moreover, I implemented an API that empowers other Tor services that publish content to integrate with Ahmia.

As a result, Ahmia is a working search engine that indexes, searches and catalogs content published on Tor Hidden Services. Furthermore, it is an environment to share meaningful statistics, insights and news about the Tor network itself.

Interesting Summer of Code

One of my best memories from the summer is the Tor Project's Summer 2014 Developers meeting that was hosted by Mozilla in Paris, France. I have always admired the people who are working on the Tor Project.

I also loved the coding itself. Finally I had time to improve the Ahmia search engine and its many features. I did a lot of work and liked it.

Some journalist were very interested in my work: Carola Frediani asked if I could analyze the content of hidden services. I coded a script that fetches every front page's HTML, I gathered all the keywords, headers and description texts and made a simple word cloud visualization.

Hidden website content visualization.

It is a simple way to glance what is published on the hidden websites.

Carola found this data useful and used it in her presentation at www.sotn.it on June 11th.

Technical design of ahmia

The Ahmia web service is written using the Django web framework. As a result, the server-side language is Python. On the client-side, most of the pages are plain HTML. There are some pages that require JavaScript, but the search itself works without client-side JavaScript.

The components of Ahmia are:

  • Django front-end site
  • PostgreSQL database for the site
  • Custom scripts to download data about hidden services
  • Django-Haystack connection to Solr database
  • Apache Solr for the crawled data
  • OnionBot crawler that gathers data to Solr database

Technical architecture.

See installation and developing tutorial

Search

The full-text search is implemented using Django-Haystack. The search is using crawled website data that is saved to Apache Solr.

OnionDir

OnionDir is a list of known online hidden service addresses. A separate script gathers this list and fetches information fields from the HTML (title, keywords, description etc.). Furthermore, users can freely edit these fields.

We've also started a convention where hidden service admins can add a file to their website, called description.json, to offer an official description of their site in Ahmia.

As a result, this information is shown in the OnionDir page and over 80 domains are already using this method.

Statistics

We are gathering statistics from hidden services. As a result, we can represent and share meaningful data about hidden services and visualize it.

We are gathering three types of popularity data:

  1. Tor2web nodes share their visiting statistics to Ahmia
  2. Number of public WWW backlinks to hidden services
  3. Number of clicks in the search results

The click counter tells the total number of clicks on a search result in ahmia.fi

Filtering

We have decided to filter any sites related to child porn from our search results. Ahmia is removing everything related to these websites. These websites may not be actual child porn sites. They are rather sites where users can post content (forums, file and image uploads etc.) and as the result there have been, momentarily at least, some suspicious content that has not been moderated in a reasonable period of time. Ahmia.fi does not have the time to monitor these sites carefully and we are banning sites from our public index if we see any evidence of child abuse. Of course, the ban is removed if the site itself contacts us and we review the website to be OK.

In practice, Ahmia calculates the MD5 sums of the banned domains for use as a filtering policy. Moreover, we are sharing this list and Tor2web nodes can use the list to filter out pages.

At the moment, there seems to be 1228 hidden website domains online and 7 of them has been filtered because they are possibly sharing child porn content.

OnionBot

OnionBot is a crawler for hidden service websites based on the Scrapy framework. It crawls the Tor network and passes data to the search database. OnionBot requires the Tor software (using Tor2web mode) and Polipo. The results are saved to Apache Solr.

Apache Solr

Apache Solr is a popular, open source enterprise search platform. Its major features include powerful full-text search, hit highlighting, faceted search, and near real-time indexing.

The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

Security measures for privacy

In the software

  • We do not log any IP addresses, see Apache configuration
  • We are gathering real-time clicks, however, this data is not shown accurately

In the host ahmia.fi

  • Backend servers are run separately and they do not have any knowledge about the end-users
  • All servers are hosted in countries with strong privacy laws. For example, Finland and the Netherlands
  • Communication between servers is encrypted
  • Only a few trustworthy people know the locations of the back-end servers and are able to access them

Future work

GSoC 2014 was fun and productive!

There is a lot more to do. However, I do not have time to do everything myself. Of course, I am coding when I have time and maintaining the search engine.

In addition, I am going to write a scientific article about the implementation.

Is there anyone who would be interested in developing Ahmia.fi?

Is anyone familiar with Solr and would know how to tweak it for full text search?

Furthermore, any kind of help would be most welcome. There are always Linux admin duties, HTML/CSS design, bug fixing, Django development, etc...

For further information, please don't hesitate to contact me by e-mail: juha.nurmi@ahmia.fi

Comments

Please note that the comment area below has been archived.

September 09, 2014

Permalink

Will your crawler parse and honour "robots.txt" from hidden services for general and/or specific interdictions, if any ?
If your crawler robot is honouring targetted interdiction, how' s the robot name to be spelled in the "robots.txt" ?

September 09, 2014

Permalink

1. What PT does orbot support?
2. if I direct my dns settings to 127.0.0.1:5400 while tor is running will my system use a torified dns?
3.if I install standalone tor on my computer (via terminal/cmd) can I use it?

September 10, 2014

Permalink

I understand the importance of Tor, but I also see that using it, as opposed to other browser set-ups, instantly cuts out about 60% of the current Internet. Search engines on web sites that use a Google plug-in search engine suddenly are inaccessible. Comment sections become inaccessible. Videos and audio files become inaccessible. Video and audio web sites become inaccessible. Retail sites become inaccessible. The Internet is suddenly devolved back to 1992 or perhaps 1986. There comes a point at which anonymity begins to blur with the tinfoil hat brigade. It is right that no government, corporation or neighbor should be able to access our private information. It is also right that very little that you do is important enough for anyone else's scrutiny. I value my privacy, but I'm the only one. Not because I have so much at stake but because no one else cares, nor should. Right now, Tor is too anonymous. It stops being useful as a general web browser. Using Tor is not going to force the rest of the world to concede to our desires for anonymity. They're too busy selling smart phones and pet meds to the 99% who don't care about their privacy. Unless you can change the behavior of enough Internet users to have an economic impact on the perpetrators, privacy is a non-issue. That makes Tor just an annoying curiosity, like Dungeons & Dragons or Warhammer 40,000.

Would anyone but a troll start their speech claiming
"I understand the importance of Tor",
and then nowhere in their entire _l_o_n_g_ unbroken paragraph make any attempt to show they might "understand the importance of Tor"??
:-)

September 10, 2014

Permalink

ahmia.fi site trying extra canvas data when I click a link on the Statistics Viewer page? NOT COOL.

Yes, there are pages that are using JavaScipt and even canvas. These pages are showing some stats visualizations.

I have made sure that the main features of Ahmia are working without JavaScript.

Unfortunately, it is the easiest way for me to make the visualizations with JavaScript. That's why it is used in some pages.

September 11, 2014

In reply to juha

Permalink

understood and you are not alone in these respects (e.g. Atlas uses javascript), but my opinion is that applications specifically designed for tbb users should always aim for full functionality without the use of browser features that increase a user's attack surface.

not saying your implementation is any sort of threat per se, but affiliated projects should not be in the business of prompting people to make themselves more vulnerable to use your site and have to remember to change things back again once they're off-site. this is especially true for a search engine, where clicking through to other sites is what people are going to be doing. i think it encourages user behaviors that aren't good for their security.

September 12, 2014

In reply to juha

Permalink

Do "the main features of Ahmia" include the abuse notifier button? (See my other comment; I am not the person who commented about canvas, but I will agree, NOT COOL.)

I think this is the first example I have seen first-hand that JavaScript helps CP spread (because dedicated privacy activists with a strong security posture will not be able to use the abuse button). I always joked that "JavaScript kicks puppies" and the like, but this is worse and it is not a joke.

September 10, 2014

Permalink

Why didn't you opt for a decentralized design like yacy.net? It seems like it would be easier to build and a great way to offer mutual aid and solidarity to an open source project already focused on decentralized search..

September 11, 2014

Permalink

what was broken with YaCY? the link provided doesn't really explain your thought process

September 11, 2014

Permalink

hi,everyone
Tor browser can successfully connect to the tor network , unfortunately tor Browser occasionally fails to open, why?

September 15, 2014

Permalink

Am I correct that your word cloud visualization also omits any references to CP?

Isn't that sort of lying to yourself? I don't think leaving out one of the biggest parts of hidden services is a very scientific approach just because we may not like to acknowledge it.

There were 7-9 hidden services filtered when the content analysis was made. This probably didn't have any effect to the results because there are almost 1300 hidden service websites.

September 19, 2014

Permalink

If you discriminate against pedophiles by filtering out their sites, thus censoring the hidden web, what makes you any better than those who would censor the entire internet to discriminate against any minority or prevent people knowing about any number of things? Pedophilia is not the only thing people find objectionable, nor is it the only thing associated with abuse.

Sure, you may find CP to be awful stuff, but hiding it does nothing to prevent child abuse, nor does it help the abused in any meaningful way. It only is a way of "sweeping it under the rug", so that people can feel better in their ignorance. It also endangers freedom of information for all of us. See https://falkvinge.net/2012/09/07/three-reasons-child-porn-must-be-re-le…

October 04, 2014

Permalink

Two questions I never really expect to be answered:

Why doesn't Ahmia have a .onion url?

Ahmia's main page says as of 4 Oct that there are 1270 hidden services, but according to skunksworkedp2cg.onion there are 1479 hidden sites, and assuming that they also respect robots.txt, while you claim to be only censoring 12 ATM, that leaves almost 200 hidden sites unaccounted for, how do you explain this?