Wednesday, April 04, 2007

A Mega-Crawler for the rest of us

web crawlerDo you get this urge sometimes, to query the entire web for something?

Do you wish you had your own Mega-Crawler?

I mean, it's not like you can go to Google and type in some box

SELECT TAG
FROM PAGES
WHERE DOMAIN(PAGE-URL) IS IN  "domain1, domain2, domain3"

Lunch took a long time today, so Eran and myself had some time to brainstorm a bit about a crawler-for-the-rest-of-us:

  • Crawler code would be hosted on Amazon's EC2
  • The data would be stored on Amazon's S3
  • Anyone can add "post-crawl-processors" which will post-process crawled pages (build a full text index, extract microformats, calculate rank...). The persistent data generated by the post-processors will also be hosted on S3.
  • Anyone can submit URLs to be crawled. The system will automatically fork from these URLs to any other discovered URL. Eventually, the entire web will be crawled.
  • API for querying the crawl data, or the data generated by the post-crawl-processors.

Who will pay for this? Companies and organizations who wish to use this data:

  • The basic crawling code will be divided among the "subscribers". Initially, small database, low costs. Later, larger database, more subscribers, costs (hopefully) remain low.
  • The cost of a post-processor (CPU, storage) is divided by the number of subscribers the post-processor has. The more useful it is, the more subscribers will use it, and the less each will pay. If it's a proprietary post-processor, no need to share it, but it will naturally cost more (being used by only 1 subscriber).
  • Retrieving query result will cost by bandwidth.

The general idea is pay-as-you-use, with prices going down as more subscribers use the service. No one makes money (well, except for Amazon of course), everyone sharing costs, IP (post-processors) can be shared or protected. The more you consume, the more you pay. The more you share, the less you pay.

This is very rough of course. But what do you think? Is it feasable? Is it interesting?

6 comments:

John said...

Alexa offers something like this, called the Alexa Web Search Platform

Remi said...

I bet if some really clever programmer came along they would be able to nearly decentralize everything... making everything thing P2P, ala Bit Torrent, Freenet, Gnutella, BOINC, etc.

A massive distributed search engine...

If enough people ran it...

I believe there are already some efforts in this direction.

The place to work on it might be at en.wikiversity.com, sourceforge.net, or code.google.com.

My blogs are kokyunage.net and kuzushi.us.

spinchange said...

This is a FANTASTIC Idea! The AWS platform lends itself perfectly to this kind of project. I have been tinkering with various my own & various open source spiders, "memetrackers" etc...To share, bootstrap, and crowdsource on this would be great!

-Chris Duffy
Spinchange at gmail dot com

Daniel said...

Isnt this what Younanimous (aka aftervote) Does?

spinchange said...

Why not subsidize the "cloud costs" with revenue generated by monetizing your SERPS with an AdSense account? That would be hilarious. I'd click those ads for sure!

gdupont said...

When dream becomes reality (or almost)

www.yacy.net/yacy/