Lucky Oyster Find They Can Crawl 3.4 Billion Pages For $100

17 October 2012
bzamayo.com/lucky-oyster-three-billion-pages-for-hundred-dollars

A few weeks ago, while working on prototype search technology for Lucky Oyster, we were able to leverage a few simple components — data from Common Crawl, Spot Instances from AWS, a few hundred lines of Ruby, and assorted Open Source software — to data mine 3.4 billion Web pages, extracting close to a terabyte of structured data, and building a searchable index of close to 400 million entities. The cost? About $100 US. And all work completed, thanks to several hundred worker nodes, in about 14 hours.

Businesses that rely on scale as their barrier to entry are under constant pressure from the increasing access that anyone can get to immense computational resources. For instance, Google protects against this by making it’s USP their sorting and ranking algorithms, not the collection of data.