Thedataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies.
ClueWeb09 is a 25 terabyte dataset of about 1 billion web pages crawled in January and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either
- had high OPIC values in a web graph produced from an earlier 200 million page crawl, or
- were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages.
This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian. The dataset is used by several tracks of the TREC conference.