Documents in the CLEF-IP Corpus
Format and Content
The documents in the patent collection are stored as XML files. The documents are derived from European Patent Office and have mixed content in English, German and French.
The files contain bibliographic data as well as descriptive text. The XML files are quite comprehensive, containing detailed information on inventors, assignees, priority dates etc. From the variety of information in the XML files, these are the elements you should start to look at:
Number of Documents
2009: 1,9 million patent documents, corresponding to approximately 1 million individual patents filed between 1985 and 2000.
2010: 2,6 million patent documents, corresponding to approximately 1,3 million individual patents published until 2001.
2011: All EPO documents that have an application date previous to 2002 (more than 2.5 Million patent documents constituting more than 1Million patents). In addition for EuroPCT Applications we also added the corresponding patent documents published by the WIPO (more than 400,000 documents).
Tasks and Topics
There was only one kind of task: find documents that constitute prior art. 10.000 topics were made available, participants could choose to submit experiments using subsets of the largest topic set. Accepted subsets had to contain results for the first 500, 1000, or 5000 topics out of the complete set.
The language of the topic documents was not restricted. The 2009 track also made available optional language tasks for English, German and French, where the topics had textual content in one of the three languages, only.
Two kinds of tasks were available:
- Prior Art Candidate Search Task: find patent documents that are likely to constitute prior art to a given patent application.
- Classification Task: classify a given patent document according to the IPC.
Both tasks contained 2000 topics, participants to the Prior Art task were allowed to submit results for a smaller topic set of 500 topics.
There are four tasks in the 2011 track:
- Prior Art Candidate Search: Find patent documents that are likely to constitute prior art to a given patent application.
- Classification: Classify a given patent document according to the IPC system, up to the subclass level. A new optional sub-task is to classify a given patent document up to the group/subgroup level, when the subclass is given.
- New: Image-based Patent Retrieval: Find patent documents
relevant to a given patent document containing images.
- New: Image-based Classification: Categorize given patent images into pre-defined categories of images (such as graph, flowchart, drawing, etc.).
Obtaining Relevance Judgements
Relevance judgements are produced by an automatic method using patent citations from seed patents.
In 2009, for a small number of queries, (pooled) search results were reviewed by Intellectual Property experts.
Document vs. Patent IDs
In 2009, relevancy was measured on patent-level not on patent-document level. That is, a relevant item is a patent, not a patent file.
A patent is identified by its patent ID. This means that a valid result is of the form EP0383071 rather than EP0383071-B1.xml or EP-0383071-B1 (which are document ids). Note that this patent-level relevancy can be applied for EP patents where the patent number/ID appears in every patent document in the data set and identifies a patent univocally. This may not be the case for publications from other patent offices - a typical example being the USPTO.
In 2010 relevancy was measured at the document level, the same proceedure being planned for 2011, too.