CLEF-IP '09

Evaluating Information Retrieval techniques in the Intellectual Property domain.

The CLEF-IP track was launched in 2009 to investigate IR techniques for patent retrieval. It is part of the CLEF 2009 evaluation campaign.

The track utilizes a data collection of more than 1M patent documents derived from EPO sources, covering English, French, and German patents with at least 100,000 documents in each language.

There are two kinds of tasks in the track:

  • The main task is to find patent documents that constitute prior art to a given patent.
  • Three facultative subtasks that use parallel monolingual queries in English, German, and French. The goal of these subtasks is to evaluate the impact of language on retrieval effectiveness.

The CLEF-IP collection together with training the data, relevance judgements and accompanying documents is available here.
 

How to access the CLEF-IP'09 data collection

In order to receive access to the CLEF-IP'09 data please send us a signed License Agreement by email to clef-ip-owner@ir-facility.org and by post to:
Information Retrieval Facility
Tech Gate Vienna, Donau City Straße 1
1220 Vienna
Austria

 

Working notes CLEF-IP'09 workshop

The working notes of the CLEF-IP'09 workshop in Corfu are available on-line on the CLEF campaign homepage.

Co-ordinators

John Tait (Information Retrieval Facility, AT)
Giovanna Roda (Matrixware, AT)

Advisory Board

Gianni Amati (Fondazione Ugo Bordoni, IT)
Atsushi Fujii (University of Tsukuba, JP)
Makoto Iwayama (Tokyo Institute of Technology, JP)
Kalervo Jarvelin (University of Tampere, FI)
Noriko Kando – honorary advisor - (Keio University, JP)
Mark Sanderson (University of Sheffield, UK)
Henk Thomas (IP services, NL)
Christa Womser-Hacker (University of Hildesheim, DE)

Organization Committee

Florina Piroi (Information Retrieval Facility, AT)
Giovanna Roda (Matrixware, AT)
Veronika Zenz (Matrixware, AT)

Participants

  • Technical University of Darmstadt, Dept. of Computer Science, Ubiquitous Knowledge Processing Lab, DE
  • Université de Neuchâtel - Computer Science, CH
  • Santiago de Compostela Universidad - Dept. Electronica y Computacion, SP
  • University of Tampere - Info Studies & Interactive Media and Swedish Institute of Computer Science, FI/SE
  • Glasgow University - IR Group, UK
  • Geneva Univsity - Centre Universitaire d'Informatique, CH
  • Centrum Wiskunde & Informatica - Interactive Information Access, NL
  • Geneva University Hospitals - Service of Medical Informatics, CH
  • INRIA & Humboldt University - Dept. of German Language and Linguistics, FR/DE
  • Dublin City University - School of Computing, IE
  • Radboud University Nijmegen - Centre for Language Studies & Speech Technologies, NL
  • Hildesheim University - Institute of Information Systems & Natural Language Processing, DE
  • Technical University Valencia - Natural Langugage Engineering, SP
  • AI. I. Cuza University of Iasi - Natural Language Processing, RO

Top of page

Data Collection

Number of documents

The data collection of the CLEF-IP track is a collection of 1,9 million patent documents in English, French and German from the European Patent Office. The documents in this collection correspond to approximately 1 million individual patents filed between 1985 and 2000.

For an overview over the specialities of patent-test collections and an introduction to the patent domain, we recommend to read "A methodology for building a patent test collection for prior art search" by Erik Graf and Leif Azzopardi. The NTCIR online proceedings are also a great resource for  patent information retrieval.

Document format and content

The documents in the patent collection are stored as XML files, and corresponds to the "Alexandria XML" DTD.

Here are sample Patent Documents:

The files contain bibliographic data as well as descriptive text. The XML files are quite comprehensive, containing detailed information on inventors, assignees, priority dates etc. From the variety of information in the XML files, these are the elements you should start to look at:

  • invention-title - /patent-document/bibliographic-data/technical-data/invention-title
  • classifications-ipcr - /patent-document/bibliographic-data/technical-data
  • abstract - /patent-document/abstract
  • decription - /patent-document/description
  • claims - /patent-document/claims

Document Statistics

Patent Documents and Patents per year of application filing

 

The IPC classification of is a multiclassification system, so each document can be in more than one class.

Language and IPC classes distributions

 

 

15 Most frequent IPC Classes in the corpus and the number of documents in these classes

Top of page

Task and Topics

The Task

Find documents that constitute prior art.

Query Topics

Each topic of the the CLEF-IP track has the following format:

<PATENT>
<NUM>patentNumber</NUM>
<NARR>A relevant document is a patent which constitutes prior art for patent number patentNumber </NARR>
<DESC taskType=task>
filename.xml 
</DESC>
</PATENT>

The patentNumber has the form "EP" followed by seven digits.

The task can be either "Main" (for the mandatory topic set), or one of "EN", "DE", "FR" (for the optional topic set).

Finally, filename.xml is a patent document in the Alexandria XML format which constitutes the content of the topic with number patentNumber. The content and the name of this file depends on the task type and on the patentNumber:

  • For the "Main" task it is a full patent document, where missing patent descriptions have been automatically added). It is up to the participants which fields they want to use to generate a query from the patent topic.
  • For the language tasks ("EN", "DE", or "FR"), filename.xml contains a subset of the elements in the Alexandria XML format: invention-title and claims. For each language a separate topic file is delivered containing title and claims of the patent topic in this one language.

Submission format

For all tasks, a submission consists of a single ASCII text file with at most 1000 lines per topic, in the standard format used for most TREC submissions. A relevant item is a document id of a patent, not a patent file. It is important that if two files that belong to the same patent are considered relevant, the patent is returned only once. In case two patent documents belonging to the same patent are retrieved, it's up to the participants to decide at which position in the ranked list the patent should be returned. See the track guidelines for more details on the format of the submission files.

Top of page

Relevance Assessments

Document IDs

Relevancy is measured on patent-level not on patent-document level. A relevant item is a patent, not a patent file. A patent is identified by its document id.

This means that a valid result is of the form EP0383071 rather than EP0383071-B1.xml or EP-0383071-B1. Assemble the document id by concatenating the country and doc-number attribute of the patent-document element in the data. The patent number appears in every patent document in the data set and identifies a patent univocally. 

Obtaining Relevance Judgements

Relevance judgements are produced by two methods. The first is an automatic method using patent citations from seed patents. The second uses a small number of queries. Search results are going to be reviewed by Intellectual Property experts.

We will primarily report results retrieving across all three languages. In 2009 we will stick to the Cranfield evaluation model:  in subsequent years we expect to offer refined retrieval process models and assessment tools.

Top of page

Training Data

 

Two sets of training data are available for this track. The small training set contains a list of 5 topic patents together with their relevance assessments, while the large training set contains 500 topics for the main task and 100 topics for each of the three language tasks, together with their relevance assessments.

The criteria for choosing the topic patents in the large training set are that

  • there is a full text description;
  • claims in German, English and French are available;
  • the topic candidate has at least three citations to patents in the corpus;
  • and at least one citation is highly relevant.

Download the guidelines for both training.

 

Results

ID Institution Tasks Size Runs
TUD Technical University of Darmstadt, Dept. of Computer Science, Ubiquitous Knowledge Processing Lab, Germany

Main, EN,
DE, FR

S(4), M(4),
L(4), XL(4)
16
UniNE Univ. Neuchatel - Computer Science, Switzerland Main S(7), XL(1) 8
uscom Santiago de Compostela Univ. - Dept. Electronica y Computacion, Spain Main S(8) 8
UTASICS University of Tampere - Info Studies & Interactive Media and Swedish Institute of Computer Science, Finland/Sweden Main XL(8) 8
clefip-ug Glasgow Univ. - IR Group Keith, Great Britain Main M(4), XL(1) 5
clefip-unige Geneva Univ. - Centre Universitaire d'Informatique, Switzerland Main XL(5) 5
cwi Centrum Wiskunde & Informatica - Interactive Information Access, Netherlands Main M(1), XL(4) 4
hcuge Geneva Univ. Hospitals - Service of Medical Informatics, Switzerland Main, EN,
DE, FR
M(3), XL(1) 4
 humb Humboldt Univ. - Dept. of German Language and Linguistics, Germany Main, EN,
DE, FR
 XL(4)  4
 clefip-dcu Dublin City Univ. - School of Computing, Ireland Main  XL(3)  3
 clefip-run Radboud Univ. Nijmegen - Centre for Language Studies & Speech Technologies, Netherlands Main, EN  S(2)  2
 Hildesheim Hildesheim Univ. - Institute of Information Systems & Natural Language Processing, Germany Main  S(1)  1
 NLEL Technical Univ. Valencia - Natural Langugage Engineering, Spain Main  S(1)  1
 UAIC AI. I. Cuza University of Iasi - Natural Language Processing, Romania EN  S(1)  1
       Total:  70

Top of page

Evaluation

For each experiment we computed 10 standard IR measures

  • Precision, Precision@5, Precision@10, Precision@100
  • Recall, Recall@5, Recall@10, Recall@100
  • MAP
  • nDCG(with reduction factor given by a logarithm in base 10)

All computations were done with SOIRE, a software for IR evaluation based on a service-oriented architecture and double-checked against trec_eval, the standard program for evaluation used in the TREC evaluation campaign.

All evaluation results can be found in the CLEF-IP 2009 Evaluation Summary.

Methodology

Relevant Bibliography

  • CLEFIP @ CiteULike - a comprehensive collaborative collection of literature related to CLEF-IP

  • Knowledge Base - You can find here a tutorial on "IR 4 IP" and corresponding glossaries for IP and IR

  • CLEF - Campaign - Main Website of the Cross Language Evaluation Forum

Selected Resources

  • J. Michel. Considerations, challenges and methodologies for implementing best practices in patent office and like patent information departments. World Patent Information 28:132-135, 2006.

  • Sougata Mukherjea, Bhuvan Bamba. BioPatentMiner: an information retrieval system for biomedical patents. In VLDB '04: Proceedings of the Thirtieth international conference on Very large data bases pp. 1066-1077, 2004.

  • Jae-Ho Kim, Key-Sun Choi. Patent document categorization based on semantic structural information. Information Processing & Management 43, 2007.

  • T. Takaki, A. Fuji, T. Ishikawa. Associative Document Retrieval by Query Subtopic Analysis and its Application to Invalidity Patent Search. In Proc. of CIKM , 2004.

  • H. Mase, T. Matsubayashi, Y. Ogawa, M. Iwayama, T. Oshio. Proposal for Two-Stage Patent Retrieval Method Considering the Claim Structure. ACM Transactions on Asian Language Information Processing 4(2), 2005.