Skip to Content

Classification Task

This year there are two classification tasks on a set of 3000 topic documents comprising 1000 English, German and French language documents each. The topic files are named CLS<n>_<lang>_topics.txt with <n> being the task number, and <lang> being en, de or fr.

Important note: The classification system used by the whole CLEF-IP collection is IPC-R and later. The results, too, should be IPC-R codes.


Classification Task 1

The goal for the Classification Task 1 (CLS1) is to classify a given topic document according to the Internation Patent Classification System (IPC) on Subclass level.

The topic structure is as follows (example):

   <narr>Classify  patent  document  EP-1469052-A1  according
         to the IPC system.<narr>

where <num> contains the unique topic identifier consisting of the prefix CLS1_ and the patent number, which itself contains a country code (always EP in this data set), a seven-digit number and the kind code (A1, A2). The <file> tag contains the name of the XML file.


Refined Classification Task 2

The goal for the Refined Classification Task 2 (CLS2) is to classify a given topic document with a given subclass on subgroup level.

The topic structure is similar to Task 1 (example):

   <narr>Classify  patent  document  EP-1674081-A1 classified in
         subclass A61K into subgroup.<narr>

Here, the topic identifier additionally contains the given Subclass, since one document can have multiple classifications, i.e. it can fall into multiple subclasses.

The total number of topics for the second classification task is 4934.

Training Set

For both classification tasks there is no specially created training set. The participants can use the whole data corpus for training their classifiers.

Optionally, participants can submit runs where only patents in the corpus with publication date post 1995 (exclusively) were used to train their classifiers.

Submission Formats

A submission to this tasks consists of an ASCII file similar to the TREC submission format. For each of the two Classification tasks we require two submission formats per run, one with the extension runP, the second with the extension runC

Example of submission files:

Classification Task 1, file extension runP, maximum 5 entries per topic

CLS1_EP-9999999-A1  Q0  A20K  1 3010
CLS1_EP-9999999-A1  Q0  A20K  2 3008
CLS1_EP-9999999-A1  Q0  A20K  3 2985

Classification Task 1, file extension runC

A20K   CLS1_EP-9999999-A1 3010
A20K   CLS1_EP-8888888-A1 3008
A20K   CLS1_EP-7777777-A1 2956

Refined Classification Task 2, file extension runP, maximum 20 entries per topic

CLS2_EP-9999999-A1_A61K  Q0  A61K9/16  1  3000
CLS2_EP-9999999-A1_A61K  Q0  A61K9/50  2  2980
CLS2_EP-9999999-A1_A61K  Q0  A61K9/38  3  2910

Refined Classification Task 2, file extension runC

A61K9/16  CLS2_EP-9999999-A1  3000
A61K9/16  CLS2_EP-8888888-A1  2905
A61K9/16  CLS2_EP-7777777-A1  2810

A run is uniquely identified by the participant id, the method used, the run id and the task type:



Important notes:

  • the IPC system used is post IPC-R (inclusive)
  • the number of entries in the submission files for CLS1.runP is 5 per topic
  • the number of entries in the submission files for CLS2.runP is 20 per topic
  • maximum of runs per task for a participant is 8.

The file submission procedure will be communicated at a later time.

CLEF-IP is supported by the
PROMISE Network of Excellence
(co-funded by the 7th Framework Programme of the European Commission)

 IMPEx logo


The image tasks are supported by the IMPEx project
(funded by the Austrian Research Promotion Agency - FFG)


How To Register To CLEF-IP

Follow these steps to register to the Lab.

CLEF-IP Past and Present