This year there are two classification tasks on a set of 3000 topic documents comprising 1000 English, German and French language documents each. The topic files are named CLS<n>_<lang>_topics.txt with <n> being the task number, and <lang> being en, de or fr.
Important note: The classification system used by the whole CLEF-IP collection is IPC-R and later. The results, too, should be IPC-R codes.
Classification Task 1
The goal for the Classification Task 1 (CLS1) is to classify a given topic document according to the Internation Patent Classification System (IPC) on Subclass level.
The topic structure is as follows (example):
<narr>Classify patent document EP-1469052-A1 according
to the IPC system.<narr>
where <num> contains the unique topic identifier consisting of the prefix CLS1_ and the patent number, which itself contains a country code (always EP in this data set), a seven-digit number and the kind code (A1, A2). The <file> tag contains the name of the XML file.
Refined Classification Task 2
The goal for the Refined Classification Task 2 (CLS2) is to classify a given topic document with a given subclass on subgroup level.
The topic structure is similar to Task 1 (example):
<narr>Classify patent document EP-1674081-A1 classified in
subclass A61K into subgroup.<narr>
Here, the topic identifier additionally contains the given Subclass, since one document can have multiple classifications, i.e. it can fall into multiple subclasses.
The total number of topics for the second classification task is 4934.
For both classification tasks there is no specially created training set. The participants can use the whole data corpus for training their classifiers.
Optionally, participants can submit runs where only patents in the corpus with publication date post 1995 (exclusively) were used to train their classifiers.
A submission to this tasks consists of an ASCII file similar to the TREC submission format. For each of the two Classification tasks we require two submission formats per run, one with the extension runP, the second with the extension runC
Example of submission files:
Classification Task 1, file extension runP, maximum 5 entries per topic
CLS1_EP-9999999-A1 Q0 A20K 1 3010
CLS1_EP-9999999-A1 Q0 A20K 2 3008
CLS1_EP-9999999-A1 Q0 A20K 3 2985
Classification Task 1, file extension runC
A20K CLS1_EP-9999999-A1 3010
A20K CLS1_EP-8888888-A1 3008
A20K CLS1_EP-7777777-A1 2956
Refined Classification Task 2, file extension runP, maximum 20 entries per topic
CLS2_EP-9999999-A1_A61K Q0 A61K9/16 1 3000
CLS2_EP-9999999-A1_A61K Q0 A61K9/50 2 2980
CLS2_EP-9999999-A1_A61K Q0 A61K9/38 3 2910
Refined Classification Task 2, file extension runC
A61K9/16 CLS2_EP-9999999-A1 3000
A61K9/16 CLS2_EP-8888888-A1 2905
A61K9/16 CLS2_EP-7777777-A1 2810
A run is uniquely identified by the participant id, the method used, the run id and the task type:
- the IPC system used is post IPC-R (inclusive)
- the number of entries in the submission files for CLS1.runP is 5 per topic
- the number of entries in the submission files for CLS2.runP is 20 per topic
- maximum of runs per task for a participant is 8.
The file submission procedure will be communicated at a later time.
CLEF-IP is supported by the
PROMISE Network of Excellence
(co-funded by the 7th Framework Programme of the European Commission)
The image tasks are supported by the IMPEx project
(funded by the Austrian Research Promotion Agency - FFG)