How to construct a test collection

Author: Erik Graf
Date: 2008/11/20

Introduction:

Following discussions from the IRFS 2008 I was asked to write a short summary of work that has been done on creating a patent test collection, based on inferring relevance assessments from references found on patents, here at the University of Glasgow. In the following I will provide a short analysis concerning the viability of this approach based on information derived from the corpus of patents issued by the European Patent Office. In the first section I will discuss the challenge of creating relevance assessments for a patent test collection. The second section will outline key properties of the EP patent corpus. The third section will outline potential tasks that can be realized using inferred relevance assessments, followed by a short summary.

Relevance Assessments:

The creation of relevance assessments for a retrieval collection usually forms the most time-consuming task of creating a test collection. Especially so for patents since judging a documents relevance requires judicial as well as technological expertise. Basically there are three potential ways of generating relevance assessments for test collections: Human assessment, inferring relevance assessments, and simulating relevance assessments. I will briefly outline the case of the first and second based on the work we conducted and discussions led at IRFS. Human Assessment: The only publicly available large scale patent test collections known to me that applied human assessment are the NTCIR 3 and 4 test collections. NTCIR 3 conducted human assessment for the technology survey task, where participants were asked to retrieve relevant patents with respect to a news paper article. Details of this are available in the “Evaluation of Patent Retrieval at NTCIR and its Implications for better IP Services.pdf “ page 10-15 available at: http://www.ir-facility.org/symposium/irf-symposium-2008/videos-and-presentations . Summarizing it can be said that they applied pooling with a cut-off at 30 documents. It is noteworthy and has been remarked by Henk Tomas during the symposium that the number of relevant documents for this task is comparatively very high when compared to other patent retrieval tasks. This is favourable for applying pooled assessments. Applying human assessment to core patent tasks such as “Invalidity search” or “Prior Art search” is potentially more time-intensive mainly due to the very low number of relevant documents that can be expected in the corpus.

The NTCIR organizers mention (See: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/PATENT/NTCIR5-OV-PATENT-FujiiA.pdf) that for the assessment of the invalidity task participants were supplied with training topics using relevance assessments inferred from references. Thirty submitted runs, utilizing these training topics were pooled and assessed. It is noteworthy that NTICIR abandoned human assessment for the subsequent invalidity tasks run at NTCIR 5, 6. The creation of relevance assessments for these tasks was based on the inference technique described in the next subsection. Inferred Relevance Assessment: For our work we analysed the viability of building a test collection based on EP documents via utilization of references found on patents as basis for relevance assessments for the task of prior art search. The technique of interpreting references on patents as relevance assessments is not new and has been applied by NTCIR for their NTCIR 5 and NTCIR 6 patent test collections. See: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/PATENT/NTCIR5-OV-PATENT-FujiiA.pdf The justification of inferring relevance assessments from EP references is based on the following:

The patent references found on patent documents issued by a patent office are set by its patent examiners. The subject and legal expertise of the examiner at the patent office allows for qualified assessment of relevance from his or her side with respect to the prior art search task.
The legal specification setting the criteria for valid reference matter can be interpreted as a definition of relevance for an information need (e.g. European Patent Convention Rule 44, Article 92(1), and Article 54).
Additional guidance provided by examination manuals (e.g. USPTO Manual of Patent Examining Procedure ) provides a further precise description of the nature of the stated form of relevance. This is exemplary demonstrated through an excerpt: 'All documents cited in the search report are identified by placing a particular letter in the first column of the citation sheets. ... Where a document cited in the European search report is particularly relevant, it should be indicated by the letter ’X’ or ’Y’ '(, Guidelines for Examination in the EPO B X 9.2)

Concluding it can be said, that compared to other domains such as the Web or scientific documents, where the motivation for setting a reference can be varied (commercial, navigational, higher probability of citing work at own university, etc ...), the fact that the setting of patent references is governed by rules and performed by experts, their utilization as relevance assessments is reasonable. It is noteworthy that the EPO does not apply the concept of “duty of candour” (i.e. asking applicants to provide references to prior art). The USPTO does apply this concept, which leads to significantly larger lists of prior art references. However it has been noted that references provided by the applicant are less likely to be highly relevant (lack of motivation to provide such references from this party). EPO references can therefore be considered less noisy. (Note: Only since 2006 USPTO references are labelled with respect to their origin (examiner versus applicant)). For our work we analysed the distribution of references found on EP documents pointing to other EP documents. This restriction allows to limit the corpus that has to be provided to participants to those patents issued by the EPO (approx 200-300 Gb of raw data). The following table shows the distribution of the frequency of these references.

# of cit.	1	2	> 2	> 3	> 4	> 5	> 6	> 7	> 8	> 9	> 10
# of docs.	357,387	168,896	111,397	43,306	17,279	7,430	3,486	1,798	1,043	664	435

As can be seen from the table while the number of references (potential relevance assessments) per document is low, the number of potential topics that could be formed by using a technique based on inferring relevance assessments is very high. Further Henk Tomas explained to me that another, and even more accurate, source of obtaining references are those mentioned in opposition procedures (challenging the validity of a patent), as a party opposing a specific patent is very strongly motivated to supply highly accurate references to relevant prior art.

Corpus:

The corpus of documents refers to the actual test collection that will be provided to the participants of the task. The test collection we have created is focused on the body of documents issued by the European Patent office. The EPO corpus contains documents from 1978 till today. The following figures have been derived from a subset of this corpus. The table below shows the distribution of languages within these documents.

Total number	ENG	GER	FR
3,631,954	2,549,633	848,471	232,950

The distribution of kind codes within these documents is the following.

Kind of code	# of documents
A1	1226849
A2	678434
A3	686075
A4	157957
B1	890436
B2	13286
Other	10032

Potential tasks:

Topics and relevance assessments for the following tasks can be easily created following the methodology described in our attached publication .

Cross-language Prior Art Search Since the references set on EP documents refer to prior art relevant to the application in question, utilizing those reference as relevance assessments for the prior art task is the most precise way of interpretation. In a cross-language environment topics based on the EP corpus can be defined as:

A document to document task. The query of the topic consisting of the application in question. The task consisting of identifying prior art with respect to this application.
Based on the EP corpus the following cross language topics can be defined: ENG → FR, FR → ENG, ENG → DE, DE → ENG, etc...
The granularity of the task can be altered by (a) defining a subset of the application as query (e.g. a specific claim).
(b) utilizing the different categories assigned to references (e.g. “X”, “Y” which denote the type of relevance of the reference)

Cross-language Invalidity Search NTCIR utilized patent references to create their test collections for NTCIR 5 and 6 test collections. Prior patent based test collections were based on human assessment. A brief description of this is provided in http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings4/PATENT/NTCIR4-OV-PATENT-FujiiA.pdf . For the NTCIR 5 and 6 collections references of denied Japanese patent applications were utilized in the Invalidity task. A brief description can be found here: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/PATENT/NTCIR5-OV-PATENT-FujiiA.pdf

The organizers do not provide a lot of detail as to why they abandoned manual assessment: “In NTCIR-4, a number of issues remained open questions. First, in the invalidity search subtask, the number of relevant documents was small and the evaluation result was perhaps less reliable compared with the conventional ad-hoc retrieval tasks.” However at IRFS 2008 Noriko Kando did state that 34 topics were considered a maximum for performing human assessment on the invalidity search task. As with the prior art search task described above topics can be based on the EP corpus be defined as:

A document to document task. The query of the topic consisting of the application in question. The task consisting of identifying prior art with respect to this application.
Based on the EP corpus the following cross language topics can be defined: ENG → FR, FR → ENG, ENG → DE, DE → ENG, etc...
The granularity of the task can be altered by (a) defining a subset of the application as query (e.g. a specific claim).
(b) utilizing the different categories assigned to references (e.g. “X”, “Y” which denote the type of relevance of the reference)

Summary:

Concerning the application of the technique described in our paper the following conclusions can be drawn from our side.

Applying pooled human assessment for prior art or invalidity search without providing training topics does not seem a viable option. Both search types seem very hard problems and experts with years of patent search experience spend days to identify a small amount of relevant documents (at least that is my interpretation of the situation, please correct me if this is wrong). It can therefore be expected that first time participating systems might not be able to provide relevant results within a sensible cut-off rate (e.g. 100).
Utilizing an inferred assessment technique could be utilized both to create training topics and task topics for the above described task with a comparatively low required amount of work. If this is considered to be a viable approach I can evaluate the distribution of potential cross-language topics within the EP corpus.
Mihai Lupu suggested that human assessment could be applied after having received runs from participating groups. This could be a very good opportunity to enrich certain topics with more relevance assessments, and to explore the viability of applying pooling (e.g. identifying a sensible cut-off value for pooling)

Methodology

How to construct a test collection? Read more
How to create a citation based test collection. Read more
What is Prior Art? Read more
USPTO Query Strategies - Read more

Research