Anatomy of the ASPIRE'10 Collection

Published Mar 9, 2010 by Veronika Zenz

In this article we present an analysis of the anatomy of the ASPIRE patent corpus. We present the number of unique terms in this corpus, the average term frequency, the distribution of unique terms over document frequencies and an analysis over different term types. Those statistics are then compared to the New York Times Annotated Corpus, a collection of news-paper articles, and the differences between the two corpora are highlighted.

 

When trying to improve the quality of information retrieval, information access or translation mechansims, a first step is to investigate the characteristics of the vocabulary used in the target data. This background knowledge can help to understand why an algorithm that was developed for a domain (e.g. web documents) behaves not as expected when applied to another domain. It offers the opportunity to fine tune existing methods or to apply specialised techniques in order to tackle specific characteristics of a certain data set.


In this article we present an analysis of the anatomy of the ASPIRE patent corpus. We present the number of unique terms in this corpus, the average term frequency, the distribution of unique terms over document frequencies and an analysis over different term types. Those statistics are then compared to the New York Times Annotated Corpus, a collection of news-paper articles, and the differences between the two corpora are highlighted.

Corpora

This study operates on corpora stemming from two different domains: the ASPIRE corpus of 400,000 patent documents and the New York Times Annotated Corpus of 1.9 million newspaper articles.

The ASPIRE patent corpus is a subset of 400,000 documents of the MAtrixware REsearch Collection (MAREC), which is a larg corpus of patent documents provided by the Information Retrieval Facility. ASPIRE consists of patent documents published by the European Patent Office (EPO), the United States Patent and Trademark Office (USPTO), Japan Patent Office (JPO) and the World Intellectual Property Organization (WIPO) with a total size of 14GB. 100,000 patent documents were drawn from each patent authority. The documents are a random selection with respect to technology domains and have been published between January 1976 and June 2008. Thus, a wide variety of topics from areas including chemistry, medical sciences, physics, etc is covered. The majority of the text is written in English, but some documents contain also sections in French and German. The research we are presenting in this paper focuses on the English-language text parts of the documents, i.e. title, abstract, description and claims.

The New York Times Annotated Corpus, a collection of nearly 1.9 million articles from the New York Times between January 1987 and June 2007 is distributed by the Linguistic Data Consortium for research purposes. The collection contains alongside of the newspaper articles themselves also annotations written by library scientists in the form of summaries and different kinds of tags. For this study though no annotations but only the bodies of the newspaper articles were used.

Terminology

When analysing the anatomy of a collection, the Information Retrieval scientist commonly speaks of tokens, terms, term frequencies, document frequencies and dictionaries. A token is "an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing" (taken from Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze, Introduction to Information Retrieval, http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html). A term is the (possibly normalized) class of all tokens containing the same character sequence. All terms make up the dictionary. For example, if the document to be indexed is "to sleep perchance to dream", then there are 5 tokens, but only 4 terms (since there are 2 instances of to): to, sleep, perchance, and dream. The document frequency is the number of documents a term occurs in, while the term frequency is the number of appearances of this term in the whole collection.

 

Anatomy

  ASPIRE NYT
Tokens 998,613,477 1,023,832,590
Terms 4,191,381 2,320,382

 

 

While the number of tokens in the two corpora differs only by 2% the number of terms in the ASPIRE collection is 55% higher than in the NYT. The number of terms can be affected by artefacts like OCR or spelling errors, but in general the higher amount of unique terms means that the language used in the patent corpus is more diverse. The diversity of the patent language is also reflected by the comparatively low mean term frequency, which is in the case of the ASPIRE collection 238 and 441 for NYT.

In order to understand how terms are distributed across documents, we have classified the terms into different categories depending on the number of documents they appear in (i.e. the document frequency). We use a logarithmic document frequency scale where terms that appear in 2^(x-1) to 2^x documents are grouped together. Thus there is one group for terms occuring only once (2^0), one group for terms occuring twice (2^1), one group for terms occuring in 3 or 4 documents (2^1-2^2), another group for terms occuring in 5 to 8 documents (2^2-2^3) and so on. The graph "Document Frequency of Unique Terms" shows that with growing document frequency the number of terms decreases. Most terms, i.e. 74% of ASPIRE terms and 53% of NYT terms, occur only in one document. This means, that if a user searches for such a term in the collection, the result list would contain exactly one item. 11% of the ASPIRE terms appear in two documents and only 6% in three or four. The graph has been cut off after 512 as the number of terms that fall into the remaining categories is too small to be visualized. The lines in the graph show the accumulated percentage of terms, which reaches for ASPIRE 99% at 128 documents. In other words 99% of the terms have a document frequency equal or below to 128, and only 1% occur more often. Still, the last category, that contains terms that occur in half to all documents contains some terms. These are terms like "and", "of", "the", "a", that are highly common but have low significance on their own. Such terms are referred to as stopwords.

The general tendency of most of the words being infrequent and very few words being very frequent has been observed on many corpora and languages and is formalised as Zipf's law. While both corpora used in this study show this tendendcy, the curve for the ASPIRE corpus is much steeper.

Antoher way to analyze the dictionary of a corpus is to investigate the distribution of its terms over different term types. For this sake we divided the terms into 4 different types: letters, numbers, DNA/RNA and mixed number-letter-punctuation. Regular expressions were used to assign a term to one of these categories. The first category - letters - unites all terms that you would normally expect in a natural language dictionary, like "effects", "method", "and". But also any other combination of letters, like abbreviations or misspelled words are allocated here. Numbers contains any natural or floating point numbers in the dictionary. DNA/RNA sequences consist of combinations of letters "a,c,g,t" or "a,u,g,t" with a minimum term length of 4. The type mixed consists of terms that are combinations of digits and letters (e.g. 40mm) or letters and special characters (e.g. 3,5-Dinitrobenzoyl). In the dictionary of the ASPIRE corpus only 29% of the terms fall into the category letters. 14% are numbers, 9% DNA/RNA and more than 46% fall into the mixed category. Letter-words make up for nearly half of the NYT-dictionary, while only for 29% of the ASPIRE. Not surprisingly, NYT lacks terms in the category DNA/RNA. It also has much less terms that fall into the mixed class.

Conclusions

We have analyzed the anatomies of a newspaper corpus (NYT) and a patent corpus(ASPIRE). We have shown how these two collections differ, by comparing the dictionaries obtained from the two collections. Especially we have investigated the distribution of terms over document frequencies and the distrubtion of terms over different term categories. It has become clear that the reference collection of patents features a very high percentage of infrequent terms. The fact that "normal letter words" make up for less than a third of the terms in the patent dictionary calls for specialised treatment of numbers, formulas, DNA sequences etc. These contribute to a much larger extent to the vocabulary than in traditionally IR domains like Web documents or newspaper articles.

Further Readings


V. Zenz, S. Wurzer, M. Dittenbach, E. Ambrosi. On the effects of indexing and retrieval models in patent search and the potential of result set merging. 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe'10)

C. D. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html