Beyond the document

Published Jan 11, 2010 by Erik Graf

What is a document? Ordinarily the word 'document' describes a textual record, a writing conveying information. In this sense we quite naturally first and foremost think and perceive of documents as sheets of paper covered with sequences of characters, static self-contained entities as encountered by us in daily life in the form of scientific papers, tax records, blog entries, and shopping lists. Based on an exploration of research in Information Retrieval this article wants to shed light on a steadily advancing trend that has blurred the boundaries of documents, slowly redefines their perception, and keeps Information scientists busy trying to clarify the very question raised at the beginning of this article: What is a document?

Information scientists identify the emergence of electronic publishing that raised questions such as “Do linked portions of several documents constitute a single document?” and “Is a thread of E-mail messages a single document?” as the starting point of confusion with respect to the identity of documents (Schamber, 1996). The advent of digital documents specifically the birth of the World Wide Web also proved to have a disruptive effect on the domain of Information Retrieval. The exponential growth of hyperlinked documents acted as a catalyst for the development of several novel key techniques within the field.

Apart from massive advances in terms of scalability two additional major developments, that can be credited as playing key roles concerning the transformation of documents, consisted of Google's PageRank algorithm (Brin & Page, 1998), and the establishment of the concept of virtual documents. The underlying idea of PageRank to measure the importance of a document based on an analysis of the relations between documents as stated in explicit form through hyperlinks presented an approach to Web Information Retrieval that distinctly differed from the prevalent document content centred approaches. Instead of determining the similarity of two textual artefacts (i.e. a user query and a web document) solely based on their content, a third element was not only introduced into, but dominated the equations: Inter-document relations. Certainly in no small part due to the success of Google this heralded a vast variety of additional approaches aimed at the exploitation of explicitly stated or inferred relationships between documents. Most notable concerning Web retrieval in this sense is the large body of research concerned with the inference of document relations based on graphs constructed with respect to queries and subsequent clicks of users within search results (Joachims, 2002) .

The second major event with respect to the revision of the document consisted of the introduction of the virtual document. The term virtual document was coined as a result of trying to integrate anchor text into Web retrieval (Zaragoza, Craswell, Taylor, Saria, & Robertson, 2004). Anchor text, the text of a clickable hyperlink, usually gives the user relevant descriptive or contextual information about the content of the link's destination. In this sense it often can be interpreted as a good summarization of important aspects of the target document. The basic idea of utilizing these informative text snippets consists of simply adding them to the target document. To account for the fact that the text did not actually occur in the document, but rather on another document, the term virtual document was introduced. Based on this concept a document, in a state of the art Web search engine, therefore will not only consist of its actual text, but also of its respective anchor text fragments appearing on all other Web documents linking to it. This development quickly led to the exploration of extending documents in similar fashion via a vast range of other external text sources such as thesauri (Jing & Croft, 1994), named entity graphs(Kumaran & Allan, 2004), and descriptive user tags (Heymann, Koutrika, & Garcia-Molina, 2008).

In light of the application of these techniques, documents in the context of the Web retain only a limited resemblance with respect to the ordinary definition of the word. In this sense the difficulties of information science researchers to derive concise definitions to describe these entities set within ever denser networks of relations, that in their virtually expanded form undergo frequent changes of content, is well understandable. However the above described state does not constitute a static point in the transformation of the document. Due to the huge potential benefit for a variety of Information Retrieval tasks these techniques have seen constant expansion and have been applied to a growing number of domains. With respect to this development the patent domain seems to have witnessed only a limited application of such techniques. This is specifically surprising in view of some of the characteristics exhibited by its documents such as the high density of named entities, the existence of references and classifications, its expert user base, and the association of patent documents with other legal documents.

From the point of view of a member of the Information Retrieval domain it seems plausible that this might have been induced by a prior lack of knowledge and understanding of patent domain specific concepts on the side of IR. In lieu of this it seems likely that only a joint effort from the patent community and the IR community might be able to pave the way for widespread application of these techniques. In this respect the provision of the MAREC patent document collection to the scientific community, and the ongoing process of enriching Alexandria data with named entity annotations pursued by the Gate team, can certainly be interpreted as big preliminary steps in the direction of exploring the potential benefit of the vast space beyond the boundaries of documents as perceived by the ordinarily meaning of the word 'document'.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems, 30(1-7), 107–117. Elsevier. Retrieved from

http://linkinghub.elsevier.com/retrieve/pii/S016975529800110X.

Heymann, P., Koutrika, G., & Garcia-Molina, H. (2008). Can social bookmarking improve web search? In Proceedings of the international conference on Web search and web data mining - WSDM '08 (p. 195). New York, New York, USA: ACM Press. doi: 10.1145/1341531.1341558.

Jing, Y., & Croft, W. (1994). An association thesaurus for information retrieval. In Proceedings of RIAO (Vol. 94, p. 146–160). Citeseer. Retrieved from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.2421&rep=rep1&type=pdf.

Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (Vol. pages, p. 133–142). ACM New York, NY, USA. Retrieved from

http://portal.acm.org/citation.cfm?id=775047.775067.

Kumaran, G., & Allan, J. (2004). Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (p. 297–304). New York, New York, USA: ACM New York, NY, USA. doi: 10.1145/1008992.1009044.

Schamber, L. (1996). What Is a Document? Rethinking the concept in uneasy times. Journal of the American Society for Information Science, 47(9), 669-671. Retrieved from

http://www3.interscience.wiley.com/journal/57816/abstract.

Zaragoza, H., Craswell, N., Taylor, M., Saria, S., & Robertson, S. (2004). Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC 2004. Citeseer. Retrieved from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.965&rep=rep1&type=pdf.

IRF

IRF

Beyond the document

Beyond the document