Detection of OCR Quality on Patents
Overview
Optical Character Recognition (OCR) works by scanning source documents and performing character analysis on the resulting images, giving a translation to ASCII text, which can then be stored and manipulated electronically. The character recognition process is not perfect, errors often occur. These errors have an adverse effect on the effectiveness of information retrieval algorithms that are based on exact matches of query terms and document terms.
Goals
The goal of this project is to identify a strategy for assessing the quality of a document obtained via an OCR process, and to assign a score (a quality coefficient) to each patent document. This coefficient indicates the quality of an OCR result by means of calculating and comparing statistical models for a gold standard of manually pre-processed documents and the document in question.
Expected outcome for IP experts
- Identify (old) OCRed patent documents showing a high likelihood of containing errors and, thus, need to undergo an additional workflow (such as another OCR process, a manual inspection etc...).
- Obtain means to reduce the amount of manual inspection.
Timeline
This project has started in May 2008. First results were presented at the IRFS2008. A gold standard has been delivered. Updates are expected by the 3rd quarter of 2009.
Project Partners
- University of Massachusetts Amherst (Research & Development)
- Matrixware Information Services GmbH (showcase, funding)
- Information Retrieval Facility (infrastructure, data)
Links
Matrixware.net/OCRQ (for more information about methods and findings, as well as publications and related works)
Contact
Besides patent search, OCR processes can be applied to various aspects of information management/retrieval. The IRF can provide you with more details about how OCR processes can help in addressing your concrete needs. Please send your inquiry to: science@ir-facility.org.

