Skip to content. | Skip to navigation

Personal tools
Sections
Home  /  Research  /  Research Projects  /  Data Representation  /  Detection of OCR Quality on Patents

Detection of OCR Quality on Patents

Overview

Optical Character Recognition (OCR) works by scanning source documents and performing character analysis on the resulting images, giving a translation to ASCII text, which can then be stored and manipulated electronically. The character recognition process is not perfect, errors often occur. These errors have an adverse effect on the effectiveness of information retrieval algorithms that are based on exact matches of query terms and document terms.
 

Goals

The goal of this project is to identify a strategy for assessing the quality of a document obtained via an OCR process, and to assign a score (a quality coefficient) to each patent document. This coefficient indicates the quality of an OCR result by means of calculating and comparing statistical models for a gold standard of manually pre-processed documents and the document in question.

 

Expected outcome for IP experts

  • Identify (old) OCRed patent documents showing a high likelihood of containing errors and, thus, need to undergo an additional workflow (such as another OCR process, a manual inspection etc...).
  • Obtain means to reduce the amount of manual inspection.

 

Timeline

This project has started in May 2008. First results were presented at the IRFS2008. A gold standard has been delivered. Updates are expected by the 3rd quarter of 2009.
 

Project Partners

 

Links

Matrixware.net/OCRQ (for more information about methods and findings, as well as publications and related works)
 

Contact

Besides patent search, OCR processes can be applied to various aspects of information management/retrieval. The IRF can provide you with more details about how OCR processes can help in addressing your concrete needs. Please send your inquiry to: science@ir-facility.org.

 

MAREC

IRF Scientific Members now have access to the first standardised patent data corpus for research purposes. read more