Skip to content. | Skip to navigation

Personal tools
Sections
Home  /  Research  /  Research Projects  /  Data Representation  /  Text Mining for Intellectual Property

Text Mining for Intellectual Property

Overview

Basically, patents are long stretches of complicated technical text; for computers, they are very difficult to analyse thoroughly.

Text Mining for Intellectual Property (TM4IP) aims at providing a better means for modelling complex dependencies in patent texts and for searching patents using these dependencies.

 

Goals

The goal of this project is to generate linguistic resources for accurate dependency parsing of patent documents and to apply these resources in a new kind of search engine, which uses dependency triples as terms. The resulting system will allow for sophisticated searching of patents, using both thesaurus information and feedback from the index to achieve high precision and recall. Although the project focuses on the development of concrete tools and resources, it will also contribute to the state-of-the-art in natural language processing and information retrieval through research and publications.

 

Expected outcome for IP experts

  • An IP search engine based on deep linguistic techniques and suitable for professional search in patent documents.
  • Accurate re-useable linguistic resources (parsers and lexica) for the IP domain.

 

Timeline

The project started in 2008 and will run for 3 years

  • End of 2009: parser and search engine prototypes
  • End of 2010: beta versions
  • End of 2011: final versions

 

Project Partners

Links

Matixware.net/Text Mining (for more information about methods and findings, as well as publications and related works)

 

Contact

Text mining can be applied to various aspects of information management/retrieval. The IRF can provide you with more details about how text mining can help in addressing your concrete needs. Please send your inquiry to: science@ir-facility.org.

MAREC

IRF Scientific Members now have access to the first standardised patent data corpus for research purposes. read more