MAREC is a static collection of over 19 million patent applications and granted patents in a unified file format normalized from EP, WO, US, and JP sources, spanning a range from 1976 to June 2008. MAREC is intended as raw material for research and evaluation in areas such as information retrieval, natural language processing or machine translation, which require large amounts of complex documents. It allows experiments with real data on a realistic scale.
The collection contains documents in 19 languages, the majority being English, German and French, and about half of the documents include full text. Many projects exhibited on the IRF Portal use MAREC.
In MAREC, the documents from different countries and sources are normalized to a common XML format with a uniform patent numbering scheme and citation format. The standardized fields include dates, countries, languages, references, person names, and companies as well as rich subject classifications. It is a comparable corpus, where many documents are available in similar versions in other languages.
The 19,386,697 XML files measure a total of 621 GB. An overview of the structure is shown on the statistics page.
Access to the MAREC data collection
MAREC by IRF is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. You can download it here: http://www.ifs.tuwien.ac.at/imp/marec.shtml
Permissions beyond the scope of this license may be available at mailto:firstname.lastname@example.org.
MAREC at a glance
- 19 million XML documents
- ALL patent applications and granted patents between 1976 and June 2008
- From EPO, WIPO, USPTO, JPO
- Unified fields, numbering scheme and citation format
- Comparable corpus