Dataset

The MAREC 400.000 collection consists of 100.000 randomly picked patents from each sub-collection of the MAREC dataset (EPO, JPO, USPTO, WIPO). It was targeted at people submitting papers to the AsPIRe'10 workshop at the ECIR.

Participants were encouraged to apply the techniques they develop to this dataset, where possible. This allows the results of the presented techniques applied to the same dataset to be more easily comparable. Furthermore, the MAREC 400.000 collection allows initial patent processing experiments to be done on a representative dataset of a reasonable size, before scaling these up to the 19 million patents of the MAREC collection.

How to access MAREC

If you are interested in accessing the MAREC data please contact membership@ir-facility.org.

MAREC at a glance

  • 19 million XML documents
  • ALL patent applications and granted patents between 1976 and June 2008
  • From EPO, WIPO, USPTO, JPO
  • Unified fields, numbering scheme and citation format
  • Comparable corpus