Documentation

 

The LCS is a machine learning framework and therefore does not only comprise multiple learning algorithms, but also offers various choices for pre-processing, term selection, term strength calculation and the two standard evaluation algorithms train-test split and cross validation.

The so called categorization pipeline outlines the sequence as well as interdependencies of the components involved in the categorization process. LCS provides the possibility to adjust each of the pipeline steps in order to evaluate a configuration that is most adequate to the underlying categorization problem. The following categorization steps exist:

  • Pre-processing of documents
    In this step, a bag-of-words model is built for every patent, incorporating the entire content of the patent based on the chosen document representation. The initial bag-of-words representation is built through an indexing process involving the pre-processing of documents. Besides standard pre-processing such as decapitalisation and special character removal, LCS is also capable of language-dependent pre-processing by stop word removal and lemmatisation.
  • Term selection
    The purpose of term selection is to reduce the usually large number of different terms extracted from the training documents (e.g. from a million different terms to a couple of tens of thousands terms) to a feasible number in terms of the consecutive categorisation.

    LCS comprises term selection by means of i) a selection based on term and document frequencies, i.e. global selection and ii) a category-specific selection that identifies those terms which have a high discriminative power with respect to every category, i.e. local selection. Various common heuristics can be applied by the LCS, such as chi square, information gain, and mutual information, in order to estimate the discriminative power of a term in respect to the categories.
     
  • Term strength calculation
    This step transforms the bag-of-words model into a numeric representation which acts as the input for the learning algorithm (e.g. SVM). Term strength refers to an estimation of how important a term is for a document within a collection. LCS offers various heuristics for term strength calculation such as Boolean, term frequency, tf-idf.
     
  • Categorization
    This step comprises all facilities to categorize patent documents based on the chosen machine leaning algorithm. Particularly, the two methods training and testing (i.e. categorizing unseen patents) are deployed in this step. Training delivers a learned model of the machine leaning algorithm with respect to a given taxonomy being IPC in case of the categorization prototype.
    LCS offers the three quite different machine learning algorithms Rocchio, Balanced Winnow and SVM (using the SVMLight implementation). In order to handle multi-categorization, LCS supports both flat learning, i.e. a set of partial categorizer trained on one category respectively, and hierarchical categorization.

    The testing method applies categorization measures, such as precision, recall and harmonic mean, based on one of the evaluation algorithms train-test split and cross validation.