Documentation

Clustering Algorithms

In the current configuration, two different clustering algorithms can be used:

  • Suffix Tree Clustering (STC)

  • Lingo
     

Suffix Tree Clustering (STC)

The Suffix Tree Clustering (STC) is a purpose-built search results clustering algorithm, which groups the input texts according to the identical phrases these texts share. Phrases refer to any sequence of words, no matter of grammatical correctness or function.

The rationale behind this approach is that phrases, compared to sets of single keywords, have greater descriptive power. The main reason for this is that phrases retain the relationships of proximity and order between words.

More details can be found in "Stanisław Osiński: Dimensionality Reduction Techniques for Search Results Clustering. Master thesis, Department of Computer Science, The University of Sheffield, UK, 2004”.
(Download from http://project.carrot2.org/publications.html)
 

Lingo

Lingo is based on the description-comes-first approach where the process of clustering is reversed: Find meaningful cluster labels first and only then assign snippets to them to create proper groups. The main goal of Lingo is to ensure good quality of cluster labels. Labels provide the users with an overview of the topics covered in the results; and help them to identify the specific group of documents they were looking for. Therefore, the quality of the entire search results clustering process crucially depends on the readability of group descriptions.

A detailed description of the algorithm is given in "Stanisław Osiński, Jerzy Stefanowski, Dawid Weiss: Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition. Advances in Soft Computing, Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM´04 Conference, Zakopane, Poland, 2004, pp. 359—368"
(Download from http://project.carrot2.org/publications.html).