PHRASE-BASED KNOWLEDGEABLE DOCUMENT INDEX MODEL FOR WEB DOCUMENT CLUSTERING
The system consists of four components:
1. A Web document-restructuring scheme that identifies different document parts, and assigns levels of significance to these parts according to their importance.
2. A novel phrase-based document indexing model, the Document Index Graph (DIG) that captures the structure of sentences in the document set, rather than single words only. The DIG model is based on graph theory and utilizes graph properties to match any-length phrase from a document to any number of previously seen documents in a time nearly proportional to the number of words of the document.
3. A phrase-based similarity measure for scoring the similarity between two documents according to the matching phrases and their significance.
4. An incremental document clustering method based on maintaining high cluster cohesiveness.
The integration of these four components proved to be of superior performance to traditional document clustering methods.