... | ... | @@ -3,6 +3,21 @@ |
|
|
|
|
|
# Document clustering
|
|
|
|
|
|
If you have a bunch of documents and want to get an impression about their contents, clustering technology can give this opportunity. Documents that are similar to each other will be grouped together inside clusters. The clustering algorithm tries to label the clusters with their common document topic.
|
|
|
|
|
|
This is a mathematical, statistical approach that can help in several circumstances. Nevertheless, new users of this technology can be confused with unexpected results as in all statistical approaches.
|
|
|
|
|
|

|
|
|
|
|
|
Labeled clusters will be shown as bubbles in different colors, bigger clusters with more documents inside are visualized as bigger bubbles also (1). A document can appear in more than one cluster. DynaQ shows this fact as relationships between clusters, where clusters that have documents in common are connected. The bunch of intersecting documents of two nodes are represented as rectangle. If you click on a bubble or a rectangle you can see the underlying documents inside a list (5).
|
|
|
|
|
|
To pick a bunch of documents for clustering, you have several possibilities:
|
|
|
* Open a clustering view and load the current result list with the according button (2). Cluster the result list afterwards with the 'cluster' button.
|
|
|
* Cluster the result documents of a [topic specification](https://git.opendfki.de/reuschling/dynaq/-/wikis/Classification) (3). An according cluster view will be opened.
|
|
|
* Cluster a list of dedicated documents remembered as a [document pool](https://git.opendfki.de/reuschling/dynaq/-/wikis/Document%20pool).
|
|
|
|
|
|
You can adjust some parameters for cluster calculation, as specifying a rough direction of the assumed count of clusters, and whether the documents body and/or the title text contents should be considered (4). For (re)clustering click the according 'Cluster' button.
|
|
|
|
|
|
Because of the statistical nature of clustering, the resulting clusters with their according labels can be a bit confusing or just not corresponding to the result assumed by the user. As a try to get a bit rid of this, DynaQ offers a so-called 'Term blacklist for clustering'. This is a simple approach where just all terms inside the blacklist will be erased from the document text in front of cluster calculation. If you add unwanted cluster labels into the blacklist (6), these clusters will disappear. This enables to dig deeper into the content and reach other semantical 'cluster layers' - or just to eliminate waste. Be careful if you have small documents (e.g. if you cluster titles only). Big blacklists can erase too much in this case, where the clustering will have not enough remaining content to makes its work anymore.
|
|
|
|
|
|
 |
|
|
\ No newline at end of file |