... | ... | @@ -2,8 +2,164 @@ |
|
|
## [Home](https://git.opendfki.de/reuschling/dynaq4solr/wikis/Home) | [How to start](https://git.opendfki.de/reuschling/dynaq4solr/wikis/how-to-start) | [Modules](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules) | [Code snippets / Examples](https://git.opendfki.de/reuschling/dynaq4solr/wikis/examples) | [People / Contact](https://git.opendfki.de/reuschling/dynaq4solr/wikis/people) | [Supporters](https://git.opendfki.de/reuschling/dynaq4solr/wikis/supporters)##
|
|
|
***
|
|
|
|
|
|
|
|
|
[Clustering Summarizer](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#clustering summarizer)
|
|
|
|
|
|
[Contextualization](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#contextualization)
|
|
|
|
|
|
[Document group summarizer](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#Document group summarizer)
|
|
|
|
|
|
[EmptyFlResponseCleaner](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#emptyflresponsecleaner)
|
|
|
|
|
|
[LastQueriesWarmer](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#lastquerieswarmer)
|
|
|
|
|
|
[Trend analysis](https://git.opendfki.de/reuschling/dynaq4solr/wikis/modules#trend analysis)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Clustering Summarizer ###
|
|
|
Solr implements result list clustering with the help of the [carrot2 Search Results Clustering Engine](http://project.carrot2.org/). Clusters comes per default with one label that describes the cluster. DynaQ gets all the clusters with its documents and enhance them with further describing labels, in the case the default label is not enough in your scenario.
|
|
|
|
|
|
Entry inside ___solrconfig.xml___:
|
|
|
|
|
|
```
|
|
|
<!-- clustering search component -->
|
|
|
<searchComponent name="clusteringSTC" enable="true" class="org.apache.solr.handler.clustering.ClusteringComponent">
|
|
|
<lst name="engine">
|
|
|
<str name="name">default</str>
|
|
|
<str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>
|
|
|
</lst>
|
|
|
</searchComponent>
|
|
|
|
|
|
<!-- our ClusteringComponentSummarizer -->
|
|
|
<searchComponent name="clusteringComponentSummarizer" enable="true" class="de.dfki.km.dynaq.docgroups.ClusteringComponentSummarizer"></searchComponent>
|
|
|
|
|
|
<!-- clustering request handler -->
|
|
|
<requestHandler name="/clusteringSTC" enable="true" class="solr.SearchHandler">
|
|
|
<lst name="defaults">
|
|
|
<!-- .... -->
|
|
|
</lst>
|
|
|
|
|
|
<lst name="invariants">
|
|
|
<!-- these are the attributes the label should be generated from -->
|
|
|
<str name="mlt.fl">title,body</str>
|
|
|
<bool name="mlt.boost">true</bool>
|
|
|
</lst>
|
|
|
|
|
|
<lst name="appends">
|
|
|
<!-- the id should be loaded in any case -->
|
|
|
<str name="fl">id</str>
|
|
|
</lst>
|
|
|
|
|
|
<arr name="last-components">
|
|
|
<str>clusteringSTC</str>
|
|
|
<str>clusteringComponentSummarizer</str>
|
|
|
</arr>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
* rows: picks up the top N result list documents for clustering
|
|
|
* relevantTermsSummarizer=[true|false]: enables cluster summarization
|
|
|
* maxIntTerms: how many label terms should be extracted per cluster
|
|
|
* maxDocsPerCluster: prunes the number of documents per cluster that will be considered for cluster summarization. This is to adjust performance issues, default is 100
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
http://earlytrendradarservice.kl.dfki.de/solr/etrCollection/clusteringSTC?q=%2BdynaqCategory:brandwatch+%2Btitle%3Ascreen+%2Btitle%3Anews+%2Bmodified%3A[20130301000000000+TO+20140201000000000]&rows=100&fl=&relevantTermsSummarizer=true&maxIntTerms=42&maxDocsPerCluster=50
|
|
|
```
|
|
|
|
|
|
#### Contextualization ###
|
|
|
The DynaQ module __ContextDocsSearchComponent__ gives the possibility to contextualize your queries with certain documents, describing the topic/context you want to search for. For example, you want to search inside the domain of fishes, and you have a huge index with pet forenames. You search for 'harry', and recieve birds, cats, and fishes. Beside to add a new search term 'fish', you can set one or more (possibly preconfigured) fish documents as context alternatively. The fishes named 'harry' will appear on the top of your result list. Or, if you doesn't specify 'harry' anymore, you will receive any fish documents in your corpus (fuzzy), performing a statistical document similarity search.
|
|
|
|
|
|
___solrconfig.xml___ entry:
|
|
|
|
|
|
```
|
|
|
<searchComponent name="contextualization" enable="true" class="de.dfki.km.dynaq.context.ContextDocsSearchComponent"></searchComponent>
|
|
|
|
|
|
<requestHandler name="/dynaq" class="solr.SearchHandler">
|
|
|
<arr name="first-components">
|
|
|
<str>contextualization</str>
|
|
|
</arr>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
* contextDocId: an id to specify a context. Can be specified several times
|
|
|
* contextDocsField: the attribute which should be considered for similarity search. Normally the full body text of an document
|
|
|
* contextDocsBoost=[number]: boosts the context docs with an multiplication factor to the scores
|
|
|
* includeContextResults=[true|false]: Set it to true if you want to include to the final result also all documents that are similar to the context docs but doesn't match the query
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
http://earlytrendradarservice.kl.dfki.de/solr/etrCollection/dynaq?q=%2B%28dynaqCategory:brandwatch%29&contextDocIds=http://www.usatoday.com/story/news/nation/2013/02/14/drought-farmers-midwest/1920577/&contextDocsField=body&rows=10&fl=dataEntityId,title,creator,score
|
|
|
```
|
|
|
|
|
|
### EmptyFlResponseCleaner ###
|
|
|
Sometimes you don't want the default behaviour of Solr, that all attributes of a document should be loaded inside the response, if you doesn't specify anything on the fl parameter (&fl=). If you want to load nothing in this case, you can use the EmptyFlResponseCleaner module. It just strips all document attribute entries from the response before the server sends it back to the client.
|
|
|
|
|
|
EmptyFlResponseCleaner is configured as last-component:
|
|
|
|
|
|
```
|
|
|
<searchComponent name="emptyFlResponseCleaner" enable="true" class="de.dfki.km.dynaq.util.EmptyFlResponseCleaner"></searchComponent>
|
|
|
|
|
|
<requestHandler name="/select" class="solr.SearchHandler">
|
|
|
<arr name="last-components">
|
|
|
<str>emptyFlResponseCleaner</str>
|
|
|
</arr>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
*
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|
|
|
### Document group summarizer ###
|
|
|
It is sometimes too much time consuming to look inside one or more documents in order to know what their topics are. Sometimes a rough overview is enough. For this, there exists the module for document summarization. You specify one or more documents as a searching query. The summarizer gets the result list, looks into the documents and extracts characteristic terms out of it. The resulting list of buzzwords can be used as summarization.
|
|
|
|
|
|
Enable the module in your ___solrconfig.xml___:
|
|
|
|
|
|
```
|
|
|
<searchComponent name="relevantTermsSummarizer" enable="true" class="de.dfki.km.dynaq.docgroups.RelevantTermsSummarizer"></searchComponent>
|
|
|
|
|
|
<requestHandler name="/docgroups/relevantTerms" class="solr.SearchHandler">
|
|
|
<lst name="defaults">
|
|
|
<str name="echoParams">all</str>
|
|
|
<str name="rows">10000</str>
|
|
|
<str name="fl">dataEntityId</str>
|
|
|
</lst>
|
|
|
|
|
|
<lst name="invariants">
|
|
|
<!-- specify the attributes the buzzwords should be extracted from -->
|
|
|
<str name="mlt.fl">title,body</str>
|
|
|
<bool name="mlt.boost">true</bool>
|
|
|
</lst>
|
|
|
|
|
|
<arr name="last-components">
|
|
|
<str>relevantTermsSummarizer</str>
|
|
|
</arr>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
*
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|
|
|
### LastQueriesWarmer ###
|
|
|
The LastQueriesWarmer module is a possibility to warm up your solr caches after a server restart. The module remembers the last frequently used queries and saves them at server shutdown under \<yourHomeDir\>/.dynaq4solr/warming. After starting the server, you can call them for cache warmup. For this, invoke the script ___\<zip\>/bin/cachewarming.sh___.
|
|
|
|
... | ... | @@ -25,3 +181,74 @@ Enable and configure the module inside ___solrconfig.xml___ as follows: |
|
|
</listener>
|
|
|
```
|
|
|
|
|
|
Further, you have to configure it to each SearchHandler for which you want to remember queries as a first-component, like this:
|
|
|
|
|
|
```
|
|
|
<searchComponent name="lastQueriesWarmer" enable="true" class="de.dfki.km.dynaq.warming.LastQueriesWarmer"></searchComponent>
|
|
|
|
|
|
<requestHandler name="/select" class="solr.SearchHandler">
|
|
|
<arr name="first-components">
|
|
|
<str>lastQueriesWarmer</str>
|
|
|
</arr>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
*
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|
|
|
### Trend Analysis ###
|
|
|
|
|
|
The DynaQ Solr modules offers a component for trendmining and forecasting. This module enables a higher abstracted view onto the corpus via trend mining analysis. The module gets a trend definition specified by a regular query, potentially contextualized by documents from the contextualization module. Further, certain parameters for the time range, that should be analyzed, are offered.
|
|
|
|
|
|
The result is a data container (e.g. JSON), which contains the results of the time series analysis. These comprises calculation results for every considered time segment inside the analysis range, together with aggregated calculations for the whole range.
|
|
|
|
|
|
Data for each single time range segment:
|
|
|
* Amplitude at the given time: Number of results, amount of results (percentage), sum of relevancies, average relevancies
|
|
|
* Ids of the documents found similar to the trend, together with their score and further metata
|
|
|
* Growth (slope, first derivation) in this segment, for both result count and relevancies
|
|
|
* Momentum (second derivation) in this segment
|
|
|
|
|
|
Aggregated data for the whole time series analysis:
|
|
|
* Overall Amplitude of the trend: Number of results, amount of results (percentage), sum of relevancies, average relevancies
|
|
|
* Slope average, for both result count and relevancies
|
|
|
* Momentum average
|
|
|
|
|
|
Entry in ___solrconfig.xml___:
|
|
|
|
|
|
```
|
|
|
<searchComponent name="dynaQTrendsComponent" enable="true" class="de.dfki.km.dynaq.trends.DynaQTrendsComponent"></searchComponent>
|
|
|
|
|
|
<requestHandler name="/trends" class="de.dfki.km.dynaq.util.SearchHandlerWithoutComponents">
|
|
|
<arr name="first-components">
|
|
|
<str>contextualization</str>
|
|
|
<str>dynaQTrendsComponent</str>
|
|
|
</arr>
|
|
|
|
|
|
<lst name="defaults">
|
|
|
<str name="rows">7</str>
|
|
|
<str name="wt">json</str>
|
|
|
<str name="indent">false</str>
|
|
|
</lst>
|
|
|
|
|
|
<lst name="invariants">
|
|
|
<!-- enter your core/collection id -->
|
|
|
<str name="df">id</str>
|
|
|
</lst>
|
|
|
</requestHandler>
|
|
|
```
|
|
|
|
|
|
Parameters:
|
|
|
|
|
|
*
|
|
|
|
|
|
Examples:
|
|
|
```
|
|
|
|
|
|
```
|
|
|
|