Solr implements result list clustering with the help of the carrot2 Search Results Clustering Engine. Clusters comes per default with one label that describes the cluster. DynaQ gets all the clusters with its documents and enhance them with further describing labels, in the case the default label is not enough in your scenario.
Entry inside solrconfig.xml:
<!-- clustering search component --> <searchComponent name="clusteringSTC" enable="true" class="org.apache.solr.handler.clustering.ClusteringComponent"> <lst name="engine"> <str name="name">default</str> <str name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str> </lst> </searchComponent> <!-- our ClusteringComponentSummarizer --> <searchComponent name="clusteringComponentSummarizer" enable="true" class="de.dfki.km.dynaq.docgroups.ClusteringComponentSummarizer"></searchComponent> <!-- clustering request handler --> <requestHandler name="/clusteringSTC" enable="true" class="solr.SearchHandler"> <lst name="defaults"> <!-- .... --> </lst> <lst name="invariants"> <!-- these are the attributes the label should be generated from --> <str name="mlt.fl">title,body</str> <bool name="mlt.boost">true</bool> </lst> <lst name="appends"> <!-- the id should be loaded in any case --> <str name="fl">id</str> </lst> <arr name="last-components"> <str>clusteringSTC</str> <str>clusteringComponentSummarizer</str> </arr> </requestHandler>
rows: picks up the top N result list documents for clustering
The module for collaborative filtering enables to perform arbitrary cf queries, and is designed to be independent from the data structures inside the index.
The module doesn't differ between classical, pre-defined 'item' and 'user' roles. As an alternative, the attributes that should be considered as ids and references between entities are defined as part of the query. Furthermore, the query syntax doesn't force you to choose between pre-defined forms of queries, e.g. item-user-item, user-item-user. The CF module gets a 'chain of id attributes', that can be arbitrary long, where a chain link defines one hop between two entities. Thus, you can perform much more flexible queries, such as user-itemType1-itemType2-usergroup-itemType3-user-....etc.
querySuffix is optional, [searchIn,extractFrom] is valid
Abbrevation for the first chain link: [extractFrom]
Abbrevation for the last chain link: [searchIn]
The system searches for the specified query (&q) or, if it is not the first chain link, the values extracted in the former chain link. These values will be searched inside the index under the attribute specified with 'searchIn'. The values will be boosted according to their counts. For this query, it is possible to append a query specified with 'querySuffix', for e.g. filtering out unwanted documents inside this hop. One example for this could be 'search in the index general 'id' field, but consider only documents of type 'user' (querySuffix could be '+Content-Type:user')'. To prune the result list, also the parameter '&chainRows' will be considered. In the case there is a succeeding chain link, it goes further with point 2. Otherwise, the current result list is the final result, considering &fl, &fq and &rows parameters.
As the next step the system gets the result list from point 1, and extracts all values from the result list documents under the attribute specified with 'extractFrom'. The values will be counted, whereby the counts acts as a score, and thus have a meaningfull order. Normally, these values are ids to other entities, e.g. userIds or itemIds. To prune the number of extracted values, the parameter '&chainRows' will be considered. In any case, the system goes further to the next chain link, processing point 1.
The DynaQ module ContextDocsSearchComponent gives the possibility to contextualize your queries with certain documents, describing the topic/context you want to search for. For example, you want to search inside the domain of fishes, and you have a huge index with pet forenames. You search for 'harry', and recieve birds, cats, and fishes. Beside to add a new search term 'fish', you can set one or more (possibly preconfigured) fish documents as context alternatively. The fishes named 'harry' will appear on the top of your result list. Or, if you doesn't specify 'harry' anymore, you will receive any fish documents in your corpus (fuzzy), performing a statistical document similarity search.
Sometimes you don't want the default behaviour of Solr, that all attributes of a document should be loaded inside the response, if you doesn't specify anything on the fl parameter (&fl=). If you want to load nothing in this case, you can use the EmptyFlResponseCleaner module. It just strips all document attribute entries from the response before the server sends it back to the client.
EmptyFlResponseCleaner is configured as last-component:
It is sometimes too much time consuming to look inside one or more documents in order to know what their topics are. Sometimes a rough overview is enough. For this, there exists the module for document summarization. You specify one or more documents as a searching query. The summarizer gets the result list, looks into the documents and extracts characteristic terms out of it. The resulting list of buzzwords can be used as summarization.
Enable the module in your solrconfig.xml:
<searchComponent name="relevantTermsSummarizer" enable="true" class="de.dfki.km.dynaq.docgroups.RelevantTermsSummarizer"></searchComponent> <requestHandler name="/docgroups/relevantTerms" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">all</str> <str name="rows">10000</str> <str name="fl">dataEntityId</str> </lst> <lst name="invariants"> <!-- specify the attributes the buzzwords should be extracted from --> <str name="mlt.fl">title,body</str> <bool name="mlt.boost">true</bool> </lst> <arr name="last-components"> <str>relevantTermsSummarizer</str> </arr> </requestHandler>
maxIntTerms: the maximum count of potentially interesting terms the system should extract. default: 42
rows: specify the length of the result list, thus the number of top documents that should be considered for extraction
further, all parameters from the Solr MoreLikeThis component can be specified
The LastQueriesWarmer module is a possibility to warm up your solr caches after a server restart. The module remembers the last frequently used queries and saves them at server shutdown under <yourHomeDir>/.dynaq4solr/warming. After starting the server, you can call them for cache warmup. For this, invoke the script <zip>/bin/cachewarming.sh.
LastQueriesWarmer remembers not all query parameters, only some default, basic ones, i.e. q, df, fl, fq, rows, rq, sort, start. This is to not remember some internal solr parameters accidently. If you have the need to store further attributes, you can configure them. Further, you can adjust the cache size, how many queries will be remembered from the module.
Enable and configure the module inside solrconfig.xml as follows:
The DynaQ Solr modules offers a component for trendmining and forecasting. This module enables a higher abstracted view onto the corpus via trend mining analysis. The module gets a trend definition specified by a regular query, potentially contextualized by documents from the contextualization module. Further, certain parameters for the time range, that should be analyzed, are offered.
The result is a data container (e.g. JSON), which contains the results of the time series analysis. These comprises calculation results for every considered time segment inside the analysis range, together with aggregated calculations for the whole range.
Data for each single time range segment:
Amplitude at the given time: Number of results, amount of results (percentage), sum of relevancies, average relevancies
Ids of the documents found similar to the trend, together with their score and further metata
Growth (slope, first derivation) in this segment, for both result count and relevancies
Momentum (second derivation) in this segment
Aggregated data for the whole time series analysis:
Overall amplitude of the trend: Number of results, amount of results (percentage), sum of relevancies, average relevancies
Slope average, for both result count and relevancies
range: the time range inside the corpus that should be analyzed. You can use standard solr date syntax, or dynaq date syntax, depending what you use in the index for this field. DynaQ syntax is [-]yyyyMMddhhmmssSSS, whereby y can be specified as often as you want.
granularity=: the overall time range is segmented for calculation, for each segment you will get analysis results. This parameter specifies the length of the time segments. Possible values for the time unit: w(eeks), d(ays), h(ours), m(inutes), s(econds), [ms oder S] (milliseconds), p(ercantage from whole range), M(onths), e.g. 5w=> five weeks
slicedata=[true|false]: if true, the data for all slices/time segments will be returned. If false, only the overall information will be added to the response. Default is true, so you can print a ptime series graph.
relevanciesAndDocs=[true|false]: per default, the document ids and the relevancy data won't be considered for calculation, for performance reasons. If you want them anyway, you can get them by specifying relevanciesAndDocs=true
predictionLength=: enables forecasting. The response gets additional slices/time segments for the future time after the specified analysis time range. This parameter specifies the number of additional time segments the systems approximates into the future
timeSeries=: instead of triggering a trend analysis and then appoximate into the future with the resulting time series, you can also specify a time series directly with this parameter, skipping the trend analysis step