Changes

Christian Reuschling · f8e42c59
--- a/Documentation.md
+++ b/Documentation.md
@@ -4,7 +4,7 @@

 GenIe can be used standalone or programmatically as a java library. For both possibilities, the genetic run is parameterized inside the [config file](https://git.opendfki.de/reuschling/genie/-/blob/main/geneticOptimization.conf). Have a look inside the config file for detailed documentation of all that can be adjusted.

-### Standalone use
+## Standalone use

 For using GenIe standalone, you have to specify an exec call for the fitness function. This executable will be called for each candidate vector evaluation, receiving the candidate vector values as invocation arguments. The last invocation argument will be the parents metadata as Json string.

@@ -27,7 +27,7 @@ Be aware that starting a whole process for a vector fitness evaluation can lead
 An example fitness function shell script can be shown [here](https://git.opendfki.de/reuschling/genie/-/blob/main/sum23TestFitnessFunction.sh).


-### Programmatic use
+## Programmatic use

 You can use GenIe as java library to optimize parameters programmatically. For this, you have the possibility to start the optimization with a method call. Per default, the settings inside the specified config file are used, even including the exec cost function call. Further, you can also set the cost function and adjust the config file settings by code.

@@ -78,8 +78,224 @@ During the genetic call so called ['gods'](https://git.opendfki.de/reuschling/ge
 genIe.addEvolutionGod(GeneticParamOptimizerGod evolutionGod)
 ```

-### Result with entropy analysis
+## Result interpretation with entropy analysis

-TODO like in dq4es and maybe pwcSmartSearch
+During optimization the software gives further information about the parameters that are optimized. This comprises hints how big the impact of a single parameter to the final search result quality is. It gives you an overview of the relevancies of the parameters, for e.g. deciding to remove unnecessary parameters at all.

-TODO auch noch mal den entropyImpact und StdDevImpact erklären
\ No newline at end of file
+Further, the system gives information of the optimization process itself, and the meaningfulness of single parameter analysis results.
+
+### Convergation status
+You can check if the optimization converges by enabling the max and average generation fitness monitor gui in the config file with `showMonitorGui=true`. Here you can see an example:
+
+![populationFitnessCurve](uploads/84187460c39fc1a76340a41abb4e4588/populationFitnessCurve.png)
+
+In this view, you can see if the optimization works in general, and when to stop. Kudos to the underlying, great evolutionary [watchmaker library](https://watchmaker.uncommons.org/).
+
+
+
+### Independence analysis
+
+The so called 'independence analysis' optimizes the single parameters under the - in most of the cases wrong - assumption that the parameters are independent. Nevertheless, basic questions like min/max parameter relevance can be answered, parameters that are potentially broken can be highlighted. 
+
+The 'independence analysis' is more or less the naive loop over the single parameter range values, whereby the values of the other parameters are fixed to a static default value (the first one of the specified parameter value range). During this, the resulting fitness values will be collected and visualized:
+
+![independenceAnalysis](uploads/149e08a25131cc9988bab3bbd897634b/independenceAnalysis.png)
+
+The attributes are sorted according to their best fitness values, parameters that have achieved better results are at the top. The fitness diversity columns 'EntropyImpact, StdDeviation, DeviationImpact' show entropy and standard deviation based values calculated from the fitness heatmap histograms on the right. The contents of the histograms depend on the configured value ranges of the parameters. If one value is considered independent, all other values are set to their default value (here zero in any case).
+
+In this example, we optimized fulltext search attribute weightings, the fitness values are calculated result list nDCG values. These values shows the quality of document orders inside result lists against a given ground truth.
+
+If you change the attribute weighting within the value range of a parameter and leave all others at 0, the absolute scores of the documents in the results lists change, but their sorting order remains identical. The nDCGs derived from the sequence thus remain constant. This is reflected in the identical values for a parameter within the histogram. The entropy derived from it also remains constant - it is independent of the absolute numbers in the histogram, the entropies over the parameters are largely constant. The resulting entropy impact, which shows the reverse, percentage entropy (1-entropy, normalized to [0-1]), is also constant.
+
+The standard deviation and the deviation impact - the percentage equivalent of the entropy impact - are based on the absolute values of the fitness values and are thus sorted according to best fitness.
+
+The Independence Analysis can give hints on different facts, which we present as a checklist:
+
+**1. Identify potentially relevant attributes**
+
+Attributes which individually controlled lead to a high quality, most probably also contribute to a high quality in combination with other aspects. The order of maximum relevance of the individually controlled aspects can be found in the column 'BestFitness'.
+
+
+**2. Order of cumulative quality of aspects of groups with the same content base**
+
+If aspects are in groups with the same content base, a statement can be made about the average quality of the groups in relation to each other. The order of average relevance determined in this way is thus as follows:
+
+   1. title (average nDCg: 0.6708)
+   2. body (average nDCg: 0.58286)
+   3. description (average nDCg: 0,4663)
+   4. attachmentBody (average nDCg: 0.2434)
+   5. attachmentName (average nDCg: 0.128)
+
+
+**3. Direct correlations between certain values and their quality**
+
+In the heat map, those values of an aspect stand out which, when viewed individually, have lead to high qualities. However, these are constant in the current analysis example, since the nDCG as fitness is independent of the set weightings of an individual parameter.
+
+If there is an order in the set value ranges, it can also be determined here whether this order is orthogonal to the achieved quality. For example, a high weight could always lead to a high quality and vice versa.
+
+
+**4. Qualitative equivalence of attributes of a content group**
+
+In our analysis, all aspects within a content group are also grouped together in their values for bestFitness. Since these attributes are in a direct interaction, it is to be expected that the system will switch the top weightings to these attributes during genetic optimization. For example, the almost identical influence of all 'title' parameters can lead to a situation where a different title attribute is always weighted up in the optimized vectors.
+
+
+**5. Identify potentially irrelevant attributes**
+
+The two aspects 'title.raw' and 'attachments.attachmentName' each return fitness values (here: nDCGs) of 0 - i.e. they do not deliver (correct) result documents at all. This is probably an error in their configuration inside the underlying system that should be optimized. In any case, they do not contribute to quality and can be safely removed from the system.
+
+attachmentBody' and 'attachmentName', in all their forms, obtain only small fitness values. These aspects are potentially not needed to obtain good results. However, their relevance can only become apparent in combination with other aspects. However, an examination of the optimized fitness values without these aspects seems relevant.
+
+
+**6. Maximum quality of the individual examination**
+
+The maximum fitness value (here:nDCG) of genetic optimization in this case was ~0.9, which is far from being reached by each parameter individually - a clever combination of several aspects is required. If this is not the case, and one or some parameters leads to maximum quality, all other attributes could be potentially deleted.
+
+**7. Identifying further correlations**
+
+Special circumstances of the current parameters can also be reflected in the Independence Analysis. In this case we have implemented the same dedicated searching techniques `stemming (en/de), word trigrams and nGrams` for different attributes to improve the search.
+
+From the arrangement of the average fitness of these implemented techniques within the content groups you can see a slight tendency:
+
+* title: de=>en=>original=>trigram=>ngram
+* body: en=>en=>trigram=>original=>ngram
+* description: en=>en=>trigram=>original=>ngram
+* attachmentBody: trigram=>original=>ngram
+
+All techniques used achieve approximately the same quality as the simple full-text search, in addition to the other advantages they offer, e.g. the matching of partial word terms in nGrams.
+
+
+### Analysis of the most significant candidate vectors
+
+The genetic optimization result is a vector that consists of the best parameter values determined. Each index in the vector represents an optimized attribute-value pair. During the optimization, several candidate vectors will be created and proofed, succeeding vector generations have in average better candidate vectors than the first ones.
+
+There are two different sets of high-quality candidate vectors that are considered in this analysis:
+
+1. The winners of the last generations. When you look to the population fitness curve, when the optimization converged, all these vectors led to good qualities. Nevertheless they are normally slightly different - these differences will be analyzed.
+2. The last generation. The average fitness is converged, all vectors tend to represent decent, stable results. Here we have a similar assumption as with the last generations winners. The results should more or less coincide, and can verify the obtained findings from the last generations. Nevertheless there is a bit randomness from the vector generation inside, thus the significance of the analysis result from the last generation is not so high as from the generation winners.
+
+Here is an example where we consider the last 23 winner vectors:
+
+![last23GenerationWinners_entropyAnalysis_highGranularity](uploads/ddcfa084a642ef131590c4d524c3cd45/last23GenerationWinners_entropyAnalysis_highGranularity.png)
+
+
+The best vector of the last generation is highlighted. As a final, optimized vector, it represents the final result of the genetic optimization.
+
+
+The winning vectors show a rather clear picture, with a tendency that is reflected in the last generation vectors (analysis not shown). The selected weightings/values to the upper attributes appear quite homogeneous in most of the high quality vectors (low entropy), the statement about the selected values of these attributes seems to be quite well founded. For those attributes where the values are strongly distributed over the value range, the selected weighting appears less relevant for achieving quality. These attributes can be potentially removed.
+
+Also for this entropy analysis, a checklist was developed for the interpretation of the results:
+
+
+**1. Potential clustering of adjacent values in the histogram heatmap**
+
+For some aspects, a cluster formation can be seen in the histogram, neighbouring values occur with similar frequency, but not more distant values. Here the weights only jump in a small range, within this range the values appear to be largely equivalent. However, these jumps have an influence on the calculated entropy - since this value is based on discrete values, the entropy is increased, although a different tendency can be seen in the heat map.
+
+For a sharper and clearer picture, the granularity of the discrete values should be changed - the distance between the weighting values should be increased, to a new discrete value range. Care should be taken to ensure that the resulting optimized quality does not deteriorate.
+
+
+**2. Identifying the meaningful attributes**
+
+Most high-quality vectors contain the identified optimal value for this attribute. This is reflected in a low entropy of the frequency distribution of the values of these attributes. If the entropy impact is large, the identified values appear to be relevant for achieving quality.
+
+The following aspects have been identified as stably relevant in the last winning vectors:
+`description.ngram, body.ngram, title, tagValue, title.stemmed_de, attachments.attachmentBody.ngram, attachments.attachmentName.tokenized, title.ngram, description`.
+
+
+
+**3. Identifying attributes that are not very meaningful**
+
+No tendency towards an optimal value can be identified. In the high-quality vectors, the values for these attributes tend to jump arbitrarily across their value range. If the entropy impact is small, i.e. the entropy is large, the identified values appear to be of little relevance for achieving the optimized quality.
+
+These attributes can potentially be removed. However, this should be verified in a further optimization run to exclude possible but relevant interrelationships between the attributes.
+
+The following aspects have been identified as tending to be unstable in recent winning vectors: `description.word_trigram, body.word_trigram, title.raw, body, body.stemmed_en, description.stemmed_de, body.stemmed_de, attachments.attachmentName`.
+
+
+
+**4. Potential order in the identified values**
+
+Here the attributes are sorted according to their identified optimal value. There is an order in the identified values if there is an order of the values in the given value ranges. If different aspects share the same value ranges, the values can be compared between the aspects. This applies, for example, to the optimization of weighting vectors, which is the case here. In this case, the identified weight of a parameter can be used to make a statement about its overall relevance. Parameters that have been weighted low or even with 0 can potentially be removed while maintaining the same quality. This should also be verified with a subsequent optimization run.
+
+The identified order of the final optimized parameter set - the last winning vector - is as follows:
+
+```
+1.  title.stemmed_de: 6
+2.  title.stemmed_en: 6
+3.  description.word_trigram: 6
+4.  body.stemmed_de: 6
+5.  title.raw: 6
+6.  attachments.attachmentBody.word_trigram: 6
+7.  attachments.attachmentName: 6
+8.  title: 3
+9.  title.ngram: 3
+10. description.ngram: 3
+11. description.stemmed_en: 3
+12. body.ngram: 3
+13. tagValue: 3
+14. title.word_trigram: 1
+15. description: 1
+16. body.stemmed_en: 1
+17. description.stemmed_de: 0
+18. body: 0
+19. body.word_trigram: 0
+20. attachments.attachmentBody: 0
+21. attachments.attachmentBody.ngram: 0
+22. attachments.attachmentName.tokenized: 0
+```
+
+If one refers to the pure accumulations of the weightings, which are represented by the hotspots in the heat map, the following order results from the last winning vectors. Thereby, the significance of the entropy is included:
+
+```
+g  1. title: 6
+g  2. tagValue: 6
+v  3. title.stemmed_de: 6
+   4. description: 6
+gv 5. description.ngram: 3
+gv 6. body.ngram: 3
+gv 7. title.ngram: 3
+g  8. attachments.attachmentBody.ngram: 1
+   9. attachments.attachmentName.tokenized: 1
+```
+
+Values marked with 'g' correspond to entropy values from the last generation. All entropy strong attributes of the last generation are represented. These also correspond to their identified value. 
+
+Values marked with 'v' correspond to values from the final winning vector. The coverage is quite high - a 100% coverage cannot be expected due to interrelations between the attributes. Furthermore, the winner vectors tend to differ between generations without a configured number of elite vectors, which should counteract the training of local maxima.
+
+Here we see the hardest statements of the analysis.
+
+**5. Potential order in the cumulated values of the contentbase-same groups**
+
+Aspects that belong to the same group implicitly share to a certain extent the weighting by their similarity in content. If, for example, two versions of the group 'title' are highly weighted, this represents a doubling of the relevance of this group compared to the other groups. This applies generally to attributes with interrelationships - the decisive relevance for achieving quality can be a common one. If the weightings of all aspects of a group are accumulated, we get an order for the relevance of the groups. This represents a similar statement like the one of the Independence Analysis but more profound because the converged vectors consider the interrelations of all attributes among each other.
+
+The optimized order of the relevance of groups with identical content bases is as follows:
+
+Final winner vector:
+```
+1. title: 25
+2. description: 13
+3. body: 10
+4. attachmentBody: 6
+5. attachmentName: 6
+6. tagValue: 3
+```
+
+Heatmap-Hotspots:
+```
+1. title: 15
+2. description: 9
+3. tagValue: 6
+4. body: 3
+5. attachmentBody: 1
+6. attachmentName: 1
+```
+
+
+The order is the same in both cases except for 'tagValue', although the significance of this aspect should be considered with caution due to low entropy in the result list scores. For the order that was determined in the Independence Analysis without consideration of interactions, 'description' and 'body' are swapped. 
+
+**6. Potentially multiple deflections for values of one aspect, possibly with correlation to attributes of the same content based group**
+
+This indicates a correlation. For example, attributes that are in the same content group potentially alternate in their weighting. This is especially the case if they have achieved similar good values in Independence Analysis.
+
+Of course, for more detailed information in this context a more detailed analysis of the interrelations between the attributes would be necessary. 
+
+If two or more values are largely equivalent, potential redundancies can be eliminated by removing redundant aspects. Here too, unrecognized effects of interrelationships should be excluded by a subsequent optimization run.
\ No newline at end of file