Consensus Ranking for Increasing Mean Average Precision in Keyword Spotting Anders Hast Uppsala, Sweden. anders.hast@it.uu.se Department of information technology, division of visual information and interaction. Uppsala University, Abstract Word spotting use a query word image to find any instances of that word among document images. The obtained list of words is ranked according to similarity to the query word. Ideally, any false positives should only occur in the end of that list. However, in reality they often occur higher up, which decreases the so called mean average precision. It is shown how creating new ranked lists by re-scoring using the top n occurrences in the original list, and then fusing the scores, can increase the mean average precision. 1 Introduction Transcription of large collections of handwritten material is a tedious and costly task. Nevertheless, images of documents can still be made searchable by a technique called keyword spotting, or simply word spotting [GSGN17, LRF+ 12]. Word spotting can be considered as a special case of image retrieval, where the goal is to find all instances of the query word in the document image collection at question. When searching in several document images, using a query word in the form of an example image, there are often some false positives in the retrieved list. Therefore ranking is performed so that the words most similar to the query word will appear in the top and the most dissimilar, but still similar enough to be presumed to be correct, will appear in the bottom of the list. In this paper, methods for improving the ranked list will be evaluated and discussed. The main idea is to take the top n occurrences in the ranked list and then rank it again against each one of them, thus creating n rankings, which then are fused to create a better overall ranking. 2 Background Generally, in both information retrieval and object-image retrieval the retrieved result is ranked according to its relevance. Similarly, the retrieved regions supposedly containing the searched word of all the pages are collected and ranked according to their similarity score. This procedure involves pairwise comparison, in which the query word image is compared to all retrieved word images using some scoring function (i.e. computing the distance Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: A. Amelio, G. Borgefors, A. Hast (eds.): Proceedings of the 2nd International Workshop on Visual Pattern Extraction and Recognition for Cultural Heritage Understanding, Bari, Italy, 29-Jan-2020, published at http://ceur-ws.org 46 between feature vectors obtained from the word images) that will give a score, also known as confidence value, for each comparison, i.e. each word image. Re-ranking usually refers to the procedure of ranking all over again using a more exact method for computing the score. However, in this work, only the situation where the same scoring function is used will be considered, even though there is nothing that prohibits the use of different scoring functions. The use of machine learning techniques to train the ranking model is often referred to as ”learning to rank” [LLZ09], which includes methods like LambdaMART and LambdaRank, which is based on RankNet [Bur10]. Relevance feedback [Har92] could also be used with the proposed method by letting the user choose the relevant words for ranking and fusion. Normally, some score normalisation procedure is necessary in order to make scores, which are obtained using different scoring functions, e.g. from different information retrieval systems, comparable to each other [WCB06]. This is not necessary here, since the proposed algorithm use the same scoring function. In information retrieval, data fusion [Wu12] is a technique for combining the ranked lists from several retrieval methods [SF95]. It has been noted that the fusion together of document rankings can yield a higher level of relevance than either of the individual methods [Sme98, MS05]. It has also been noted that in general it is better to fuse scores rather than ranks [BP09]. One common technique for score based fusion in information retrieval is CombSUM [BKFS95], where the scores from several lists are simply added together to obtain a new scores. By multiplying this score by the number of lists, i.e. frequency, where the document occurs, combMNZ is obtained [BC17, WH18]. Herein, the average of scores for each instance of the word will be used, since the new score will be used for purging those that fall under a certain threshold τ The relevance of documents in an information retrieval system can vary quite a lot and can therefore be considered being fuzzy, while the word found in word spotting is either the correct word or not, and can therefore being considered binary. Of course, there are some exceptions like differences in spelling or the same word with different semantics etc. Nevertheless, it is possible to utilise this fact by taking the top n occurrences in the first ranking and rank the whole list against each one of them creating n rankings, which then are fused to create a better ranking. 2.1 Fusion for keyword spotting Rusiñol et al. [RnL14] propose three different techniques for data fusion techniques, where variability in writing style is handled by preforming multiple queries and then combining the results. Louloudis et al. [LKG12] evaluate three techniques based on the rank position. A systematic evaluation of different fusion techniques was presented by Peña et al. [PnSMVG10]. Wei et al. [WGS15] discuss how query expansion (QE) [AZ12] that produce one ranked list per expanded search, can be subsequently merged to obtain the final ranked list. Later on, this method will be compared to the proposed method. 3 Consensus Ranking and Query Expansion A learning and segmentation free word spotting engine [HF16, HCV18] was used that search for an example image, referred to as a query word, in one or several document images [HCVA19]. Tentative words that fall below a certain threshold τ will be discarded as garbage (i.e. a so called negative). It is obviously important to set this threshold so that the negatives are mostly true negatives and the positives are mostly true positives. However, due to the nature of the problem there will often be some false positives as well as false negatives. Moreover, the matching of the query word and the found tentative words cannot be too precise as some words would be missed, i.e. the recall will be lower. At the same time it cannot be too relaxed so that many false positives will be found, i.e. the precision will be low (these terms are explained under section 4). In the tests presented a rather tight threshold was used to find mostly true positives, and then perform QE on these, where new searches are performed, using each found word as a query on the same page as the word is found, as well as on the previous and following page. This will ensure that the QE will have enough instances of the word to query on, but not too many as it will increase search time and also increase the risk of finding false positives. Hence, even if this approach uses the same underlying architecture as a previously published approach [VHF19], it is still different as that one only used words found on the same page. Nevertheless, QE will increase the recall substantially while still keeping the precision high. The ranked list will be ranked depending on the similarity to the original query word only. Therefore, the score value can be lower for certain words in the end of the ranked list, than the threshold τ used in each individual search. This fact can be utilised to purge the end of the list for all words having a score value below τ or some other threshold depending on the quality of the found words. 47 The idea behind QE is that the found words contain some variation that will help finding even more words that the single query word could not find, i.e. the score will be quite different depending on the variation. Similarly, the presented consensus ranking utilise this fact and takes the top n words from the resulting list from the QE search and score each of them to the ranked list to create n new ranked lists. Hence, each list is a bit different and by fusing them a better ranked list is obtained. The above approach will therefore be different from the one of Wei et al. [WGS15] where the the QE produce several lists, i.e. one ranked list per search. These lists are ranked on the similarity of each query word, that are subsequently merged to obtain the final ranked list. Each QE will generally not find exactly the same words and therefore each ranked list will be a bit different. When merged, the list will be as complete as possible, but the fusion will depend on how many expanded query searches finds each word. In general, the more queries that end up finding one word the better, since the final score will be computed from more lists. However, by taking the top n words in the final list resulting from QE and making n new lists by re-scoring, this problem will not occur. This approach will be referred to as consensus ranking (CR). Nonetheless, one possible drawback with it is that the QE list will be ordered depending on the similarity to the original query word. If this word is not for some reason very representative for the words in the document, then the approach by Wei et al. might be better. Therefore, it will be investigated herein the effect of utilising their approach to merge the lists resulting from QE and then use the top n words in that list to perform the proposed CR. This approach is later referred to as ML-CR. 4 Method First as a proof of concept for the proposed algorithm, the ranking of digits in the MNIST data set [LCB10] is performed, which showed good performance even when the top n digit images contains some false positives (see section 5). Then tests were performed both on the Esposalles data set [RFS+ 13] and the Bentham data set from the Transcribe Bentham project [MTW11]. Four different words were chosen as query words and the initial search was performed on 10 pages for each data set. For each word 6 instances were chosen in order to capture some variance in the handwriting. By comparing all words in the ground truth with every other word, a confusion matrix was obtained containing scores (similarity). For both data sets, the 3 words having highest mean value in each column of the matrix, i.e. being as equal to the others as possible, were therefore considered being representative query words. The 3 words having lowest mean were also used to see how the algorithms involved cope with such situation when the query word is less representative, i.e. more difficult, as it might contain strokes from adjacent lines, being written differently or having some kind of degradation. Examples of both variants are shown in Figure 1. (a) The query images Barca (b) The query images ”reber”. Figure 1: Example of two query words and their images used in the tests. The top row shows word images that give higher mAP for word spotting and the bottom row shows words that give lower mAP, since they contain strokes from surrounding lines or are written differently. (a) ”Barca” (b) ”reber”. Figure 2 shows two examples of word spotting of the word ”Barca, which is the abbreviation for ”Barcelona” in the the Esposalles data set 2a and the word ”aforesaid” in the Bentham data set 2b. Here a background removal technique has been used so that the strokes appear with less disturbing background noise [VHS17]. One can note how the bounding boxes captures the word well even if they vary quite a lot in size and appearance [VH17]. 48 (a) The word Barca has been found on one page in the Esposalles (b) The word aforesaid has been found on one page of the Bentham data set. data set. Figure 2: Example of how the word spotting system finds words in the (a) Esposalles data set, and (b) the Bentham data set. (Figure best viewed in colour) 4.1 Mean Average Precision The word spotting algorithm can be evaluated by computing the Precision-Recall curve. Precision is the number of relevant (true positive) objects found divided by the number of all retrieved objects, and Recall is the number of relevant objects found divided by the number of all relevant objects in the set [Pow07]. They are computed as: |retrieved ∩ relevant| P recision = (1) |retrieved| |retrieved ∩ relevant| Recall = (2) |relevant| Many prefer using a single value for easily comparing different approaches, such as the mean Average Precision (mAP), which corresponds to the area below the Precision-Recall curve, which is computed using the following equation: 49 P|retrieved| n=1 P @n × r(n) mAP = (3) |relevant| where P @n is the precision at the n top-most returned results, and r(n) is a binary function indicating whether the n-th item in the returned ranked list is a relevant object (true positive). 5 Results The results from both the proof of concept and applying QE and CR on the Esposalles and the Bentham data sets are reported in this section. 5.1 Proof of Concept The proposed CR algorithm was used for ranking of digits in the MNIST data set [LCB10], as a proof of concept. Feature vectors were computed (as in [HLV19]) for 2000 images containing 200 images for each digit. The set was divided into two disjoint sets with 100 number of occurrences of each digit in each set. The idea is to rank the images in the second set according to how similar they are to each digit in the first set. This was done by computing the cosine distance (i.e a score value) and therefore the list contains the nearest neighbours (NN) in a descending scale. The results are shown in Table 1. Here it was chosen to use only a maximum of 12 of the nearest neighbours having a distance equal to or above 0.8. However, it should be noted that as for MNIST this did not prohibit from performing CR using some false positives among the top n images. Actually, it was quite common, with an average of 17.14% for the reported tests, which the low mAP also confirms. Despite this, the CR still increases the mAP with an average of 12.30%, which proves the efficiency and the stability of the proposed algorithm. Table 1: Comparison of mAP for the digits in MNIST between nearest neighbour search (NN) and applying CR on that result. Increase in percentage is reported. Digit NN CR % 0 0.5915 0.6873 16.20% 1 0.5983 0.6540 9.31% 2 0.4514 0.5195 15.10% 3 0.4416 0.5164 16.94% 4 0.3993 0.4008 0.38% 5 0.3430 0.3809 11.07% 6 0.5646 0.6395 13.27% 7 0.4208 0.4443 5.59% 8 0.5400 0.6421 18.92% 9 0.4369 0.5076 16.20% mean 0.4787 0.5393 12.30% 5.2 Performance on real data sets Three different approaches were used that are reported in this section, which are producing a new ranking after fusion of the results obtained from a word spotting algorithm (subsequently referred in the tables as WS). Query Expansion is performed on the result obtained from WS and scored against the original query word. This approach will be referred to as QE in both images and tables. The fusion approaches are as follows: • CR: Consensus ranking: re-score the QE list using the top n words and perform fusion. • ML: Merge all lists from QE and perform Fusion. • ML-CR: Perform CR on the result from ML. In the tests n was chosen to be 12, with the condition that the score for each and one word was higher than a specified threshold, in this case 0.8. Hence n might be lower than 12 for some cases. This threshold was set 50 a bit higher than the threshold τ = 0.7 for regarding a word as a correct word, in order to be more on the safe side when re-scoring, Both conditions are assuring that the words that are being used for re-scoring are true positives to a high probability. Note, that if the QE do a bad job in the first place this will not hold. However, as shown in the proof of concept, the CR can usually still improve the ranking, even if some false positives are found among the top n candidates for re-scoring. The plot in 3 shows the confidence score (or similarity value), where the ranked lists are from QE (magenta), CR (blue), ML (red), CR-ML (yellow). The y − axis shows the score and the x − axis the position in the list. One can note that the query word used was taken from the same page images since the highest score is 1.0 (which means it finds itself) and then decaying for QE. One can also note that the score for ML is generally higher at the end of the ranked list since it by definition cannot go below the threshold τ used both for WS and QE. Nevertheless, the score for CR can be lower since the re-scoring between words can result in low scores even if both are true positives, as they might be rather dissimilar. (a) ”reber” (b) ”Barca” Figure 3: Comparison of the ranked lists from QE (magenta), CR (blue), ML (red), CR-ML (yellow). The y − axis show the score and the x − axis the position in the list (rank). (Figure best viewed in colour) One advantage with the CR and CR-ML is therefore that it is possible to purge the list by removing those under a certain threshold. This will make it easier for a user to investigate the result if obvious false positives are removed. Therefore, the user of a search tool or a transcription tool using word spotting [HV18], only needs to concentrate on a few items (true positives) in the end of the purged list that might be miss ranked and end up scattered among the false positives. A comparison is shown in Figure 4 and 5 of the output from the word spotting system (WS) 4a and the result of applying fusion through consensus ranking (CR) 4b. The true positives are depicted in green and the false positives in red. Once again the background removal technique has been used. One can note how the miss ranked words are all pushed to the bottom of the list. The ranked list obtained by merging lists (ML) 4c and performing a CR on ML 4c is also shown. One can note how ML generally performs worse than CR. This is due to the fact that τ was set rather tight in order to avoid false positives. This causes the algorithm to find fewer correct words in both the WS and in each single search in the QE. However, in general this method leads to both high recall and precision. By lowering the threshold τ , ML will perform better, with the drawback of having much more false positives in the finals lists. One can note how CR improves the position of miss ranked words quite a lot, even if the mAP improvement is modest. The results are not always perfect, since the same score function is used both for QE and CR, which cannot be too precise in order to still capture the variation in handwriting. 51 (a) List ranked using the original query word. (b) List after consensus ranking. (c) List ranked after merging. list (d) List after merging list and performing consensus ranking. Figure 4: Example of a comparison of the output from the word spotting system using query expansion searching in 10 pages, where the true positives are depicted in green and the false positives in red. (a) the original ranked list. (b) the result of applying fusion using the proposed consensus ranking scheme. (c) fusion by merging lists from each query expansion. (d) Consensus ranking performed on the merged list. (Figure best viewed in colour) 52 (a) List ranked using the original query word. (b) List after consensus ranking. (c) List ranked after merging. (d) List after merging list and performing consensus ranking. Figure 5: Example of a comparison of the output from the word spotting system using query expansion searching in 10 pages, where the true positives are depicted in green and the false positives in red. (a) the original ranked list. (b) the result of applying fusion using the proposed consensus ranking scheme. (c) fusion by merging lists from each query expansion. (d) Consensus ranking performed on the merged list. (Figure best viewed in colour) Table 2 show the result (in mAP) of running the different algorithms on several different words, as explained earlier for the Esposalles data set. Each row shows the average of mAP from using three different query words. One can note how QE improves over WS, but also how CR pushes up the mAP. The increase might seem small but in reality the effect is that the ranked list will contain a longer sequence of true positives and the false positives will be aggregated in the end of the list, instead of being scattered in the middle of the list. Table 3 shows the effect of choosing words that give worse results as explained earlier. One can note how the WS performs worse, but that QE still deals with that problem in an elegant way. Generally, CR pushes the mAP up higher. Furthermore, table 4 and Table 5 shows some results for the Bentham data set. 53 Table 2: Comparison between different methods, where the average of three representative occurrences for four different words from the Esposalles data set is reported. The increase compared QE qith WS as well as CR, ML and ML-CR with QE. Word WS QE CR ML ML-CR Bara 0,844631 0,960884 0,982386 0,869193 0,982389 donsella 0,933588 0,971495 0,980789 0,978913 0,980245 pages 0,952578 0,995402 0,998977 0,999088 0,999105 rebere 0,990066 0,999624 1,000000 0,999674 1,000000 Mean 0,930216 0,981851 0,990538 0,961717 0,990435 Increase 5, 55% 0,88% −2, 05% 0, 87% Table 3: Comparison between different methods, where the average of three difficult occurrences for four different words from the Esposalles data set is reported. The increase compared QE qith WS as well as CR, ML and ML-CR with QE. Word WS QE CR ML ML-CR Bara 0,116402 0,639233 0,649848 0,656740 0,655781 donsella 0,401411 0,906396 0,911237 0,887814 0,895160 pages 0,293333 0,969244 0,981201 0,982882 0,982378 rebere 0,389439 0,965221 0,970122 0,967734 0,970262 mean 0,300146 0,870024 0,878102 0,873792 0,875895 Increase 189,87% 0,93% 0,43% 0,67% Table 4: Comparison between different methods, where the average of three representative occurrences for four different words from the Bentham data set is reported. The increase compared QE with WS as well as CR, ML and ML-CR with QE. Word WS QE CR ML ML-CR aforesaid 0,900281 0,942199 0,956681 0,453054 0,841440 amount 0,750419 0,843464 0,935143 0,644368 0,898775 intended 0,833333 0,900000 0,900000 0,836233 0,900000 Jeremy 0,865481 0,957196 0,971777 0,801528 0,941335 Mean 0,837378 0,910715 0,940900 0,683796 0,895388 Increase 8,76% 3,31% -24,92% -1,68% Table 5: Comparison between different methods, where the average of three difficult occurrences for four different words from the Bentham data set is reported. The increase compared QE qith WS as well as CR, ML and ML-CR with QE. Word WS QE CR ML ML-CR aforesaid 0,643265 0,960291 0,969982 0,520125 0,571602 amount 0,387363 0,654970 0,716165 0,518282 0,713422 intended 0,596667 0,633333 0,633333 0,574102 0,633333 Jeremy 0,333333 0,760000 0,760000 0,760000 0,760000 mean 0,490157 0,752148 0,769870 0,593127 0,669589 Increase 53,45% 2,36% -21,14% -10,98% 54 6 Discussion The proof of concept showed that the idea generally works very well (the mAP generally improved by 12.30%) even for heavily contaminated sets, such as in the case of the experiment with MNIST where only 10% in the ranked list are true positives. Furthermore, the experiment showed that the proposed consensus ranking still works rather well even when some of the items used for the CR are false positives (in this case, on average 17.14%). Hence, a few false positives does not, in general, make the algorithm fail. The increases presented in the tables for the Esposalles and the Bentham data sets in section 5, using CR after QE might seem rather low. However, they are important as they reorder the rank so that it will be easier for a human to quickly distinguish true positives fro false positives by a visual inspection, as shown in the presented images in section 5. This will be a useful feature both for semi-automatic transcription [HCV18] as well as for searching non transcribed document images [HCVA19]. One advantage with CR is that it helps to purge what must be false positives to a high degree of certainty. By inspecting Figure 3 and the ranked list in Figure 4 and Figure 5 one can see how CR (the blue curve) suddenly drops rapidly and in the end of that drop there are no true positives anymore. Just a few false positives are found in the drop or above. This drop can be quite obvious as in the case for ”reber” or more subtle as for ”Barca”. Therefore, it is probably better to set a fixed threshold for purging than trying to locate the ”drop”. It seems like the threshold generally could be set equal to τ or a bit below. Being used in a transcription or a search tool, the threshold could be a parameter set using a slider by the user. CR will not substantially increase the mAP when there are very few true positives. This is generally not a big problem, since the retrieved ranked list will consequently be very short anyway. For instance ”intended” has only 10 occurrences in the Bentham data set that was used herein (the other words have between 12 and 25 occurrences.) and the mAP do not improve as seen in Table 4. In fact, it is already good and could not be improved further since QE found 9 out of 10 occurrences and ranked them on top. Overall, the time it takes to perform QE is many times longer the time it takes to perform CR, due to the word spotting search per page, and the QE that performs such search on several words per page. Nevertheless, the time it takes to perform CR is not negligible, especially not for when the final list produced by QE is long, i.e. several hundred words. QE always ends by performing one ranking using the original query word to produce a final ranked list. CR then performs up to n such rankings and thus adds n times more execution. The fusion itself is almost negligible since it is just averaging scores, compared to the work it takes to match images to compute the scores in the first place. One disappointment is that the ML-CR generally did not perform better than CR for the difficult word images. Of course the tests performed here are quite limited and running on larger data sets and also using different number for the top n words might turn out to give different results. Nevertheless, ML generally performs much worse than CR because of the tight set thresholds in QE that will lower the number of false positives and hence decrease the visual load for the user. 7 Conclusions The ranking of words found by word spotting is important for tasks like fast transcription or simple searches. The user do not want to be bothered with the many false positives that often are retrieved. Therefore, ranking is very important, pushing the false positives to the end of the list. Moreover, purging the list automatically from the obvious false positives also helps the user. The proposed Consensus Ranking helps the user with both of these aspects. The idea is simply to take the top n words in the list ranked by matching to the query word, and re-score the list using each of these words to re-score again, giving n new lists that can be fused together to obtain a better ranked list. By averaging the scores, a majority of the false positives will generally obtain lower scores than the threshold used for the word spotting itself, and can therefore be purged. References [AZ12] Relja Arandjelović and Andrew Zisserman. Multiple queries for large scale specific object retrieval. In British Machine Vision Conference, 2012. [BC17] Rodger Benham and J. Shane Culpepper. Risk-reward trade-offs in rank fusion. In Proceedings of the 22Nd Australasian Document Computing Symposium, ADCS 2017, pages 1:1–1:8, New York, NY, USA, 2017. ACM. 55 [BKFS95] N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of multiple query representations for information retrieval. Inf. Process. Manage., 31(3):431–448, May 1995. [BP09] Narayan L. Bhamidipati and Sankar K. Pal. Comparing scores intended for ranking. IEEE Trans. on Knowl. and Data Eng., 21(1):21–34, jan 2009. [Bur10] Chris J.C. Burges. From ranknet to lambdarank to lambdamart: An overview. Technical Report MSR-TR-2010-82, Microsoft Research, June 2010. [GSGN17] Angelos P Giotis, Giorgos Sfikas, Basilis Gatos, and Christophoros Nikou. A survey of document image word spotting techniques. Pattern Recognition, 68:310–332, 2017. [Har92] Donna Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval, pages 241–263. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992. [HCV18] Anders Hast, Per Cullhed, and Ekta Vats. \text - text extractor tool for handwritten document transcription and annotation. In Digital Libraries and Multimedia Archives - 14th Italian Research Conference on Digital Libraries, IRCDL 2018, Udine, Italy, January 25-26, 2018, Proceedings, pages 81–92, 2018. [HCVA19] Anders Hast, Per Cullhed, Ekta Vats, and Matteo Abrate. Making large collections of handwritten material easily accessible and searchable. In Digital Libraries: Supporting Open Science - 15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, January 31 - February 1, 2019, Proceedings, pages 18–28, 2019. [HF16] Anders Hast and Alicia Fornés. A segmentation-free handwritten word spotting approach by relaxed feature matching. In Document Analysis Systems (DAS), 2016 12th IAPR Workshop on, pages 150–155. IEEE, 2016. [HLV19] Anders Hast, Mats Lind, and Ekta Vats. Embedded prototype subspace classification : A sub- space learning framework. In Percannella G. Vento M., editor, Computer Analysis of Images and Patterns: CAIP2019, Lecture Notes in Computer Science, 2019. [HV18] Anders Hast and Ekta Vats. An intelligent user interface for efficient semi-automatic transcription of historical handwritten documents. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, IUI ’18 Companion, pages 48:1–48:2, New York, NY, USA, 2018. ACM. [LCB10] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010. [LKG12] G. Louloudis, A. L. Kesidis, and B. Gatos. Efficient word retrieval using a multiple ranking combination scheme. In Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems, DAS ’12, pages 379–383, Washington, DC, USA, 2012. IEEE Computer Society. [LLZ09] Hang Li, Tie-Yan Liu, and ChengXiang Zhai. Learning to rank for information retrieval (lr4ir 2009). SIGIR Forum, 43(2):41–45, dec 2009. [LRF+ 12] J. Llados, M. Rusinol, A. Fornes, D. Fernandez, and A. Dutta. On the influence of word represen- tations for handwritten word spotting in historical documents. International Journal of Pattern Recognition and Artificial Intelligence, 26(05), 2012. [MS05] Kieran McDonald and Alan F. Smeaton. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Proceedings of the 4th International Conference on Image and Video Retrieval, CIVR’05, pages 61–70, Berlin, Heidelberg, 2005. Springer-Verlag. [MTW11] Martin Moyle, Justin Tonra, and Valerie Wallace. Manuscript transcription by crowdsourcing: Transcribe bentham. Liber Quarterly, 20(3-4), 2011. 56 [PnSMVG10] Sebastián Peña Saldarriaga, Emmanuel Morin, and Christian Viard-Gaudin. Ranking fusion meth- ods applied to on-line handwriting information retrieval. In Proceedings of the 32Nd European Conference on Advances in Information Retrieval, ECIR’2010, pages 253–264, Berlin, Heidelberg, 2010. Springer-Verlag. [Pow07] David Powers. Evaluation: From precision, recall and f-factor to roc, informedness, markedness & correlation. Technical Report Technical Report SIE-07-001, School of Informatics and Engineering Flinders University, Adelaide, Australia, December 2007. [RFS+ 13] VeróNica Romero, Alicia Fornés, NicoláS Serrano, Joan Andreu Sánchez, Alejandro H Toselli, Volkmar Frinken, Enrique Vidal, and Josep Lladós. The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition, 46(6):1658–1669, 2013. [RnL14] Marçal Rusiñol and Josep Lladós. Boosting the handwritten word spotting experience by including the user in the loop. Pattern Recogn., 47(3):1063–1072, mar 2014. [SF95] Joseph A. Shaw and Edward A. Fox. Combination of multiple searches. In Donna K. Harman, editor, Overview of the third Text REtreival Conference (TREC-3), pages 105–108, 1995. [Sme98] Alan F. Smeaton. Independence of contributing retrieval strategies in data fusion for effective information retrieval. In Proceedings of the 20th Annual BCS-IRSG Conference on Information Retrieval Research, IRSG’98, pages 12–12, Swindon, UK, 1998. BCS Learning & Development Ltd. [VH17] Ekta Vats and Anders Hast. On-the-fly historical handwritten text annotation. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 8, pages 10–14. IEEE, 2017. [VHF19] Ekta Vats, Anders Hast, and Alicia Fornearch. Training-free and segmentation-free word spotting using feature matching and query expansion. In Proc. 15th International Conference on Document Analysis and Recognition, 2019. [VHS17] Ekta Vats, Anders Hast, and Prashant Singh. Automatic document image binarization using bayesian optimization. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, pages 89–94. ACM, 2017. [WCB06] Shengli Wu, Fabio Crestani, and Yaxin Bi. Evaluating score normalization methods in data fusion. In Hwee Tou Ng, Mun-Kew Leong, Min-Yen Kan, and Donghong Ji, editors, Information Retrieval Technology, pages 642–648, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. [WGS15] H. Wei, G. Gao, and X. Su. A multiple instances approach to improving keyword spotting on his- torical mongolian document images. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 121–125, Aug 2015. [WH18] Markus Wegmann and Andreas Henrich. Search for an appropriate journal - in depth evaluation of data fusion techniques. In Rainer Gemulla, Simone Paolo Ponzetto, Christian Bizer, Margret Ke- uper, and Heiner Stuckenschmidt, editors, Proceedings of the Conference ”Lernen, Wissen, Daten, Analysen”, LWDA 2018, Mannheim, Germany, August 22-24, 2018., volume 2191 of CEUR Work- shop Proceedings, pages 343–354. CEUR-WS.org, 2018. [Wu12] Shengli Wu. Ranking-Based Fusion, pages 135–147. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. 57