UAEMex at ImageCLEF 2016: Handwritten Scanned Document Retrieval Task

UAEMex at ImageCLEF 2016: Handwritten Scanned Document Retrieval Task MiguelÁngel Autonomous University of the State of Mexico (UAEMex)

Mexico

GarcíaCalderón Autonomous University of the State of Mexico (UAEMex)

Mexico

ArnulfoRené renearnulfo@hotmail.com Autonomous University of the State of Mexico (UAEMex)

Mexico

García Hernández Autonomous University of the State of Mexico (UAEMex)

Mexico

YuliaLedeneva yledeneva@yahoo.com Autonomous University of the State of Mexico (UAEMex)

Mexico

UAEMex at ImageCLEF 2016: Handwritten Scanned Document Retrieval Task 5672DF99F4BA4AACBD805C32E31C744C GROBID - A machine learning software for extracting information from scholarly documents Information Retrieval Longest Common Subsequence Free Text Search

This paper describes the participation of the (UAEMex) at the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task. We propose to use a skip-character text search method based on Longest Common Subsequence. Our system split all characters in query to find all Longest Common Subsequence in one line of text.

Introduction

This paper describes the free text search method used by UAEMex at the ImageCLEF 2016 [3] handwritten retrieval task [4]. The 1st edition of the handwritten retrieval challenge has one task targeted in free text search. Considering transcript text for every character we use a skip-character text search method based on Longest Common Subsequence (LCS) problem.

Fixed Gap Longest Common Subsequence

The problem to extract LCS consists of given two sequences find the length of longest subsequence present in both of them. Given a string, a subsequence of the string can be obtained from the string by deleting none or some symbols [2] (not necessarily consecutive ones). To extract non-consecutive subsequences, Iliopoulos [1] proposes a variant to find the LCS, called Fixed Gap Longest Common Subsequence (FGLCS) problem, where a value of k is the fixed gap constraint and the distance between two consecutive matches is required to be limited to at most k+1. Figure 1 shows an example of LCS and FGLCS searching.

Free Text Search

The proposed method is based on FGLCS search and is divided into three phases. The system is proposed for transcriptions of incomplete or non-existent words.

Preprocessing Phase

1. Delete non-alphabetic characters in transcript file. 2. Delete line breaks on every segment to get one line segment. 3. Split line by every char.

String Matching Phase

At first step, each query is divided by a space, and then the FGLCS is searched in the actual segment for every word in the query.

Ranking Phase

Every FGLCS is revised to have the same order of words that in the query, in such case, the confidence score is calculated using equation (1). The system considers that a result is relevant if confidence is more than 0.5.

• q = chars in query

• s = chars in the longest sequence • c = confidence c = #$ #$% #(1)

We prove confidence threshold with values 0.9, 0.8, 0.7, 0.6 and 0.5. The best confidence threshold was 0.5.

Submitted Runs

In this section, the nine free text search runs submitted by UAEMex are presented.

Considering bad transcribed words, we change gap value to retrieve more words, however retrieval performance decrease. Run1: FGLCS search with gap = 0. Run2: FGLCS search with gap = 1. Run3: FGLCS search with gap = 2. Run4: FGLCS search with gap = 3. Run5: Union of Run1 + Run2. Run6: Union of Run1 + Run2 + Run3. Run7: Union of Run1 + Run3. Run8: Union of Run1 + Run2 + Run3 + Run4.

Results

In this section, the results of submitted runs by UAEMex are presented. The results with '-' could not be analyzed. Only the measures based on segments are included, and the ones for bounding boxes were omitted. The presented results are extracted only using the n-best No.20 of the n-best providers by the organizers.

The results of the runs in development the following set of four metrics: Global Average Precision (Segm_gAP), Mean Average Precision (Segm_mAP), Global Normalized Discounted Cumulative Gain (Segm_gNDCG) and Mean Normalized Discounted Cumulative Gain (Segm_mNDCG) have been used to evaluate the accuracy of submissions (see Table 1 and Table2). 1 and Table2).

Conclusions

This paper presents results in free text search using LCS. We describe the joint participation of the UAEMex at ImageCLEF 2016 Handwritten Scanned Document Retrieval Task. The proposed method works with words of dictionary and nonexistent words. There are big differences between the results of development set (Table 1) and test set (Table 2). We assume we got bad results because we only use one n-best file provided by the organizers.

Fig. 1 .1Fig. 1. Example of FGLCS with gap 1 and FGLCS with gap 0.

Fig. 2 .2Fig. 2. Example of text segment.

Fig. 3 .3Fig. 3. Example of text segment after line break deletion.

Table 1 .1The results of the development set.Segm_gAP Segm_mAP Segm_gNDCG Segm_mNDCGRUN1 61.1138.5569.0841.69RUN2 47.6132.3359.3937.56RUN3 30.2220.3243.6427.11RUN4 ----RUN5 51.2136.9264.5540.70RUN6 27.6219.8253.8228.90RUN7 0.151.691.932.82RUN8 26.2419.6453.3728.81

Table 2 .2The results of the test set.Segm_gAP Segm_mAP Segm_gNDCG Segm_mNDCGRUN1 0.260.391.220.39RUN2 ----RUN3 ----RUN4 ----RUN5 3.510.9410.151.52RUN6 ----RUN7 ----RUN8 ----

Algorithms for computing variants of the longest common subsequence problem CSIliopoulos MSRahman Theoretical Computer Science 395 2008 An optimal algorithm for the longest common subsequence problem HLin MLu JFang 10.1109/SPDP.1991.21820 Proceedings of the Third IEEE Symposium on the Third IEEE Symposium on

Dallas, TX

1991. 1991 Parallel and Distributed Processing General Overview of ImageCLEF at the CLEF 2016 Labs MVillegas HMüller ASeco De Herrera RSchaer SBromuri AGilbert LPiras JWang FYan ARamisa EDellandrea RGaizauskas KMikolajczyk JPuigcerver AHToselli JASánchez EVidal Lecture Notes in Computer Science 2016 Springer International Publishing Overview of the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task MVillegas JPuigcerver AHToselli JASánchez EVidal CLEF2016 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org

Évora, Portugal

September 5-8 2016