-

INAOE's participation at ImageCLEF 2016: Text Illustration Task

Luis Pellegrin

pellegrin@inaoep.mx 0

A. Pastor Lopez-Monroy

pastor@inaoep.mx 0

Hugo Jair Escalante

hugojair@inaoep.mx 0

Manuel Montes-y-Gomez

0 0 Instituto Nacional de Astrof sica , Optica y Electronica (INAOE) , Mexico

In this paper we describe the participation of the Language Technologies Lab of INAOE at ImageCLEF 2016 teaser 1: Text Illustration (TI). The goal of the TI task consists in nding the best image that describes a given document query. For evaluating this task, there is a dataset containing web pages having text and images. We address the TI as a purely Information Retrieval (IR) task, for a given document query we search for the most similar web pages and use the associated images to them as illustrations. In this way, queries are used to retrieve related images from web pages, but the retrieval result are only the associated images. For this, we represent the web pages and queries using state-of-the-art text representations. Those representations incorporate information that allows us to exploit textual or semantic aspects. According to ImageCLEF 2016 evaluation, the proposed approach holds the best performance for the TI task.

text illustration image retrieval document representation

Since 2010, ImageCLEF promotes research into annotation of images using noisy web page data. Following the same path, for the 2016 edition [ 1 ] two new tasks were introduced as teasers: Text illustration and Geolocation, this paper focuses in the former. The goal of the Text Illustration task consists in nding the best illustration, from a set of reference images, for a given text-document. Unlike the problem of illustrating a sentence formed by few words, the TI is a much more challenging task. The reason of this is that we want to illustrate a whole document (i.e. web page) including a number of di erent topics. In this regard, the used dataset consists in images embedded in web pages.

We address the TI problem as an Information Retrieval (IR) task. The hypothesis is that related web pages have related images. Thus, the document queries to be illustrated are a target set of web pages, which we illustrate using the embedded images of the retrieved web pages. For this, we bring two popular representations from the IR eld, that do not take into account visual characteristics of images. On the one hand, the bag-of-words representation denes each document as histograms of word occurrences. On the other hand, the Word2vec representation incorporates distributional semantics to text documents with learned word vectors [ 2 ]. Finally, as work in progress we experiment with a third novel multimodal representation, where the textual and visual information are used to produce a multimodal representation of the queries. Such representation, allows us to directly retrieve images from a reference image dataset. The o cial results in the evaluation are encouraging and lays the background for future avenues of inquiry.

The remainder of the paper is organized as follows. Section 2 describes our method; Section 3 shows the obtained experimental results; nally, in Section 4 some conclusions of this work are presented. 2

Text illustration using an IR-based approach

To approach the TI task we consider the following elements in our strategy. Let Q = fq1; : : : ; qmg be the set of document queries to be illustrated. Also, let D = f(d1; I1); : : : ; (dn; In)g be the set of web pages of dn documents and In images pairs in the collection. Finally, let V = fw1; : : : ; wrg be the textual features extracted from documents in the reference collection D. The general process of the proposed approach has two stages. The rst consist in representing each query qj and each document di into the same space Rr. In the second stage, each query qj 2 Q is used to retrieve the k most similar web pages f(dh; Ih) : (dh; Ih) 2 Dg to qj . The nal result only considers the Ih elements as the resultant illustration set. The rest of this section explains the stages in detail. 2.1

Representing documents

The rst stage in our strategy requires computing query vectors qj = hw1; : : : ; wri and document vectors di = hw1; : : : ; wri in a space Rr. For this, we relied in two di erent textual representations exploiting word occurrences (i.e., in BoW) and co-occurrences (i.e., in Word2vec) in documents, as described below. Note that jrj is de ned according to each representation. For the case of BoW, jrj = jVj. In the case of Word2vec, jrj is number of hidden neurons used to represent the learned word vectors.

Bag-of-Words (BoW) Under BoW each document is represented by taking each word in the vocabulary as an attribute to build document vectors di = hw1; : : : ; wri. Intuitively, the BoW is an histogram representing word frequencies in each document. BoW representation was built ltering terms with high frequency and using TF-IDF (Term Frequency Inverse Document Frequency) weighting scheme [ 3 ].

Word2vec Adaptation The purpose of Word2vec is to build accurate representations of words in a space Rr. The main goal is that semantically related words should have similar word vectors in Rr [ 2 ]. For instance, Paris vector are close to Berlin vector, since both are capitals. Surprisingly, Mikolov et. a.l (2013) also showed other generalizations using speci c lineal operations. For example, France-Paris+Berlin result in a very close vector to Germany. In this paper, we exploit the use of learned word vectors from Wikipedia using Word2vec [ 2 ]. For our experiments, the learned word vectors from each document are used to compute the average document vector as in [ 4 ]. The idea is that the average of those word representations, should capture rich notions of semantic relatedness and compositionality of the whole document. 2.2

Retrieval stage

In this stage, a document query qj under a speci c representation is used to retrieve a set of relevant items f(dh; Ih) : (dh; Ih) 2 Dg. Note that only the textual information from web pages and textual queries are used in the retrieval stage, but the reported results correspond to the immersed images in the retrieved items. For the retrieval stage we used the cosine similarity measure, which is de ned in Equation 1.

similarity(qj ; di) = cosine(qj ; di) = qj

di jjqj jjjjdijj (1) where qj ; di are the representations of the document query qj , and the ith document di from the collection, respectively. This equation iterates over all documents from D, then the images associated to the k most similar documents to qj are used to illustrate it. 3

Experimental Results

In this section we present quantitative and qualitative results of the proposed approach in the TI task. 3.1

Quantitative results

In Table 1, it can be seen the performance of the proposed representations for TI. The table reports scores from the metric proposed in [ 5 ], where basically the recall is evaluated at the k-th rank position (R@K) of the ground truth images. Several values of k are reported, in Table 1 we can see the scores that correspond to the test set.

Our best score is reported by run1, which uses the BoW under a TF-IDF weighting scheme ltering 5% of the highest frequent terms. The results obtained by run1 validate our hypothesis that related images appear in related web pages. On the other hand, the run2 and run3 report scores obtained by Word2vec representation (denoted as d2v), both runs use also a ltering of 5%, but with and without TF-IDF weighting respectively. In these latter results, we consider that the representation is a ected by noise when increasing the number of used word to build it. Although, a Word2vec representation helps to retrieve similar documents (as is showed in Figure 1), we have found that this representation is more con dent with short documents or in speci c domains. However, in documents with diversity of topics the performance decrease (see Figure 2) because of the great variety of di erent words involved.

Team Baseline

CEA INAOE In this subsection we compare the proposed representation. In Figure 1, we show top retrieved images that illustrate the document query under two representations. In this case, the document query consists in a short text, we can see that both representations show relevant images to illustrate the text. An interesting output is obtained by run3 that shows diversity on retrieved images.

On the other hand, Figure 2 shows a long document used as query. Again, the outputs of run1 and run3 are compared. Despite that in the documents are included great quantity of topics, the image retrieval of run1 is e ective, instead the image retrieval of run3 includes few relevant images. Taking as examples the Figures 1 and 2, we can see that the quantity of terms and rich vocabularies contained in the documents is an important factor for selecting the representation. While Word2vec representation seems to be robust in short documents or documents in a speci c domain, the BoW representation plus weighting TF-IDF shows to be a better option for the case of long documents.

Work in progress: representing documents in a visual space

We have worked with a visual representation but it is not reported in Table 1. Unfortunately, we were not able to submit a run because of the tight time for the deadline. Nevertheless, we also present an in-house evaluation showing qualitative results.

For representing documents in a visual space, we used a multimodal representation M composed by visual prototypes. The construction of M is performed in an unsupervised way by using images immersed in web pages. The idea is that images can be represented by two di erent modalities: a visual representation extracted from the image I, and a textual representation extracted from the web pages D. In M for every word in D a visual prototype is formed, where each prototype is a distribution over visual representation (more detail of this approach in [ 6 ]). We used a reference image dataset (training set of [ 1 ]) for construction of M.

The aim of this representation is to include the visual information in the text illustration. Under this representation, the words from a given document query are seen in function of its visual representation. First, using visual prototypes of words extracted from a query, then an average visual prototypes is formed. Second, using average visual prototype as query, then we retrieve some related images. In other words, the document query is translated to a visual document and used it to retrieve images, as a CBIR (Content-Based Image Retrieval) task.

In Figure 3, we show one favorable case for the visual representation. However, the average visual prototype in this case is formed by three words of the document query with the highest weight. For this kind of representation, we have observed that the more terms in the document query, the more noisy the visual representation is. As conclusion, a visual document representation is formed only by few words, so it is necessary a keyword extraction process on document query. In this paper we presented an IR approach to address the Text Illustration task. The documents are de ned as textual, semantic or visual representations. The performed experiments under these di erent representations give an initial point of comparison for future approaches. According to the performed evaluation we conclude that, related web pages have related images, then it is possible to retrieve highly relevant elements using IR techniques. On the one hand, the BoW obtained outstanding performances because of the ltering of high frequent terms and the discriminative information captured by TF-IDF weighting scheme. On the other hand, Word2vec representation did not obtain reliable representations because of the great diversity of words involved in web pages. Such diversity makes di cult to build accurate document representations using the simple average of words. Our perspectives for future work include exploring relationships between representation to incorporate mix information (textual-visual) and adding a keyword extraction for the document query.

Acknowledgments. This work was supported by CONACYT under project grant CB-2014-241306 (Clasi cacion y recuperacion de imagenes mediante tecnicas de miner a de textos). The rst author was supported by CONACyT under scholarship No. 214764, and Lopez-Monroy thanks for doctoral scholarship CONACyTMexico 243957.

1. Gilbert , A. , Piras , L. , Wang , J. , Yan , F. , Ramisa , A. , Dellandrea , E. , Gaizauskas , R. , Villegas , M. , Mikolajczyk , K. : Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task . In: CLEF2016 Working Notes. CEUR Workshop Proceedings , Evora, Portugal ( 2016 )

2. Mikolov , T. , Chen , K. , Corrado , G. , Dean , J.: E cient estimation of word representations in vector space . CoRR abs/1301 .3781 ( 2013 )

3. Jones , K.S.: A statistical interpretation of term speci city and its application in retrieval . Journal of Documentation 28 ( 1972 ) 11 { 21

4. Le , Q.V. , Mikolov , T. : Distributed representations of sentences and documents . CoRR abs/1405 .4053 ( 2014 )

5. Hodosh , M. , Young , P. , Hockenmaier , J.: Framing image description as a ranking task: Data, models and evaluation metrics . J. Artif. Int. Res . 47 ( 2013 ) 853 { 899

6. Pellegrin , L. , Vanegas , J.A. , Arevalo , J. , Beltran , V. , Escalante , H.J. , Montes-YGomez , M. , Gonzalez , F. : INAOE-UNAL at ImageCLEF 2015: Scalable Concept Image Annotation . In: CLEF2015 Working Notes. CEUR Workshop Proceedings , Toulouse, France ( 2015 )