Introduction and Motivation

Wikification: Mining Structured Queries Unstructured Information Needs using Wikipedia-based Semantic Analysis

Amir Hossein Jadidinejad

amir@jadidi.info 0 1

Fariborz Mahmoudi

mahmoudi@itrc.ac.ir 0 1 0 Islamic Azad University of Qazvin 1 Measurement , Performance, Experimentation

Combining the language model and inference network, as implemented in the Indri search engine, is efficient and verified approach. In this retrieval model, the user's information need is exhibited as Indri's Structural Query Language. Although the SQL allows expert users to richly represent its information needs but unfortunately, the complicacy of SQLs make them unpopular in the WEB for ordinary ones. Automatically detecting the concepts in a user's information need and generate a richly structured equivalent query is a good solution. It needs a concept repository and a way to extracting appropriate concepts from the user's information need. We utilize Wikipedia as a great, multilingual, free-content encyclopedia for our knowledge base and also some state of the art algorithms for extracting Wikipedia's concepts from the user's information need. This process is called “Query Wikification”. Mining Wikipedia concept repository help us to propose a solution that supports usability in multilingual environments, cross-language retrievals, scalability and covering erratum, various equivalents and synonyms of a concept. Experimental results verify that our automatic structured query construction is an efficient and scalable method that has a very good potential to apply on the WEB. Our experiments over TEL corpus in CLEF2009 achieves +23% improvement in Mean Average Precision and retrieves more than 600 relevant documents against the Indri baselines. In Persian track, we evaluated a simple stemmer so-called “Perstem”, a stemmer and light morphological analyzer for Persian language. Our experimental results show that using this stemmer in indexing and retrieval phase can significantly improve both precision (+91%) and recall (+43%).

Introduction and Motivation

Representing user’s information need is a fundamental part in an information retrieval system. Most systems get a list of keywords for each information need. For example, if a user is interested in “colour therapy” and the therapeutic use of colour they might formulate the natural language query “colour therapy”. It’s not only a hard task for ordinary users to represent its information need as a set of keywords but also clear that a lot of semantics is lost by transcribing the information need into a set of keywords. Such a query may retrieve some documents about “color” or “therapy” that completely irrelevant. Also user’s knowledge about the query is neglected when encoding it as a list of keywords. For example, maybe our user knows that “color” and “colour” are synonymous!

Structured Queries can represent user’s information needs accurately. A Structured Query Language (SQL) allows terms weighting, the use of proximity information among terms, field restricting and various ways of combining concepts. Since structured queries can be more expressive than keywords, it’s verified that retrieval models that can evaluate structured queries [ 8 ] such as Indri [ 12 ] and InQuery [ 3 ] have more potential to retrieve more accurate results.

Although the structured queries and related models got a very good results in different experiments [ 12, 8 ] but they suffer from a drawback that made them unusable in the WEB. Having knowledge about related concepts in the query is necessary to constructing structured queries. Even we presume that the user has a good knowledge about its information need, learning the complicated Structured Query Language for WEB users is not favorable. Understanding the user’s information need and generating a richly structured query is a great solution. It needs a huge concept repository that covers all query concepts and a way to extracting appropriate concepts from the user’s information need. Wikipedia is a multilingual, web-based, free-content encyclopedia that cover most important concepts in the world. We call the process of extracting a list of Wikipedia concepts from a natural language information need as “Query Wikification”. Mining Wikipedia and some state of the art Wikification algorithms [ 9, 10 ] are used to generate a richly, efficient structured query. The contributions of this paper are the following: ∙ Proposing a new method for converting a simple natural language information need into a well-formed, rich, efficient, structured query. This process is done with the aid of state of the art algorithms in both “Wikification” [ 9, 10 ] and “Structural Retrieval Models” [ 8 ]. It can replace keyword-based search engines on the WEB with the powerful structured queries and related models. ∙ Usability in multilingual environments and cross-language retrieval. The proposed model make a meta-language search engine from Indri [ 12 ] that can efficiently apply on multilingual environments such as the WEB. Our experiments in CLEF2009 campaign is a good evidence for this feature. ∙ Scalability. The proposed approach is base on Indri search engine1, a scalable language modeling search engine that supports structured queries. Also some new projects such as Galago2 that supports Indri Structured Query Language in a distributed computation framework make it more and more scalable and suitable for the WEB. ∙ Our model can extract a vocabulary for each query’s concept by mining Wikipedia. It contains erratum, stemmed equivalent and synonyms of the concept. All of them are embedded in the structured query. Also, this vocabulary can work as a semi-stemming algorithm and very helpful in multilingual environments or complicated languages such as Persian language that have a hard morphology (Sec. 4.2.2). This feature is WEB suitable too!

1http://lemurproject.org/indri/ 2http://www.galagosearch.org/

The process of automatically recognizing the topics mentioned in unstructured text and linking them to the appropriate Wikipedia articles is known as wikification [ 9 ]. The user’s information need is a short and informative text. So we can apply Wikification on user’s information needs in order to map unstructured query into a weighted list of concepts in Wikipedia. We call this process as “Query Wikification”. To our knowledge, there isn’t any relevant publication in this research area.

Two Wikification method have been proposed by now. The first is Wikify! [ 9 ] and the second is WM-Wikifier [ 10 ]. WM-Wikifier is a distinguish approach that uses Wikipedia articles not only as a source of information to point to, but also as training data for how best to create links. We utilize this algorithm for “Query Wikification”. More details can be found in [ 10 ].

For example, take a look at Figure 1. It’s a sample user’s information need in CLEF 2009. The result of Query Wikification is shown in Figure 2. As you see, the important topics are extracted and the original query is annotated using Wikipedia concepts. We use Wikipedia-Miner3 toolkit [ 14 ] in our experiments. 3

Structured Query Construction

If we can map an unstructured user’s information need to a weighted list of Wikipedia concepts, what can we do with these concepts?!, It can help us to move from unstructured, limited and noisy text to structured, well-known and accurate concepts. It’s a break through step in Information Retrieval. The Wikification algorithms simply do that!

In our experiments, we utilize the WM-Wikifier [ 10 ] algorithm in order to extract a weighted list of Wikipedia concepts and mine translation and synonyms of these concepts from Wikipedia knowledge-base to construct an equivalent structure query. For example, take a look at Figure 1. It’s a sample topic in CLEF 2009. In this topic the user is looking for all relevant information bout colour therapy and therapeutic use of colours. The following is the Indri [ 12, 8 ] equivalent structure query after removing redundant and stop words: #combine(colour therapy therapeutic)

3http://wikipedia-miner.sourceforge.net/

Title dc:title dcterms:alternative dc:subject dc:abstract dc:description dc:contributor Distribution 80% little 210% little 42% little

Description

This is record’s title. All records contains this field and it ia a valuable field.

In some records, this field contains relevant information.

Manually assigned subject heading.

Record’s abstract.

Record’s description. Mostly contains copyrights and related stuffs.

Record’s contributor.

The following structure query is generated by our approach4. It contains some professional expressions (“chromotherapy”) and all translations and synonyms of each concept: #combine(colour therapy therapeutic #syn(chromotherapy farbtherapi colourology #1(color therapy)) #syn(color couleur farb colour colors colours couleur) #syn(therapi thrap therapi treatment therapie therapy))

There are various approach in constructing equivalent structure query. In the next section, we describe our experiments. 4 4.1 4.1.1

Experiments

TEL@CLEF2009

Meta-Language Field Index Construction TEL is an inherently multilingual corpus. It contains not only records in different languages but also some records maybe have multilingual fields. Detecting record’s language is a fundamental task to apply stemming and stop word removal. On the other hand, detecting different languages in each record is not only a hard work but also lead to poor results. Previous experiments utilize different language identification approaches to detect each field’s language and then apply appropriate stemmer and stop words [ 1 ]. We use a meta-language index in our experiments. Instead of distinguishing different languages, all fields are indexed without stemming and stop word removal. In this approach, all valuable contents are indexed together without any concern about underlying language. It is clear that such indexing strategy is not appropriate in general but our experiments have shown that it is an appropriate indexing strategy in tandem with Query Wikification and Indri Structured Query Language.

In the preprocessing step, we delete all noisy and invaluable fields from TEL corpus. After analyzing TEL’s records, we extract a list of fields that contains important information. Table 1 shows the valuable fields in preprocessing step. For example, see Figure 3, it is a sample record in TEL corpus. Figure 4 is an equivalent record after preprocessing. As you see, we skip all invaluable fields and store remaining one in TREC format. Also we don’t apply any stemming or stop word removal in the indexing phase. Instead apply stop word removal in retrieval phase using a list of stop words provided by UNINE5.

We utilize Indri [ 12, 8 ] Field Index for indexing because it not only construct a powerful field index but also support index’s fields in its query language. All valuable fields (Table 1) is configured 4The generation procedure is discussed in Sec 4. 5http://members.unine.ch/jacques.savoy/clef/englishST.txt as a backward field index. Finally, the indexing is done by “indribuildindex” application in Lemur toolkit [ 11 ]. 4.1.2 Indri Baseline To compare our results, we apply Indri retrieval model [ 12, 8 ] on the title and description of each topic. The query model is as follow: #combine( <title> <description> ) Before passing topics to Indri retrieval engine, all common and redundant words are removed. For example, for the query that is shown in Figure 1, after removing common and redundant words: #combine(colour therapy therapeutic) This run is addressed as “SIM” in the our experiments. Table 3 and Figures 5 and 6 compare this baseline with proposed approaches. 4.1.3

Concept Translation Wikipedia contains articles in more than 250 natural languages. Each article link to equivalent one in other languages. After extracting concepts from unstructured user’s information need, we can utilize the translation links in Wikipedia in order to translate each concepts. The following model is applied: #combine( <title> <description> #syn(#1(EN) #1(FR) #1(GE)) )

For example for previous sample query:

This run is addressed as “SIMTR” in our experiments. Table 3 and Figures 5 and 6 compare this run with other approaches and the baseline. Also take a look at Table 2, it compares the proposed approaches and baseline for the previous example (“colour therapy”). Evaluation results show that translating concepts using Wikipedia significantly improve both precision (+18%) and recall (+8%). For the example query (Table 2), Mean Average Precision is improved (+62%) and also 1 (+4%) more relevant document is retrieved. 4.1.4

Concept Translation and Synonyms Extraction Most retrieval systems are a simple pattern matcher. So co-occur terms play an important role in ranking algorithm. So we eager to know more and more synonyms and relevant concepts for each concept. If we have an article in Wikipedia, we can mine all other articles to find a list of synonyms for this article. There are two distinct ways: redirect pages6 and anchors. We prefer anchor titles since we can rank the vocabulary for each concept while ranking is not possible for redirect pages7. This can be done by anchor texts. All anchors for one articles are synonym. This assumption construct the following structure query: #combine( <title> <description>

#syn(#1(EN) #1(FR) #1(GE) <Anchors List>)) For example the previous sample query is defined as:

6redirects are standalone pages in Wikipedia that just have a title that refer to an article. For covering various equivalents, misspelling, and. . .

7We can rank redirect pages by query logs in Wikipedia. #combine(colour therapy therapeutic #syn(chromotherapy farbtherapi colourology #1(color therapy)) #syn(color couleur farb colour colors colours couleur) #syn(therapi thrap therapi treatment therapie therapy))

This run is addressed as “SIMEXT” in our experiments. Table 3 and Figures 5 and 6 compare this run with other approaches and the baseline. Also take a look at Table 2, it compares the proposed approaches and baseline for the previous example (“colour therapy”). Evaluation results show that translating concepts in tandem with synonyms and various equivalent extraction using Wikipedia significantly improve both precision (+22%) and recall (+13%). For the example query (Table 2), Mean Average Precision is improved (+66%) and also 2 (+8%) more relevant document is retrieved. Also our experimental results over TEL corpus show that SIMEXT is a better solution than SIMTR in both precision and recall. 4.2

Persian@CLEF 4.2.1

Bilingual Persian is an Indo-European language spoken in Iran, Afghanistan and Tajikistan. It is also known as Farsi [ 1, 6 ]. In this section we summarize our experiments in the Persian track of CLEF2009. Bilingual retrieval in Persian track is done with a same approach as discussed in Sec. 4.1. Unlike TEL experiments, we have a very poor results, due to little coverage of Farsi language of Wikipedia8. For example most topics is extracted from the query but since there isn’t an equivalent article in Farsi language of Wikipedia, we can’t translate it. Table 4 shows our different runs.

RUN IAUPEREN1 IAUPEREN2 IAUPEREN3

Relevant-Retrieved 650/4330 659/4330 773/4330

MAP 0.0195 0.0202 0.0277

NDCG 0.0975 0.0984 0.1223 R-PREC 0.0433 0.0427 0.0477 Relevant-Retrieved Desc

Perstem9 is a stemmer and light morphological analyzer for Persian by Jon Dehdari 10. It is written in Perl and uses regular expression substitutions to separate inflectional morphemes and remove affixes. The stemmer currently has 76 substitution rules, which replace one pattern of text with another [ 4 ]. It has a very good performance and accuracy for stemming and morphological analyzing of Persian texts. On a sample dataset, Perstem correctly and efficiently analyzed 97% of the words [ 4 ].

Inconsistent stemming results have been reported in CLEF2008 [ 2, 7 ]. So we decided to evaluate it in our CLEF 2009 experiments. Unlike [ 13 ], our evaluation is based on overall performance (precision/recall) with Hamshahri corpus and benchmark queries in CLEF 2009. On the other hand, we investigate the application of Perstem in Persian retrieval in a large news corpus. Table 5 shows our official runs11. Experimental results show that stemming algorithm significantly improved both precision (+91%) and recall (+43%). 5

Conclusion and Future Works

In this paper we propose an efficient approach for extracting relevant concepts and a vocabulary of synonyms, translations, various equivalents and. . . that all of them are embedded in a structured query. We leverage Wikipedia as our knowledge base and Indri as Structured Query Language and model. Query modification techniques such as query expansion suffer from a problem so-called “Query Drift”. It means that although by modifying a query we can get more relevant documents but it maybe hurt the precision. Our experiments over TEL corpus show that this method is an efficient and robust approach that significantly improves both precision and recall. We believe that our method is a good potential to apply on the WEB. For example, take a look at the following query12: Title: Modern Persian Language, Desc: Retrieve publications providing instructions on learning or teaching modern/contemporary Persian.

Take a look at the generated structured query by SIMEXT : #weight(0.3 #combine(modern teaching instructions persian contemporary learning language) 0.7 #syn(farsi #1(persian languages) #1(farsi salis language) #1(modern perisan) persian #1(modern persian language) #1(parsi language) #1(farsi language) #1(modern persian) #1(persian language) #1(persische sprache) )) “Farsi” or “Parsi” are informal equivalents of “Modern Persian Language” that it can’t nowise understand from the original query. Using these informal equivalent on the WEB is very important evidence. For another example, take a look at the following structured query for Figure 1: 9http://sourceforge.net/projects/perstem/ 10http://www.ling.ohio-state.edu/∼jonsafari/ 11DOI: 10.2415/AH-PERSIAN-MONO-FA-CLEF2009.QAZVINIAU.IAUPERFA¡X¿ 1210.2452/733AH As you see, without applying a complicated stemmer in our multilingual environment (TEL corpus), our extracted vocabulary from anchor titles can cover most of them efficiently. For example, in the structured query, “color” and “colour” are synonyms. It’s a very good potential in highly multilingual environments such as the WEB. Evaluation comparison for each query is shown in Table 6. 6

Acknowledgements

We would like to thank Donald Metzler13, one of the main developers of Indri Structured Query Language, for his ideas and advice, and Lemur community14 for supporting and sharing an excellent resource. Also, we would like to thank Jon Dehdari for sharing Perstem, and DBRG15 for Hamshahri corpus. Finally, we must of course acknowledge the tireless efforts of the Wikipedia community that make a valuable knowledge base during years. We are also debated to the CLEF organizers too.

13http://research.yahoo.com/Don Metzler 14http://sourceforge.net/forum/?group id=161383 15http://ece.ut.ac.ir/DBRG/ W .8 .4 .4 .3 .3 .2 .2 .2 .2 .2 .1 .4 .1 0 0 0 0 0 0 0 0 0 0 0 0 0 g n i r e e n i g n e y m to y a d n o y a b s n o i t h la c t n n ic n rc i i 1

2 4

0 0

0 4 n o i t a l u p i n a M e n e G n a m u 9 y h p o s o l y i h h p p s r e h p o s o l i h P h c n e r F y r a r o p m e t n o e l t i D I T

[1]

Eneko

Agirre , Giorgio M. Di Nunzio , Nicola Ferro, Thomas Mandl, and Carol Peters . Clef 2008 : Ad hoc track overview . In Proceedings of the CLEF 2008: Workshop on Cross-Language Information Retrieval and Evaluation , Aarhus, Denmark, 2008 . 4 , 8

[2]

AleAhmad , E. Kamalloo,

Zareh ,

Rahgozar , and

Oroumchian . Cross language experiments at persian@clef 2008 . In Proceedings of the CLEF 2008: Workshop on CrossLanguage Information Retrieval and Evaluation , Aarhus, Denmark, 2008 . CLEF 2008 Organizing Committee . 9

[3] James

Callan , W. Bruce

Croft , and John Broglio. Trec

and tipster experiments with inquery . Inf . Process. Manage., 31 ( 3 ): 327 - 343 , 1995 . 2

[4]

Jon

Dehdari and

Deryle

Lonsdale . A link grammar parser for Persian . In Simin Karimi, Vida Samiian, and Don Stilo, editors, Aspects of Iranian Linguistics , volume 1 . Cambridge Scholars Press, 2008 . 9

[5]

Amir

Hossein Jadidinejad and

Fariborz

Mahmoudi . Qiau at clef2009: Persian track . In Proceedings of the CLEF 2009: Workshop on Cross-Language Information Retrieval and Evaluation , Corfu, Greece, September 2009 . CLEF 2009 Organizing Committee .

[6]

Simin

Karimi . Persian or farsi? 20Farsi.pdf. 8

[7]

Karimpour ,

Ghorbani ,

Pishdad ,

Mohtarami , A. AleAhmad, and Amiri A. Using part of speech tagging in persian information retrieval . In Proceedings of the CLEF 2008: Workshop on Cross-Language Information Retrieval and Evaluation , Aarhus, Denmark, 2008 . CLEF 2008 Organizing Committee . 9

[8]

Donald

Metzler and

W. Bruce

Croft . Combining the language model and inference network approaches to retrieval . Inf. Process. Manage., 40 ( 5 ): 735 - 750 , 2004 . 2 , 3 , 4 , 6

[9]

Rada

Mihalcea and

Andras

Csomai . Wikify! : linking documents to encyclopedic knowledge . In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management , pages 233 - 242 , New York, NY, USA, 2007 . ACM. 2 , 3

[10]

David

Milne and

Ian H.

Witten . Learning to link with wikipedia . In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management , pages 509 - 518 , New York, NY, USA, 2008 . ACM. 2 , 3

[11]

Paul

Ogilvie and

Jamie

Callan . Experiments using the lemur toolkit . In In Proceedings of the Tenth Text Retrieval Conference (TREC-10 , pages 103 - 108 , 2002 . 6

[12] Trevor

Strohman

, Donald Metzler, Howard Turtle, and

W. Bruce

Croft . Indri: A languagemodel based search engine for complex queries (extended version) . IR 407 , University of Massachusetts, 2005 . 2 , 3 , 4 , 6

[13] Masoud

Tashakori

, Mohammad Reza Meybodi, and

Farhad

Oroumchian . Bon: The persian stemmer . In EurAsia-ICT , pages 487 - 494 , 2002 . 9

[14] Ian

Witten and David

Milne . An open-source toolkit for mining Wikipedia . In (to appear), 2009 . 3