<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Monolingual and Cross-lingual information retrieval in cultural Microblog at CLEF 2018</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chedi</forename><surname>Bechikh</surname></persName>
							<email>chedi.bechikh@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Institut supérieur de gestion</orgName>
								<orgName type="institution">Université de Tunis</orgName>
								<address>
									<country key="TN">Tunisia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hatem</forename><surname>Haddad</surname></persName>
							<email>hatem.haddad@ulb.ac.be</email>
							<affiliation key="aff1">
								<orgName type="institution">Université libre de Bruxelles</orgName>
								<address>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Monolingual and Cross-lingual information retrieval in cultural Microblog at CLEF 2018</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">32F9274BDB4C708ABF844D6F2D4A6838</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Microblog search</term>
					<term>Query expansion</term>
					<term>Query translation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>For CLEF 2018, we focus on cultural microblog search. The aim of this work is to find relevant microcritics in a monolingual and cross lingual context about films. This task is challenging due to the short length of the query and of the documents. For the monolingual context we propose to expand the query using a probalistic weighting scheme. For the french-english cross language task, we used a state of the art approach based on query transation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The Cross Language cultural microblog search is the first task<ref type="foot" target="#foot_0">3</ref> from the lab multilingual cultural mining and retrieval (MC2).</p><p>The goal of this task is to find relevant microblogs in different languages from MC2 corpus using 75 topics related to films from a dataset of 70 000 000 microblogs. This corpus is collected between May and September 2015 and is about the keyword Festival. The topics used are selected by the task organizers from VodKaster website and represent a selection of microcrtitcs in french about film and cinema festivals. The topics are composed by a film title, a narrative field containing a microcritic about the film and a third field containing a list of expressions extracted manually from the microcritics. Figure <ref type="figure" target="#fig_0">1</ref> shows an example of the query structure.</p><p>In this work we choose to intervene at the topics level, because it is simpler than to change the indexing shceme. For the monolingual search we based our work on a query expansion approach, for the cross lingual search we used query translation.</p><p>This paper is organized as follow, in section 2 we describe related work on query expansion and cross language information retrieval. Then in section 3 we describe the proposed approach used for query expansion and the query translation techniques used. We conclude this work in Section 4. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>In this paper we propose to use query expansion for monolingual microblog search, so we present a brief state of the art about query expansion. We present also a state of the art about the different approaches used for cross language information retrieval.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Query expansion</head><p>Given that user queries are usually short and that some words can be ambiguous, the use of the simple retrieval model based on the matching between query and document is prone to errors and omissions. Also the users of an information retrieval system can use other words than those present in relevant documents, this lead to the issue of term mismatch. Researchers proposed to use query expansion to resolve the problems caused by short queries and term mismatch.</p><p>Different approach were proposed <ref type="bibr" target="#b1">[2]</ref>:</p><formula xml:id="formula_0">-Interactive Query Refinement -Relevance feedback -Word Sense Disambiguation in IR -Search Results Clustering</formula><p>In our work we rely on relevance feedback. For Relevance feedback, an initial retrieval run is performed using the initial formulation of the query. In addition, some number of top-ranked documents are mined for additional query terms to be added to the initial query. The content of the assessed documents is used to adjust the weights of terms in the original query and/or to add words to the query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Cross language information retrieval</head><p>For cross language information retrieval (CLIR), given a query in language A, the goal is to retrieve documents in language B. CLIR can be useful in different contexts. For example, relevant documents may not exist in the query's language, so the system must be able to retrieve relevant documents in other languages. Many approaches were used for CLIR: query translation, document translation, or the translation of both documents and query. Most researchers rely on query translation because it is easier than the translation of all the documents. For this purpose different approaches were used <ref type="bibr" target="#b0">[1]</ref>:</p><p>-Machine translation -Dictionary -Parallel or comparable corpora -Inter-language representation These different approaches can be used for cross language microblog search and adapted to our task context. The translated queries are then executed against the target collection in a monolingual way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head><p>The documents search in microblogs presents many difficulties: the queries are short, so the retrieval system doesn't have enough contexts to understand user's needs. This can lead to different problems: word ambiguity and word mismatch between queries and documents. Also, we can notice that the documents (microcritics) are also short, so they lack of context.</p><p>We submitted 3 runs:</p><p>-A French monolingual run: "Be-Ha-submission-fr-fr" -Tow french-english Cross lingual runs:</p><p>• "Be-Ha-Submission3Fr-English-dictionary"</p><p>• "Be-HaSubmussion2-FR-English"</p><p>For the monolingual run we used query expansion based on the probalistic BM25 model <ref type="bibr" target="#b3">[4]</ref>. This process is based on many steps:</p><p>-Step 1: Original queries are used to retrieve the top 50 microcritics for each query. These retrieved microcrtics are used as a set of pseudo-relevant documents. -Step 2: For each query, we merge the retrieved documents set in a single document. -Step 3: Since expansion terms are selected from this set of supposed relevant documents, the new set of documents is used to compute the weight of each expansion term. The top 30 terms with the highest score are selected as expansion terms and added to the original query. -Step 4: The new query is used to match the collection of microblogs.</p><p>We choose the BM25 model to attribute the score for each terms, because it is one of most accurate model for information retrieval. This model is based on a weighting scheme defined by this formula <ref type="bibr" target="#b3">[4]</ref>:</p><formula xml:id="formula_1">Score(R|D) = t∈Q w (1) (k 1 + 1) tf k1 (1 − b) + b |d| |davg| + tf (k3 + 1)qtf k3 + qtf<label>(1)</label></formula><p>where qtf is the number of times that the term t is present in the query, tf is the frequency of the term t in the document, |d| is the number of terms in the document, |d avg | is the average length of a document, k1 is a parameter that controls the saturation of tf , k3 is a parameter that controls the saturation of qtf and b is the frequency normalization parameter.</p><p>w (1) is similar to the inverse document frequency (idf) and is defined by [?]:</p><formula xml:id="formula_2">w (1) = log N − df + 0.5 df + 0.5 (<label>2</label></formula><formula xml:id="formula_3">)</formula><p>Where N is the number of documents in the collection and df is the number of documents where the term t occurs in the collection.</p><p>For cross language microblog search, we translate the query using tow techniques:</p><p>-A french-english dictionary -Bilingual Wordnet.</p><p>For the run "Be-Ha-Submission3Fr-English-dictionary" we used a dictionary translation approach to translate the queries from the french to the english language.</p><p>A bilingual dictionary was used for query translation and it is constructed from an online dictionary. It consists of 33k distinct English words and 28k distinct French words, which constitutes 76k translation pairs. It contains lemmatized forms of content words (nouns, verbs, adjectives, adverbs).</p><p>For the run "Be-HaSubmussion2-FR-English" we used an inter-lingual representation based on english and french Wordnet. WordNet is a large lexical database of English that was extended to many languages. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept <ref type="bibr" target="#b2">[3]</ref>. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.</p><p>In wordnet each word is mapped to an indentifier, so words with the same meaning have the same number. For exemple the french word 'acteur' (actor) has as identifier "09765278-n", wich is the samed identifier for the word "actor", "actress", "performer", etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>This paper describes our first participation on the first task Cross language Microblog search of the MC2 lab at CLEF 2018. The aim of this work is to find relevant microblogs given title and microcritic about films. In this work we submitted 3 runs, one french monolingual run and two cross lingual runs based on query translation. This work is still in progress and needs more investigations for future work, so other expansion approaches must be proposed, also the indexing process must be studied.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. topic number 201800</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://mc2.talne.eu/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Cross-language information retrieval based on bilingual formal concept mining</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bechikh-Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Haddad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Slimani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">14 IACS/IEEE International Conference on Computer Systems and Application (AICSSA 2017)</title>
				<meeting><address><addrLine>Hammamet, Tunisia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-11-03">October 30-November 3, 2017. 2017</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of automatic query expansion in information retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Carpineto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Romano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">50</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">WordNet: an electronic lexical database</title>
		<editor>Fellbaum, C.</editor>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Okapi at TREC-3</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hancock-Beaulieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gatford</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of The Third Text REtrieval Conference, TREC 1994</title>
				<meeting>The Third Text REtrieval Conference, TREC 1994<address><addrLine>Gaithersburg, Maryland, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1994">November 2-4, 1994. 1994</date>
			<biblScope unit="page" from="109" to="126" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
