<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Two-Stage Refinement of Transitive Query Translation with English Disambiguation for Cross-Language Information Retrieval: A Trial at CLEF 2004</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Kazuaki</forename><surname>Kishida</surname></persName>
							<email>kishida@surugadai.ac.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Surugadai University</orgName>
								<address>
									<addrLine>698 Azu</addrLine>
									<postCode>357-8555</postCode>
									<settlement>Hanno</settlement>
									<region>Saitama</region>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Noriko</forename><surname>Kando</surname></persName>
							<email>kando@nii.ac.jp</email>
							<affiliation key="aff1">
								<orgName type="institution">National Institute of Informatics (NII)</orgName>
								<address>
									<postCode>101-8430</postCode>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kuang-Hua</forename><surname>Chen</surname></persName>
							<email>khchen@ntu.edu.tw</email>
							<affiliation key="aff2">
								<orgName type="institution">National Taiwan University</orgName>
								<address>
									<postCode>10617</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Two-Stage Refinement of Transitive Query Translation with English Disambiguation for Cross-Language Information Retrieval: A Trial at CLEF 2004</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">AB8217D13096B540C639FD37CA773E2B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:47+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper reports experimental results of cross-language information retrieval (CLIR) from German to French. The authors are concerned with CLIR in cases where available language resources are very limited. Thus transitive translation of queries using English as a pivot language was used to search French document collections for German queries without any direct bilingual dictionary or MT system of these two languages. The two-stage refinement of query translations that we proposed at the previous CLEF 2003 campaign is again used for enhancing performance of pivot language approach. In particular, disambiguation of English terms in the middle stage of transitive translation was attempted as a new trial. Our experiment results show that the two-stage refinement method is able to significantly improve search performance of bilingual IR using a pivot language, but unfortunately, the English disambiguation has almost no effect.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This paper aims at reporting our experiment of cross-language IR (CLIR) from German to French in CLEF 2004. In the previous CLEF 2003, the authors proposed the "two-stage refinement technique" for enhancing search performance of pivot language approach in the situation that only limited language resource is available, where German to Italian search runs were executed using only three resources: (1) a German to English dictionary, <ref type="bibr" target="#b1">(2)</ref> an English to Italian dictionary and (3) target document collection <ref type="bibr" target="#b0">[1]</ref>. The target document collection was employed as a language resource for both translation disambiguation and query expansion by applying a kind of pseudo-relevance feedback (PRF) <ref type="bibr" target="#b0">[1]</ref>.</p><p>In CLEF 2004, we have tried to add an English document collection as a language resource for executing German to French search runs via English as a pivot. That is, unlike CLEF 2003, a disambiguation procedure using a document collection is applied to the English term set in the middle position of transitive query translation. It is expected that irrelevant French words decrease because of removing inappropriate English translations.</p><p>This paper is organized as follows. In section 2, the two-stage refinement technique and the English disambiguation method are introduced. Section 3 will describe our system used in the experiment of CLEF 2004. In section 4, the results will be reported.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Two-Stage Refinement of Query Translation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Basic Procedure</head><p>A purpose of the "two-stage refinement technique" is to modify a result of query translation for improving CLIR performance. The modification consists of two steps: (1) disambiguation and (2) expansion. In our approach, "disambiguation" means selecting a single translation for each search term in source language, and "expansion" is to execute a standard PRF technique using the set of translations selected in the disambiguation stage as an initial query. Although many researchers have performed the two processes together for CLIR, in our method, both processes are based on a PRF technique using the target document collection. That is, under an assumption that only limited language resource is available, we use the target collection as a language resource for disambiguation.</p><p>We define mathematical notations such that: ), j T ′ : a set of translations in the target language for term j s , and</p><formula xml:id="formula_0">m T T T T ′ ∪ ∪ ′ ∪ ′ = ... 2 1 .</formula><p>First, the target document collection is searched for the set of terms T . Second, the most frequently appearing term in the top-ranked documents is selected from each set of (</p><formula xml:id="formula_1">)<label>2</label></formula><p>The disambiguation technique is clearly based on PRF, in which some top-ranked documents are assumed to be relevant. The most frequently appearing term in the relevant document set is considered as a correct translation in the context of a given query.</p><p>In the next stage, according to Ballestellos and Croft <ref type="bibr" target="#b1">[2]</ref>, a standard post-translation query expansion by PRF technique is executed using T ~ in (2) as a query. In this study, we use a standard formula based on the probabilistic model for estimating terms weight as follows:</p><formula xml:id="formula_2">) 5 . 0 )( 5 . 0 ( ) 5 . 0 )( 5 . 0 ( log + − + − + + − − + × = t t t t t t t r R n N r n R N r r w , (<label>3</label></formula><formula xml:id="formula_3">)</formula><p>where N is the total number of documents, R is the number of relevant documents, t n is the number of documents including term t , and t r is defined as the same as before (see Equation ( <ref type="formula">1</ref>)). The expanded term set is used as a final query for obtaining a list of ranked documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Disambiguation during Transitive Query Translation</head><p>The pivot language approach is adopted in this paper, i.e., a search term in the source language is translated into the set of English terms, and each English term is transitively translated into terms in the target language. As many researchers pointed out, if the set of English terms includes erroneous translations, they would yield much more irrelevant terms in the target language.</p><p>A solution is to apply any disambiguation technique to the set of English translations (see Fig. <ref type="figure" target="#fig_1">1</ref>). If an English document collection is available, we can use easily our disambiguation method described in the previous section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">System Description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Text Processing</head><p>Both German and French texts (in documents and queries) were basically processed by the following steps: (1) identifying tokens, (2) removing stopwords, (3) lemmatization, and (4) stemming. In addition, for German text, decomposition of compound words was attempted based on an algorithm of longest matching with headwords included in the German to English dictionary in machine readable form. For example, a German word, "Briefbombe," is broken down into two headwords listed in the German to English dictionary, "Brief" and "Bombe," according to a rule that only the longest headwords included in the original compound word are extracted from it. If a substring of "Brief" or "Bombe" is also listed in the dictionary, the substring is not used as a separated word.</p><p>We downloaded free dictionaries (German to English and English to French) from the Internet<ref type="foot" target="#foot_0">1</ref> . Also, stemmers and stopword lists for German and French were available through the Snowball project<ref type="foot" target="#foot_1">2</ref> . Stemming for English was conducted by the original Porter's algorithm <ref type="bibr" target="#b2">[3]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Transitive Translation Procedure</head><p>Before executing transitive translation by two bilingual dictionaries, all terms included in the dictionaries were normalized through stemming and lemmatization processes with the same procedure applied to texts of documents and queries. The actual translation process is a simple replacement, i.e., each normalized German term (to which decomposition process was applied) in a query was replaced with a set of corresponding normalized English words, and similarly, each English word was replaced with the corresponding French words. As a result, for each query, a set of normalized French words was obtained. If no corresponding headword was included in the dictionaries (German-English or English-French), the unknown word was sent directly to the next step without any change.</p><p>Next, refinement of the translations by our two-stage technique described in the previous section was executed. The number of top-ranked documents was set to 100 in both stages, and in the query expansion stage, the top 30 terms were selected from the ranked list in decreasing order of term weights (Equation ( <ref type="formula" target="#formula_2">3</ref>)).</p><p>Let t y be the frequency of a given term in the query. If the top-ranked term was already included in the set of search terms, the term frequency in the query was changed into ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Type of Search Runs</head><p>As for dictionary-based transitive query translation via a pivot language, we executed three types of run as follows:</p><p>original query (German)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G to E dictionary</head><p>English translations In order to comparatively evaluate performance of our two-stage refinement method, we decided to use commercial MT software produced by a Japanese company <ref type="foot" target="#foot_2">3</ref> . In this case, first of all, the original German query was entered into the software. The software we used executes automatically German to English translation and then English to French translation (i.e., a kind of transitive translation). The resulting French text from the software was processed according to the procedure described in section 3.1, and finally, a set of normalized French words was obtained for each query. In the case of MT translation, only post-translation query expansion was executed with the same procedure and parameters as the case of dictionary-based translation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E to F dictionary</head><p>Similarly, for comparison, we tried to execute French monolingual runs with post-translation query expansion.</p><p>The well-known the BM25 of Okapi formula <ref type="bibr" target="#b3">[4]</ref> was employed for computing each document score in all searches of this study. We executed five runs in which &lt;TITLE&gt; and &lt;DESCRIPTION&gt; fields in each query were used, and submitted the results to the organizers of CLEF 2004. All runs were executed on the information retrieval system, ADOMAS (Advanced Document Management System) developed at Surugadai University in Japan.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Basic Statistics</head><p>The target French collections include 90,261 documents in total. The average document length is 227.14 words. Also, we use the Glasgow Herald 1995 as a document set for English disambiguation. The English collection includes 56,742 documents and the average document length is 231.56. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results</head><p>Scores of average precision and R-precision are shown in Table <ref type="table" target="#tab_1">1</ref>, and recall-precision curves of each run are presented as Fig. <ref type="figure" target="#fig_4">2</ref>. Note that each value in Table <ref type="table" target="#tab_1">1</ref> and Fig. <ref type="figure" target="#fig_4">2</ref> is calculated for 49 topics.</p><p>As shown in Table <ref type="table" target="#tab_1">1</ref>, MT outperforms significantly dictionary-based translations, and its value of mean average precision (MAP) is 0.3368, which is 85.4% of that by the monolingual run (.3944). Although performance of dictionary-based approach using free dictionaries downloaded from the Internet is less than that of MT approach, Table <ref type="table" target="#tab_1">1</ref> shows two-stage refinements improve effectiveness of the dictionary-based translation method as similar with our CLEF2003 experiment. That is, the MAP score of NiiDic05 with no refinement is .1015, and NiiDic03 (with English disambiguation) and NiiDic04 (with no English disambiguation) outperform significantly NiiDic05.</p><p>However, it looks that the English disambiguation has almost no effect. The MAP score of NiiDic03 is .2690, which is slightly inferior to that of NiiDic04 (.2740), and clearly there is no statistically significant difference between them. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Concluding Remarks</head><p>This paper reported results of our experiment on CLIR from German to French, in which English was used as a pivot language. Two-stage refinement of query translation was employed for removing irrelevant terms in the target language produced by transitive translation using two bilingual dictionaries successively and for expanding the set of translations. Particularly, in CLEF 2004, disambiguation of English terms in the middle process of transitive translation was tried.</p><p>As a result, it turned out that − our two-stage refinement method significantly improves retrieval performance of bilingual IR using a pivot language, and − English disambiguation has almost no effect.</p><p>Intuitively, the English disambiguation is promising because removing erroneous English term is theoretically effective for preventing irrelevant terms from spreading in the final set of search terms in the target language. Further research is needed.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Two-stage refinement of translation with English disambiguation</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>Two-stage refinement of translation with English disambiguation -(b) Two-stage refinement of translation without English disambiguation (same in CLEF 2003) -(c) No refinement</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Recall-precision curves</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Average precison and R-precision (49 topics)</figDesc><table><row><cell>Run</cell><cell>ID</cell><cell>Average</cell><cell>R-Precision</cell></row><row><cell></cell><cell></cell><cell>Precision</cell><cell></cell></row><row><cell>French Monolingual</cell><cell>NiiFF01</cell><cell>.3944</cell><cell>.3783</cell></row><row><cell>MT</cell><cell>NiiMt02</cell><cell>.3368</cell><cell>.3125</cell></row><row><cell>Dictionary 1: Two-stage refinement with</cell><cell>NiiDic03</cell><cell>.2690</cell><cell>.2549</cell></row><row><cell>English disambiguation</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Dictionary 2: Two-stage refinement without</cell><cell>NiiDic04</cell><cell>.2746</cell><cell>.2542</cell></row><row><cell>English disambiguation</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Dictinary 3: No refinement</cell><cell>NiiDic05</cell><cell>.1015</cell><cell>.1014</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.freelang.net/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://snowball.tartarus.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.crosslanguage.co.jp/english/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Two stages refinement of query translation for pivot language approach to cross lingual information retrieval: a trial at CLEF</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kishida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kando</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes for the CLEF 2003 Workshop</title>
				<imprint>
			<date type="published" when="2003">2003. 2003</date>
			<biblScope unit="page" from="129" to="136" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Resolving ambiguity for cross-language retrieval</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ballesteros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st ACM SIGIR conference on Research and Development in Information Retrieval</title>
				<meeting>the 21st ACM SIGIR conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="1988">1988</date>
			<biblScope unit="page" from="64" to="71" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">An algorithm for suffix stripping</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Porter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Program</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="130" to="137" />
			<date type="published" when="1980">1980</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Okapi at TREC-3</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Roberson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Hancock-Beaulieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gatford</surname></persName>
		</author>
		<ptr target="http://trec.nist.gov/pubs/" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of TREC-3</title>
				<meeting>TREC-3<address><addrLine>Gaithersburg</addrLine></address></meeting>
		<imprint>
			<publisher>National Institute of Standards and Technology</publisher>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
