<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">From terms to concepts: a revisited approach to Local Context Analysis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Annalina</forename><surname>Caputo</surname></persName>
							<email>acaputo@di.uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari</orgName>
								<address>
									<postCode>70126</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pierpaolo</forename><surname>Basile</surname></persName>
							<email>basilepp@di.uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari</orgName>
								<address>
									<postCode>70126</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><surname>Semeraro</surname></persName>
							<email>semeraro@di.uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari</orgName>
								<address>
									<postCode>70126</postCode>
									<settlement>Bari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">From terms to concepts: a revisited approach to Local Context Analysis</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">35A9E6739DD4593B54E0FC025924AC17</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Pseudo-Relevance Feedback (PRF) is a widely used technique which aims to improve the query representation assuming as relevant the top ranked documents. This should results in better performance as, after the expansion and re-weigh of the original query, the resultant vector should contain all those worth features able to express utterly the user's information need. This paper presents the application of a pseudo-relevance feedback technique, called Local Context Analysis (LCA), to SENSE (SEmantic N-levels Search Engine). SENSE is an IR system that tries to overcome the limitations of the ranked keyword approach by introducing semantic levels which integrate (and not simply replace) the lexical level represented by keywords. The evaluation shows that this PRF technique is able to work worthily on both the lexical level represented by keywords and the semantic level represented by WordNet synsets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction and Background</head><p>LCA <ref type="bibr" target="#b5">[6]</ref> is a PRF technique which exploits the context of query words in a collection of documents, by analyzing which words in the top ranked documents simultaneously co-occur with the most of query terms. This paper presents an extension of LCA in SENSE <ref type="bibr" target="#b1">[2]</ref>, an IR system which aims to be a step forward traditional keyword-based systems. The main idea underlying SENSE is the definition of an open framework to model different semantic aspects (or levels) pertaining document content. Two basic levels are available in the framework: The keyword level, the entry level in which the document is represented by the words occurring in the text, and the word meaning level, represented through synsets obtained by WordNet, a semantic lexicon for the English language. A synset is a set of synonym words. Word Sense Disambiguation algorithms are adopted to assign synsets to words. Analogously, several different levels of representation are needed for representing queries. In this model also the notion of relevance of a document d in the collection for the user query q is extended to several levels of representation. A local similarity function computes the document relevance for each level, according to feature weights defined by the corresponding local scoring function. Then, a global ranking function is needed to merge all the result lists that come from each level in a single list of documents ranked in decreasing order of relevance. In the same way, the PRF technique should be able to work over all the levels involved in our model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">nLCA</head><p>LCA proved its effectiveness on several test collections. This technique combines the strength of a global relevance feedback method like PhraseFinder <ref type="bibr" target="#b3">[4]</ref> while preventing its drawbacks. LCA selects the expansion terms directly from the collection on the basis of their co-occurrences with query terms. Differently from PhraseFinder, this method computes this statistics on the basis of the top-ranked documents that are assumed to be the relevant ones, with a considerable gain in efficiency. Then, LCA joins the advantage of a global technique with the efficiency of a local one. This technique is grounded on the hypothesis that terms frequently occurring in the top-ranked documents frequently co-occur with all query terms in those documents too. Our work exploits the idea of LCA in the N-levels model. In that model, LCA is integrated into two representation levels: keyword and word meaning. The challenge lies in the idea that the LCA hypothesis could also be applied to the word meaning level, in which meanings are involved instead of terms. The original measure of co-occurrence degree is extended to encompass the weight of a generic feature (keyword or word meaning) rather than just a term.</p><p>We modify the orginal formula introducing two new factors θ and γ (in bold in following formulae):</p><formula xml:id="formula_0">codegree(f, q i ) = log 10 (co(f, q i ) + 1) • idf (f ) log 10 (n) (1)</formula><p>codegree is computed starting from the degree of co-occurrence of the feature f and the query feature q i (co(f, q i )), but it takes also into account the frequency of f in the whole collection (idf (f )) and normalizes this value with respect to n, the number of documents in the top-ranked set.</p><formula xml:id="formula_1">co(f, q i ) = d∈S tf (f, d) • tf (q i , d) • θ (2) idf (f ) = min(1.0, log 10 N N f 5.0 )<label>(3)</label></formula><p>where tf (f, d) and tf (q i , d) are the frequencies in d of f and q i respectively, S is the set of top-ranked documents, N is the number of documents in the collection and N f is the number of documents containing the feature f . For each level, we retrieve the n top-ranked documents for a query q and then we rank the feature belonging to those documents by computing the function lca, as follows:</p><formula xml:id="formula_2">lca(f, q) = qi∈q (δ + γ • codegree(f, q i )) idf (qi)<label>(4)</label></formula><p>θ and γ transfer the importance of a query term into the weight of words it cooccurs with. In fact, θ takes into account the frequency of a query term (qf ) in the original query (θ = 1+log(qf (q i ))), while γ takes into account a boost factor associated with a specific query term (γ = 1 + log(boost(q i ))). lca is used to rank the list of features that occur in the top-ranked documents, δ is a smoothing factor, while the power is used to raise the impact of rare features. The new query q * is given by the sum of the original query q and the expanded query q , where q = (w f1 , ..., w f k ) and w fi = 1.0 − 0.9i k is the weight of the i-th feature f i . Hence, the new query is re-executed to obtain the final list of ranked documents for each level. Differently from the original work, we applied LCA to the top ranked documents rather than passages<ref type="foot" target="#foot_0">1</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Setting the scene</head><p>We evaluate our technique on the CLEF Ad-Hoc Robust Task collection <ref type="bibr" target="#b0">[1]</ref>. The CLEF collection is composed by 166,717 documents and 160 topics. In this collection both documents and topics are disambiguated by the task organizers. Topics are structured in three fields: T itle, Description and N arrative. All query fields are exploited in the search phase with a different boost factor: T itle = 8, Description = 2 and N arrative = 1. We use the Okapi BM25 <ref type="bibr" target="#b4">[5]</ref> as local similarity functions for both meaning and keyword levels. In particular, we adopt the BM25-based strategy which takes into account multi-field documents. Documents in CLEF collection are represented by two fields: HEADLINE and TEXT. The multi-field representation reflects this structure. We set the BM25 parameters as follows: b = 0.7 in both levels, k 1 = 3.25 and 3.50 in keyword and meaning levels respectively. We tested several n, k, and δ values, and we set n, k = 10 and δ = 0.1. To compute the global ranking function we adopt the CombSUM <ref type="bibr" target="#b2">[3]</ref> strategy, giving a weight of 0.8 to the keyword level and 0.2 to the meaning level. All parameters (boosting factors, BM25 and global ranking function) are set after a tuning phase over a set of training topics provided by organizers. In order to compare our approach we consider the Mean Average Precision (MAP) and the Geometric Mean Average Precision (GMAP).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results and Remarks</head><p>We performed two experiments in which one level at a time is considered and then the two lists are merged producing a single list of ranked documents. We explored two strategies involving LCA: The first strategy (lca) is based on the formula proposed in <ref type="bibr" target="#b5">[6]</ref>. In the second strategy (lca-n), we took into account also the meaning level and we decided to expand only synsets referring to nouns. The second strategy tries to overcome a limit of Word Sense Disambiguation algorithms which, in general, have better performance with nouns. The latter strategy (lca-n-θγ) is based on lca-n, but with the introduction of θ and γ factors. The results of our evaluation are depicted in Table <ref type="table" target="#tab_0">1</ref>. While the synset level alone is not able to reach the performance of the keyword level, the combination of these two levels without expansion strategies (no-expansion) improves performance in both MAP and GMAP. All lca strategies exploited in this paper outperform our baseline (no-expansion). However, it is worth to highlight here that the expansion on synset level produces slightly better results with respect to the standard metod lca when it involves only nouns (lca-n). The introduction of θ and γ parameters results in the best performance. This result supports the claim that the weight of query terms is important also to weigh the expansion terms. Future work will include the comparison in the N-levels model of the proposed approach with other PRF, such as Rocchio, Divergence from Randomness and Kullback-Leibler language modeling.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results on CLEF Ad-Hoc Robust collection</figDesc><table><row><cell></cell><cell>Run</cell><cell>MAP GMAP</cell></row><row><cell>one-level (no-expansion)</cell><cell>keyword synset</cell><cell>.4207 .1900 .3119 .1197</cell></row><row><cell></cell><cell cols="2">no-expansion .4253 .1973</cell></row><row><cell>n-levels</cell><cell>lca-n</cell><cell>.4304 .1945</cell></row><row><cell></cell><cell>lca-n-θγ</cell><cell>.4532 .2114</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">In the original work, passages are parts of document text of about 300 words</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">CLEF 2009 Ad Hoc Track Overview: Robust-WSD Task</title>
		<author>
			<persName><forename type="first">E</forename><surname>Agirre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Di Nunzio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Otegi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Multilingual Information Access Evaluation</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">C</forename><surname>Peters</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Di Nunzio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Mostefa</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Peñas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Roda</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">I</biblScope>
		</imprint>
	</monogr>
	<note>Text Retrieval Experiments</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Enhancing semantic search using N-levels document representation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Caputo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Gentile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Degemmis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lops</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Semeraro</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Semantic Search (SemSearch 2008) at the 5th European Semantic Web Conference (ESWC 2008)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Bloehdorn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Mika</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Tran</surname></persName>
		</editor>
		<meeting>the Workshop on Semantic Search (SemSearch 2008) at the 5th European Semantic Web Conference (ESWC 2008)<address><addrLine>Tenerife, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008-06-02">June 2nd, 2008. 2008</date>
			<biblScope unit="volume">334</biblScope>
			<biblScope unit="page" from="29" to="43" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Combination of Multiple Searches</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Shaw</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">TREC</title>
		<imprint>
			<biblScope unit="page" from="243" to="252" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An association thesaurus for information retrieval</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Jing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">RIAO 94 Conference Proceedings</title>
				<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="146" to="160" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Simple BM25 extension to multiple weighted fields</title>
		<author>
			<persName><forename type="first">S</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zaragoza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the thirteenth ACM international conference on Information and knowledge management</title>
				<meeting>the thirteenth ACM international conference on Information and knowledge management<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="42" to="49" />
		</imprint>
	</monogr>
	<note>CIKM &apos;04</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Improving the effectiveness of information retrieval with local context analysis</title>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="79" to="112" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
