<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Heung-Seon</forename><surname>Oh</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Science and Technology Information</orgName>
								<orgName type="institution">Korea</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yuchul</forename><surname>Jung</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute of Science and Technology Information</orgName>
								<orgName type="institution">Korea</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C9F8FC6F271674685C8499A5A2B50E3B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>language model</term>
					<term>feedback model</term>
					<term>query expansion</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this report, we describe our retrieval framework for participating in CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Ad-hoc Search. Our retrieval framework is a query expansion approach which adopts relevance and pseudo relevance feedback to improve retrieval performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>This report summarizes our approaches to CLEF eHealth 2017 <ref type="bibr" target="#b1">[2]</ref> Patient-Centered Information Retrieval Task-1, a standard ad-hoc task <ref type="bibr" target="#b6">[7]</ref> . As same with 2016, this task utilizes a large web corpus (ClueWeb12 B13) and topics developed by mining health web forums where users were seeking advice about specific symptoms, diagnosis, conditions or treatments.</p><p>The main goal of the task is to improve the relevance assessment pool and the collection reusability. To meet the evaluation requirements of this year, we explicitly exclude documents that have been already assessed in 2016 from our search results. Meanwhile, to enhance the relevance of the searched, we utilize the already assessed documents in our proposed approaches by following the suggested guideline.</p><p>Based on the above considerations, we've designed a medical information retrieval framework which is characterized with relevance feedback for initial search and query expansion for re-ranking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.1</head><p>Retrieval framework</p><p>Our proposed framework basically performs selective query expansion techniques in the initial retrieval and re-ranks the retrieval results based on the more accurate query expansion methods developed. Figure <ref type="figure">1</ref> shows the overview of our retrieval framework. First, we employ relevance feedback (RF) based on the relevance judgements built in last year since it is encouraged to improve retrieval performance and relevance assessment pool. For a query , a feedback model, , is constructed and combined to produce a new query model, . Second, an initial search is performed using and produces a set of documents, , from a collection . For the retrieved documents, we perform re-ranking with new queries built via two different query expansion methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig. 1. Overview of retrieval framework</head><p>As summarized above, our framework starts with the relevance feedback to improve retrieval performance and relevance assessment pool. Let is a set of documents relevant to a query . A relevance model, i.e. RM1 <ref type="bibr" target="#b3">[4]</ref>, is constructed with scored by KL-divergence method (KLD) <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b8">9]</ref>. There exists two differences compared to standard RM1 since it is built using the relevance judgements. First, all documents in are used to involve in a feedback model because they are explicitly relevant. Second, the relevance are employed as document priors. From the differences, it is expected that a query model includes all relevant information in . Finally, a new query is constructed via RM3 <ref type="bibr" target="#b0">[1]</ref>. After that, the initial search is performed using KLD method on the entire collection and obtain a set of retrieved documents which are target for re-ranking. Before performing the re-ranking, two different query expansion techniques are considered based on . The first query expansion approach adopts random-walk based centrality scores <ref type="bibr" target="#b4">[5]</ref> with a different transition matrix. This strategy is to estimate the query model by considering the associations of words in a query. The major difference is that an association between two words w and u where is computed using two corresponding word vectors rather than co-occurrences. The word vectors are an accurate representation obtained through GloVe <ref type="bibr" target="#b7">[8]</ref>, an unsupervised learning algorithm for obtaining vector representations for words, so call word embedding. The GloVe is known to outperform word2vec models on similarity tasks and named entity recognition tasks. The word vectors were computed on TREC CDS 2016 collection <ref type="bibr" target="#b7">[8]</ref> which contains about 1.2M biomedical journal articles. We expect that the word vectors are more representative in medical domain than other domains. Then, centrality scores are computed using random-walk based on the transition matrix and regarded as a query model. Similar to RM3 above, a query model are generated by combining and the centrality scores. Finally, documents in are re-ranked according to with KLD method. The second query expansion approach follows cluster-based external expansion model (CBEEM) <ref type="bibr" target="#b5">[6]</ref> which is an advanced version of using external collections in pseudo relevance feedback (PRF). The key idea of CBEEM is to estimate an accurate feedback model using not only the original collection but also other benchmark collections. Again, TREC CDS 2016 collection was employed as an external collection. As a result, re-ranking is performed with a new query with .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data</head><p>Two different collections are used for target and external collections, respectively. The target collection is ClueWeb12-Disk-B (ClueWeb12B) including about 52M web pages while the external collection is TREC CDS 2016 including about 1.2M biomedical journal articles. In both collections, text of pages were extracted by removing HMTL and XML tags using JSOUP<ref type="foot" target="#foot_0">1</ref> parser. Avg. Doc. Len 850.9 4,511.9</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Evaluation Settings</head><p>All mixtures for combining the query and feedback models are set to 0.5. Dirichlet prior is set to 2500. In relevance feedback (RF), the size of feedback words is set to 50 while the size of feedback documents corresponds to the number of relevant documents. In two query expansion approaches, they are fixed as 5 and 50, respectively. Word vectors are estimated using GloVe with ADAM optimizer where the vector size is 200.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Submitted Runs</head><p>We submitted three runs for this task. Run1, considered as our baseline, is the results of applying RF. Run2 and Run3 employed centrality scores and CBEEM, respectively. Table <ref type="table" target="#tab_2">2</ref> summarized three runs. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell>shows the summary</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Data Statistics</figDesc><table><row><cell></cell><cell>ClueWeb12B</cell><cell>TREC CDS 2016</cell></row><row><cell>#Docs</cell><cell>52,051,844</cell><cell>1,255,260</cell></row><row><cell>Voc. Size</cell><cell>20,139,450</cell><cell>2,938,617</cell></row><row><cell>Tokens</cell><cell>44,291,018,290</cell><cell>5,663,660,754</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Descriptions of our Submitted Runs</figDesc><table><row><cell>Run</cell><cell>Description</cell></row><row><cell>1</cell><cell>Relevance feedback (RF)</cell></row><row><cell>2</cell><cell>RF + Random-walk based centrality scores</cell></row><row><cell>3</cell><cell>RF + Cluster-based external expansion model</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://jsoup.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquer y</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">UMass at TREC 2004: Novelty and HARD</title>
		<author>
			<persName><forename type="first">N</forename><surname>Abdul-Jaleel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Text REtrieval Conference (TREC)</title>
				<meeting>Text REtrieval Conference (TREC)</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">CLEF 2017 eHealth Evaluation Lab Overview</title>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF 2017 -8th Conference and Labs of the Evaluation Forum</title>
		<title level="s">Lecture Notes in Computer Science (LNCS</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">PageRank without hyperlinks: Structural re-ranking using links induced by language models</title>
		<author>
			<persName><forename type="first">O</forename><surname>Kurland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR &apos;05</title>
				<meeting>the 28th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR &apos;05<address><addrLine>New York, New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM Press</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="306" to="313" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Conceptual language models for domain-specific retrieval</title>
		<author>
			<persName><forename type="first">E</forename><surname>Meij</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Process. Manag</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="448" to="469" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A Multiple-Stage Approach to Re-ranking Medical Documents</title>
		<author>
			<persName><forename type="first">H.-S</forename><surname>Oh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CLEF</title>
				<meeting>CLEF</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="166" to="177" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Cluster-based query expansion using external collections in medical information retrieval</title>
		<author>
			<persName><forename type="first">H.-S</forename><surname>Oh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Biomed. Inform</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="70" to="79" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab</title>
		<author>
			<persName><forename type="first">J</forename><surname>Palotti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of the TREC 2016 Clinical Decision Support Track</title>
		<author>
			<persName><forename type="first">K</forename><surname>Roberts</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of The Twenty-Fifth Text REtrieval Conference</title>
				<meeting>The Twenty-Fifth Text REtrieval Conference<address><addrLine>TREC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Model-based feedback in the language modeling approach to information retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the tenth international conference on Information and knowledge management</title>
				<meeting>the tenth international conference on Information and knowledge management<address><addrLine>New York, New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="403" to="410" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
