<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Semantic Annotations for Entity-based Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lei Zhang</string-name>
          <email>l.zhang@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Farber</string-name>
          <email>michael.faerber@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh Tran</string-name>
          <email>ducthanh.tran@sjsu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achim Rettinger</string-name>
          <email>rettinger@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute AIFB, Karlsruhe Institute of Technology</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>San Jose State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a new approach to entity-based information retrieval by exploiting semantic annotations of documents. With the increased availability of structured knowledge bases and semantic annotation techniques, we can capture documents and queries at their semantic level to avoid the high semantic ambiguity of terms and to bridge the language barrier between queries and documents. Based on various semantic interpretations, users can re ne the queries to match their intents. By exploiting the semantics of entities and their relations in knowledge bases, we propose a novel ranking scheme to address the information needs of users.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The ever-increasing amount of semantic data on the Web pose new challenges
but at the same time open up new opportunities for information access. With
the advancement of semantic annotation technologies, the semantic data can be
employed to signi cantly enhance information access by increasing the depth
of analysis of current systems, while traditional document search excels at the
shallow information needs expressed by keyword queries and the meaningful
semantic annotations contribute very little. There is an impending need to
exploit the currently emerging knowledge bases (KBs), such as DBpedia and
Freebase, as underlying semantic model and make use of semantic annotations
that contain vital cues for matching the speci c information needs of users.</p>
      <p>There is a large body of work that automatically analyzes documents and
the analysis results, such as part-of-speech tags, syntactic parses, word senses,
named entity and relation information, are leveraged to improve the search
performance. A study [1] investigates the impact of named entity and relation
recognition on search performance. However, this kind of work is based on natural
language processing (NLP) techniques to extract linguistic information from
documents, where the rich semantic data on the Web has not been utilized. In [2],
an ontology-based scheme for semi-automatic annotation of documents and a
retrieval system is presented, where the ranking is based on an adaptation of the
traditional vector space model taking into account adapted TF-IDF weights.</p>
      <p>
        This work can be dedicated to research in this area. Nevertheless, it provides a
signi cantly new search paradigm. The main contributions include: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The rich
semantics in KBs are used to yield the semantic representations of documents
and queries. Based on the various semantic interpretations of queries, users
can re ne them to match their intents. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Given our emphasize on semantics
of entities and relations, we introduce a novel scoring mechanism to in uence
document ranking through manual selection of entities and weighting of relations
by users. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Another important feature is the support of cross-linguality, which
is crucial when queries and documents are in di erent languages.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Document Retrieval Process</title>
      <p>In this section, we present our document retrieval process, which consists of ve
steps. While lexica extraction and text annotation are performed o ine, entity
matching, query re nement and document ranking are handled online based on
the index generated by o ine processing.</p>
      <p>Lexica Extraction. In this step, we constructed the cross-lingual lexica by
exploiting the multilingual Wikipedia to extract the cross-lingual groundings of
entities in KBs, also called surface forms, i.e., words and phrases in di erent
languages that can be used to refer to entities [3]. Besides the extracted surface
forms, we also exploit statistics of the cross-lingual groundings to measure the
association strength between the surface forms and the referent entities.</p>
      <p>Text Annotation. The next step is performed to enrich documents with
entities in KBs to help to bridge the ambiguity of natural language text and
precise formal semantics captured by KBs as well as to transform documents in
di erent languages into a language independent representation. For this purpose,
we employ our cross-lingual semantic annotation system [4] and the resulting
annotated documents are indexed to make them searchable with KB entities.</p>
      <p>Entity Matching. Our online search process starts with the keyword query
in a speci c language. Instead of retrieving documents, our approach rst nds
entities from KBs matching the query based on the index constructed in the
lexica extraction step. These entities represent di erent semantic interpretations
of the query and thus are employed in the following steps to help users to re ne
the search and in uence document ranking according to their intents.</p>
      <p>Query Re nement. Di erent interpretations of the query are presented for
users to select the intended ones. Since interpretations correspond to entities in
this step, users can choose the intended entity for re nement of their information
needs. We also enable users to adjust the weights of entity relations to in uence
the document ranking for a personalized document retrieval. For this, the chosen
entity is shown and extended with relations to other entities retrieved from KBs.</p>
      <p>Document Ranking. After query re nement by users, the documents in
di erent languages containing the chosen entity are retrieved from the index
constructed by text annotation. Then, we exploit the semantics of entities and
relations for ranking. We observe that annotated documents generally share the
following structure pattern: every document is linked to a set of entities, where
where LCCde is the largest connected component of d containing e and jLCCdej
represents the number of entities in LCCe.</p>
      <p>d</p>
      <p>Relation-Based Ranking: Given the chosen entity e, the users can weight
both the existence and the occurrence frequency of its relations to in uence the
document ranking. This di erentiation separates the one scenario where users
are interested in obtaining more detailed information about the relationship
(qualitative information) from the other, where users are interested in the
quantity. Let Re be the set of relations of chosen entity e. We de ne xr = 1
jrdj
if r 2 Re, otherwise 0, and yr = log(avgr) , where jrdj denotes the occurrence
frequency of r in d and avgr is the average occurrence frequency of r. Then,
we propose ScoreRelation(d; e) between document d and entity e to capture the
relevance of d to the weighted relations in Re as follows:</p>
      <p>
        ScoreRelation(d; e) = X xr wrexistence + yr wrfrequency
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
r2Re
where wexistence and wfrequency are weights given by users for the existence and
r r
the occurrence frequency of relation r, respectively.
      </p>
      <p>By taking into account both focus-based and relation-based ranking, we
present the nal function for scoring the documents as given in Eq 3.
a subset (several subsets) of these entities are connected via relations in the
KB, forming a graph (graphs). In this regard, a document can be conceived as
a graph containing several connected components. Leveraging this pattern, we
propose a novel ranking scheme based on the focus on the chosen entity and the
relevance to the weighted relations.</p>
      <p>Focus-Based Ranking: Intuitively, given two documents d1 and d2 retrieved
for the chosen entity e, d1 is more relevant than d2 if it focuses more on e than
d2 does, i.e., when the largest connected component of d1 containing e is larger
than that of d2. Based on this rationale, we propose ScoreF ocus(d; e) between
document d and entity e to capture the focus of d on e as follows:</p>
      <p>ScoreF ocus(d; e) = jLCCdej
where efd denotes the total number of entities in d, avgef is the average number
of entities in the document collection, and s is a parameter taken from IR
literature, which has been typically set to 0:2.
Score(d; e) =</p>
      <p>
        ScoreF ocus(d; e) ScoreRelation(d; e)
ndlde
where ndlde is the normalized document length of d w.r.t. annotations, i.e. the
number of entities contained in d, which is used to penalize documents in
accordance with their lengths because a document containing more entities has
a higher likelihood to be retrieved. The e ect of this component is similar to
that of normalized document length w.r.t. terms in IR. We can compute it as
ndlde = (1
s) + s
efd
avgef
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We now discuss our preliminary evaluation results. In the experiment, we use
DBpedia [5] as the KB and Reuters Corpus Volume 1 (RCV1) as the document
corpus containing about 810,000 English news articles. To assess the e ectiveness
of our approach, we investigate the normalized discounted cumulative gain
(nDCG) measure of the top-k results instead of the common measures like
precision and recall, which are not suitable to our scenario because the results
can be di erent in relevance for each query and di er for each facet or weight
used. We asked volunteers to provide keyword queries in Chinese (17 in total)
along with descriptions of the intents used to set the weight for the relations,
which yield the average nDCG of 0:87 and the average number of results of 612.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we show that the semantics captured in KBs can be exploited
to allow the information needs to be speci ed and addressed on the semantic
level, resulting in the semantic representations of documents and queries, which
are language independent. The user feedback on our demo system [6] suggests
that the proposed approach enables more precise re nement of the queries and
is also valuable in terms of the cross-linguality. In the future, we plan to advance
the query capability to support keyword queries involving several entities and
conduct more comprehensive experiments to evaluate our system.
Acknowledgments. This work is supported by the European Community's
Seventh Framework Programme FP7-ICT-2011-7 (XLike, Grant 288342) and
FP7-ICT-2013-10 (XLiMe, Grant 611346). It is also partially supported by
the German Federal Ministry of Education and Research (BMBF) within the
SyncTech project (Grant 02PJ1002) and the Software-Campus project \SUITE"
(Grant 01IS12051).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>An experimental study of the impact of information extraction accuracy on semantic search performance</article-title>
          . In: CIKM. (
          <year>2007</year>
          )
          <volume>505</volume>
          {
          <fpage>514</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Castells</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vallet</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An adaptation of the vector-space model for ontology-based information retrieval</article-title>
          .
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>19</volume>
          (
          <issue>2</issue>
          ) (
          <year>2007</year>
          )
          <volume>261</volume>
          {
          <fpage>272</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Rettinger</surname>
          </string-name>
          , A.:
          <article-title>xlid-lexica: Cross-lingual linked data lexica</article-title>
          .
          <source>In: LREC</source>
          . (
          <year>2014</year>
          )
          <volume>2101</volume>
          {
          <fpage>2105</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rettinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>X-lisa: Cross-lingual semantic annotation</article-title>
          .
          <source>PVLDB</source>
          <volume>7</volume>
          (
          <issue>13</issue>
          ) (
          <year>2014</year>
          )
          <volume>1693</volume>
          {
          <fpage>1696</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>DBpedia - A crystallization point for the Web of Data</article-title>
          .
          <source>J. Web Sem</source>
          .
          <volume>7</volume>
          (
          <issue>3</issue>
          ) (
          <year>2009</year>
          )
          <volume>154</volume>
          {
          <fpage>165</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Rettinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Kuphi - an investigation tool for searching for and via semantic relations</article-title>
          .
          <source>In: ESWC</source>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>