<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Methodology for Searching Entities on the Web?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Enterprise Research Institute National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>From a Web of Documents to a Web of Entities The Semantic Web is driven by the idea of moving from a Web of documents, designed for human consumption, to a Web of data in order to “create a universal medium for the exchange of data where data can be shared and processed by automated tools as well as by people”1. Nowadays, more and more machine-readable annotations and meta-data are available on the Web. This data, typically codified using the Resource Description Framework (RDF) or Microformats, is accessible directly via HTTP. Microformat enables the annotation of an entity in a web page, whereas RDF enables the description of anything that can be named using a Uniform Resource Identifier (URI). By describing the relationships between resources, the Web moves from a Web of documents to a semantically interconnected Web of entities. Although data are available, data consumers face a challenge due to the decentralised publishing infrastructure of the Web: they need to locate information about an entity and handle multiple, possibly discording, views of the entity. Search engines are the primary method for accessing information on the Web, i.e. finding relevant documents given a keyword-based query. By leveraging the Web of entities, we can imagine an entity-centric search engine which, given a query, would support the user in obtaining an aggregated and balanced view of the data available on the Semantic Web. Given that the Semantic Web data are machine processable, the most interesting use of such an engine could be made by machines themselves: any application could use one such engine directly to find, interconnect and enrich information. Searching information about a particular entity on the Web raises new challenges: (i) how to efficiently locate and retrieve Semantic Web data and (ii) how to integrate data on a decentralised and heterogeneous information space. We aims to propose a comprehensive methodology for searching entities on the Web along these requirements. ? This material is based upon works supported by the European FP7 project Okkam - Enabling a Web of Entities (contract no. ICT-215032), and by Science Foundation Ireland under Grant No. SFI/02/CE1/I131. 1 Semantic Web Activity Statement: http://www.w3.org/2001/sw/Activity.html</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We plan to tackle such complex problems by exploring how existing and proven robust
technologies can be advanced and specialized specifically to address the needs of an
entity-centric search engine. In particular, our work will focus on the topics illustrated
in the following sections.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Adapting Information Retrieval engines for Semantic Web Data</title>
      <p>
        Standard Web search engines are intensively using Information Retrieval (IR)
techniques for locating relevant information on the Web. Information Retrieval is a well
studied field [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and many optimisations have been developed for efficiently storing
and querying large amount of information. Techniques such as inverted indexes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
have proved to scale to the size of the Web (e.g. Google). The shortcoming of such
systems is that they can only answer simple queries, e.g. a boolean combination of words,
but are not really meant to query relationships between entities, e.g. a graph pattern.
      </p>
      <p>
        On the contrary, entity-centric search engines such as SWSE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are built on a data
structure which is more similar to relational databases than to IR engines. SWSE
relies on YARS [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a distributed RDF store, for storing and querying large amounts of
graph-structured data. Such systems can typically answer complex conjunctive queries
involving large joins, but they are in turn difficult to scale since they need clever indexes
for query efficiency which are however computationally expensive to update.
      </p>
      <p>Our intuition is that it is possible to construct a fast and scalable entity centric search
engine based on a two-tier architecture: a modified IR engine to efficiently perform a
preliminary semantic document selection, and an optimised triple level post-processing
to answer complex queries. Our research will therefore focus on how to employ existing
IR engines to perform useful queries over semantically structured documents.</p>
      <p>
        Information Retrieval engines, however, are primarily designed for unstructured text
information, and not for graph-structured information such as RDF. Information
Retrieval engines for Semantic Web data have notable previous works with Semplore [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and ESTER [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which however were developed with different goals than those we
consider. The developers of Swoogle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have also discussed the problem of introducing a
new search paradigm for Semantic Web resources and emphasized the importance of
combining knowledge inference with information retrieval methods.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Optimising Inference at Web Scale</title>
      <p>
        Reasoning over semantically structured documents enables to make explicit what would
otherwise be implicit knowledge: it adds value to the information and enables an
entitycentric search engine to ultimately be much more competitive in terms of precision
and recall [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The drawback is that inference can be computationally expensive, and
therefore prevent efficient indexing.
      </p>
      <p>The novel aspect that our work covers is how to reason over semantically structured
documents that have been harvested from the Web. To reason on documents, we assume
that ontologies, which are referenced explicitly with OWL:IMPORTS or implicitly by
using properties and classes of a certain namespace, are also part of the Semantic Web
as dereferenciable data, in accord with the W3C Best Practice2. As ontologies might
2 Best Practice Recipes for Publishing RDF Vocabularies:
http://www.w3.org/TR/swbp-vocabpub/
refer to other ontologies, the web fetching process is recursive and should, in theory, be
repeated for each harvested documents independently.</p>
      <p>The proposed research will focus on how to maximally reuse the results of such
“web closure reasoning”, i.e finding and exploiting the referenced ontologies, that has
been performed over previously indexed semantically structured documents in order to
minimise the computational cost of indexing. We will also considers how to “keep in
quarantine” reasoning tasks and inference results in order to prevent maliciously crafted
web ontologies to alter the semantics of agreed ontologies published by third parties
on a global level. For example, if an ontology states that FOAF:NAME is an inverse
functional property, an inferencing agent should not consider this axiom outside the
scope of the document that references this particular ontology.</p>
      <p>The coordinate use of the features offered by the IR and inference engines will be
demonstrated in the applications described in the following sections.</p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Identification, Coreference Resolution and Information Merging</title>
      <p>Due to its decentralised publishing infrastructure, information about an entity are
generally spread across the Web. The identification of an entity is fundamental for
discovering complementary data sources. The use of URI makes easier the identification of
an entity, but the Unique Name Assumption (UNA) does not hold. In theory, a single
URI uniquely identifies a resource, but it is unrealistic to assume that data publishers
can universally agree on a single identifier for each resource. Therefore, the
identification of an entity among the Semantic Web becomes uncertain since two identifiers,
apparently distinct, can refer to a unique entity.</p>
      <p>
        The coreference problem is well known across various research communities with
a variety of different names, such as record linkage [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], entity resolution [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
reference reconciliation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or object consolidation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A wide variety of algorithms has
been developed for resolving the coreference problem, but these are generally not
designed for Web scale and semi-structured data. Recent initiatives amongst the Semantic
Web community addressed the problem of resource identification: [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] described the
phenomenon of the proliferation and coreference of URIs and the OKKAM project3
proposed to research an infrastructure for assigning global identifiers at Web scale.
      </p>
      <p>
        The problem of identification and coreference resolution will be a natural testbed
for the IR and inference engines that we described previously. The IR engine will enable
to perform a blocking pass [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] before executing complex coreference resolution and,
coupled with the inference engine, will permit more advanced reasoning than what was
possible in the Semantic Web object consolidation work described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Clearly, coreference resolution is an important enabler for information merging.
More factors, however, have to be taken into consideration before aggregating diverse
information sources. Entity descriptions are generally produced under a certain context
(provenance, time, etc.). The descriptive information is usually a subjective view of the
entity with a certain level of reliability. Merging these descriptions can result in
inconsistent and contradictory information. In order to enable a proper data integration, we</p>
      <sec id="sec-4-1">
        <title>3 http://www.okkam.org/</title>
        <p>have to keep information in its context which is naturally supported by the
documentcentric storage we adopt.
3</p>
        <p>Methodology
We will evaluate the methodology for searching entity by implementing a solution for
each identified problem, by integrating them into a single software platform and by
performing a qualitative evaluation of the resulting platform. In addition, we will perform
an evaluation of each solution with a dedicated corpus, as described below.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Information Retrieval Engine for Semantic Web Data A benchmark, including in</title>
      <p>dex size and query response time, against other systems is planned.</p>
      <p>Semantic Web Inference Engine The evaluation of the inference engine will include
an analysis of its complexity in term of size and response time.</p>
    </sec>
    <sec id="sec-6">
      <title>Entity Identification and Coreference Resolution The evaluation of the coreference</title>
      <p>resolution system requires a gold standard dataset for analysing the precision of the
different algorithms.
4</p>
      <p>Achievements and Work Plan 4</p>
    </sec>
    <sec id="sec-7">
      <title>Information Retrieval Engine for Semantic Web Data We achieved, as part of the</title>
      <p>
        Sindice project [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a first prototype of the Information Retrieval engine. The
current system has currently indexed more than 2 billions of triples. The system
enables fast lookup of URIs, keywords and Inverse Functional Properties (IFP)
through a human interface or a HTTP API for machine access. We are currently
finishing a second prototype that enables queries of increased complexity and
semantic meaning, i.e combining URIs and keywords and adding triple-structure. We
foresee a third and final prototype capable of answering more complex queries
involving simple joins.
      </p>
      <p>
        Semantic Web Inference Engine We have developed a prototype of an optimised
inference engine that enables inference of a subset of OWL at indexing time.
Preliminary results of this work has been published in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We have formalised an
advanced inference engine that avoid malicious users from “infecting” cached data
on a global scale. Its development is in progress.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Entity Identification and Coreference Resolution As a next step we will tackle a</title>
      <p>coreference resolution system, based on the IR and inference engines. The first
task will be to implement a prototype for identifying entities with the help of
OWL:SAMEAS statement and IFPs, e.g. the e-mail of a person. The prototype will
be able to return an aggregated view of the entity information available on the Web.
The second task will be to improve the system with “pair-wise” matching
algorithms.</p>
      <sec id="sec-8-1">
        <title>4 The work on the thesis has formally started in February, 2007.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          :
          <article-title>Modern Information Retrieval</article-title>
          . ACM Press / Addison-Wesley (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Inverted files for text search engines</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>38</volume>
          (
          <year>2006</year>
          )
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
          </string-name>
          , J.,
          <source>O´ '</source>
          <string-name>
            <surname>Riain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : SWSE:
          <article-title>Answers before links!</article-title>
          <source>In: Proceedings of the Semantic Web Challenge, 6th International Semantic Web Conference</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>YARS2: A federated repository for querying graph structured data from the web</article-title>
          .
          <source>In: Proceedings of the 6th International Semantic Web Conference</source>
          . (
          <year>2007</year>
          )
          <fpage>211</fpage>
          -
          <lpage>224</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Semplore: An IR approach to scalable hybrid query of semantic web data</article-title>
          .
          <source>In: Proceedings of the 6th International Semantic Web Conference</source>
          . (
          <year>2007</year>
          )
          <fpage>652</fpage>
          -
          <lpage>665</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bast</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chitea</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>ESTER: efficient search on text, entities, and relations</article-title>
          .
          <source>In: SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , New York, NY, USA, ACM (
          <year>2007</year>
          )
          <fpage>671</fpage>
          -
          <lpage>678</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Information retrieval on the Semantic Web: Integrating inference and retrieval</article-title>
          .
          <source>In: Proceedings of the SIGIR Workshop on the Semantic Web</source>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fellegi</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sunter</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>64</volume>
          (
          <year>1969</year>
          )
          <fpage>1183</fpage>
          -
          <lpage>1210</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Benjelloun</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
          </string-name>
          , J.:
          <article-title>Swoosh: A generic approach to entity resolution</article-title>
          .
          <source>Technical report</source>
          , Stanford University (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
          </string-name>
          , J.:
          <article-title>Reference reconciliation in complex information spaces</article-title>
          . In O¨ zcan, F., ed.
          <source>: SIGMOD Conference</source>
          , ACM (
          <year>2005</year>
          )
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Performing object consolidation on the semantic web data graph</article-title>
          .
          <source>In: Proceedings of the WWW2007 Workshop I3</source>
          : Identity, Identifiers, Identification,
          <article-title>Entity-Centric Approaches to Information and Knowledge Management on the Web</article-title>
          .
          <article-title>(</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jaffri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glaser</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Millard</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>URI identity management for semantic web data integration and linkage</article-title>
          .
          <source>In: 3rd International Workshop On Scalable Semantic Web Knowledge Base Systems</source>
          , Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipeirotis</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verykios</surname>
            ,
            <given-names>V.S.:</given-names>
          </string-name>
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>19</volume>
          (
          <year>2007</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catasta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stenzhorn</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tummarello</surname>
          </string-name>
          , G.:
          <article-title>Sindice.com: A document-oriented lookup index for open linked data</article-title>
          .
          <source>International Journal of Metadata, Semantics and Ontologies</source>
          <volume>3</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>