<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COLINA: A Method for Ranking SPARQL Query Results through Content and Link Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Azam Feyznia</string-name>
          <email>azam.feyznia@stu-mail.um.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohsen Kahani</string-name>
          <email>kahani@um.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fattane Zarrinkalam</string-name>
          <email>fattane.zarrinkalam@stu-mail.um.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Department, Ferdowsi University of Mashhad</institution>
          ,
          <addr-line>Mashhad</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growing amount of Linked Data increases the importance of semantic search engines for retrieving information. Users often examine the first few results among all returned results. Therefore, using an appropriate ranking algorithm has a great effect on user satisfaction. To the best of our knowledge, all previous methods for ranking SPARQL query results are based on popularity calculation and currently there isn't any method for calculating the relevance of results with SPARQL query. However, the proposed ranking method of this paper calculates both relevancy and popularity ranks for SPARQL query results through content and link analysis respectively. It calculates the popularity rank by generalizing PageRank method on a graph with two layers, data sources and semantic documents. It also assigns weights automatically to different semantic links. Further, the relevancy rank is based on the relevance of semantic documents with SPARQL query.</p>
      </abstract>
      <kwd-group>
        <kwd>Ranking</kwd>
        <kwd>SPARQL</kwd>
        <kwd>Semantic Search Engine</kwd>
        <kwd>Link Analysis</kwd>
        <kwd>Content Analysis</kwd>
        <kwd>Semantic Web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Structured data has enabled users to search semantic web, based on SPARQL queries.
The increasing amount of structured data on the web has led to many results to be
returned by a SPARQL query [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Further, since in most cases, all returned results equally
satisfy query conditions, checking all of them and finding the best answers takes too
much time. Therefore, the semantic web search engines whose have provided a
SPARQL endpoint for processing and running SPARQL queries on their indexed data,
require some mechanisms for ranking SPARQL query results besides the ranking
methods applied to keyword queries, to help users find their desired answers in less time.
      </p>
      <p>
        In search engines, ranking are usually done by content and link analysis and the final
rank for each result is calculated by combining scores obtained from each analysis
algorithm [
        <xref ref-type="bibr" rid="ref2 ref3">2-3</xref>
        ]. The content analysis ranking algorithms calculate the relevancy between
each result with the user query in online mode. In the link analysis ranking algorithms,
popularity calculation is done in offline mode, before the user query is received, by
constructing data graph and analyzing the existing links in it.
      </p>
      <p>
        To the best of our knowledge, all previous methods for ranking SPARQL query
results are based on popularity calculation and currently there is no method for calculating
the relevance of sub-graph results with SPARQL query. The ranking methods which
are based on link analysis, compute rank for entities of result graphs by utilizing
entitycentric data models. It is worth noting that, the results of a SPARQL query in addition
to the entities, may be made up of predicates and constant values. As a result, the
proposed algorithms by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which are only based on entity ranking, cannot rank all
results of SPARQL queries. One of the cornerstones in ranking SPARQL query results
are language model based ranking methods [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Providing an approach for analyzing
content of structured queries such as SPARQL queries, is a significant advance which
is obtained by these methods.
      </p>
      <p>Therefore, by studying the limitations presented in existing researches and
considering specific features of SPARQL queries and results, this paper proposed a ranking
method which calculates relevancy and popularity scores through content and link
analysis respectively.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Method: COLINA</title>
      <p>We are interested in measuring how valuable the result graph is, for a given query. Our
method ranks SPARQL query results by combining content and link analysis scores of
semantic documents which results are retrieved from. In the next subsections, we
briefly describe two key components of our method.
2.1</p>
      <sec id="sec-2-1">
        <title>Offline Ranker</title>
        <p>The offline ranker calculates data popularity by applying weighted PageRank algorithm
on data graph. We first explain our data model and then reveal our scheme for weighting
semantic links.</p>
        <p>
          Data Model. In order to consider the provenance of data in our link analysis ranking,
we choose a two-layer graph including data source and semantic document layers. Data
source layer is made up of a collection of inter-connected data sources. A data source
is the source which has authority to assign URI identifier and is defined as a pay-level
domain similar to [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The semantic document layer is composed of independent graphs
of semantic documents. Each graph contains a set of internal nodes and edges.
        </p>
        <p>
          Our explanation for using document-centric data model instead of entity-centric data
model is that in response to a SPARQL query, the sub-graphs that meet query
conditions are returned as results. Depending on the number of triple patterns in query, each
sub-graph constitutes several triples. Hence, we can estimate the rank score of triples
by the rank score of documents which are appeared in them. The document graph was
constructed by extracting explicit and implicit links between semantic documents
according to [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Weighting mechanism. We categorized links in two classes based on their labels, but
not their frequency: specific and general links. In semantic web, links are semantically
different and so they have different importance. Our method for measuring the
importance of link labels goes beyond just measuring the frequency of labels by also
taking these categories into account. We first determine which category the link label
belongs to, then we use different frequency based measurements. The intuition behind
this idea is that general and common link labels such as owl:sameAs, which convey
high importance, get high weight. On the other hand, specific link labels, that hold much
information based on information-theory, get high weight too. This way we can
consider the importance of common link labels and also maintain the importance of specific
link labels. In this paper, we exploit a hierarchical approach to separate the link labels
that are between data sources. From this point of view, the link label that is defined for
a particular class is considered general for all of its subclasses. Hence each data source
is a subclass of owl:thing, we can derive general labels through extracting link labels
which rdfs:domain of them is defined owl:thing by Virtuoso1.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Online Ranker</title>
        <p>Unlike keyword-based queries which are collection of words that are specified by users,
each triple pattern in SPARQL queries has two arguments: the bound arguments which
are labeled by users and the unbound arguments which are variable. We can measure
the relevancy of document, based on bound and unbound query arguments as follows:
Sq (doc) = β rq (doc) + (1 − β ) rr (doc)
(1)
where and denotes the relevancy score of a document with respect
to unlabeled arguments in query and produced answer, respectively. Parameter set
empirically to a calibrated global value.</p>
        <p>For example, assuming that “Bob a Physicist” is an answer for “?x a Physicist”. If
this triple appears in a document which is exclusively about physicist or Bob, it is more
relevant than when it is included in a document which is about anything else. This
example highlights our justification for using both bound and unbound arguments in the
relevance calculation for documents.</p>
        <p>Since the computing value for and depends on query formulation,
we need to deal with possible forms of triple patterns. For this, we define ACDT and
QCDT functions for estimating and respectively.</p>
        <p>
          The ACDT is Answer Container Document’s Triples. In short, it computes
frequency of a result in semantic documents with respect to the position of unbound
arguments in intended triple pattern. The QCDT is Query Container Document’s Triples.
Similarly, it computes frequency of a query in semantic documents with respect to the
position of bound arguments in intended triple pattern. The basic idea for ACDT and
QCDT is derived from TF Scheme in information retrieval.
1 http://lod.openlinksw.com/sparql
We combine relevancy score and popularity score in order to compute final score
for document. Since the foundation of our ranking algorithms is similar to algorithms
presented in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we use his method for combining scores of our algorithms.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper we presented a method for ranking SPARQL query results based on
content and link analysis, which can be used as ranking component in semantic web search
engines. In our method, the rank of triples that constitute the result graphs are
approximated by the rank score of semantic documents which expressed them. We introduced
a two-layer data model and proposed a novel link weighting mechanism based on
separation of link labels incorporating the notion frequency of labels in a convenient
manner. Our content analysis ranking algorithm provides an approach to compute the
relevancy of results with respect to the bound and unbound arguments in intended SPARQL
query. We believe that using content analysis ranking in combination with link analysis
ranking which is powered by our data model and weighting mechanism, can improve
accuracy of ranking algorithm for SPARQL query results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Hees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khamis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biedert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdennadher</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          .
          <article-title>Collecting links between entities ranked by human association strengths</article-title>
          .
          <source>In Proceedings of ESWC-13</source>
          , pages
          <fpage>517</fpage>
          -
          <lpage>531</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          .
          <article-title>Searching Web Data: an Entity Retrieval Model</article-title>
          .
          <source>Ph.D. Thesis</source>
          , National University of Ireland, Ireland,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kinsella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>Searching and Browsing Linked Data with SWSE: the Semantic Web Search Engine</article-title>
          .
          <source>Journal of web semantics</source>
          , pages
          <fpage>365</fpage>
          -
          <lpage>401</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Mulay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          . SPRING:
          <article-title>Ranking the results of SPARQL queries on Linked Data</article-title>
          .
          <source>17th International Conference on Management of Data COMAD</source>
          , Bangalore, India,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Buikstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Neth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schooler</surname>
          </string-name>
          , A. ten
          <string-name>
            <surname>Teije</surname>
            ,
            <given-names>F. van. Harmelen.</given-names>
          </string-name>
          <article-title>Ranking query results from linked open data using a simple cognitive heuristic</article-title>
          .
          <source>In Workshop on discovering meaning on the go in large heterogeneous data 2011 (LHD-11)</source>
          , Twenty-second
          <source>International Joint Conference on Articial Intelligence (IJCAI-11)</source>
          , Barcelona, Spain,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Ifrim,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ramanath</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Weikum.</surname>
          </string-name>
          <article-title>NAGA: Searching and Ranking Knowledge</article-title>
          .
          <source>In 24th International Conference on Data Engineering (ICDE</source>
          <year>2008</year>
          ). IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Feyznia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kahani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramezani</surname>
          </string-name>
          .
          <article-title>A Link Analysis Based Ranking Algorithm for Semantic Web Documents</article-title>
          .
          <source>In 6th Conference on Information and Knowledge (IKT</source>
          <year>2014</year>
          ), Shahrood, Iran,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>