<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating structured Pro les of Linked Data Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Besnik Fetahu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Dietze</string-name>
          <email>dietzeg@L3S.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Pereira Nunes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Taibi</string-name>
          <email>davide.taibi@itd.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Antonio Casanova</string-name>
          <email>casanovag@inf.puc-rio.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics - PUC-Rio - Rio de Janeiro</institution>
          ,
          <addr-line>RJ -</addr-line>
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Italian National Research Council, Institute for Educational Technologies</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>L3S Research Center, Leibniz University Hanover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While there exists an increasingly large number of Linked Data, metadata about the content covered by individual datasets is sparse. In this paper, we introduce a processing pipeline to automatically assess, annotate and index available linked datasets. Given a minimal description of a dataset from the DataHub, the process produces a structured RDF-based description that includes information about its main topics. Additionally, the generated descriptions embed datasets into an interlinked graph of datasets based on shared topic vocabularies. We adopt and integrate techniques for Named Entity Recognition and automated data validation, providing a consistent work ow for dataset proling and annotation. Finally, we validate the results obtained with our tool.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Annotation</kwd>
        <kwd>Datasets</kwd>
        <kwd>Metadata</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The emergence of the Web of Data, in particularly Linked Data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has led to
a vast amount of data being available on the Web. The DataHub1, which serves
as the central registry for open Web data, currently contains over 6000 datasets,
338 of which are (at the time of writing) part of the Linked Open Data group2.
      </p>
      <p>While datasets are highly heterogeneous with respect to represented resource
types, currentness, quality or topic coverage, only brief and insu cient
structured information about datasets are available. In the case of DataHub, only
simple tags, few structured metadata about the size, endpoints or used schemas
and a brief textual descriptions are available. This causes signi cant problems
for data consumers (e.g. educational service providers or developers) to identify
useful and trust-worthy data for di erent scenarios.</p>
      <p>
        Nevertheless, earlier works address related issues [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], such as schema
alignment and extraction of shared resource annotations across datasets. However,
they do not yet facilitate the extraction of reliable dataset metadata with respect
1 http://www.datahub.io
2 http://datahub.io/group/lodcloud
to represented topics. In order to address these limitations, we present an
approach that automatically and incrementally indexes datasets by interlinking and
annotating arbitrary datasets with relevant topics in the form of DBpedia entities
and categories. By incrementally computing topic relevance scores for individual
datasets, we gradually create a knowledge base of dataset meta-information. To
improve scalability the process exploits representative sample sets of resources.
Moreover, to ensure high annotation accuracy a semi-automated evaluation
approach is proposed.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Semi-Automatic Dataset Annotation</title>
      <p>Our dataset pro ling platform automatically extracts top-ranked topic
annotations (DBpedia categories) and captures these together with a relevance score for
each dataset description. All dataset descriptions are captured using the VoID
schema3.</p>
      <sec id="sec-2-1">
        <title>2.1 Entity Recognition</title>
        <p>The analysis of sampled resources for a set of datasets consists of an
annotation process using Named Entity Recognition (NER) and
disambiguation tools (DBpedia Spotlight4). From each resource we extract the textual
content assigned to the following properties: frdfs:label, rdfs:comment,
teach:courseTitle, teach:courseDescription, skos:prefLabel, dcterms:
description, dcterms:alternative, dcterms:title, bibo:abstract, bibo:
body, cnrb:titolo, cnrd:descrizione, foaf:name, rdf:valueg; and perform
contextual, that is resource-wise, NER. This establishes a common descriptive
layer of top-ranked entities for each dataset extracted from DBpedia.</p>
        <p>As the NER process can pose a bottleneck, we introduce an incremental
annotation extraction process to alleviate this issue. This process avoids annotating
resources similar to previously annotated ones by reusing already obtained
annotations. Thus, for a prede ned threshold similarity , from a pool of existing
annotations A, we assign an annotation to a resource if the similarity
(resourceannotation) computed by the Jaccard's index is above threshold :
8a 2 A : J (r; a) = jr \ aj
jr [ aj
(1)
where a 2 A represents already extracted annotations, while r is a resource
instance which is analysed using the incremental annotation process.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Category Annotation</title>
        <p>From the extracted annotations (DBpedia entities) A, we analyse the set of
assigned categories for each annotation. Such information is extracted from the
DBpedia graph via the property dcterms:subject representing the topic
covered by an entity. Furthermore, we leverage the hierarchical category
organisation (as de ned by SKOS schema: skos:broader and skos:related) assigned
to entities within DBpedia.
3 http://www.w3.org/TR/void/
4 http://spotlight.dbpedia.org</p>
        <p>However, such information extracted about categories is only useful when
ranked according to their relevance for each dataset. Hence, we compute a
normalised relevance score for each category assigned to a dataset by taking into
(i) entities assigned to a category intra- and inter-datasets; and (ii) number of
entities assigned to a dataset and over all datasets, see Equation 2:
score(t) =
(t; D)
( ; D)
+
(t; )
( ; ) ; 8t 2 T ^ D 2 D
(2)
where ( ; ) represents the number of entities associated with a topic t and for
a dataset D, in case of void arguments, it outputs the number of entities in a
dataset or over all datasets.
2.3 Automated Annotation Validation &amp; Filtering Approach
Validation and ltering of extracted annotations is necessary, due to noise
inherited from NER&amp;NED results. The approach we propose for ltering out noisy
annotations takes into account the contextual support given for an annotation
from the resource instance it is extracted from. Therefore, we compute a con
dence score which measures the similarity between an annotation and a resource
using Jaccard's index similar to Equation 1, based on values extracted from
properties dbpedia-owl:abstract and rdfs:comment, and the set of analysed
properties listed in Section 2.1, respectively.</p>
        <p>Whereas, in the validation phase we consider only entities that have a con
dence score above some pre-de ne threshold and use human evaluators to assess
the relevance of an extracted annotation with respect to the resource context.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Evaluation</title>
      <p>Our current implementation focuses on educationally relevant datasets as
collected in a dedicated group on the DataHub5 from which we selected a subset of
17 datasets based on their accessibility. Our topic annotation used representative,
randomly selected samples of resources from each datasets, with approximately
100 instances for each resource type. Steps included NER, category extraction
and threshold-based ltering using our relevance &amp; con dence scores.</p>
      <p>From the extracted categories based on the resulting annotations, we
incorporated only the top-50 categories being the most representative ones for a dataset
based on the computed normalised-score. Results obtained from this processing
are stored as part of a VoID6-based dataset catalog currently being provided as
part of the LinkedUp project7; a catalogue providing access to such extensive
information can be accessed under the following url8.</p>
      <p>The evaluation of annotation accuracy was measured based on two datasets:
(a) annotation accuracy without any ltering (see Section 2.3); and (b)
annotation accuracy after ltering, where only annotations with scores above some
5 http://datahub.io/groups/linkededucation
6 http://www.w3.org/TR/void/
7 http://www.linkedup-project.eu
8 http://data.linkededucation.org
threshold (in our case 0:15) are considered. The accuracy was measured for
1000 extracted annotations, picked randomly from A. For (a) the accuracy was
71%, whereas for (b) after ltering annotations below threshold 0:15. We
observed an increase in accuracy of almost +10%.</p>
      <p>Our demo application9 focuses mainly on representation, pro ling and search
functionalities of the analysed datasets based on the structured descriptions.
Figure 1 shows a screenshot of the exploratory search functionality of datasets using
extracted annotations and categories. The user interface provides the following:
{ Exploratory search of datasets based on extracted annotations &amp; categories
{ Interlinking of datasets based on most representative categories
{ List of ranked categories for each dataset</p>
    </sec>
    <sec id="sec-4">
      <title>Future Work</title>
      <p>Our current processing pipeline is able to extract topic annotations for arbitrary
Linked Data with only minimal manual intervention. Having applied it to a small
subset of available datasets, our future work aims at the automatic pro ling of all
available LOD datasets, towards providing a more descriptive catalog of Linked
Datasets.</p>
      <p>Acknowledgements. This work was partly funded by the LinkedUp (GA
No:317620) and DURAARK (GA No:600908) projects under the FP7
programme of the European Commission.
9 http://l3s.de/~fetahu/iswc_demo/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. M.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Adamou</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Assessing the educational linked data landscape</article-title>
          .
          <source>In WebSci</source>
          , pages
          <volume>43</volume>
          {
          <fpage>46</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fetahu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Towards integration of web data into a coherent educational data graph</article-title>
          .
          <source>In WWW (Companion Volume)</source>
          , pages
          <fpage>419</fpage>
          {
          <fpage>424</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>