<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Open Data to Linked Open Data with ODMiner</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Poggi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Giovanni Nuzzolese</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriele Cigna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DASPLab, Department of Computer Science and Engineering, University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>STLab, Institute of Cognitive Science and Technologies, National Research Council</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we introduce ODMiner, an automatic tool that enhances open datasets provided in heterogenous structured formats (e.g. JSON, CSV, XML, etc.) to Linked Open Data. ODMiner mines OD by recognising well known data types and formats (e.g., dates, emails, currencies, etc.) and by exploiting well known open linked datasets and vocabularies (e.g. DBpedia, WordNet, etc.) in order to extract named entities and relations between the open dataset elements. ODMiner is designed as modular and extensible software architecture and its process can be customised in order to address specific needs of final data representation and modelling. Finally, an evaluation of ODMiner with heterogenous multi-language OD datasets is provided in order to give evidence of its practical effectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Over the last years, the volume of Open Data (OD) published on the Web
has grown hugely, raising the joining interest of public institutions, private
companies and citizens. Unfortunately, OD consumers have still to face a
variety of challenges when trying to access, understand or use OD in
order to develop innovative services for solving real world problems on top
of them. In particular, a manual effort is still required for analysing and
mining open datasets. For example, in most of the cases it is complex to
understand their content and context, to extract information and transform
it into an machine-understandable format, to filter the elements of interest,
etc. This limits the exploitation of OD, posing serious threats to their real
reuse. In this paper we introduce ODMiner, an automatic tool that enhances
open datasets provided in heterogenous structured formats (e.g. JSON, CSV,
XML, etc.) to Linked Open Data. ODMiner mines OD by recognising well
known data types and formats (e.g., dates, emails, currencies, etc.) and by
exploiting well known open linked datasets and vocabularies (e.g. DBpedia,
WordNet, etc.) in order to extract named entities and relations between the
open dataset elements.</p>
      <p>We mention only two of the possible applications of these results. First
of all, the linking of OD datasets to popular open linked datasets and
vocabularies can be used to automatize the process of OD triplification and
semantic enrichment, and their consequent use in Semantic Web-based
applications. Another useful application we envision is using this information
for automatic OD visualization. By leveraging rich OD description we could
be able to automatically generate, for instance, visual analytics tools and
panels that can be used by users to explore, analyse and make sense of OD
datasets.</p>
      <p>ODMiner is designed as modular and extensible software architecture
and its process can be customised in order to address specific needs of
final data representation and modelling. Finally, an evaluation of ODMiner
with heterogenous multi-language OD datasets is provided in order to give
evidence of its practical effectiveness. The structure of the paper is the
following: (i) Section 2 reports the related work; (ii) Section 3 describe our
system, i.e. ODMiner; (iii) Section 4 describes the evaluation we carried on
to assess the effectiveness of ODMiner in the task of data type recognition;
(iv) finally, Section 5 provides the conclusions and possible future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related works</title>
      <p>
        In recent years a lot of research has been carried on to populate the
Semantic Web with Linked Open Data. Most of the approaches rely on fixed
or mostly limited customisable solutions to convert structured sources to
Linked Open Data. Worth mentioning tools are the D2R Server [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
Krextor [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. D2R is a tool for publishing relational databases on the
Semantic Web. It enables RDF and HTML browsers to navigate the content of a
database, and allows other applications to query a database through SPARQL.
D2R Server uses the D2RQ Mapping Language to map the content of a
relational database to RDF. A D2RQ mapping rule specifies how to assign URIs
to resources, and which properties are used to describe them. Krextor is an
extensible XSLT-based framework for extracting RDF from XML, supporting
multiple input languages as well as multiple output RDF notations.
      </p>
      <p>
        On the other hand, mining meaningful information from structured as
well as semi-structured data sources (e.g. CSV, XML or JSON) remains a
challenging task. In addition to OD quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], one of the main problem
that limits OD reuse is a lack of tools that allow a nearly one-click process
to mine relevant information from OD format in order to enhance them to
Linked Open Data effectively. As a matter of fact most of the state of the
art systems rely on machine learning platforms such as Weka [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or
RapidMiner3. A relevant example is [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] that introduces the RapidMiner Linked
Open Data extension. This extension hooks into the powerful data mining
platform RapidMiner, and offers operators for accessing Linked Open Data
3 https://rapidminer.com/
in RapidMiner, allowing for using it in sophisticated data analysis workflows
without the need to know SPARQL or RDF.
      </p>
      <p>
        Another approach is using a tool such as Wrangler [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or OpenRefine[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
to mine, clean and transform semi-structured (e.g. tabular) OD datasets
to Linked Data. The goal of these tools is to facilitate users in the task of
analysing OD and enhancing them to LOD. Though very effective, the
conversion provided by such tools is not completely automatic and still requires
the human intervention.
      </p>
      <p>To the best of our knowledge ODMiner is a novel and extensible solution
to support the generation of Linked Open Data. It provides an HTTP REST
interface that allows to automatically mining the data types from Open Data
available in heterogeneous formats such as XML, JSON and CSV in a nearly
one-click process.</p>
    </sec>
    <sec id="sec-3">
      <title>3 ODMiner</title>
      <p>ODMiner4 is a web-based tool aimed at analysing and making sense of
heterogeneous OD datasets. In particular, ODMiner takes as input the URL of
an OD dataset (JSON, XML, CSV and TSV are the data formats currently
supported), analyses its content, and infers which kind of data each field of the
dataset is an instance of. For example, given the tabular dataset in Figure 1,
ODMiner map each column to a meaningful class (the red text at the top of
the table) of a well-known ontology.</p>
      <p>ODMiner has a modular and extensible architecture. It provides some
software modules, each of which is responsible of recognising if the data
fields are instances of a specific class. The list of some of the basic classes
that are identified by ODMiner is shown in Table 1, and has been derived by
the analysis of the set of italian datasets described in Table 2. Other software
modules can be implemented and loaded into the ODMiner engine,
extending its analysis capabilities (e.g. extending the set of managed ontologies to
which input data can be mapped).</p>
      <p>
        ODMiner main engine is responsible of coordinating the analyses
performed by its software modules. The overall process can be summarised as
4 http://eelst.cs.unibo.it:8099/detector/api/url
follows. First, ODMiner splits the dataset fields into homogeneous partitions.
For example, the fields of tabular datasets are grouped by columns. Then,
each module analyses the fields of each group separately. The kind of
analysis differs from module to module: for example, the module responsible of
recognising dates performs a simple pattern matching test, while a more
sophisticated one such the place analyser computes Levenshtein distance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
to estimate the similarity of data with DBpedia labels, etc. Finally, the main
engine uses confidence thresholds to guarantee the statistical relevance of
each analysis and, consequently, the accuracy of the results. The principle
at the base of ODMiner logic is to leverage the homogeneity of structured
or semi-structured datasets (e.g. of all the elements of a column in a tabular
dataset) to infer meaningful information about their content.
      </p>
      <p>ODMiner also provides users a mean to configure the analysis modules
to define custom data analysers. For example, a user may be interested in
recognising some specific types of time-related data (e.g. dates, times,
durations, intervals, etc.). To do so, the user has to invoke ODMiner providing a
configuration file5 defining that he/she wants include an analysis based on a
regular expression search, together with the regular expression to test and
the ontological class that should be associated to the matching data.</p>
      <p>Another configurable ODMiner module allows to identify data values that
are instances of specific DBpedia classes. The basic idea is that ODMiner
compares each field of the dataset (e.g. the cells of a column in a table) to
the labels associated to DBpedia entities. When a match is found, ODMiner
collects all the classes those entities are instances of. Each match is
considered by ODMiner as a vote for that class. At the end of the analysis, all
the classes that have not obtained a minimum number of votes (i.e. a
config5 The description of the format of the configuration, together with all the other
information needed to download, compile and run ODMiner, is available at the
address https://bitbucket.org/Cigna/semanticdetector
urable quorum threshold set by the user) are discarded. Among the
remaining, the majority principle is applied, and the class with the higher number
of matches is chosen. The DBpedia classes users are interested in can be
specified by users, together with the number of records to analyse and the
confidence threshold (i.e. the quorum). As discussed in Section 4, this
principle has proven to be effective for the datasets and the classes taken into
account in our evaluation.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Evaluation</title>
      <p>We designed an experiment to assess the capability of ODMiner to enhance
Open Datasets by effectively recognising the correct types associated with
data. We randomly selected 16 datasets from Italian Open Data archives6
and from some US data portals. Table 2 reports the full list of datasets
used for the evaluation of ODMiner. Each row of the table identifies an open
dataset and specifies the topic of the dataset and the URL that can be used
to retrieve the dataset. The datasets were selected according to the
following rationale: (i) homogeneous distribution of heterogeneous formats (e.g.,
CSV, XML, etc.); (ii) multilinguality of datasets, i.e. we did not focus on a
single language (e.g. Italian); (iii) not negligible size of source data; (iv)
heterogeneity of topics.</p>
      <p>LPPPSAAETCCDDDNMMoiluhebbmohrroubauu.rupabssncirnrssplvWeergltistdaiitiliocmassccescsocrhrtthcarieyaooshoetyelecssspeaibeusinarlleloacsniitoflsccfnhhotiovveehdoRrifpennmdososElrsseluemsoaseemrrseftddseviiRpooliocpicnrfoaforeomdonoCsRgoftfeareoorcatlsamftihmsfocteoaihtrorogseeDnn,rnneasiCaaraplocios(tttlfiy.lootmIoafrolsflleIfii,lanRnlCeinotenhtihocoasicf.aidsabNmgielioinwtaistYtioorarnktiovSfefitacrteeegoifonC)alifornia htthptsthh:hp/hthhth/t:ththdtttt/thpphht/ttptttap8tspt:ptpttstt/ptp:hts5:ss/ap:/pp/g::://t:.:/.:////:h1//t:gca////d////p/gdge8/cic/abgagtg:aaaahh.bry/ta1aa/btlhbihahorbgtaeb7brtir.shsfra.leir3ticri..iel.eipeibddllil.dhee.eleil1l:raacinealiel/.ln1eicei/tcte.to.eg8aaoac.7ca.i.iclcgn5c.i.igise/.cgicsginoin.a.go.gg.aag1ncnpyga..nn..nw.o8ioaae.ggoggaawavr...nevooo1w.nw.g/.wdebw/vavwv7a/aeeba.//ap/e.3eecpaababwt.pibb.sbcppiap/..1i/cv.e.c.s./i/ivicuc1cs/iE/vb/s.isevvvusn.7s.ie.lu.uieiw.ic.nei/uweeueubnmwnsnisnwwwnnbos.ii/csubaibi/5.iosssbob/2ibpnoo//c./vtoBojegiio.pi/.jhwtzi.bdig.xz.a/tietid3idtoafx/r8t/nt//dvxd-/-./Batu-ddwd9id--iadatt-gei/aeaaavi/tSbnftbr/ttdiMatiwtfflakc/ios/ifiaiar/dp//gd9okaypuumdtsroet//////ilsmorrrrrrrr/teaoaiwactnooooooccnpacosewwwwwwgahhtilcrieesslleoiissssssUii................xxxxxxccccccccccRmmmmmmssssssssssLvvvvvvvvvvllllll 4881221139273214S40830292402425iz44077e1(KBs) ForXXXXXXCCCCCCCCCCmMMMMMMSSSSSSSSSSaVVVVVVVVVVLLLLLLt
Table 2. Topics, source URLs, sizes and formats of the open datasets used for the
evaluation of ODMiner.</p>
      <p>We constructed a ground truth by manually annotating the correct data
type for each field (e.g. a column of a CSV table) of the datasets presented
in Table 2. The ground truth is available on-line as CSV for consultation7.
Hence, we used ODMiner to mine data types against the same datasets in
order to compare the output of ODMiner with respect to the ground truth in
6 http://www.dati.gov.it/ and http://85.18.173.117/
7 http://eelst.cs.unibo.it/ld4ie/goldenstandard
terms of precision, recall and F-measure. We set up ODMiner with the
configuration provided in Table 3. This configuration was provided to ODMiner
as a JSON file8 and allowed us to specify the confidence threshold and the
matching criteria for data type recognition. The confidence thresholds
reported in Table 3 are expressed in a range between 0 and 1, and have been
selected through an empirical analysis, trying to find a good trade-off
between the accuracy of the results and the analysis duration. The meaning
associated with matching criteria is the following: (i) RegEx is a regular
expression that is used to perform the matching; (ii) pattern matching
identifies a set of cases or patterns from a list that are used to test whether a set of
data matches those cases; (iii) Exact and Contain-based matches on DBpedia
allows ODMiner to perform the classic “exact match” and “contains”
operations on the values (either literals or URIs) coming from a Linked Dataset
(e.g. DBpedia), which is used as background knowledge.</p>
      <p>Finally, we ran the experiment by configuring ODMiner to recognise data
types by using only a small slice (i.e. 20%) of the whole dataset. The
resources part of the slice were picked up randomly. Figure 2 shows the final
results we recorded.</p>
      <p>Although ODMiner requires a further and more detailed evaluation that
should be performed on a larger set of OD datasets and a broader set of
classes to recognise, the preliminary results of these tests are encouraging
and show that ODMiner has a good accuracy in performing the linking task.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusions</title>
      <p>In this paper we present ODMiner, a novel tool to support the process of
generating of Linked Open Data from Open Data format such as XML, JSON and
CSV. It provides a HTTP REST interface and it is easy to extend by adding
new algorithms and supported formats. The evaluation that we carried on
demonstrates the effectiveness of our tool, though further investigation is
needed. Possible future work includes the extension of ODMiner in order to
8 This JSON file is available on-line at http://eelst.cs.unibo.it/ld4ie/config.json.
dbpedia:Person
odm:TaxCode
sioc:Site
odm:Gender
foaf:mbox
support more formats like XSLX or OpenOffice spreadsheets and the study
of solutions to mine meaningful information that are distributed across
multiple fields. For example, the latter is the case of dates represented as
multicolumn values in a CSV having a column for the day, another for the month
and another for the year.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          .
          <article-title>D2R server - publishing relational databases on the semantic web</article-title>
          .
          <source>In Proceedings of ISWC2006 posters</source>
          , volume
          <volume>175</volume>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ciancarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Poggi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <article-title>Big data quality: a roadmap for open data</article-title>
          .
          <source>In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService)</source>
          , pages
          <fpage>210</fpage>
          -
          <lpage>215</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The WEKA data mining software: an update</article-title>
          .
          <source>ACM SIGKDD explorations newsletter</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          . Wrangler:
          <article-title>Interactive visual specification of data transformation scripts</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</source>
          , pages
          <fpage>3363</fpage>
          -
          <lpage>3372</lpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          .
          <article-title>Krextor - an extensible XML-&gt;RDF extraction framework</article-title>
          .
          <source>In Proceedings of 5th Workshop on Scripting and Development for the Semantic Web at ESWC</source>
          <year>2009</year>
          , volume
          <volume>449</volume>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <article-title>Binary codes capable of correcting deletions, insertions and reversals</article-title>
          .
          <source>Cybernetics and Control Theory</source>
          ,
          <volume>8</volume>
          (
          <issue>10</issue>
          ):
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ristoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Mining the Web of Linked Data with RapidMiner</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>35</volume>
          :
          <fpage>142</fpage>
          -
          <lpage>151</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Verlic</surname>
          </string-name>
          .
          <article-title>Lodgrefine-lod-enabled google refine in action</article-title>
          .
          <source>In I-SEMANTICS (Posters &amp; Demos)</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>37</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>