<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ASIA: a Tool for Assisted Semantic Interpretation and Annotation of Tabular Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincenzo Cutrona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Ciavotta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flavio De Paoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Palmonari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano - Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Enriching datasets with additional information to build robust models is an essential task in many data science applications. Also, the huge availability of Linked Data encourages to reuse and integrate such high-quality information. The ASIA tool assists users in annotating tabular data both at schema- and instance-level, in such a way to enable data extension. This demo paper presents its core capabilities.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic annotation</kwd>
        <kwd>Data enrichment</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Table interpretation and annotation is the process where a table, e.g., a CSV le
or an HTML &lt;table&gt;, is annotated with semantic pieces of information such as
types (ontology classes or data types), properties and resource identi ers.
Consider for instance the case where the columns of the table are annotated with
types specifying the class of entities or literal values contained in the column.
Columns can also be associated with ontology properties, which specify a
relation that is implicitly represented in the column; in this case, the column can
be interpreted as a source of RDF triples &lt;subject, predicate, object&gt;, one
per row, such that the values of the annotated column, i.e., the target column, are
interpreted as objects of the triple, values contained in a di erent column,
speci ed as the source column (of the relation), are interpreted as subjects, and the
property speci ed in the annotation de nes the predicate of the triples. In
addition to these schema-level annotations, instance-level annotations match values
in the columns (interpreted as mentions to entities) to identi ers in a Knowledge
Base (KB), e.g., identi ers of DBpedia resources. Several approaches have been
proposed to automate this interpretation and annotation process; we suggest
two recent papers for a review of techniques proposed in these approaches [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ].
Among these approaches, we also mention semantic labeling approaches, where
the distinction as mentioned above between class-based and property-based
annotations of columns is less strict than in our de nition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The automatic table
interpretation and annotation approaches discussed above target two main kinds
of applications: mapping tables to known vocabularies and instances so as to
generate RDF data from the table; execute structured queries on a large amount of
data available in web tables.
      </p>
      <p>In this paper, we showcase ASIA (Assisted Semantic Interpretation and
Annotation Tool)1, a tool designed to support users in annotating data by providing
assistance with three main tasks: i) schema-level annotation, to map tabular data
to existing vocabularies and generate RDF data; ii) instance-level annotation,
to perform data linking while generating the new data; iii) data extension, to
use the links established with instance-level annotations to fetch additional data
from third-party sources (e.g., after linking a column to DBpedia cities,
additional data about these cities can be fetched from DBpedia). Thanks to the
combination of instance-level annotations and data extension features, both
implemented to work with third-party reconciliation and extension services2, ASIA
targets a new type of application that is crucial to support analytics work ows
at scale: semantic enrichment of tabular data to help users analyzing their
proprietary data once they are enriched with third-party data sources.
Applications of this semantic enrichment task can be found in real-world data analytic
projects in domains such as Digital Marketing3, and eCommerce.</p>
      <p>
        ASIA is built on top of the D-a-a-S application DataGraft and its data
manipulation tool Grafterizer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. From the latter ASIA borrows the capability to
transform the annotations into full- edged data transformation scripts, which
can be applied in batch mode to transform data into RDF or enrich large
volumes of data. Moreover, ASIA provides features to streamline the annotation
task by supplying cross-lingual vocabulary suggestions services based on data
pro ling systems, which provide information about the usage of vocabularies in
existing data (currently, ABSTAT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Linked Open Vocabularies (LOV) are
supported). As a result, ASIA table interpretation and annotation is o ered as
part of an end-to-end solution for semantic data preparation.
      </p>
      <p>We can summarize the novelties of ASIA by comparing it with other table
annotation tools (the comparison with table interpretation tools or techniques
that do not o er a UI is out of the scope of this paper).4 Compared to Karma,
ASIA provides also reconciliation of column values as well as data extension as
features; in addition, (cross-lingual ) schema-level annotation is implemented as a
service and is currently performed using vocabulary usage statistics, rather than
one full- edged ontology (otherwise, Karma uses more sophisticated schema-level
annotation techniques). Compared to OpenRe ne, ASIA supports more
sophisticated schema-level annotations and RDF data generation; it also supports,
natively, batch execution of data transformations. Odalic and MantisTable
support schema-level annotation, but - to the best of our understanding - does not
support data enrichment. RMLEditor supports the editing of rules to generate
RDF data, but does not perform table annotation and data enrichment.</p>
      <sec id="sec-1-1">
        <title>1 http://inside.disco.unimib.it/index.php/asia/</title>
        <p>2 The latest release of ASIA includes several reconciliation services: GeoNames, Google</p>
        <p>
          GeoTargets, Wiki er, and Google ProductsServices Categories.
3 Examples of ASIA-supported enrichment pipelines in this domain can be found in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
4 A more complete comparison can be found as a resource at https://ew-shopp.
github.io/eswc2019-tutorial/, the tutorial's page where ASIA has been presented
and compared with several other tools. This is the rst work that illustrates the tool.
        </p>
        <p>ASIA: a Tool for Assisted Semantic Interpretation and Annotation
load
dataset
dump
dataset</p>
        <p>Toponym
! GN city</p>
        <p>1
GN region
! wind</p>
        <p>10
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Demonstration</title>
      <p>GN city
! hasLabel</p>
      <p>2</p>
      <p>GN region
! temperature
9</p>
      <p>GN city
! population</p>
      <p>3</p>
      <p>GN region
! population
8</p>
      <p>GN city
! latitude</p>
      <p>4
lat/long
! coordinates</p>
      <p>7
ASIA's prime objective is to support users in annotating semantically and
extending datasets in a tabular format. In the following, we consider a scenario
where a user is interested in running analyses requiring information about cities
and their regions (such as population and coordinates), and weather forecasts
about those regions. The dataset used for this demo has been provided by the
JOT Internet Media company5 and contains data about digital marketing
campaigns performance. Particularly, it comes with a column \CityStr", featuring
city toponyms. We demonstrate how ASIA can help the user in extending the
working dataset. First, the user relies on ASIA's matching functionalities to
disambiguate the toponyms with non-ambiguous identi ers (URIs) from a reference
KB, e.g., GeoNames in the example. These identi ers are then used to query the
reference KB to retrieve additional information.</p>
      <p>Figure 1 depicts the whole enrichment pipeline. The blocks refer to: a
reconciliation step (blue), an annotation step (orange), a transformation step (purple),
and extensions steps, namely, KB-based extensions (green), and weather
extensions (yellow). The rst reconciliation step includes an important user validation
step mediated by an interface (wrong reconciliations lead to wrong extensions).
Statistics help the user understand the quality of the results returned by
automatic reconciliation; the user can modify the results by i) choosing an alternative
URI, or ii) manually inserting the URI himself. Consequently, a new reconciled
column (named \GN city") is appended in the working dataset and is
automatically annotated with the type of the entities listed therein.6 In step 2, with the
support of the schema-level annotation form, the user speci es that toponyms
are associated as labels with the GeoNames entities. Subsequently, the user
exploits some KB-based extensions to extend the working dataset with information
from GeoNames: the extension form allows to select as many properties as the
user needs, and then retrieves all the properties' objects from the KB. In the
pipeline depicted in Figure 1, the user applies four consecutive KB-based
extension steps starting from the \GN city" column (steps 3 to 6): all these steps
can be accomplished at once by selecting four properties in the extension form.
The sixth step, \GN city ! GN region", adds a new reconciled column to the
dataset, which contains the region entity wherein the city entity is located. At
this point, the user may want to slightly modify the extension results, for
ex</p>
      <sec id="sec-2-1">
        <title>5 https://www.jot-im.com</title>
        <p>6 Since GeoNames uses the type gn:Feature for all its instances, we adopted the
gn:featureCode property as type, which is more signi cant.
ample by merging the latitude and longitude columns into a new \coordinates"
column (step 7). Starting from the \GN region" column, the user applies new
KB-based extensions and appends the population of each region to the dataset
(step 8). Lastly, the user retrieves information about weather (temperature and
wind) at region level.</p>
        <p>Weather extensions become available in ASIA when i) the dataset contains
one column annotated as xsd:date, or ii) the dataset contains a column
reconciled to GeoNames. Thus, the user obtains weather data by extending the
\GN region" column. In the Weather extension form, the user selects the
observation dates (that can be kept from another column - ASIA can recognize
the most common date formats) and the day o set, i.e., the weather forecast for
the next x days using the observation date as base. The user has also to select
which aggregation function to apply to the daily weather observations (avg, min,
max, cumulative). In the example pipeline, the user chooses to add information
about temperature and wind (steps 9 and 10); as a result, the Weather extension
appends n m p new columns, where n is the number of selected parameters,
m is the number of selected o sets, and p the number of selected aggregation
functions. Finally, the user downloads the enriched dataset in CSV format.
Alternatively, she can generate a KB in RDF, or download the whole pipeline as
an executable JAR to perform the same manipulations locally on larger volumes
of data compared to those that can be managed from the UI.</p>
        <p>A video demonstration of ASIA for building an enrichment pipeline that
extends the one described above can be found at https://youtu.be/Z7M2_
SjN2xo7. The demonstration can be replicated using the online version of
Datagraft at https://datagraft.io/.</p>
        <p>Acknowledgment. This research has been partly supported by EU H2020
projects EW-Shopp - Grant n. 732590, and EuBusinessGraph - Grant n. 732003.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Colnet: Embedding the semantics of web tables for column type prediction</article-title>
          .
          <source>In: AAAI</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cutrona</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Paoli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosmerlj</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perales</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Semantically-enabled optimization of digital marketing campaigns (</article-title>
          <year>2019</year>
          ),
          <article-title>Accepted for ISWC2019 In-Use track</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alse</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>Semantic labeling: A domainindependent approach</article-title>
          . In: ISWC. pp.
          <volume>446</volume>
          {
          <issue>462</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ritze</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Matching web tables to dbpedia - A feature utility study</article-title>
          .
          <source>In: EDBT</source>
          . pp.
          <volume>210</volume>
          {
          <issue>221</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Spahiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porrini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Abstat: Ontologydriven linked data summaries with pattern minimalization</article-title>
          .
          <source>In: The Semantic Web</source>
          . pp.
          <volume>381</volume>
          {
          <issue>395</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sukhobok</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pultier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moynihan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elves</surname>
            <given-names>ter</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Mahasivam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Tabular data cleaning and linked data generation with grafterizer</article-title>
          .
          <source>In: ESWC (Posters &amp; Demos)</source>
          . pp.
          <volume>134</volume>
          {
          <issue>139</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7 Other videos are available at https://www.youtube.com/playlist?list=
          <fpage>PLy7SznldqqmezwdL4QcxQYy2Fz1HV0wMS</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>