<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Translation Service for Open Data Portals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Urbanek</string-name>
          <email>sebastian.urbanek@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonja Schimmler</string-name>
          <email>sonja.schimmler@fokus.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Open Communication Systems</institution>
          ,
          <addr-line>Kaiserin-Augusta-Allee 31, 10589 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>113</fpage>
      <lpage>119</lpage>
      <abstract>
        <p>Open Data portals provide tens of thousands of datasets, which are described by metadata. This metadata is typically available in one or two languages only and, if translations exist, they are usually added manually. To build an inclusive data infrastructure, metadata should be available in as many languages as possible. The paper presents an approach for automatic translation of metadata within Open Data portals that are based on Semantic Web technologies and use the metadata standard DCAT-AP. Based on this approach, new functionalities are possible, such as enabling users to search for datasets in their native language. The approach was implemented for and tested within a practical application in a production environment. Acknowledgement: This work has been partially supported by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 16DII117 (“Deutsches Internet-Institut”). The implementation and provision of the European Data Portal has been funded by the European Commission under contracts DG CONNECT SMART 2017/1123.</p>
      </abstract>
      <kwd-group>
        <kwd>translation service</kwd>
        <kwd>open data</kwd>
        <kwd>semantic web</kwd>
        <kwd>linked data</kwd>
        <kwd>DCAT</kwd>
        <kwd>DCAT-AP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A central task of Open Data portals is the provision of data and the possibility to reuse it for other
purposes. To make this possible, the data must be findable and accessible by the users. This is
ensured by metadata, which describes the properties of the original data. For interoperability,
several metadata standards have been established. The Data Catalogue Vocabulary (DCAT)
        <xref ref-type="bibr" rid="ref1">(Albertoni et al., 2020)</xref>
        is a metadata standard for the provision of Open Data. In combination with
the Resource Description Framework (RDF) as part of the Semantic Web technology stack,
knowledge graph is enabled, to provide context and foster interoperability.
      </p>
      <p>Metadata is not always published in English; many national Open Data portals only list them in
the local language. To reduce language barriers and to make data globally accessible to foreign
speakers, metadata should be made available in as many languages as possible. Integrating this
information into a knowledge graph allows for additional search and filtering options. Making a
SPARQL endpoint available to access such a graph gives access to the translations in addition to the
original metadata. This opens up the possibility of developing further applications.</p>
      <p>This paper presents an approach for automatic translation of metadata, tailored to Open Data
portals that are based on Semantic Web technologies and use the metadata standard DCAT-AP.
Translations are performed on top of a knowledge graph representation, which serves as the input
and is expanded and updated accordingly. A software component was developed within a data
management ecosystem, which uses automatic translation and knowledge graph embedding.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Today, it is possible to translate texts without extensive knowledge of a target language. Machine
translation engines do this without the assistance of a single person. Modern translation engines
are based on statistical methods (SMT), which analyze the frequency and the similarity of words or
phrases in two different languages. If the similarity is high enough, one phrase is considered to be
the translation of the other
        <xref ref-type="bibr" rid="ref2">(Koehn, 2010)</xref>
        . Neural machine translation (NMT) is also a special type
of statistical translation. An artificial neural network analyzes the texts, and the network is trained
for the translation with some better results in contrast to SMT
        <xref ref-type="bibr" rid="ref3">(Bahdanau, Cho, Bengio, 2016)</xref>
        . This
approach is used by the European Commission's translation service, called eTranslation. It emerged
in 2017 from a research project called MT@EC and can translate text snippets as well as full
documents.
      </p>
      <p>
        There exist several standards for describing data, which build on each other. One widely used
metadata standard is the Data Catalog Vocabulary (DCAT), developed by the World Wide Web
Consortium (W3C). DCAT uses the Resource Description Framework (RDF) to ensure
interoperability between different datasets and portals. An extension for data portals in Europe is
the DCAT Application Profile (DCAT-AP), which the European Commission now publishes in
version 2.0. It describes the metadata of a dataset and the format in which they are stored and enables
Open Data portals to harvest datasets and catalogues from other ones with little effort.
        <xref ref-type="bibr" rid="ref4">Nuffelen
(2019)</xref>
        describes a dataset as a collection of data available for access or download in one or more
formats. Properties describe each dataset as metadata. Mandatory properties are title and
description. The actual data is linked as a distribution to a dataset. Distributions have an access URL
as a mandatory property, and a title and a description as optional properties. The datasets and
distributions each contain a description and a title that can be translated as literal.
      </p>
      <p>
        As far as we know, very few Open Data portals have translations of their metadata into other
languages, and if so, they are not managed automatically by a middleware. For instance, the EU
Open Data portal1 has translated metadata, but these translations are done manually. Within
research data portals, as provided by universities, metadata is usually provided in a native language
and/or in English, and translations are done manually by data curators. In Wikidata
        <xref ref-type="bibr" rid="ref6">(Wikidata,
2020)</xref>
        , metadata and data are provided in multiple languages. These translations are contributed by
tens of thousands of volunteers.
      </p>
      <p>1 https://data.europa.eu/euodp/en/about?</p>
    </sec>
    <sec id="sec-3">
      <title>3. Design of the Translation Service</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Global Architecture of the Translation Service</title>
        <p>Many tools for the machine translation of text snippets or entire documents exist on the market. Very
well-known are Google Translate, Bing Translate or DeepL, which are commercial or private
services. This is not suitable for every application. In the context of the Connecting Europe Facility
(CEF) funding programme2, the EU supports a machine translation project, eTranslation, focusing
on translating documents for EU members that do not necessarily have to be done by a qualified
translator as a person and are secure on servers in Europe. eTranslation provides a Web interface to
upload documents in Microsoft Word or PDF format. Developers can use its API that requires text
snippets and some meta-information. This includes the domain, authorisation, source language, and
target languages.</p>
        <p>
          Our Translation Service is part of Piveau
          <xref ref-type="bibr" rid="ref7">(Kirstein et al., 2020)</xref>
          , a data management ecosystem for
the public sector. It is designed for Open Data portals based on Semantic Web technologies and uses
the metadata standard DCAT-AP. For the translation of text snippets, it integrates eTranslation.
Piveau is divided into several microservices that perform different tasks. The most important ones
are Piveau Hub, Piveau Consus, Piveau Metrics and Piveau UI. Piveau Hub is the heart of the entire
application, managing a data catalogue and providing metadata. It includes an indexing component,
enabling search and filter functions. Internal storage is realised with a Virtuoso triplestore3. Piveau
Consus is importing and processing data and metadata, also from external sources. Piveau Metrics
checks the data against various data quality criteria. Piveau UI is a customisable web interface for
interacting with the data.
        </p>
        <p>As shown in Figure 1, our Translation Service is integrated into Piveau as follows: Being
triggered by a scheduler, data is collected by Piveau Harvester. Piveau Hub checks within the
internal knowledge graph if the retrieved datasets contain new or changed texts and sends
translation requests to the Translation Service. When translations are finished, the Translation
Service sends them back to Piveau Hub, which adds them to the internal knowledge graph.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Components of the Translation Service</title>
        <p>Internally, our Translation Service consists of independently operating components that control all
translations simultaneously. It works asynchronously and uses the reactive framework Vert.x,
which is well suited for handling large amounts of text snippets. As shown in Figure 2, it consists of
four core components, which are triggered one after another: translation request handler, translation
request partitioner, translation receiver and translation allocator. It further provides a database for
internal storage for fail-safe purposes and integrates eTranslation for the actual translation of text
snippets.</p>
        <p>2 https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+Digital+Home
3 https://virtuoso.openlinksw.com</p>
        <sec id="sec-3-2-1">
          <title>Translation Request Handler</title>
          <p>As described in the previous subsection, new translation requests are sent from Piveau Hub to the
Translation Service. The Translation Request Handler formally checks the requests. If errors occur,
a request is rejected. For instance, text snippets might be empty, or the original language is not
supported by eTranslation4. Valid translation requests are placed in a queue and temporarily stored
in a database for fail-safe purposes.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Translation Request Partitioner</title>
          <p>The Translation Request Partitioner checks periodically if there are new translation requests in the
queue. Once translation requests are waiting to be translated, the corresponding datasets are
prepared for translation, and text snippets are sent to eTranslation.</p>
          <p>It is not possible to send a record in the form of composed RDF triples to a translation engine
since such an engine simply translates text into another language. The engine might destroy the
syntax or might not be able to translate it at all. Thus, all data records need to be prepared. The
individual texts that are meant to be translated are extracted and always remain assignable to the
respective datasets by adept ID management. These text snippets are submitted to the translation
engine used.</p>
          <p>eTranslation provides a REST API that accepts a POST request and JSON as a body. To ensure a
successful translation of all texts from the datasets, they are brought into a form that conforms to
eTranslation's API. As eTranslation has a character limit, the texts to be translated are divided into
several subtexts, translated individually, and reassembled afterwards.</p>
          <p>Although the translation requests were already formally checked by the Translation Request
Handler, further pre-processing steps are required. Often, the title and description of a dataset are
4 en, bg, hr, cs, da, nl, et, fi, fr, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, nb, de
not correctly entered by the data provider. The actual plain text is sometimes an XML or HTML
document, sometimes a string of characters resulting from a binary coded file. In the case of XML or
HTML documents, a particular form of document translation is used that does not translate tags and
alike. In the case of a binary file, no translation is performed. Often, text snippets include URLs and
hashtags. These are extracted and saved separately before translation and inserted again afterwards.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Translation Receiver</title>
          <p>The Translation Service must also provide an interface for the finished translations, as
eTranslation works asynchronously. For each language, the Translation Service receives a separate
request from eTranslation. On average, each text is translated into 26 languages. Additionally, each
dataset consists of at least four texts - title and description of a dataset and title and description of
each distribution. Datasets often have multiple distributions, on average 8 in our usage scenarios,
and sometimes hundreds, thus creating a vast number of requests. The Translation
Receiver manages the finished translations. It checks the properties and language and performs
error handling. All translations are stored in a database for fail-safe purposes.</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Translation Allocator</title>
          <p>
            The Translation Allocator finally prepares the translated texts for integration into the internal
knowledge graph. All text snippets are assigned to the corresponding dataset or distribution and
provided with semantic features. This includes information about the languages and whether it is a
manual translation or a machine translation. A text to be translated according to the DCAT-AP
standard is a literal and has an optional language label. The XML data type lang specifies this. "This
is an example text."@en. The language tag is based on the ISO 639-1 language encoding. If a text is
translated, the tag is extended by: en-t-de-t0-etranslation. The first language code stands for the actual
language, followed by a t for the translation from the original language (in this example: de).
The t0 signals a machine translation, and the etranslation stands for the engine used
            <xref ref-type="bibr" rid="ref9">(Davis et al.,
2012)</xref>
            . New translations are integrated into the internal knowledge graph together with the
generated language tags. The Apache Jena framework5 is used for interaction between the software
and the knowledge graph.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Use Case: The European Data Portal</title>
      <p>Generally, each EU member has one or more national Open Data portals, separated by topic such as
geographic or administrative data. Due to the historical development of Europe, different languages
are spoken in these countries. Nevertheless, the EU is trying to foster trans-European exchange and
minimise language barriers.</p>
      <p>
        This requirement also exists for the European Data Portal
        <xref ref-type="bibr" rid="ref8">(EDP, 2019)</xref>
        . Its central task is making
public sector information available as Open Data. It currently provides approximately 1.1 million
datasets from 36 European states, harvested from 81 national Open Data portals. These national
5 https://jena.apache.org/index.html
portals mostly contain datasets in English or the national language. For the goal of being an inclusive
portal, all datasets need to be translated, possibly into all other European languages.
      </p>
      <p>In August 2019, we successfully integrated our Translation Service into the EDP. It fully meets
the formulated requirements, and apart from some minor adaptions, it is fully functioning since
then.</p>
      <p>Due to the daily harvesting of national Open Data portals, there are a lot of datasets that need to
be translated every day within the EDP. This is because the descriptions or titles change or because
new distributions or even entire datasets are added. Usually, between 1000 and 130,000 new data
records or translation requests are sent to the Translation Service every day. This requires the
Translation Service to produce the translations in a timely manner.</p>
      <p>At some point, the metadata has been integrated into a national Open Data Portal by either
humans or machines. Thus, the input is very heterogeneous and does not always follow the rules
defined by DCAT-AP. This is a major challenge and requires metadata pre-processing.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion &amp; Outlook</title>
      <p>The Translation Service has been developed to translate large amounts of metadata automatically.
It is designed for Open Data portals that are based on Semantic Web technologies and use the
metadata standard DCAT-AP. It allows for the use of different translation engines, which take care
of the actual translation. Thanks to integrating the translations into an internal knowledge graph,
new functionalities are enabled, e.g., users can search for datasets in their native language. Also,
translations are available via a SPARQL endpoint, which can be used for future applications.</p>
      <p>Future work on metadata pre-processing can further improve the service. One example is the
source language, which is not always stated correctly. Consequently, the text snippet cannot be
translated by eTranslation. Automatic recognition of the source language could resolve this issue.
Further pre-processing steps of the texts are also required. One example are problems with special
characters and character encodings in some languages - including text sections that should not be
translated.</p>
      <sec id="sec-5-1">
        <title>About the Author</title>
        <p>Sebastian Urbanek
Sebastian Urbanek is a researcher at the Fraunhofer FOKUS. Here, he is working on the core topics of Open
Data and data management, including the European Data Portal. As a member of the Weizenbaum Institute,
he is researching the effects of digitalisation on science. He focuses on databases, data structures and data
quality.</p>
        <p>Sonja Schimmler</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Albertoni</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Browning</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Beltran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Perego</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Winstanley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Data Catalog Vocabulary (DCAT) - Version 2</article-title>
          .
          <string-name>
            <given-names>W3C</given-names>
            <surname>Recommendation</surname>
          </string-name>
          .
          <source>Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          , from https://www.w3.org/TR/2020/REC-vocab-dcat-2-20200204/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <source>Statistical Machine Translation</source>
          . New York: Cambridge University Press, 3rd edition,
          <fpage>81</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Nuffelen</surname>
            ,
            <given-names>B. v.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>DCAT Application Profile for data portals in Europe Version 2.0.0</article-title>
          .
          <string-name>
            <given-names>EU</given-names>
            <surname>Commission</surname>
          </string-name>
          .
          <source>Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          , from
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Tags for Identifying Languages</article-title>
          .
          <source>BCP47. Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          from https://tools.ietf.org/html/bcp47.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Wikidata</surname>
          </string-name>
          (
          <year>2020</year>
          ,
          <year>August 10</year>
          ).
          <source>Help:Multilingual. Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          from https://www.wikidata.org/wiki/Help:Multilingual.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Kirstein</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dittwald</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dutkowski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Urbanek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hauswirth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , (
          <year>2020</year>
          ).
          <article-title>Piveau: A Large-Scale Open Data Management Platform Based on Semantic Web Technologies</article-title>
          . In: Harth,
          <string-name>
            <surname>A.</surname>
          </string-name>
          et al. (
          <article-title>eds) The Semantic Web</article-title>
          .
          <source>ESWC 2020. Lecture Notes in Computer Science</source>
          , vol
          <volume>12123</volume>
          . Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>EDP</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>European Data Portal User Manual</article-title>
          .
          <source>Portal Version 4.3. Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          from https://www.europeandataportal.eu/sites/default/files/edp_s1
          <source>_man_portal-version_4.3-usermanual_v1.0</source>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Umaoka</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Falk</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2012</year>
          ). Extension T - Transformed
          <string-name>
            <surname>Content</surname>
          </string-name>
          .
          <source>BCP47. Retrieved March</source>
          <volume>19</volume>
          ,
          <year>2021</year>
          from https://tools.ietf.org/html/rfc6497.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>