<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Henrik Dibowski</string-name>
          <email>henrik.dibowski@de.bosch.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Schmid</string-name>
          <email>stefan.schmid@de.bosch.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yulia Svetashova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cory Henson</string-name>
          <email>cory.henson@us.bosch.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan Tran</string-name>
          <email>anhtuan.tran2@de.bosch.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Research and Technology Center</institution>
          ,
          <addr-line>PA 15222 Pittsburgh</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology</institution>
          ,
          <addr-line>76133 Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Robert Bosch GmbH</institution>
          ,
          <addr-line>Chassis Systems Control, 74232 Abstatt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Robert Bosch GmbH, Corporate Research</institution>
          ,
          <addr-line>71272 Renningen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>65</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>Data lake architectures enable the storage and retrieval of large amounts of data across an enterprise. At Robert Bosch GmbH, we have deployed a data lake for this expressed purpose, focused on managing automotive sensor data. Simply centralizing and storing data in a data lake, however, does not magically solve critical data management challenges such as data findability, accessibility, interoperability, and re-use. In this paper, we discuss how semantic technologies can help to resolve such challenges. More specifically, we will demonstrate the use of ontologies and knowledge graphs to provide vital data lake functions including the cataloging of data, tracking provenance, access control, and of course semantic search. Of particular importance is the development of the DCPAC Ontology (Data Catalog, Provenance, and Access Control) along with its deployment and use within a large enterprise setting to manage the huge volume and variety of data generated by current and future vehicles.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Semantic Data Lake</kwd>
        <kwd>Semantic Search</kwd>
        <kwd>Semantic Layer</kwd>
        <kwd>Provenance</kwd>
        <kwd>Access Control</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Robert Bosch GmbH is a large enterprise company that designs and manufactures
automotive components, ensuring the agility, comfort, function and safety of vehicles
and driver assistance systems. Such components range from classical safety products
including airbags and electronic stability control to next generation automated driving
systems. Both the volume and variety of data generated by these systems have been
growing dramatically in the past few years. More specifically, the types of data range
from sensor data – including video, RADAR, LIDAR, and CANbus signals – to
textual data and metadata about the various projects collecting and using the data within
the company. To handle this complexity, we have developed a holistic architecture for
managing our data within the enterprise – the Bosch Automotive Data Lake.</p>
      <p>Simply centralizing and storing data in a data lake, however, does not immediately
solve all data management challenges. Specifically, issues of findability, accessibility,
interoperability, and re-use – the four principles of FAIR data1 – remain unresolved.
To facilitate these principles of FAIR data, we have extended our data lake
architecture with a semantic layer. This semantic layer consists of an ontology and knowledge
graph (KG) that provides meaningful, semantic description of all resources in the data
lake. The resources include a heterogeneous assortment of documents, datasets, and
databases. Semantic description of these resources, represented as a knowledge graph,
includes information about the content of the resources, the provenance, and access
control permissions. The ability to perform semantic search of all data in the data lake
provides enhanced findability, access, interoperability, and re-use.</p>
      <p>The three primary contributions of this paper include the creation of a DCPAC
Ontology (Data Catalog, Provenance, and Access Control), the development of the
Semantic Data Lake Catalog KG that is conformant to DCPAC, and the application of
the ontology and KG for semantic search and retrieval. In Section 2, we discuss
related work and then introduce the development and structure of the DCPAC Ontology in
Section 3. The creation of a conformant KG and its use within an enterprise setting is
explained in Section 4. Finally, in Section 5 we conclude with an overall summary
and directions for the future.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In the era of big data, data catalogs emerged as the standard for metadata
management. In the last few years, however, new application areas have appeared and the
volume and richness of metadata required has grown significantly. Data lakes
constitute one such important new application for data catalogs, besides warehouses, master
data repositories, etc. According to Gartner, a data catalog “… maintains an
inventory of data assets through the discovery, description, and organization of datasets2. The
catalog provides context to enable data analysts, data scientists, data stewards, and
other data consumers to find and understand a relevant dataset for the purpose of
extracting business value.” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Current vendors offer a wide range of commercial data catalog software. A sample
of such vendors includes Alation Data Catalog, Atlan Enterprise Data Catalog, Talend
Data Catalog, Collibra Data Catalog, Informatica Enterprise Data Catalog, Microsoft
Azure Data Catalog, Oracle Cloud Infrastructure Data Catalog, and even Google is
joining the market with its Googles Data Catalog. To our knowledge, however, none
of these data catalogs uses or supports standard semantic technologies, nor do they
allow for using existing ontology vocabularies. Rather they are closed, propriety
systems with their own metadata languages and glossaries.
1 https://www.go-fair.org/fair-principles/
2 Datasets are the files, tables, graphs etc. that applications or data engineers need to find and
access.</p>
      <p>Anzo Cambridge Semantics3 is one of a few exceptions, as it is built from the open
data standards OWL, RDF and SPARQL, which makes it simple to leverage rapidly
evolving vocabularies in multiple industries. Anzo has a built-in smart data catalog
functionality that is able to automatically extract the schemas of databases in a data
lake and support the mapping of the schemas to ontology terms. But the integration of
this data catalog functionality with existing ETL pipelines, as well as extensibility of
the built-in data catalog ontology based on domain specific needs, is limited.</p>
      <p>
        Adding a semantic layer to a data lake is a common approach to developing a
semantic data lake, which have been described in literature. The use of data catalogs in
this context, however, are still rare. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a data lake using semantic technologies is
presented that can manage datasets produced by sensors or simulation programs in the
manufacturing domain. It comprises a data catalog that provides inventory services
and also implements security mechanisms. Different from our approach, however, this
data catalog is not built using standard semantic technologies, but rather as a simple
file system.
      </p>
      <p>
        A semantic data lake architecture for autonomous fault management in
softwaredefined networking environments, with clear similarities to ours (Section 4), is
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Another comparable semantic data lake architecture called “Squerall”
is proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This solution proposes distributed query execution techniques and
strategies for querying heterogeneous big data. Both approaches, however, lack a data
catalog and other means of handling provenance or access control.
      </p>
      <p>Our solution differs from existing solutions by proposing a semantic data lake
architecture that incorporates a semantic data catalog, built with standard semantic
technologies, and that addresses provenance and access control for resources in the
data lake. This solution is described in detail in the following sections.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Semantic Data Catalog, Provenance and Access Control</title>
    </sec>
    <sec id="sec-4">
      <title>Layer for Data Lakes</title>
      <p>As one of the three primary contributions of this paper, this section describes the
DCPAC ontology (Data Catalog, Provenance, and Access Control). The DCPAC
ontology can be applied for adding a semantic layer to a data lake, which provides
semantic description of the content, provenance, and access control permissions of the
resources in a data lake. This ontology was created by combining several common,
(predominantly) standardized ontology vocabularies and by aligning and extending
them where necessary.
3.1</p>
      <sec id="sec-4-1">
        <title>Ontology Layer Architecture</title>
        <p>the bottom, and recursively imports all other ontologies. Additionally, it defines
SHACL constraints for validating instance data (ABox).</p>
        <p>In the following subsections, the primary ontologies utilized by DCPAC are
described.</p>
        <sec id="sec-4-1-1">
          <title>SKOS</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Ontology</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>DCMI Metadata</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Terms Ontologies</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Data Catalog Ontology (DCAT)</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Provenance</title>
          <p>Ontology (PROV-O)
s
t
r
o
p
m
I</p>
        </sec>
        <sec id="sec-4-1-7">
          <title>SKOS Tags Ontology (STO)</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>ODRL</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>Ontology</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>DCAT – PROV-O Alignment</title>
        </sec>
        <sec id="sec-4-1-11">
          <title>Ontology (DPA)</title>
        </sec>
        <sec id="sec-4-1-12">
          <title>FOAF</title>
        </sec>
        <sec id="sec-4-1-13">
          <title>Ontology</title>
        </sec>
        <sec id="sec-4-1-14">
          <title>Data Catalog – Provenance – Access Control Ontology for Data Lakes (DCPAC)</title>
        </sec>
        <sec id="sec-4-1-15">
          <title>SHACL</title>
        </sec>
        <sec id="sec-4-1-16">
          <title>Shapes</title>
          <p>Fig. 1. Layer architecture of the data catalog, provenance and access control (DCPAC)
ontology for data lakes.</p>
          <p>
            Data Catalog (DCAT) Ontology [Prefix: dcat]. The Data Catalog (DCAT) ontology
“… is an RDF vocabulary designed to facilitate interoperability between data catalogs
published on the Web. … DCAT enables a publisher to describe datasets and data
services in a catalog using a standard model and vocabulary that facilitates the
consumption and aggregation of metadata from multiple catalogs. This can increase the
discoverability of datasets and data services. It also makes it possible to have a
decentralized approach to publishing data catalogs and makes federated search for datasets
across catalogs in multiple sites possible using the same query mechanism and
structure.” [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. DCAT is standardized as a W3C recommendation, with the latest version
from February 2020, and is being developed further by an active community.
          </p>
          <p>
            The DCAT ontology imports and uses the widely recognized SKOS [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and DCMI
Metadata Terms [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] ontologies. Its primary purpose in the context of the DCPAC
ontology is the semantic description of the content of resources in a data lake.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Provenance Ontology (PROV-O) [Prefix: prov]. The Provenance Ontology</title>
        <p>
          (PROV-O) “… expresses the PROV Data Model using the OWL2 Web Ontology
Language. It provides a set of classes, properties, and restrictions that can be used to
represent and interchange provenance information generated in different systems and
under different contexts. It can also be specialized to create new classes and
properties to model provenance information for different applications and domains.” [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
PROV-O is a W3C recommendation from April 2013. Its purpose in the context of
DCPAC is to describe the provenance of the data lake resources. Such provenance
information may include the ownership of resources how they were created, by which
activity and agent, and from what data they were derived.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Open Digital Rights Language (ODRL) Ontology [Prefix: odrl]. The Open Digital</title>
        <p>
          Rights Language (ODRL) ontology “… is a policy expression language that provides
a flexible and interoperable information model, vocabulary, and encoding
mechanisms for representing statements about the usage of content and services. The ODRL
Vocabulary and Expression describes the terms used in ODRL policies and how to
encode them.” [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The latest version 2.2 was published by the W3C in September
2017. In our data lake scenario, ODRL is applied to defining access control
permissions for the data lake resources, including who can access a resource and which
actions are permitted, i.e. display, read, modify, delete.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>DCAT – PROV-O Alignment (DPA) Ontology [Prefix: dpa]. The DCAT – PROV</title>
        <p>
          O Alignment (DPA) ontology [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was created by the W3C Dataset Exchange
Working Group (DXWG) and contains alignment axioms between DCAT ontology and
PROV-O. Thereby, it enhances the DCAT ontology with the ability to use PROV-O
for expressing advanced provenance information.
        </p>
        <p>The most relevant alignments defined in the DPA ontology are shown in Fig. 2. It
aligns the DCAT ontology classes dcat:CatalogRecord, dcat:Resource
and dcat:Distribution as subclasses of the PROV-O class prov:Entity by
adding corresponding rdfs:subClassOf statements. Thus, all instances of these
classes and their subclasses become instances of prov:Entity, which allows the
usage of all associated PROV-O object properties and classes for modeling
provenance information. This makes the provenance and authorship of data, along with its
evolution over time, trackable in each little detail.</p>
        <p>
          SKOS Tags Ontology (STO). The Simple Knowledge Organization System (SKOS)
is “a common data model for sharing and linking knowledge organization systems”
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We design separate SKOS vocabularies for different domains and use them to
specify the semantics of resources in a data lake, dependent on its subject. In
particular, we assign each dataset a set of skos:Concepts as tags that provide semantic
description of the content of a data lake resource.
        </p>
        <p>The SKOS vocabularies are domain specific. While defining these vocabularies,
we often reuse terms from existing or newly developed domain ontologies. From the
domain ontologies, we select subsets of classes and individuals that are relevant for
the tasks of retrieval, and define them as instances of skos:Concept. Domain
specific SKOS vocabularies are iteratively added to the SKOS Tags ontology (STO),
which serves as a generic component of the domain-agnostic architecture of the
DCPAC ontology (see Fig. 1), bridging it with domain specific ontologies. In the
following sub-section, we describe one such domain ontology (ASO), developed for
the Bosch Automotive Data Lake, and show its relationship to the STO.</p>
        <p>prov:Entity
dcat:CatalogRecord
dcat:Resource</p>
        <p>dcat:Distribution
dcat:Dataset</p>
        <p>dcat:DataService
dcat:Catalog</p>
        <p>dcat:DataDistributionService</p>
        <p>
          Automotive Signal Ontology (ASO) [Prefix: aso]. The primary goal of the
Automotive Signal Ontology [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is to represent manifold signal types in automotive datasets
and to enable non-trivial queries spanning over datasets of different types, formats
and modalities (including radar signals, onboard diagnostics and video data). The use
of this ontology allows non-domain experts to understand and query the data, as well
as to automate the integration of signals from different sources in support of a wide
range of applications and use-cases of interest to the automotive industry.
        </p>
        <p>
          The ASO is an OWL 2 ontology. It borrows concepts from several standard
ontologies and vocabularies, namely the W3C Semantic Sensor Network Ontology (SSN)
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], the Quantities, Units, Dimensions, and Data Types Ontologies (QUDT) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and
the Vehicle Signal and Attribute Ontology (VSSo) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], generated from the
automotive standard VSS [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The ASO conceptualizes a signal by defining several
meaningful relations, including the signal type (e.g. aso:WindowPosition as a
subclass of aso:ObservableSignal), the associated vehicle component (e.g.
aso:Window), the sensor(s) and actuator(s) involved in generating signal data, as
well as the measured physical quantities and units-of-measure. It also provides terms
to describe the specific details of automotive data collection, e.g. CAN bus data, CAN
frames, messages and signals.
        </p>
        <p>The ASO also defines an associated SKOS vocabulary, where all signals are
defined as instances of skos:Concepts. This vocabulary is a part of STO.</p>
        <p>Consequently, the ASO has a dual role in our Automotive Data Lake. The typing
of ASO signals as skos:Concepts provides the means to tag resources in the data
lake in a consistent way and enriches the semantic search capabilities provided by our
DCPAC ontology. In addition, the formal semantics of the ASO itself enables
expressive queries, which go beyond the hierarchical SKOS tag search and make the data
lake truly semantic. For example, find all datasets that are tagged with signals of a
certain type (e.g. aso:ObservableSignal)and being associated with specific
vehicle component (e.g. aso:Window).
3.2</p>
      </sec>
      <sec id="sec-4-5">
        <title>Data Catalog – Provenance – Access Control (DCPAC) Ontology [Prefix: dcpac]</title>
        <p>
          The DCPAC ontology is our primary contribution to the ontology layer architecture
shown in Fig. 1. It combines, aligns and extends the ontology vocabularies described
in the previous section. The ontology directly imports the ODRL ontology, the DPA
ontology, the FOAF (“Friend of a Friend”) ontology [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and optionally one or more
STO ontologies, and recursively imports all other shown ontologies. We chose to
reuse properties defined by the FOAF ontology – such as foaf:givenName,
foaf:name, and foaf:mbox – to extend the existing definitions of prov:Agent
and odrl:PartyCollection.
        </p>
        <p>Alignments and Extensions to the Upper Layer Ontologies. The DCPAC ontology
aligns the DCAT ontology with the ODRL ontology by declaring the classes
dcat:Distribution and dcat:Resource to be subclasses of odrl:Asset,
as can be seen in the upper part of Fig. 3. With odrl:Asset representing a resource
or a collection of resources that are the subject of access authorization rules, this
enables the definition of access control permissions for these DCAT classes and
subclasses with the ODRL vocabulary. Furthermore, the DCPAC ontology extends the
DCAT ontology by defining various types of dcat:Dataset subclasses (see Fig.
3), which allows for distinguishing different types of datasets in a data lake, such as
raw data files, tabular data files, relational database and graph database resources.</p>
        <p>Another contribution is the alignment of PROV-O with the ODRL ontology, as
shown in Fig. 4. The PROV-O class prov:Agent is declared as subclass of
odrl:Party, hence enabling all instances of prov:Agent to undertake roles in
access control permissions. Additionally, the DCPAC ontology defines new
subclasses of prov:Activity, which allow for distinguishing different types of
activities that created (dcpac:GenerationActivity) or modified (dcpac:
ModificationActivity) a data lake resource.</p>
        <p>
          SHACL Constraints. The DCPAC ontology is associated with a SHACL shapes
definition file that defines a comprehensive set of SHACL constraints of type SHACL
Shapes (Node Shapes, Property Shapes) and SPARQL-based constraints [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
SHACL shapes define cardinalities and type restrictions on properties, and regular
expressions on the allowed values of string datatype properties. One such SHACL
shape, for example, validates that each dcat:Dataset instance has to have exactly
one value of type string defined for the property dct:identifier, and the string
must match the regular expression “^[a-z0-9][a-z0-9_\-]{2,59}$”.
SPARQL-based constraints have a higher expressivity and can capture complex
dependencies as graph patterns. For the class dcat:Dataset, for example, we
defined a constraint that validates that each instance must have at least one semantic tag
(skos:Concept) attached, and the tags must be members of a
skos:ConceptScheme that is associated with (i.e. enabled for) the catalog the
dataset belongs to (see also next section and Fig. 5).
        </p>
        <p>A SHACL engine can process the constraints and validate the consistency of the
KG (ABox)4. That improves the integrity and quality of the KG and prevents issues.</p>
        <p>odrl:Asset
odrl:AssetCollection dcat:Resource dcat:Distribution
4 We use Stardog as highly scalable triple store for our Bosch data lake. Stardog supports
SHACL and has an inbuilt SHACL engine. https://www.stardog.com/platform/
3.3</p>
      </sec>
      <sec id="sec-4-6">
        <title>The Core Vocabulary.</title>
        <p>This Section provides an overview and explanation of the core vocabulary of the
DCPAC ontology and the primary imported vocabularies, which are explained in the
previous sections. For the explanation, we refer to Fig. 5, which shows the main
ontology classes as well as the most important object properties and datatype properties.
The stereotypes shown for some of the classes in Fig. 5 contain their superclasses and
hence their alignment to the other ontologies described in the previous sections. We
abstain from showing and explaining additional classes and properties that are
specific for the Bosch Automotive Data Lake in order to maintain comprehensibility and
domain-independence.</p>
        <p>DCAT Entities. Let us start with the DCAT ontology classes shown in the center and
bottom left of Fig. 5. The overall data catalog of the data lake is represented by one
instance of class dcat:Catalog. It can contain many dcat:Dataset instances,
one per resource in the data lake, e.g. raw data files, HBase or Hive tables, or
RDFbased knowledge graphs. An instance of class dcat:Distribution models a
specific representation of a dataset, comprising a specific serialization or schematic
arrangement. Different distributions can exist for the same dataset, and are accessible
via a URL (dcat:downloadURL). The data catalog and the datasets can each have
several data distribution services (dcat:DataDistributionService), which
are end-points that provide access. They are accessible via an endpoint URL
(dcat:endpointURL).</p>
        <p>PROV-O Entities. The PROV-O classes and properties shown in the top right part of
Fig. 5 are used for modeling the provenance of the data catalog and its datasets (both
declared as subclasses of prov:Entity, see Fig. 2), and for defining agents (e.g.
person, software agent) they are attributed to (prov:wasAttributedTo) or that
were involved in the activity of creating the dataset. Activities (prov:Activity)
are initiated by agents (prov:wasAssociatedWith), create new Datasets
(prov:wasGeneratedBy), have an start and end time, and can use other datasets
as input (prov:used).</p>
        <p>ODRL Entities. Access control is defined by classes and properties from the ODRL
ontology. An odrl:Permission can define an access rule for groups of agents
(odrl:PartyCollection) to datasets, their distributions and/or data distribution
services (odrl:target). The allowed actions (odrl:Action), such as display,
read, modify, delete, are defined as skos:Concept and attached via
odrl:action object property.</p>
        <p>Fig. 5. The main classes and properties of the DCPAC ontology (TBox).</p>
        <p>74
SKOS Entities. SKOS finally is applied for defining the semantics of the content of a
dataset. Therefore, the catalog refers to one or more sets of SKOS concepts
(skos:ConceptScheme) that can be used for semantically tagging datasets. The
defined SKOS tags can be either directly linked to a dataset (dcat:theme), or they
can be bundled and linked as a collection (skos:Collection), which enables the
definition and reuse of (large) sets of SKOS tags. For the Bosch Automotive Data
Lake we use the ASO ontology, as described in Section 3.1.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Semantic Data Lake Catalog</title>
      <p>
        At Bosch, we have built an Automotive Data Lake as a centralized platform for the
engineering and testing of our autonomous driving applications [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. To handle and
manage the complexity and enormous volume of data from all our test drives, we
have developed a holistic architecture, which is shown in Fig. 6. and explained in this
section. Resources collected and stored in the Bosch Automotive Data Lake include a
heterogeneous assortment of documents, datasets and databases. We have created a
semantic layer, the Semantic Data Lake Catalog, which provides meaningful
semantic description of resources in the data lake and enables semantic search. The
Semantic Data Lake Catalog comprises a knowledge graph that is built with the
vocabulary defined in the DCPAC ontology (see Section 3). The semantic description of
the resources includes information about their content, provenance and access control
permissions. The ability to perform semantic search of all data in the data lake
provides enhanced findability, access, interoperability, and re-use.
      </p>
      <p>Stream Data
External
Databases
External</p>
      <p>Files</p>
      <p>Data
Ingestion
Process</p>
      <p>Data
Processing</p>
      <p>Tasks
Data Lake Catalog
Population Service</p>
      <p>Applications</p>
      <p>AI, Business
Applications
Data Analytics
Applications
Data Lake</p>
      <p>Catalog Search
Data Lake Catalog</p>
      <p>User Interface</p>
      <p>Data Access Engine
Data Stores</p>
      <p>Model Store</p>
      <p>Semantic Data
Lake Catalog</p>
      <p>Scene KG
Fig. 6. Data lake architecture and role of Semantic Data Lake Catalog.</p>
      <p>Data Lake
In the sub-sections below, we explain the other components of the Automotive Data
Lake shown in Fig. 6, and clarify the process by which the Semantic Data Lake
Catalog KG is populated and how it is used to query, find and access data assets.
Data Ingestion Process. As illustrated in Fig. 6, external data from different sources
(test fleet vehicles, test benches, data warehouse, etc.) are ingested into the data lake,
either continuously in streams or driven by users and applications. The ingestion
process is responsible for extracting, transforming and loading new data assets into the
data stores. The implementation of this ingestion process was containerized using our
in-house DevOps tool in order to allow scaling-out based on the load. It is important
to note that this tool does not only provide a mechanism to deploy the ingestion
processes on our on-premises infrastructure, it also wraps the ingestion with a list of
standard operators that are automatically called to report the process information as
well as input &amp; output data to a Kafka5 cluster. These reports, published as
standardized Kafka messages, are consumed by the Data Lake Catalog Population Service.</p>
      <sec id="sec-5-1">
        <title>Data Lake Catalog Population Service. Triggered by Kafka messages, the Data</title>
        <p>Lake Catalog Population Service reads the available metadata on the ingested data
assets and constructs the relevant semantic data as input for our Semantic Data Lake
Catalog. The Data Lake Catalog Population Service aligns, annotates and enriches the
input data from the Data Ingestion Process with DCPAC concepts before populating
the Semantic Data Lake Catalog6. Besides dictionary based mappings (i.e. input data
schema terms or tags are mapped to dedicated SKOS concepts of our Semantic Data
Lake Catalog taxonomies), the population service also links signal name strings to
relevant automotive signal concepts from the Automotive Signal Ontology (based on
Levenshtein distance). This is a critical part of the knowledge construction process, as
it enables us to search, integrate and process the various data assets based on a shared
conceptualization. The Data Lake Catalog Population Service will also record
relevant provenance information; e.g. the activity that has created or modified a data
asset, including information about the source asset as well as begin and end time.
Data Access Engine. The Data Access Engine (DAE) provides applications with a
uniform query interface and access to resources (e.g. files, tables, knowledge graphs)
in the data lake based on a common HTTP-based API and endpoint. At this stage, the
DAE supports data-type queries (i.e. in HBase7 or Hive8/Impala9 tables), knowledge
queries (i.e. SPARQL queries of knowledge graphs) and task requests (i.e. Oozie10
jobs in the Hadoop11 cluster). The DAE secures and hides the details of the underlying
5 Apache Kafka: A distributed streaming platform, https://kafka.apache.org/
6 We use Stardog for storing and processing the semantic layer as knowledge graph.
7 Apache HBase: Distributed big data store that runs on Hadoop, https://hbase.apache.org/
8 Apache Hive: Data warehouse software for large distributed datasets, https://hive.apache.org/
9 Apache Impala: Native analytic database for Apache Hadoop, https://impala.apache.org/
10 Apache Oozie: Workflow scheduler for Hadoop, https://oozie.apache.org/
11 Apache Hadoop: Scalable, distributed computing software, https://hadoop.apache.org/
storage system and enables transparent re-direction of requests based on stable and
global identifiers. The DAE uses the Semantic Data Lake Catalog to control access to
individual data assets based on common access operations (e.g. read, modify, delete).
In particular, the Semantic Data Lake Catalog provides a list of authorized user
groups for a given dataset according to its content type, security class and assigned
project. For authorization, the DAE supports Kerberos12, for easy integration with
Enterprise IT systems, as well as JSON Web Token13 based authorization. Besides the
access control, the DAE also supports template-based requests for the different types
of resources. The DAE fetches the respective template from the Semantic Data Lake
Catalog and fills the template with parameters provided in the request. Once the
template is complete, the DAE fetches/queries the respective resources in the data lake
and returns the results to the client. This template based request feature has proven
especially useful in the case of knowledge graph queries, as it enables frontend
developers – without knowledge of RDF and SPARQL – to query relevant data for their
applications.</p>
        <p>Besides using templates, administrators and authorized data engineers are also able
to query the data lake and Semantic Data Lake Catalog with native SPARQL queries
via the DAE. This enables privileged users to carry out complex semantic search and
analytics on the Semantic Data Lake Catalog (e.g. find all data assets that have been
derived from a given asset, or find all datasets that contain signals associated with a
particular sub-system of a car), or perform advanced data management operations
(e.g. delete all data assets computed by a given provenance activity or task). The DAE
also provides a means for data management agents to query and update the Semantic
Data Lake Catalog programmatically. This is the basis for implementing sophisticated
data lake management agents that can leverage the full semantic query capabilities of
our Semantic Data Lake Catalog.</p>
        <p>Data Processing Tasks. Similar to the Data Ingestion Process, our run-time
environment also supports containerized data processing tasks, such as data enrichment,
knowledge construction, and data analytics. These tasks use the Semantic Data Lake
Catalog to find data resources and the DAE to access data or knowledge in the data
lake. Such processing tasks typically create new data resources (e.g. knowledge
graphs or tables) or curate existing ones to persist the results. Whenever new data
resources are created or further metadata about resources are discovered, the Semantic
Data Lake Catalog is automatically updated. The Data Lake Catalog Population
Service is triggered via corresponding Kafka messages. Consequently, the Semantic Data
Lake Catalog is automatically updated and relevant provenance information are
recorded.</p>
        <p>Data Lake Catalog User Interface. For data lake administrators and data engineers
to manage and search data resources available in the data lake, we developed a Web
application that allows the search and selection of resources based on the available
12 Kerberos: The network authentication protocol, http://web.mit.edu/kerberos/
13 JSON Web Token, https://tools.ietf.org/html/rfc7519
meta data (e.g., type of resource, content type, signals, recording date, associated
SKOS tags). Selected data resources can then be batch processed, e.g. curated with
relevant SKOS tags or keywords, ownership or permissions changed, or deleted from
the data lake. Fig. 7 illustrates a screenshot of our Data Lake Catalog Web
application.</p>
        <p>Our Semantic Data Lake Catalog has been in use since the end of 2019. The Data
Lake Catalog Population Service is now continuously populating the Semantic Data
Lake Catalog as new data resources are ingested to the data lake. Since the data lake
was already in use prior to the implementation of the automated population process
described above, we are still batch-processing data ingested in the past based on
various metadata sources.</p>
        <p>Table 1 shows the number of data resources registered in the Semantic Data Lake
Catalog, the number of facts in the knowledge graph, the data volume of the
registered resources as well as the number of registered users.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper presents the design and implementation of a semantic layer for data lakes
in general and reports on its realization and use for managing data in the Bosch
Automotive Data Lake. In particular, we demonstrate how we use our DCPAC ontology
(Data Catalog, Provenance, Access Control) for managing data resources in a
comprehensive manner, enabling findability, accessibility, interoperability, and re-use –
the four principles of FAIR data (Section 3). We report on the concrete application of
the DCPAC ontology in conjunction with our Automotive Signal Ontology (ASO) in
the implementation of the Semantic Data Lake Catalog at Bosch (Section 4). At the
core of our Semantic Data Lake Catalog is a knowledge graph that includes instances
for all data resources in our data lake (~1.8M) comprising of the available metadata
(id, name, date created, date modified, size, format, etc.). The knowledge graph also
stores references to the encompassing automotive signals (ASO) being defined as
SKOS tags, defines access control permissions, and documents provenance
information such as activities and associated agents. This comprehensive knowledge graph
offers our data scientists and engineers sophisticated semantic search and data
management capabilities, by combining typical metadata search with semantic search
based on content-related (SKOS tags and formal semantics of the ASO) as well as
provenance-related (entities, activities, agents) information.</p>
      <p>Several important lessons learnt from the design and usage of the system so far
include: (1) The DCPAC ontology in conjunction with ASO has proven sufficiently
expressive to find and manage the automotive data resources in our Semantic Data
Lake Catalog; (2) The population of the Data Lake Catalog must be fully automatic
and triggered by the data ingestion process; (3) Semantic search and management of
data resources based on SKOS tags of automotive signals and their conceptualizations
are critical as equivalent signals occur with different names across the enterprise; (4)
Provenance related information are critical to manage data and enable traceability and
automatic reprocessing/updating of derived data; (5) Access control can be seamlessly
integrated into the KG.</p>
      <p>One of the limitations of our current system is that the ASO covers only a small
fraction of the overall signals used in our data. This is because the ASO has been
manually populated so far. As future work, we target a processing pipeline that
populates the ASO with missing signals in a semi-automatic manner by involving domain
experts as needed.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Edjlali</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duncan</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Simoni</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaidi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data Catalogs Are the New Black in Data Management and Analytics</article-title>
          .
          <source>Gartner Research</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kasrin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qureshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steuer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicklas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Semantic Data Management for Experimental Manufacturing Technologies</article-title>
          .
          <source>Datenbank-Spektrum</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <fpage>27</fpage>
          -
          <lpage>37</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Benayas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amado</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iglesias</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          :
          <article-title>A semantic data lake framework for autonomous fault management in SDN environments. Transactions on Emerging Telecommunications Technologies, Special Issue on Machine Learning/AI for IoT, M2M</article-title>
          , and
          <source>Computer Communication</source>
          <volume>30</volume>
          (
          <issue>9</issue>
          ) (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>Uniform Access to Multiform Data Lakes using Semantic Technologies</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on Information Integration and Web-based Applications &amp; Services (iiWAS2019)</source>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>322</lpage>
          , Munich, Germany (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Albertoni</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Browning</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beltran</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perego</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winstanley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Data Catalog Vocabulary (DCAT) - Version 2</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2020</year>
          ), https://www.w3.org/TR/vocab-dcat-2/, last accessed
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Miles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bechhofer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>SKOS Simple Knowledge Organization System Reference</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2009</year>
          ), https://www.w3.org/TR/skos-reference/,
          <source>last accessed</source>
          <year>2020</year>
          /03/31.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Dublin Core Metadata Initiative Board:
          <article-title>DCMI Metadata Terms</article-title>
          . DCMI
          <string-name>
            <surname>Recommendation</surname>
          </string-name>
          (
          <year>2020</year>
          ), https://www.dublincore.org/specifications/dublin-core/dcmi-terms/,
          <source>last accessed</source>
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lebo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinmess</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>PROV-O: The PROV ontology</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2013</year>
          ), http://www.w3.org/TR/prov-o/,
          <source>last accessed</source>
          <year>2020</year>
          /03/12.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Iannella</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et. al.:
          <source>ODRL Version 2.2 Ontology. W3C</source>
          (
          <year>2017</year>
          ), https://www.w3.org/ns/odrl/2/, last accessed
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. W3C Dataset Exchange Working Group (DXWG):
          <article-title>DCAT-PROV alignment ontology</article-title>
          .
          <source>W3C</source>
          (
          <year>2019</year>
          ), https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat-prov.ttl,
          <source>last accessed</source>
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Using Knowledge Graphs to Search an Enterprise Data Lake</article-title>
          . In: Hitzler,
          <string-name>
            <surname>P.</surname>
          </string-name>
          et al. (
          <article-title>eds) The Semantic Web: ESWC 2019 Satellite Events</article-title>
          .
          <source>ESWC 2019. Lecture Notes in Computer Science</source>
          , vol.
          <volume>11762</volume>
          , pp.
          <fpage>262</fpage>
          -
          <lpage>266</lpage>
          . Springer, Cham (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Haller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et. al.:
          <article-title>The modular SSN ontology: A joint W3C and OGC standard specifying the semantics of sensors, observations, sampling, and actuation</article-title>
          .
          <source>Semantic Web</source>
          <volume>10</volume>
          (
          <issue>1</issue>
          ),
          <fpage>9</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hodgson</surname>
          </string-name>
          , R., Keller, P. J.,
          <string-name>
            <surname>Hodges</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spivak</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>QUDT: quantities, units, dimensions and data types ontologies</article-title>
          .
          <source>QUDT.org</source>
          (
          <year>2014</year>
          ), http://qudt.org/ (
          <year>2014</year>
          ),
          <source>last accessed</source>
          <year>2020</year>
          /03/31.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Klotz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilms</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>VSSo: A Vehicle Signal and Attribute Ontology</article-title>
          . In: Lefrançois,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>García-Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Gyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Taylor</surname>
          </string-name>
          , K. (eds.)
          <source>Proceedings of the 9th International Semantic Sensor Networks Workshop co-located with 17th International Semantic Web Conference (SSN@ISWC)</source>
          ,
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2213</volume>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. GENIVI Alliance:
          <article-title>Vehicle Signal Specification</article-title>
          . (
          <year>2017</year>
          ), https://github.com/GENIVI/vehicle_signal_specification,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <source>FOAF Vocabulary Specification</source>
          <volume>0</volume>
          .
          <fpage>99</fpage>
          . (
          <year>2014</year>
          ), http://xmlns.com/foaf/spec/,
          <source>last accessed</source>
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Kublauch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Shapes Constraint Language (SHACL)</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>2017</year>
          ), https://www.w3.org/TR/shacl/,
          <source>last accessed</source>
          <year>2020</year>
          /03/26.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>