<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Storage Management in Smart Data Lake</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haoqiong Bian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Ailamaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bikash Chandra Ioannis Mytilinis EPFL</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data lakes are complex ecosystems where heterogeneity prevails. Raw data of diverse formats are stored and processed, while long and expensive ETL processes are avoided. Apart from dataheterogeneity, data lakes also entail hardware-heterogeneity. Typical installations involve distributed infrastructures, where each node is possibly equipped with hardware of diferent characteristics. Especially for the case of storage, the various devices a node possesses can be organized in a hierarchy that defines a spectrum of performance-capacity-cost configurations. Given the various configurations and the volatile workload landscape, taking optimal placement decisions is a cumbersome task. In this work, we propose a storage management solution for the Smart Data Lake [12] platform. The proposed system takes advantage of the available storage devices, while it abstracts away data/hardware characteristics and provides a unified interface for data accesses. This way performance is improved while tiering complexity is hidden from the application layer.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Data warehouse systems such as Teradata [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Oracle [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
IBM DB2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been extensively used for processing analytical
queries at scale. Such systems first collect data from various
sources (e.g., sensors, logs, etc.) and then apply an expensive ETL
process to make it suitable for ingestion. However, ETL increases
query to answer time and results in unnecessary work when
queries target only a small fraction of the data set.
      </p>
      <p>In contrast to warehouses, data lakes are raw data ecosystems
that manage data from multiple sources and process it in-situ,
avoiding the long and expensive ETL. Each data set preserves its
own format and execution is optimized to deal with heterogeneity.
Thus, there may be, for example, queries that combine CSV and
JSON data without having to transform and persist them first into
a common representation form. This allows for faster response
times and easier integration of new sources.</p>
      <p>
        The volume of data and the inherently distributed nature of
a data lake require a decentralized architecture that involves
multiple compute and storage nodes. Conceptually, data is stored
in a shared pool, that is accessed over the network and is exposed
to the various applications. Large cloud vendors, such as Amazon
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Microsoft [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] follow this approach and ofer data access
on top of S3 or HDFS respectively.
      </p>
      <p>Nevertheless, network access introduces additional latency
and misses optimization opportunities in cases where there is
temporal or spatial locality in the data access pattern. Thus, the
ifrst challenge we need to deal with is to avoid unnecessary data
copies while keeping the flexibility of a decoupled architecture.
Moreover, we would like to cache intermediate results without
having to tune every individual engine involved in the data lake.</p>
      <p>The two above arguments mandate the need for more
sophisticated storage management in data lakes. We envision a system
that tracks user workloads and automatically identifies caching
opportunities independently of the source of origin of each data
set. Storage management should not be tightly coupled to a
specific engine but equally serve them all. However, data placement
has not only performance-related implications. As main
memory is still a more expensive and scarce resource than hard disk
drives, placement also afects available capacity and monetary
cost. With the hardware advancements in storage technologies,
the placement optimization problem becomes even more
dificult; the choice is not anymore binary (memory or disk), but
several tiers are involved (e.g., SSD, NVM), each with its own
performance-capacity-cost ofering. A properly designed storage
manager should be aware of the trade-ofs the various tiers ofer
and transparently move data across them based on a given policy.</p>
      <p>
        Apart from being independent from the underlying data stores,
storage management should be also decoupled from the
application layer of the data lake. Existing storage systems expose
tiering information and decision making to the user. For example,
HDFS supports tiering by using the Archival Storage [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and lets
the user select the right tier. We argue that in a data lake, where
a multitude of diferent users and query engines coexist,
permitting each of them to define its own policy would result in a too
complicated design where conflicting decisions would undermine
opportunities for data-sharing and multi-engine optimization.
      </p>
      <p>
        This work addresses the aforementioned problems and presents
the preliminary design of a storage manager specifically tailored
for data lakes. The proposed system is being developed as part of
the Smart Data Lake (SDL) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] project. SDL is a scalable, elastic
and hardware-conscious data lake that supports analytical tasks.
      </p>
      <p>The proposed storage manager is a self-tuning and elastic
component that enables eficient data placement and access. We
can think of it as a middleware between storage and compute
nodes. It supports direct access to raw data but it can also create
on-demand intermediate representations of various formats (e.g.,
CSV, columnar binary, etc.) and persist them in the appropriate
tier. SDL considers several diferent storage alternatives, each
featuring a diferent set of characteristics. Our storage hierarchy
includes remote cloud storage, local spinning disks, SSDs, NVMs
and DRAM, which depending on the use-case can be either shared
between diferent processes or private.</p>
      <p>In addition, the proposed architecture ofloads the tiering
policy from the application layer to the storage manager. This way,
we enable shared optimizations and relieve the user from the
burden of fine tuning storage configurations. To simplify the
interaction with both the storage and compute nodes, storage
manager exposes an object store-like API. Data is seamlessly
exchanged in both sides (storage/compute nodes) through simple
primitives, while all the complexity of the diferent data formats
and locations is abstracted away.</p>
      <p>The remainder of the paper is organized as follows: Section 2
gives an overview of the SDL architecture, Section 3 discusses the
design decisions for the storage manager and Section 4 presents
a preliminary evaluation. Finally, Section 5 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>SDL ARCHITECTURE</title>
      <p>SDL features a data lake system that supports the full stack of
operations required in typical analytics tasks. Namely, it provides
eficient access methods for both on premise and cloud storage,
in-situ processing of diverse data, a rich set of data mining
algorithms and a visualization layer that enables comprehensive data
exploration. More specifically, as Figure 1 shows, SDL comprises
three main components: SDL-Virt, SDL-HIN and SDL-Vis.</p>
      <p>
        SDL-Virt lies at the core of the data lake. It provides all the
necessary mechanisms for virtualizing data and processing it
over heterogeneous hardware. Since data access always goes
through SDL-Virt, it acts as a storage and query interface for
the other components, that abstracts away the complexity of
the various data formats and locations. For managing the
everincreasing data volume, a distributed architecture is used while
the employed resources are elastically allocated to meet demand.
SDL-HIN is an extensible suite of algorithms for the scalable
analysis and mining of heterogeneous information networks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
It is ofered as a library and runs on top of SDL-Virt. SDL-Vis
comprises a set of tools for visually exploring unknown data sets
and interpreting the results returned by SDL-HIN and SDL-Virt.
      </p>
      <p>
        Since the SDL data lake engine is empowered by SDL-Virt, we
delve into its internals and give further details about the design
and the architectural choices. SDL-Virt features a distributed
architecture that consists of multiple nodes that collaboratively
process large data sets in a data-parallel manner. Each node runs
two collocated instances of a storage and a compute node: (i)
RAW [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and (ii) Proteus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] respectively.
      </p>
      <p>RAW is a commercial system able to access and process data
sets of various formats. No matter whether data resides in
external infrastructures (e.g., Amazon S3, Dropbox, etc.) or in an
in-house database, RAW can seamlessly access and load it through
a proprietary query language.</p>
      <p>Although RAW also provides basic processing capabilities
itself, it does not take advantage of modern hardware. For the
eficient execution of analytical operators, SDL-Virt employs
Proteus: our just-in-time (JIT) compiled engine for fast, in-memory
analytics. Performance in Proteus comes through customization
and hardware heterogeneity. By JIT-ing code, we customize data
access for the query/data set at-hand, and by using hybrid
execution plans that involve both CPUs and GPUs, we increase
parallelism and reduce end-to-end execution time.</p>
      <p>
        To coordinate the execution of the RAW-Proteus nodes,
SDLVirt includes a Query Planner and a Resource Manager. As in
any database system, the Query Planner parses SQL queries and
generates first a logical and then a physical plan of execution. In
order to do so, it takes into account hardware availability, data
location and data set characteristics. Similar to YARN [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] or
Mesos [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], our Resource Manager is responsible for allocating
resources (e.g., CPU cores, GPUs, memory) upon query execution.
      </p>
      <p>The description above implies that for every submitted query,
RAW accesses a remote data set and retrieves it over the network.
Such a process would severely harm performance, as it does not
take into account the temporal and spatial locality of data access
patterns. To deal with this issue, our design also includes a
Storage Manager. Storage Manager detects recurring patterns in data
accesses and caches data accordingly. Moreover, the plethora
of available storage devices (e.g., HDD, SSD, NVM, DRAM, etc.)
ofers the opportunity to explore the performance-capacity-cost
continuum in a more fine granular way. Each node of the
consortium organizes devices in a hierarchy and data sets are moved
across diferent storage tiers in order to maximize throughput.</p>
      <p>Albeit storage management plays a key role in the
performance of data lakes, it has not been extensively investigated yet.
Existing solutions either ignore the underlying storage hierarchy
or expose it to the applications and put the burden of tiering to
the user. To remedy this, we opt for a transparent solution where
data/hardware complexity is hidden from the user.
3</p>
    </sec>
    <sec id="sec-3">
      <title>STORAGE MANAGEMENT</title>
      <p>Storage Manager acts as a middleware between the storage and
compute layers of a data lake and can serve read/write requests
from both sides. In the SDL architecture, we install a storage
manager instance in every node along with the Proteus and
RAW processes. This way, we favor locality and reduce network
trafic. Although in this work we focus on Proteus and RAW,
Storage Manager is not coupled to them and can work with any
storage/compute engine that implements its interface.</p>
      <p>Upon query execution, a compute node contacts its local
Storage Manager instance to get information on data location: if data
is locally available, it does not need to contact RAW and directly
access it from the corresponding tier. As storage devices, we
consider both local (e.g., SHM, SSD, HDD etc.) and remote (e.g.
HDFS, Amazon S3 etc.) resources.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Architecture</title>
      <p>Figure 2 illustrates an overview of the architecture of a single
Storage Manager instance. It comprises (i) an interface that
exposes basic data access primitives, (ii) a lock-free message queue
for the communication with Proteus and RAW, (iii) a local catalog
and (iv) a tiering policy. Yellow arrows reflect data flow, while
blue ones metadata flow and the interaction between the various
componentns. Next, we elaborate on each of these separately.</p>
      <p>Object store interface. For accessing data, we expose object
store-like reader and writer interfaces. Object stores provide
generic get/put primitives that are compatible with a wide range
of data formats and query engines1. This way, we ofer an unified
interface that minimizes the complexity between data access and
computation. Each dataset can be stored as a set of objects. The
semantics and size of an object are arbitrary and are defined and
managed by the corresponding query engines. In our architecture,
objects are grouped into segments and each segment is placed in
specific storage layers/tiers (we use the terms interchangeably).
For allocating segments and grouping objects into segments, we
have implemented special storage allocators, one for each tier.
1For simplicity, we use “query engines” to refer to both Proteus and RAW at once.</p>
      <p>Lock-free queue. As Storage Manager physically lives in its
own process, the read/write requests for the various objects need
to be exchanged via an inter-process communication (IPC)
mechanism. To minimize latency, for IPC we use a queue allocated
in shared memory. However, even when using such a fast IPC
mechanism, the shared queue defines a critical section in the
communication between the query engines and the storage
manager. To reduce the synchronization cost, we design a lock-free
protocol based on GCC atomics. Moreover, while the requests
queue is shared, to further reduce contention, we use separate
response queues for each query engine.</p>
      <p>
        Another issue we need to deal with stems from the
unpredictability of the arrival rate of I/O requests. Existing storage
systems follow two common approaches: polled I/O and
interruptdriven I/O. The former has lower latency and higher throughput
when processing high-frequency requests, while the latter
consumes less CPU cycles in the case of low-frequency requests [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>To get the best of both worlds, we support adaptive switching
between the two I/O modes. Normally, Storage Manager works in
the interrupt-driven mode, i.e., it sleeps and waits for requests. We
implement interrupts with signals. After pushing a request to the
queue, the corresponding reader/writer sends a signal to notify
Storage Manager. After it has been notified, Storage Manager
works in polled mode: it loops and checks for new requests. This
happens for a configurable amount of time (e.g., 10) before
it returns back to the interrupt mode. Thus, we use polled mode
for high-frequency arrivals and interrupts for low-frequency.</p>
      <p>Local catalog. For managing data placement, Storage
Manager uses a local catalog. The catalog maps each object identifier
to a unified logical address space that spans all available tiers
and facilitates indexing. Each object is assigned a logical address
that consists of three parts: a storage layer id (8-bits), a segment
id (24-bits), and a relative ofset ( 32-bits) within the segment. For
indexing layers, we use a bit per tier and hence, we currently
support up to 8 tiers. A segment id can be a pointer in case of
memory-based tiers or a file descriptor for disk-based ones. The
translation process between logical and physical addresses is
depicted in Figure 3 and is based on a two-level lookup table.
In the first step, we search by layer id in order to identify the
segments that exist in the specific layer. Then, in a second step
we lookup the segment id. The result of this search is the physical
address of the desired segment. By adding the relative ofset, we
reconstruct the full physical address of the requested object.</p>
      <p>While it is the catalog that maintains logical addresses, the
translation takes place within the query engines and not the
Storage Manager. Each query engine has all the layer and segment
ids cached. For every I/O request, Storage Manager responds with</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Read/Write Workflow</title>
      <p>Having described all the components of Storage Manager’s
architecture, for better understanding how they work, we now
describe the workflows for the write and read access paths.</p>
      <p>Write. Let us assume that a query/storage engine (e.g.,RAW)
sends a write request to the storage manager for persisting an
object with unique identifier  . The request is sent through the
lock-free shared queue. Storage Manager receives the request
and based on the tiering policy decides on the tier that the object
should be placed (e.g., SSD). Then, the manager contacts the
storage allocator of the specific tier, in order to get a segment id
and the position of the new object within the chosen segment.
Having a tier identifier, a segment id and an ofset within the
segment, we can form a logical address  and Storage Manager
persists the mapping  →  to its catalog. Finally, the manager
returns the logical address to the engine that issued the request.</p>
      <p>Read. Upon a read request, Storage Manager checks the
catalog and retrieves the logical address for the specific object id
and returns it to the query engine in order to read the object
from the corresponding tier, without waiting for data copying
around tiers. Hardware-conscious systems (e.g., Proteus) need
to prefetch data into a privately managed CPU/GPU memory in
order to unleash their full potential. To enable this behavior, our
Storage Manager provides two additional operations: load and
unload. The first one prefetches data while the second one evicts
objects from the private memories , which are not managed by
Storage Manager. Thus these APIs are not against our overall
goal of storage virtualization.</p>
    </sec>
    <sec id="sec-6">
      <title>4 EXPERIMENTAL EVALUATION</title>
      <p>We perform a set of microbenchmarks to evaluate our design
decisions for serving I/O requests and accessing data. More
specifically, we assess the proposed IPC mechanism (lock-free queue)
and the importance of zero-copying while accessing data. As this
is a primitive version of our system, and tiering policies are not
yet in place, we leave end-to-end evaluation for future work.</p>
      <p>The experimental setup is a 2-socket server with Intel Xeon
E5-2650L v3 CPU 1.80GHz, 24 threads/socket and 256GB DRAM.
In these experiments, all data objects are cached in memory.</p>
      <p>
        Lock-free shared queue. We compare, in terms of
throughput, our lock-free message queue with other IPC methods by
sending 8-byte sized 10 million packets. The results are shown in
Table 1. In case of IPC between two processes, we achieve 6.6×
the throughput of a System V mq [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We also assess the impact of adaptive I/O when processing
queue requests. We send 1 read requests to the queue as fast
as possible and measure response latency and CPU utilization.
When in polled I/O mode, the response latency is 6.4 and
CPU usage 100%; we need an entire hardware thread just to
poll the queue. When using interrupts, CPU utilization drops
near to 0%, but we get a 5.7× latency (36.5). By switching
between the two modes, we ensure both low latency and resource
eficiency.</p>
      <p>Zero-copy. IPC mechanisms involve from zero to multiple
data copies. Techniques that bufer data in kernel space (e.g.,
protocol
Avg. latency (us)
0-copy
6.6
1 copy
1996.8
2 copies
3478.0</p>
      <p>grpc
20790.2
pipes, sockets) usually need two copies, while shared
memorybased approaches require one or no copies at all. Reading data
from Storage Manager follows the zero-copy approach. To
quantify the potential gain in data access latency, we conduct the
following experiment: a process reads 10 of data from Storage
Manager and we measure the elapsed time between sending the
request and starting to consume data at the reader’s side. To
simulate the various IPC methods, we artificially add extra rounds
of copying on top of our zero-copy mechanism. Moreover, we
compare against grpc: a widely used protocol based on http and
protobufs. Results are shown in Table 2. We observe that even a
single copy increases access latency by 3 orders of magnitude.</p>
    </sec>
    <sec id="sec-7">
      <title>5 CONCLUSION</title>
      <p>In this paper, we present a storage management solution
especially tailored for data lakes and build the proposed system
on top of the Smart Data Lake platform. Our design hides the
data/hardware complexity of a data lake and provides unified
and transparent access to diferent tiers of the storage
hierarchy. Although all the mechanisms that enable the proposed idea
are in place, we still lack sophisticated tiering policies. Our
future plans focus on this aspect, i.e., algorithms that lift workload
characteristics in order to better exploit the underlying storage.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been funded by the European Union’s Horizon
2020 research and innovation programme under the grant
agreement No 825041 (SmartDataLake).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Amazon</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Data Lake on AWS</article-title>
          .
          <source>Retrieved Dec20</source>
          ,
          <year>2020</year>
          from https://aws. amazon.com/solutions/implementations/data-lake-solution
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Periklis</given-names>
            <surname>Chrysogelos</surname>
          </string-name>
          , Manos Karpathiotakis, Raja Appuswamy, and
          <string-name>
            <given-names>Anastasia</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines</article-title>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>The</given-names>
            <surname>Apache Software Foundation</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Archival Storage, SSD and Memory on HDFS</article-title>
          .
          <source>Retrieved Dec20</source>
          ,
          <year>2020</year>
          from http://hadoop.apache.org/docs/current/ hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Herodotos</given-names>
            <surname>Herodotou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elena</given-names>
            <surname>Kakoulli</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Automating Distributed Tiered Storage Management in Cluster Computing</article-title>
          .
          <source>In MSST.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>IBM.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>IBM Db2 Warehouse</article-title>
          .
          <source>Retrieved Dec20</source>
          ,
          <year>2020</year>
          from https://www. ibm.com/products/db2-warehouse
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Mesos</surname>
          </string-name>
          . [n.d.]. . http://mesos.apache.org/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Oracle</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Oracle Autonomous Data Warehouse</article-title>
          . https://www.oracle.com/ in/database/technologies/datawarehouse-bigdata.html
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Linux</given-names>
            <surname>Manual Page</surname>
          </string-name>
          .
          <year>2020</year>
          . MQ Overview. https://man7.org/linux/man-pages/ man7/mq_overview.7.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. L. Narasimha</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>NVMFS: A hybrid file system for improving random write in nand-flash SSD</article-title>
          .
          <source>In MSST.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Raghu</surname>
            <given-names>Ramakrishnan</given-names>
          </string-name>
          , Baskar Sridharan, John R Douceur, Pavan Kasturi, Balaji Krishnamachari-Sampath, Karthick Krishnamoorthy,
          <string-name>
            <given-names>Peng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mitica</given-names>
            <surname>Manu</surname>
          </string-name>
          , Spiro Michaylov,
          <string-name>
            <given-names>Rogério</given-names>
            <surname>Ramos</surname>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Azure data lake store: a hyperscale distributed file service for big data analytics</article-title>
          .
          <source>In Proceedings of the 2017 ACM International Conference on Management of Data</source>
          .
          <volume>51</volume>
          -
          <fpage>63</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>RAW-labs</article-title>
          . [n.d.]. . https://www.raw-labs.com/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>SmartDataLake</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Sustainable DataLakes for Extreme-Scale Analytics</article-title>
          .
          <source>Retrieved Dec20</source>
          ,
          <year>2020</year>
          from https://smartdatalake.eu
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Yizhou</given-names>
            <surname>Sun</surname>
          </string-name>
          and Jiawei Han.
          <year>2013</year>
          .
          <article-title>Mining heterogeneous information networks: a structural analysis approach</article-title>
          .
          <source>Acm Sigkdd Explorations Newsletter</source>
          <volume>14</volume>
          ,
          <issue>2</issue>
          (
          <year>2013</year>
          ),
          <fpage>20</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Teradata</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Teradata Integrated Data Warehouses</article-title>
          .
          <source>Retrieved Dec20</source>
          ,
          <year>2020</year>
          from https://www.teradata.com/Products/Software/ Integrated-Data-Warehouses
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jisoo</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dave B. Minturn</surname>
            , and
            <given-names>Frank</given-names>
          </string-name>
          <string-name>
            <surname>Hady</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>When Poll is Better than Interrupt</article-title>
          .
          <source>In FAST.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Apache</surname>
            <given-names>YARN.</given-names>
          </string-name>
          [n.d.]. . https://hadoop.apache.org/docs/current/hadoop-yarn/ hadoop-yarn-site/YARN.html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>