=Paper= {{Paper |id=Vol-3184/Text2KG_short1 |storemode=property |title=Enriching Data Lakes with Knowledge Graphs (short paper) |pdfUrl=https://ceur-ws.org/Vol-3184/TEXT2KG_Short_1.pdf |volume=Vol-3184 |authors=Alessandro Chessa, Gianni Fenu, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi |dblpUrl=https://dblp.org/rec/conf/esws/ChessaFMORSS22 }} ==Enriching Data Lakes with Knowledge Graphs (short paper)== https://ceur-ws.org/Vol-3184/TEXT2KG_Short_1.pdf
Enriching Data Lakes with Knowledge Graphs
Alessandro Chessa1,2 , Gianni Fenu3 , Enrico Motta4 , Francesco Osborne4 ,
Diego Reforgiato Recupero3 , Angelo Salatino4 and Luca Secchi1,3,*
1
  Linkalab s.r.l., Cagliari, Italy
2
  Luiss Data Lab, Rome, Italy
3
  University of Cagliari, Cagliari, Italy
4
  Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom


                                         Abstract
                                         Data lakes are repositories of data stored in natural/raw format. A data lake may include structured data
                                         from relational databases, semi-structured data (i.e., JSON, CSV), unstructured data (i.e., text data), or
                                         binary data (i.e., images, audio, video). It is usually built on top of cost-efficient infrastructures such as
                                         Hadoop, Amazon S3, MongoDB, ElasticSearch, etc. Several organisations rely on big data lakes for crucial
                                         tasks such as reporting, visualisation, advanced analytics, machine learning, and business intelligence.
                                         A major limitation of this solution is that without descriptive metadata and a mechanism to maintain
                                         it, such data tend to be noisy, making their management and analysis complex and time-consuming.
                                         Therefore, there is the need to add a semantic layer based on a formal ontology to describe the data and
                                         efficient mechanism to represent them as a knowledge graph. In this paper, we present a methodology
                                         to add a semantic layer to a data lake and thus obtain a knowledge graph that can support structured
                                         queries and advanced data exploration. We describe a practical implementation of a methodology applied
                                         to a data lake consisting of text data describing the online marketplace for lodging and tourism activities.
                                         We report statistics about the data lake and the resulting knowledge graph.

                                         Keywords
                                         Semantic Data Lake, Knowledge Graphs, Information Extraction




1. Introduction
The term “data lake" was introduced by James Dixon, Chief Technology Officer of Pentaho, in a
blog post in 20101 . Data lakes are data repositories for storing large and heterogeneous sets of
raw data. They have quickly become a common data management solution for organizations
that desire to own a holistic and large repository for their data. Data lakes allow users to access
and explore data without the need to move them into another system. Insights and reporting
performed from a data lake typically occur on an ad-hoc basis. However, users might apply a

Text2KG 2022: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022,
May 05-30-2022, Crete,Hersonissos, Greece
*
  Corresponding author.
$ alessandro.chessa@linkalab.it (A. Chessa); fenu@unica.it (G. Fenu); enrico.motta@open.ac.uk (E. Motta);
francesco.osborne@open.ac.uk (F. Osborne); diego.reforgiato@unica.it (D. R. Recupero);
angelo.salatino@open.ac.uk (A. Salatino); luca.secchi@linkalab.it (L. Secchi)
 0000-0003-4668-2476 (G. Fenu); 0000-0003-0015-1952 (E. Motta); 0000-0001-6557-3131 (F. Osborne);
0000-0001-8646-6183 (D. R. Recupero); 0000-0001-6557-3131 (A. Salatino); 0000-0002-4518-1429 (L. Secchi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
              CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073




1
    https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
schema and a certain degree of automation to the data to make it possible to duplicate a report
when needed.
    Data in a data lake are stored in their raw format and are not transformed until they are
needed for analysis. Also in such a case, a schema is somehow applied so that they can be
analyzed. This way of working is called “schema on read" [1, 2], because data are kept raw until
they are ready to be used. Although it is always possible to use a schema-on-read approach,
it is not optimal for performances and optimising costs, thus sometimes data is transformed
and stored using specific file formats (e.g., Parquet, AVRO, ORC) that can handle also schema
information. Data lakes require governance to establish continual maintenance and to keep the
data usable and accessible. Otherwise, the risk is to end up with data which become inaccessible,
unwieldy, expensive, and useless; culminating to what is often referred as “data swamps"2 . In
order to address this limitation, it is useful to rely on a semantic layer: a representation of data
based on semantic technologies and a formal ontology that can offer a unified, consolidated
view of data across the organisation.
    Several attempts have been done to provide a semantic layer to data lakes and each of them
has targeted a particular domain of application [3, 4, 5, 6, 7, 8, 9, 10]. However, to the best of our
knowledge, no one has ever applied this solution in the domain of tourism. In this paper, we
propose a practical implementation for the creation of a semantic layer to generate a knowledge
graph from a data lake consisting of text data. We applied this solution in the tourism domain,
developing a knowledge graph of accommodation facilities in London, leveraging the Data Lake
Turismo platform. Our solution takes advantage of entity linking approaches for extracting and
interlinking several entities (e.g., places, food, amenities) from reviews and other textual fields,
allowing a much more comprehensive representation of accommodations and touristic locations.
This Data Lake Turismo was developed by Linkalab s.r.l.3 , capitalising on a previous research
project promoted by the Digital Innovation Hub of Sardinia4 and Fondazione di Sardegna5 .
    The remainder of this paper is organised as it follows. Section 2 focuses on the previous
works on semantic layers for data lakes. Section 3 describes our methodology and presents the
implementation in the tourism domain. We provide also statistics and information about the
resulting knowledge graph. Section 4 ends the paper with conclusions and future works.


2. Related Work
A knowledge graph [11, 12, 13, 14, 15, 16, 17] is a knowledge base that uses a graph-structured
data model to integrate data. It represents a network of real-world entities, i.e., objects, events,
situations, or concepts, and illustrates the relationship between them.
   Dibowski et al. [6] discussed how to address data findability, accessibility, interoperability, and
re-use for data stored in a data lake. They showed the benefits provided to a data lake through
the support of ontologies and knowledge graphs which provide cataloguing of data, tracking
provenance, access control, and semantic search. In particular, they built the DCPAC ontology
2
  https://developer.ibm.com/articles/ba-data-becomes-knowledge-2/
3
  Linkalab s.r.l. is a Italian small enterprise specialised in data science and data engineering. Home page https:
  //www.linkalab.it/
4
  https://www.dihsardegna.eu/
5
  https://www.fondazionedisardegna.it/
(Data Catalog, Provenance, and Access Control) related to the management of data produced by
vehicles. Similarly, Diamantini et al. [4] presented a semantic model for the correct data fruition
stored into a data lake. They mapped the indicators of interest, the dimensions of analysis and
formulas into a knowledge graph to support the correct identification of data. Pomp et al. [10]
had similar problems related to the collection, finding, understanding and accessing of large
data sources with the goal of ensuring their real-time availability. To reduce the time from the
collection to the analysis of data, they centralised the data in a data lake. Instead of populating
the data lake of unstructured data, they proposed a semantic data platform called ESKAPE for
the semantic annotation of the ingested data. Furthermore, a knowledge graph has been defined
to act as an index that evolves over time according to the data that are included. In this way,
users can easily identify and analyse the data coming from the different places. Bagozi et al. [3]
proposed a semantics-based approach for the personalised exploration of data lakes within the
domain of smart cities. First, they provided the data lake with a semantic model using domain
ontologies. Then, another ontology was adopted to describe indicators and analysis dimensions.
Finally, personalised exploration graphs were generated for different types of users. Another
work worth mentioning is by Ansari et al. [5], who proposed a semantic profiling tool for
metadata extension in data lake systems. Its aim was to understand the meaning of data. Their
tool recognised the meaning of data at schema and instance level using domain vocabularies
and ontologies. Finally, Mami [9] proposed a physical and a logical data integration whose
goal was to query large and heterogeneous data sources. For the physical data integration they
defined an ontology to transform the data into RDF.
   Differently from the approaches above, we propose a methodology to extend a data lake
containing data extracted from touristic platforms with a semantic layer and produce a knowl-
edge graph. To this end, we engineered an ontology in the touristic domain integrating already
existing ontologies and extending them with our classes. However, the focus of this manuscript
is not on the ontology but on the extracted knowledge graph and the steps we performed to
transform the data from the data lake to the knowledge graph.


3. The Proposed Methodology
In this section, we describe our methodology for enriching a data lake by creating a domain
ontology and generating a knowledge graph that will extend the data lake with a sophisticated
representation of knowledge. This approach is articulated in five steps: i) analysis of the data
sources; ii) definition of the use cases; iii) creation of the ontology; iv) data transformation; and
v) generation of the knowledge graph. In the following we will briefly describe each phase.

3.1. Analysis of the data sources
The data lake we have used comes as a result of the Data Lake Turismo6 project whose aim
was to create a digital platform of tourism data. The data lake was developed by Linkalab7
through Amazon Web Services (AWS) cloud computing technologies including S3 where the

6
    Turismo means tourism in Italian.
7
    Linkalab - https://www.linkalab.it/
Table 1
Entities stored in the data lake referring to the London Region (UK).
                  Source            Zone       Entity type                Total number
                  Booking.com       London     Lodging facility                     2,092
                  Booking.com       London     Accommodation offer                 22,154
                  Booking.com       London     Review                             443,675
                  AirBnB            London     Lodging facility                     5,975
                  AirBnB            London     Accommodation offer                  5,975
                  AirBnB            London     Review                             142,500


data was stored. The data lake collected data from various sources using an Extraction, Load
and Transformation (ELT) approach. A crawling system was developed to identify and extract
data related to the London Region8 area from various sources including Booking.com9 and
AirBnB10 .
   The data lake is organised in three tiers: i) intake tier, where the raw data is collected, ii)
curated tier, where each transformed/cleaned version of the data is stored, and iii) consumption
tier, where the data is exposed to business analysts in many formats such as reports, dashboards,
APIs. In our specific scenario, the intake tier contains HTML files extracted from Booking.com,
and JSON files extracted using APIs exposed by AirBnB systems; the curated tier contains JSON
data extracted from the Booking.com HTML files; the consumption tier contains the knowledge
graph as a set of RDF triples.
   The data lake is built on AWS serverless technologies: Amazon S311 object storage is used to
store the files, AWS Lambda12 and AWS Fargate13 are used to execute the crawling and the data
processing, Amazon Athena14 is used to query the data stored in JSON files using SQL language
while all technical metadata is managed using AWS Glue catalog15 .
   The data lake describes three kinds of entities:

    • lodging facilities i.e., any hotel, holiday house or other quarters that provide temporary
      sleeping facilities open to the public16 , which are described by specific properties like
      name, address, geolocation, average user rating, textual description, pictures, related
      amenities;
    • accommodation offers i.e., a specific place that can accommodate persons (e.g. a hotel
      room, a camping pitch or an entire apartment) that is part of a lodging facility and is
      offered for lease under specific conditions; these offers are characterised by specific
      properties like number and type of beds, max and min occupancy, related amenities, price;
8
 The London Region area is an administrative area including the 32 London boroughs and the City of London.
9
 https://www.booking.com/
10
   https://www.airbnb.com/
11
   See https://aws.amazon.com/s3/
12
   See https://aws.amazon.com/lambda/
13
   See https://aws.amazon.com/fargate/
14
   Athena is a query engine based on PrestoDB. See https://aws.amazon.com/athena/ and https://prestodb.io/
15
   See https://aws.amazon.com/glue/
16
   Source: Law Insider, see https://www.lawinsider.com/dictionary/lodging-facilities
Table 2
Storage space in the data lake.
                    Source           Zone       Total size html    Total size json
                    Booking.com      London             13.6 GB           325.4 MB
                    AirBnB           London                   -            31.6 MB


    • user reviews about the lodging facility that are characterised by a rating value and a
      text.

   Table 1 reports an overview of the number of business entities stored for both sources
(Booking.com and AirBnb). For AirBnb we have the same amount of lodging facilities and
accommodation offers, because AirBnB associates each offer to a unique lodging facility. Con-
versely, in Booking.com a lodging facility (e.g., hotel) can offer multiple accommodations (e.g.,
rooms).
   Table 2 summarize of the storage space used in the data lake. The main difference between
Booking.com and AirBnB is that the first is crawled exporting HTML pages that are then used
to extract the data whereas the latter is accessed using APIs to retrieve the data itself already in
JSON format.

3.2. Definition of the use cases
The purpose of the creation of the Data Lake Turismo project was to analyse the supply and
demand side of tourist destinations. During the project development the following use cases
have been identified in collaboration with the analysts of Linkalab:
   1. Identify the topics of interest in the tourists’ reviews;
   2. Identify the topics of interest in the text presentations of lodging businesses offers;
   3. Detect the sentiment [18] of tourists toward a certain lodging business or destinations;
   4. Classify tourist destinations according to what they offer and according to the tourist
      opinions.
  To better support these use cases the data lake has been extended with a semantic layer
supported by an ontology. The resulting knowledge graph includes both data and metadata,
hence enhancing the support for developing dedicated services.

3.3. Creation of the ontology
A crucial step is the creation of a domain ontology that could support the use cases. For
this purpose is possible to rely on standard ontology engineering frameworks and evaluation
methodologies [19].
   In our implementation, the ontology has to satisfy both functional and non functional
requirements. As far as functional requirements are concerned, the ontology has to include
classes for lodging businesses (e.g., hotels, hostels, apartments), accommodations offered by them
(e.g., rooms, suite), amenities for tourists, tourist attractions and points of interest, inter-relations
among entities (e.g., geographic relations, composition/inclusion), tourist reviews, tourist
destinations and taxonomies to support all of them. As far as the non functional requirements
are concerned the ontology must be defined in OWL, and be based on Schema.org17 and
GoodRelations18 .
   To drive the creation of the ontology, we designed a set of competency questions and identified
a set of existing ontologies that have been used as support. The entire ontology creation is
not discussed in this manuscript because out of the scope of the paper which focuses on the
methodology for the creation of a knowledge graph to support a data lake.

3.4. Data transformation
The data transformation depends on the source data structures and on the desired output. The
steps needed to transform the data are: i) extraction of relevant structured data and texts from
the original sources; ii) data cleaning; iii) ontology mappings, to represent the entities in the
structured data according to the ontology; iv) language detection, to identify the source of the
language; v) identification and extraction of entities within the text.
   The last step is very crucial to obtain a good representation of the data, since many important
information are only expressed in natural language, especially in the text regarding the descrip-
tion of the lodging facilities and reviews. To this purpose, we used DBpedia Spotlight entity
linking approach for extracting common entities such as activities, events, places, and food.
   We then integrated this information in the knowledge graph by linking DBpedia entities
with the relevant lodging facilities. This allows our system to support advanced queries such
as retrieving all the accommodations that are close to touristic attractions, those that offer a
specific amenity or propose a special kind of food, but also looking for what places or events
users cite most frequently in their reviews.

3.5. Generation of the knowledge graph
The last step takes in input the refined data and the ontology and produced the knowledge
graph. To this purpose it is possible to rely on several languages and tools for the the automatic
generation of triples [20, 21]. In our implementation, we adopted the RDF Mapping Language
(RML) [7], which is one of the most well-known solutions in this space, to build specific data
pipelines for the creation of RDF triples. The RML language specifies how linked data are
produced from the corresponding data sources. To perform an RML transformation19 we need
three things: i) an RML processor; ii) an input data source; iii) a mapping from any (structured)
data in the input data source to RDF.
  Triples are generated for each of the triples maps of the RML mapping. In our prototype, we
used RMLMapper [8]20 for such a purpose. The triples representing the knowledge graph are
generated as N-Quads files21 which are stored in the consumption tier of the data lake.

17
   https://schema.org/
18
   http://www.heppnetz.de/projects/goodrelations/
19
   https://rml.io/specs/rml/
20
   https://github.com/RMLio/rmlmapper-java
21
   See https://www.w3.org/TR/n-quads/
Table 3
Knowledge graph metrics
                     Metric                                             Value
                     Total statements                               10,299,471
                     Explicit statements                             5,148,987
                     Inferred statements                             5,150,484
                     Expansion ratio                                         2
                     Number of distinct relations                           50
                     Number of DBPedia entities linked                  91,284
                     Number of unique DBpedia enities linked             2,644
                     Number of AirBnB reviews entities                 142,500
                     Number of Booking.com reviews entities            435,276
                     Total number of reviews entities                  577,776
                     Total time for triple generation             ∼19 minutes


    We ingest new data from the original sources into the data lake every two months. We then
recreate the knowledge graph from scratch by repeating all the data transformations steps
described in Section 3.4.
    Table 3 reports some metrics about the last version of the knowledge graph: i) total state-
ments refers to the overall number of triples stored in the triplestore (both explicit and inferred),
ii) explicit statements refers to the number of raw triples created in the triplestore, iii) in-
ferred statements refers to the number of triples inferred by the reasoner from the explicit
statements, iv) expansion ratio represents the percentage of triples added using the inference.
The other metrics are self explanatory.


4. Conclusions
In this paper we have presented a general methodology for extending a data lake with a
knowledge graph. In particular, we have focused our analysis to the tourism domain by
considering a data lake containing structured and unstructured data crawled from Booking.com
and AirBnB. The knowledge graph thus obtained has been stored into a triplestore which can
be accessed online.
   We can conclude that the semantic layer provided by the knowledge graph brought many
advantages to Linkalab’s data lake platform: i) it treats data and metadata in a unified way,
ii) it has a flexible schema that can support the data variety and evolution, iii) it supports
algorithms and applications development and data science activities based on the data lake; iv)
it embeds information in its graph structure that can be leveraged by graph analytics [22, 23]
and representation learning [24] algorithms; v) it incorporates knowledge extracted from texts
along with structured and semi-structured data typically found in the data lake; vi) it can be
used to expand the data lake information context through connections with open knowledge
graphs like DBpedia.
   In future work, we aim to expand the pipeline for producing the knowledge graph by devel-
oping new solutions for entity extraction and to further improve the ontology. We also plan to
develop a tool that will take advantage of the knowledge graph for analysing and comparing
accommodations and generating explainable recommendations.


References
 [1] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, I. Stoica, Shark: Sql and rich
     analytics at scale, in: Proceedings of the 2013 ACM SIGMOD International Conference on
     Management of data, 2013, pp. 13–24.
 [2] C. Mathis, Data lakes, Datenbank-Spektrum 17 (2017) 289–293.
 [3] A. Bagozi, D. Bianchini, V. De Antonellis, M. Garda, M. Melchiori, Personalised exploration
     graphs on semantic data lakes, in: H. Panetto, C. Debruyne, M. Hepp, D. Lewis, C. A.
     Ardagna, R. Meersman (Eds.), On the Move to Meaningful Internet Systems: OTM 2019
     Conferences, Springer International Publishing, Cham, 2019, pp. 22–39.
 [4] C. Diamantini, D. Potena, E. Storti, A semantic data lake model for analytic query-driven
     discovery, in: The 23rd International Conference on Information Integration and Web
     Intelligence, iiWAS2021, Association for Computing Machinery, New York, NY, USA,
     2021, p. 183–186. URL: https://doi.org/10.1145/3487664.3487783. doi:10.1145/3487664.
     3487783.
 [5] J. W. Ansari, N. Karim, W. Ansari, O. D. Beyan, M. Cochez, Semantic profiling in data lake,
     2018.
 [6] H. Dibowski, S. Schmid, Y. Svetashova, C. Henson, T. Tran, Using semantic technologies
     to manage a data lake: Data catalog, provenance and access control, 2020.
 [7] A. Dimou, M. V. Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van De Walle, RML: A
     generic language for integrated RDF mappings of heterogeneous data, in: CEUR Workshop
     Proceedings, volume 1184, 2014.
 [8] A. Dimou, T. De Nies, R. Verborgh, E. Mannens, R. de Walle, Automated Metadata
     Generation for Linked Data Generation and Publishing Workflows, Proceedings of the 9th
     Workshop on Linked Data on the Web 1593 (2016).
 [9] Mohamed Nadjib Mami, Strategies for a Semantified Uniform Access to Large and Het-
     erogeneous Data Sources, Ph.D. thesis, Rheinische Friedrich-Wilhelms-Universität Bonn,
     2021. URL: https://hdl.handle.net/20.500.11811/8925.
[10] A. Pomp, A. Paulus, A. Kirmse, V. Kraus, T. Meisen, Applying semantics to reduce the
     time to analytics within complex heterogeneous infrastructures, Technologies 6 (2018).
     URL: https://www.mdpi.com/2227-7080/6/3/86. doi:10.3390/technologies6030086.
[11] D. Dessì, F. Osborne, D. Reforgiato Recupero, D. Buscaldi, E. Motta, H. Sack, Ai-kg:
     an automatically generated knowledge graph of artificial intelligence, in: International
     Semantic Web Conference, Springer, 2020, pp. 127–143.
[12] A. Meloni, S. Angioni, A. A. Salatino, F. Osborne, D. R. Recupero, E. Motta, Aida-bot: A
     conversational agent to explore scholarly knowledge graphs, in: O. Seneviratne, C. Pesquita,
     J. Sequeda, L. Etcheverry (Eds.), Proceedings of the ISWC 2021 Posters, Demos and Industry
     Tracks: From Novel Ideas to Industrial Practice co-located with 20th International Semantic
     Web Conference (ISWC 2021), Virtual Conference, October 24-28, 2021, volume 2980
     of CEUR Workshop Proceedings, CEUR-WS.org, 2021. URL: http://ceur-ws.org/Vol-2980/
     paper310.pdf.
[13] S. Angioni, A. Salatino, F. Osborne, D. R. Recupero, E. Motta, Aida: A knowledge graph
     about research dynamics in academia and industry, Quantitative Science Studies (2021)
     1–43.
[14] M. Alam, A. Gangemi, V. Presutti, D. R. Recupero, Semantic role labeling for knowledge
     graph extraction from text, Prog. Artif. Intell. 10 (2021) 309–320. URL: https://doi.org/10.
     1007/s13748-021-00241-7. doi:10.1007/s13748-021-00241-7.
[15] M. Nayyeri, G. M. Cil, S. Vahdati, F. Osborne, M. Rahman, S. Angioni, A. A. Salatino, D. R.
     Recupero, N. Vassilyeva, E. Motta, J. Lehmann, Trans4e: Link prediction on scholarly
     knowledge graphs, Neurocomputing 461 (2021) 530–542. URL: https://doi.org/10.1016/j.
     neucom.2021.02.100. doi:10.1016/j.neucom.2021.02.100.
[16] M. Nayyeri, G. M. Cil, S. Vahdati, F. Osborne, A. Kravchenko, S. Angioni, A. A. Salatino,
     D. R. Recupero, E. Motta, J. Lehmann, Link prediction of weighted triples for knowledge
     graph completion within the scholarly domain, IEEE Access 9 (2021) 116002–116014. URL:
     https://doi.org/10.1109/ACCESS.2021.3105183. doi:10.1109/ACCESS.2021.3105183.
[17] M. Alam, A. Fensel, J. M. Gil, B. Moser, D. R. Recupero, H. Sack, Special issue on machine
     learning and knowledge graphs, Future Gener. Comput. Syst. 129 (2022) 50–53. URL:
     https://doi.org/10.1016/j.future.2021.11.022. doi:10.1016/j.future.2021.11.022.
[18] D. Reforgiato Recupero, E. Cambria, Eswc’14 challenge on concept-level sentiment analysis,
     Communications in Computer and Information Science 475 (2014) 3–20. doi:10.1007/
     978-3-319-12024-9\_1, cited By 23.
[19] V. A. Carriero, A. Gangemi, M. L. Mancinelli, A. G. Nuzzolese, V. Presutti, C. Veninata,
     Pattern-based design applied to cultural heritage knowledge graphs 12 (2021) 313–357.
     doi:10.3233/SW-200422.
[20] D. Dessì, F. Osborne, D. R. Recupero, D. Buscaldi, E. Motta, Generating knowledge graphs
     by employing natural language processing and machine learning techniques within the
     scholarly domain, Future Gener. Comput. Syst. 116 (2021) 253–264. URL: https://doi.org/
     10.1016/j.future.2020.10.026. doi:10.1016/j.future.2020.10.026.
[21] J. Arenas-Guerrero, M. Scrocca, A. Iglesias-Molina, J. Toledo, L. P. Gilo, D. Dona, O. Corcho,
     D. Chaves-Fraga, Knowledge graph construction with r2rml and rml: an etl system-based
     overview (2021).
[22] A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. P. Pérez, T. Manhardt, H. Chafi, M. Capotă,
     N. Sundaram, M. Anderson, I. G. Tănase, Y. Xia, L. Nai, P. Boncz, LDBC graphalytics: A
     benchmark for large scale graph analysis on parallel and distributed platforms, Proceedings
     of the VLDB Endowment 9 (2015) 1317–1328. doi:10.14778/3007263.3007270.
[23] A. Cuzzocrea, I. Y. Song, Big graph analytics: The state of the art and future research agenda,
     DOLAP 2014 - Proceedings of the ACM 17th International Workshop on Data Warehousing
     and OLAP, co-located with CIKM 2014 (2014) 99–101. doi:10.1145/2666158.2668454.
[24] S. Ji, S. Pan, E. Cambria, P. Marttinen, P. S. Yu, A Survey on Knowledge Graphs: Represen-
     tation, Acquisition, and Applications, IEEE Transactions on Neural Networks and Learning
     Systems (2021) 1–26. doi:10.1109/TNNLS.2021.3070843. arXiv:2002.00388.