=Paper=
{{Paper
|id=Vol-3632/ISWC2023_paper_479
|storemode=property
|title=ADA: Automatic Data Annotation for Data Ecosystems
|pdfUrl=https://ceur-ws.org/Vol-3632/ISWC2023_paper_479.pdf
|volume=Vol-3632
|authors=Natalie Gdanitz,Sabine Janzen,Hannah Stein,Amin Harig,Wolfgang Maaß
|dblpUrl=https://dblp.org/rec/conf/semweb/GdanitzJSH023
}}
==ADA: Automatic Data Annotation for Data Ecosystems==
ADA: Automatic Data Annotation for Data Ecosystems
Natalie Gdanitz1 , Sabine Janzen1 , Hannah Stein1,2 , Amin Harig1 and
Wolfgang Maass1,2
1
German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
2
Saarland University, Saarbrücken, Germany
Abstract
Data ecosystems have emerged as versatile platforms for managing and analyzing data from diverse
sources, facilitating integration, collaboration and governance across organizations and systems. Anno-
tated data are crucial for efficient and effective large-scale data ecosystems. However, there is a lack of
full-fledged automatic annotation approaches for data ecosystems, with manual annotation by experts
being the current requirement. Addressing specific annotation requirements of data ecosystems, we
introduce ADA, an approach for automatic data annotation. ADA applies a semantic representation
model called Data Product Description Object (DPDO) in JSON-LD and combines state-of-the-art models
for metadata embeddings within an annotation pipeline. The approach extends technical metadata
by essential concepts for data ecosystems, such as data provenance, quality, and accessibility. The
effectiveness of ADA was evaluated using competency questions and data sets from diverse domains
within the GAIA-X data ecosystem.
Keywords
Data ecosystems, Automatic data annotation, Metadata, Ontology
1. Introduction
Data ecosystems consist of centralized or decentralized platforms for managing and analyzing
data from various sources, e.g., structured data, text, or images [1]. They are designed to facilitate
data integration, sharing, collaboration, and governance across different systems, applications,
and organizations, e.g., International Data Spaces 1 , GAIA-X 2 , Manufacturing-X3 . Data ecosys-
tems intend to help organizations to overcome data silos, enhance data-driven decision-making,
and foster collaboration among data stakeholders [1]. Annotated data represent a fundamental
requirement for efficient and effective large-scale data ecosystems [2, 3], referring to embedded
metadata about structure, content, quality, and meaning of data. Automatically annotated
data create a basis for high-quality curated data and allow data consumers to understand and
interpret data without relying on external documentation or knowledge. Thus, seamless integra-
tion, sharing, data governance and trust, exploration, and large-scale data analysis within data
ecosystems is enabled [4, 5]. While there exist conceptual ideas of generic cross-domain data
ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece
Envelope-Open Natalie.Gdanitz@dfki.de (N. Gdanitz); Sabine.Janzen@dfki.de (S. Janzen); Hannah.Stein@dfki.de (H. Stein);
Amin.Harig@dfki.de (A. Harig); Wolfgang.Maass@dfki.de (W. Maass)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
https://internationaldataspaces.org
2
https://gaia-x.eu/
3
https://www.plattform-i40.de/IP/Navigation/EN/Manufacturing-X/Manufacturing-X.html
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
annotations for data ecosystems [6, 7, 8], there is a lack of full-fledged automatic annotation
approaches for data ecosystems. So far, manual annotation by experts is required [9] (e.g.,
labeling training data, assigning data to developed concepts). Tackling the specific annotation
requirements of data ecosystems with respect to data provenance, quality and context, acces-
sibility, availability, and contractual information of open domain data is beyond the scope of
existing research on automatic data annotation [2], e.g., [10, 11, 12]. In this work, we introduce
ADA – an approach for automatic data annotation for data ecosystems. ADA works with a
semantic representation model called Data Product Description Object (DPDO) operationalized
in JSON-LD and combines multiple state-of-the-art models for metadata embeddings within an
annotation pipeline (e.g., ontology development and knowledge graph population[13], metadata
harvesting and extraction [14]). ADA builds up on existing ontological standards (e.g., Data
Catalogue Vocabulary 4 ) and extends technical metadata by essential concepts for data ecosys-
tems, e.g., data provenance, quality, and accessibility. ADA supports open domain structured
data sets, i.e., tabular data (CSV format). The approach was exemplified within an annotator
service for automatic data annotation in data ecosystems. We were able to evaluate ADA by
means of competency questions as well as data sets of diverse domains listed by the GAIA-X
data ecosystem2 extracted from Kaggle5 , e.g., agriculture, construction, energy, geoinformation,
or culture.
2. Automatic Data Annotation for Data Ecosystems (ADA)
In order to satisfy the requirements of data ecosystems, we first developed the DPDO which
serves as semantic foundation for the annotation process. Our automatic annotation pipeline
consists of 3 components: Analyzer, controller, and provision engine (see. figure 1).
Figure 1: Sequence of automatic data annotation shown in a pipeline
Semantic representation model: Based on existing literature [15, 16, 17, 18, 19] and
publicly available vocabularies (schema.org6 , Data Quality Vocabulary7 , Open Vocab8 ,
4
https://www.w3.org/TR/vocab-dcat-3/
5
https://www.kaggle.com/datasets
6
https://schema.org/
7
https://www.w3.org/TR/vocab-dqv/
8
https://vocab.org/open/
Data Catalog Vocabulary9 , DCMI Metadata Terms10 , GAIA-X ontology11 12 ), the semantic
representation models information on data in data ecosystems within five facets13 (see figure 2).
Potential data consumers require a product description in terms of context and metadata (e.g.,
topic, datatypes, data size). Data quality (e.g., referencing existing data quality standards such as
ISO 8000 14 , metrics for the calculation of a quality score or accuracy) lays the foundation for the
usability of data. Information on accessibility (i.e., access URL, technical support) and timeliness
(i.e., last data modification, historical information on data versioning) enables the actual usage
of the data. A data business transaction between ecosystem participants requires contractual
information (i.e., contract description, price specifications, sanctions) and information about
usage rights (e.g., license). Furthermore, trust between participants can be established by
having information on the data provenance (i.e., name, locality, and contact of the data provider).
Figure 2: Simplified overview of the semantic representation model - Data Product Description Object
Analyzer component: Using the GUI, users can upload their data intended to be annotated.
Within our demonstrator, they are additionally provided with sample JSON files that need to be
uploaded alongside the data as additional input on the provider, contractual details, and data
quality in order to fill the DPDO. In a real-life scenario, while participating in a data ecosystem,
this information is expected to be filled once and then to be stored within respective platforms
as described by existing concepts 1 2 . The analyzer (see figure 1) then reads all files in order to
extract the information required by the DPDO. Embedded metadata within the data file are
harvested as described in [14] (i.e., title of the file, file size) or extracted directly from the file’s
content as in [14] (i.e., column names, data types). Column names and title of the file are used
to generate a thematic allocation and to disambiguate the content of the data file by using them
as lemmas to search for synsets and hypernyms in BabelNet15 . Furthermore, the additional
input JSON files are parsed and given alongside all other extracted information to the controller.
Controller component: The controller (see. figure 1) assigns all gathered information by
the analyzer to a semantic description. The DPDO is operationalized in the form of JSON-LD,
serving as a marker template that is filled rule-based with the results of the processed input
files. The controller automatically maps extracted metadata, synsets and hypernyms to entities
9
https://www.w3.org/TR/vocab-dcat-3/
10
https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
11
https://gaia-x.gitlab.io/gaia-x-community/gaia-x-self-descriptions/core/core.html
12
https://gaia-x.gitlab.io/technical-committee/service-characteristics/widoco/participant/participant.html
13
A complete overview of modeled entities is given in our repository https://github.com/InformationServiceSystems/
pairs-project/tree/main/Modules/ADA
14
https://www.iso.org/standard/81745.html
15
https://babelnet.org/
of the product description facet (see figure 2). Information on the user registry is mapped to
the trust description facet and contractual information to the business and usage description
facet. The additional user input is mapped onto remaining entities of all facets of the specified
semantic scheme of the DPDO.
Provision engine: The provision engine (see. figure 1) transforms the generated JSON-LD
file into a graph to be stored within the knowledge graph database Neo4j 16 using a Cypher
script. While we could have used a triplestore for storing our knowledge graph, one reason why
we chose Neo4j is, that information on entities and relationships can be stored without creating
extra nodes, leading to a more condensed representation of the data, which is particularly rele-
vant in the case of large-scale data ecosystems. The resulting knowledge graph serves then as
data catalogue or knowledge base within a data ecosystem 1 2 3 that can be explored and queried.
Evaluation: We evaluated our approach using 16 competency questions (CQ) extracted from
related work within the domain of semantic annotations and data ecosystems. For deriving these
competency questions, we specifically focused on information to be required by stakeholders
within data ecosystems. Examples would be ’Which context is the focus of the dataset?’, ’Which
datatypes are used?’, ’What’s the size of the dataset?’ 17 . Furthermore, we annotated publicly
available data sets17 matching listed domains by the GAIA-X ecosystem, which we extracted from
Kaggle (i.e., agriculture, energy, construction, finances, geo data, industry, culture, education,
mobility, public sector, smart living). We investigated the resulting knowledge graph based on
the defined CQ with the help of Cypher queries. We were able to answer all 16 CQs.
3. Conclusion
In this paper, we proposed ADA, an approach for automatic data annotation in data ecosystems.
ADA leverages the semantic representation model DPDO (JSON-LD) and integrates state-of-
the-art models for metadata embeddings, enabling seamless integration, sharing, governance,
exploration, and large-scale data analysis within data ecosystems. ADA supports open domain
structured data sets, i.e., tabular data in CSV format, and can be used by annotating experts and
non-experts. By extending technical metadata with essential concepts such as data provenance,
quality, and accessibility, ADA addresses the specific annotation requirements of data ecosystems
and their stakeholders. The evaluation of ADA through competency questions and diverse
tabular data sets demonstrates its effectiveness in supporting open domain structured data sets
within the GAIA-X data ecosystem and beyond18 .
16
https://neo4j.com/
17
All competency questions, references of related work, used data sets, executed queries, and query results are listed
within our repository.
18
Demonstration is given within a screencast https://youtu.be/2af0_IButIA; Code of the service and evaluation
results can be found within our GitHub repository https://github.com/InformationServiceSystems/pairs-project/
tree/main/Modules/ADA.
4. Acknowledgement
This work was partially funded by the German Federal Ministry of Economics and Climate
Protection (BMWK) under the contracts 01MK21008D and 01MK20015A.
References
[1] F. Tocco, L. Lafaye, Data platform solutions, Designing Data Spaces (2022) 383.
[2] M. Fassnacht, C. Benz, D. Heinz, J. Leimstoll, et al., Barriers to data sharing among private
sector organizations, Proc. of the 56th HICSS (2023).
[3] C. Mertens, J. Alonso, O. Lázaro, C. Palansuriya, et al., A framework for big data sovereignty:
The european industrial data space (eids), in: Data Spaces: Design, Deployment and Future
Directions, Springer International Publishing Cham, 2022, pp. 201–226.
[4] W. Maass, Contract-based data-driven decision making in federated data ecosystems, Proc.
of the 55th HICSS (2022).
[5] M. Jarke, B. Otto, S. Ram, Data sovereignty and data space ecosystems, Bus Inf Syst 61
(2019) 549–550.
[6] G. Solmaz, F. Cirillo, J. Fürst, T. Jacobs, et al., Enabling data spaces: Existing developments
and challenges, in: Proc. of the International Workshop on Data Economy, 2022, pp. 42–48.
[7] GAIA-X, GAIA-X Core Ontology, https://gaia-x.gitlab.io/gaia-x-community/
gaia-x-self-descriptions/core/core.html, Accessed: 2023-07-10, 2022.
[8] D. R. Firdausy, P. de Alencar Silva, M. van Sinderen, M. E. Iacob, Semantic discovery and
selection of data connectors in international data spaces, Proc. of I-ESA 1613 (2022) 0073.
[9] S. Sharma, S. Jain, Comprehensive study of semantic annotation: Variant and praxis, Int J
Comput Intell Appl (ACI 2021) 2823 (2021) 102–116.
[10] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Mtab4wikidata at semtab
2020: Tabular data annotation with wikidata., SemTab@ ISWC 2775 (2020) 86–95.
[11] V. Janev, M. E. Vidal, K. Endris, D. Pujic, Managing knowledge in energy data spaces, in:
Companion Proc. of the Web Conf. 2021, 2021, pp. 7–15.
[12] H. Drees, D. O. Kubitza, J. Lipp, S. Pretzsch, et al., Mobility data space–first implementation
and business opportunities, in: Proc. of the 27th ITS World Congress, 2021, pp. 11–15.
[13] N. Abdelmageed, B. König-Ries, Meta2kg: transforming metadata to knowledge graphs,
in: Proc. of the 17th OM, volume 3324, 2022, pp. 226–228.
[14] S. Patankar, M. Phadke, S. Devane, Wiki sense bag creation using multilingual word sense
disambiguation, IAES Int 11 (2022) 319.
[15] S. Janzen, W. Maass, Smart product description object (spdo), in: Poster Proc. of the 5th
FOIS, Citeseer, 2008.
[16] M. Abramovici, Smart products, CIRP Encyclopedia of Prod Eng 59 (2014) 1–5.
[17] A. Oberweis, V. Pankratius, W. Stucky, Product lines for digital information products, Inf
Syst 32 (2007) 909–939.
[18] K. L. Hui, P. Y. Chau, Classifying digital products, Commun ACM 45 (2002) 73–79.
[19] S. Neumaier, J. Umbrich, A. Polleres, Automated quality assessment of metadata across
open data portals, ACM J Data Inf Qual 8 (2016) 1–29.