=Paper=
{{Paper
|id=Vol-3346/Paper4
|storemode=property
|title=A Multi-level Ontology-based Approach for Descriptors of Catalogued
Resources
|pdfUrl=https://ceur-ws.org/Vol-3346/Paper4.pdf
|volume=Vol-3346
|authors=Vânia Borges,Natália Queiroz de Oliveira,Maria Luiza Machado Campos
|dblpUrl=https://dblp.org/rec/conf/ontobras/BorgesOC22
}}
==A Multi-level Ontology-based Approach for Descriptors of Catalogued
Resources==
A Multi-level Ontology-based Approach for Descriptors of Catalogued Resources Vânia Borges 1, Natália Queiroz de Oliveira 1 and Maria Luiza Machado Campos 1 1 Programa de Pós-Graduação em Informática (PPGI) –Universidade Federal do Rio de Janeiro (UFRJ) CEP- 21941-901 - Rio de Janeiro – RJ - Brazil Abstract Technological advances and open data initiatives have increased data availability on the Web. These data, published in heterogeneous repositories, represent valuable sources for exploration and decision-making processes. The FAIR principles brought a special focus on the importance of metadata to effectively interoperate and reuse these data (and metadata). However, interoperating metadata associated with the resources catalogued in repositories is still a challenge, as each of them has its own complex architecture. Vocabularies such as DCAT aim to contribute as a step further on this, establishing a common set of elements for catalogue entries. This paper proposes a DCAT ontological analysis, applying multi-level conceptual modeling to treat ambiguities and enrich semantics. The proposed model aims to make the DCAT structures adherent to the reality aspects that need to be represented, establishing ontological commitments in an explicit way, which helps the understanding of those who will use it. Keywords 1 Catalogued Resource Descriptors, Multi-Level Modeling, Repositories Interoperability, DCAT. 1. Introduction Technological advances and open data initiatives have increased data availability from scientific research and government agencies on the Web. These data are organized into datasets and catalogued in institutional or thematic repositories, supported by a diversity of platforms. These platforms employ independent components, in large and growing complex architectures, with little and no easy interaction among themselves, generating data silos [1]. The FAIR principles publication in 2016 highlighted the importance of metadata to promote Findability, Accessibility, Interoperability, and Reuse of data on the Web [2]. In this regard, it is worth emphasizing the evolution of these principles to promote machine-actionable and AI-ready (meta)data, which opens up unprecedented research opportunities and increases reproducibility [3]. These FAIR metadata allow the development of software agents able to act with large volumes of data and accelerate search engine and analysis mechanisms [4]. Aligning metadata elements used by different repositories and software agents is not a trivial task. For interoperability, many repositories adopt protocols such as the Open Archives Initiative Protocol for Metadata Harvesting2 (OAI-PMH) for open exchange and collection of data and metadata, and yet alternative approaches via Application Programming Interfaces (API) for online access and exchange Proceedings of the 15th Seminar on Ontology Research in Brazil (ONTOBRAS) and 6th Doctoral and Masters Consortium on Ontologies (WTDO), November 22-25, 2022. EMAIL: vjborges30@ufrj.br (Vânia Borges); natalia.oliveira@ppgi.com.br (Natália Q. Oliveira); mluiza@ppgi.com.br (Maria Luiza M. Campos) ORCID: 0000-0002-6717-1168 (Vânia Borges); 0000-0001-8371-142X (Natália Q. Oliveira); 0000-0002-7930-612X (Maria Luiza M. Campos) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2 https://www.openarchives.org/pmh/ [5]. In general, these protocols and mechanisms for interoperability make use of vocabularies such as Dublin Core (DC)3 to harvest standardized metadata about datasets. To improve interoperability between data catalogues on the Web, the Data Catalog Vocabulary (DCAT) has established a lightweight ontology in OWL24 to describe catalogued datasets and data services, using the Resource Description Framework5 (RDF) standard [6]. Aligned with other interoperability protocols, DCAT defines its entities properties (attributes and relations) with DC terms. In addition, following the FAIR principles, it adds PROV ontology6 (PROV-O) properties to handle provenance. The DCAT representation, however, can be improved by establishing a well-defined conceptualization of the domain to be treated, in this case, the domain of the catalogued resource descriptors in Web catalogues. To promote this conceptualization, domain ontologies built based on foundation ontologies play a special role. They support semantic interoperability at the conceptual level by establishing "contracts" that capture conceptualizations and representations in models or other tools employed to harmonize knowledge [1]. This paper aims to present a well-founded ontological analysis of DCAT, extending its semantics and treating ambiguities. Moreover, the main contribution of this work is to propose a DCAT-based core ontology grounded on a multi-level foundational ontology. This ontology contributes to a consistent conceptualization for resource descriptors on the Web, comprising structured and standardized metadata. It can provide a canonical model for specific implementations, to be adopted by repositories and by Web resource search engines and analysis mechanisms. This paper is structured as follows: session 2 addresses relevant concepts regarding storage spaces for catalogued resources on the Web and introduces the DCAT vocabulary; session 3 discusses multi- level conceptual modeling and presents the Multi-Level Theory (MLT); session 4 describes the ontological analysis performed and the resulting model; session 5 presents an application of the model, based on a specific repository platform, and session 6 concludes this paper discussing the obtained results and future work. 2. Background: storage spaces and DCAT vocabulary In the context of open data management infrastructure on the Web, there is a lack of consensus on the meaning and scope of some terms. This section discusses some of them, aiming for a common understanding before delving deeper into the modeling discussions. In addition, in the second subsection, the DCAT vocabulary entities are introduced, and some relevant aspects are detailed. 2.1. Relevant concepts on open data management on the Web The terms "catalogue" and "repository" have different definitions as Web data management spaces. For this paper, a data catalogue is a curated collection of structured metadata for describing resources and pointing to data resources of interest [7]. It allows users to browse, search and organize these descriptors (metadata) and access the resource [8]. A catalogue usually disposes of metadata for resources stored elsewhere. In this case, metadata for the same resource may be stored in several catalogues [9]. A repository, on the other hand, organizes and manages research data (resources) [9]. In order to do this, besides operating as a catalogue, it supports different forms of storage and organization structures, facilitating the curation of resources, and providing mechanisms to access and manage these resources over time. Digital repositories built with DSpace [10], a platform for creating open-access digital repositories, exemplify what this paper designates as "repository". DSpace establishes mechanisms for storing, curating, and preserving resources associated with metadata collections that enable their discovery and access. In this case, catalogue functions describe the resources stored in the repository. Examples of catalogues include those created with the FAIR Data Point (FAIR DP) solution. FAIR DP is one of the 3 https://www.dublincore.org/ 4 https://www.w3.org/TR/owl2-primer/ 5 https://www.w3.org/RDF/ 6 https://www.w3.org/TR/prov-o/ main components of a FAIR Ecosystem. It is a software that works as an infrastructure to provide access to resources through standardized metadata, in strict compliance with the FAIR principles, for humans and machines [4]. Its current version is based on DCAT and includes descriptors for the FAIR DP itself, catalogues, datasets, and distributions [11]. Importantly, there are platforms, such as CKAN7, that can be used as a repository, as a catalogue, or both at the same time. Another distinction that deserves attention is "catalogued resource" and "catalogued resource descriptor". The term "catalogued resource" refers to a catalogued object of interest, physical or digital, stored and curated in a repository [12]. The term "catalogued resource descriptor" or "descriptor" refers to a digital object that describes a catalogued resource, contributing to its dissemination. The catalogued resource takes precedence over its descriptor, i.e., the descriptor arises because of it without becoming existentially dependent on it. Hence, a catalogue can allow access to the metadata even when the resource is no longer available. In order to describe a resource, the descriptor employs properties that can be attributes such as "Title" or relations such as "Publisher", which establishes the agent responsible for the publication. These properties, converted into structured and standardized metadata, constitute the descriptors. Thus, descriptors provide information that facilitate the resource location, access, and understanding, fundamental aspects contributing to its reuse and interoperability. Therefore, considering the FAIR principles, we can assume that an adequately conceptualized set of descriptors, with structured properties associated with standardized vocabularies and ontologies, is essential to describe a FAIR catalogued resource. Figure 1: Schematic View for Catalogued Resources and their Descriptors (Resources Metadata), employing an adaptation of the DSpace data model and DCAT Entities. Figure 1 presents a schematic view illustrating catalogued resources and their potential descriptors. The catalogued resources are represented based on the DSpace data model [10]. These resources are related to their respective descriptors, represented by DCAT entities [6], through the "isDescribedBy" relationship. On the left, we observe the DSpace model base elements, increased by the representation of the repository installation designated "DSpace Repository". In this model, a repository is organized in communities that can aggregate different sub-communities. Each "Community" can have one or more collections, and a "Collection" can be part of several communities. The central element of this model is the "Item". The "Item" belongs to a "Collection", which is responsible for it. Furthermore, an "Item" can be included in different collections. Each item consists of one or more "Bundle" constituted by "Bitstream". The "Bitstream", corresponding to a data file, refers to the actual data stored in their various formats, according to the "BitstreamFormat". 7 https://ckan.org/ The right side of Figure 1 depicts a subset of DCAT entities, using the prefix "dcat", to describe DSpace catalogued resources. Among them, the dcat:Resource entity represents the descriptor of catalogued resources, i.e., a descriptor of resource published or curated by a single agent; the dcat:Catalog defines a descriptor for a catalogue, which describes the DSpace repository itself and the different digital organizational structures for items storage; the dcat:Dataset describes a dataset, which is defined as a "collection of data, published or curated by an agent, available for access or download in one or more formats or serializations" [6]; and the dcat:Distribution describes an accessible form of the dataset, such as a downloadable file. Particularly, in DCAT, a catalogued resource of dataset type is described by the entities dcat:Dataset and dcat:Distribution. The dcat:Dataset entity describes the dataset as a conceptual entity, while dcat:Distribution represents its different serializations for access. Because of this definition, the model in Figure 1 employs dcat:Dataset to describe the “Item”, while dcat:Distribution describes the set of elements enclosed by the blue square, formed by "Bundle", "Bitstream", and "BitstreamFormat". The dcat:Distribution can provide information for these elements, according to the repository rules. 2.2. DCAT Vocabulary To present the DCAT vocabulary, it is important to understand the term "vocabulary" associated with it. "On the Semantic Web, vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and represent an area of concern" [13]. Vocabularies that classify terms, characterize relationships, and define some constraints for these terms, implemented with a major focus on computational performance, are called lightweight ontologies [14]. DCAT vocabulary is a lightweight ontology implemented in OWL2 to facilitate interoperability between catalogues. It describes catalogued resources, that is, resources published or curated in a repository, and their relationships. Initially created for interoperability among government data repositories [15], it defines concepts using entities and their properties in RDF, describing datasets and data services. In order to be DCAT-compliant, a catalogue must meet the following requirements [6]: (i) organize access to data into datasets, distributions, and data services; (ii) present a descriptor in RDF for the catalogue itself, the catalogued resources and their available distributions, using DCAT entities and properties; (iii) present consistency with the semantics declared in its specifications. In addition to the concepts introduced in 2.1, DCAT includes the concepts that follow. The dcat:DataService is a descriptor for a data service, which is defined as a "collection of operations accessible to provide access to one or more datasets or data processing functions" [6]. It allows the description of service catalogues, i.e., a catalogue of data service descriptors, providing access to the data assets. The dcat:CatalogRecord entity registers the resource descriptor entry in the catalogue, capturing provenance information explicitly, and its use is optional. Finally, there are the entities dcat:Relationship and dcat:Role. Those entities were defined in version 2.0, allowing users to establish relationships between dataset descriptors and catalogued resource descriptors when these are known but not standardized between DC terms or PROV-O properties. Given the purpose associated with the concepts dcat:Relationship and dcat:Role, they can be considered what in multi-level modeling is addressed as a type of types. They allow the creation of types employed in the domain ontologies models, making explicit meaningful relationships between their descriptors. These types are adequately represented by models that adopt more than two levels of stratification, i.e., multi-level models. 3. Multi-level conceptual modeling and Multi-Level Theory (MLT) Conceptual modeling represents physical and social world aspects, aiming at their understanding and communication among humans [16]. Conceptual models are usually developed by techniques using a strict stratification of entities into two classification levels: a level of types (or classes) and a level of instances (objects). However, there are specific domains where this kind of conventional stratification does not work, and the representation of types of types (or categories of categories) is necessary to model their conceptualization; for these domains, multi-level conceptual modeling is adopted [17]. Different approaches exist to treat Multi-level Modeling (MLM). In this paper we adopt the MLT developed from the Unified Foundational Ontology (UFO) and employed for conceptual modeling of multi-level types [18]. According to Carvalho and Almeida [19], the MLT is a theory that can be considered a foundation ontology for the conceptual modeling of multi-level types. It formally characterizes the nature of the classification levels and defines the relationships between the elements of those levels [19]. To allow the treatment of types at different levels, it admits entities that simultaneously represent a class and an object. In MLM approaches, these entities are called "Clabject" because they present two facets, one acting as an instance to some entity on the level above and another as a class for entities of a level below [20]. Similar to UFO, MLT distinguishes between universals (types) and individuals. "Types are predicative entities that can be applied to a multitude of entities, including types themselves, while individuals are particular entities" [19]. To address the distinction between types, MLT employs the notion of "Type Order" [19] [20]. These orders, equivalent to levels in other approaches, are organized in a stratified manner and represented by logical constants called "Base Types". Relationships are established between the base types and extended to the domain types. MLT defines the primitive "instance of" relation to associate entities to their respective type [19]. The dependency relation "instance of" is employed, as in UML, to indicate that a base type is an instance of another. Based on this structuring, types whose instances are individuals are called first-order types (1stOT). Types whose instances are first-order types are called second-order types (2ndOT) and so on. Special attention is given to entities that are individuals. Under the theory, the entity must be an instance of thebase type and have no other instances to be an individual. It presents only the object facet, as adopted by the UML. The top of Figure 2 depicts the organization of the base types of the MLT, represented by orange rectangles. The representation adopts UML notation. Thus, associations are used to represent relationships between instances and related types. Dashed arrows are used to define dependency relationships between types. The labels of the relations refer to the names of applied predicates. In this theory, each type in the represented domain is an instance of precisely one of the higher-order base types (<1stOT>, <2ndOT>, and <3rdOT>) and, at the same time, proper specializes a base type at the next level down. Roughly speaking, the proper specialization relation, defined by MLT, guarantees that in type specializations, if a type t1 specializes a type t2, then t1 and t2 are different types, i.e., the instances of the specialized type are a proper subset of the general type. The pattern adopted by MLT is depicted in Figure 2. In this case, "Person" is an instance of <1stOT> and proper specializes . The instances of "PersonTypeByGender", i.e., "Woman" and "Man", are specializations of "Person". As it applies to "Person", an instance of <1stOT>, "PersonTypeByGender" is an instance of <2ndOT> and proper specializes <1stOT>. Figure 2: Example of Instantiation and Specialization between domain elements (adapted from [21]). MLT defines cross-level structural relations between types of adjacent orders. These relations provide support for the analysis of the powertype notions in the literature. It is important to note that for MLT, the instances of powertype are clabjects. The relation named adopts the same concept proposed by Cardelli [22], i.e., powertype establishes sets associated with the base type. The Categorization relation is based on Odell's definition [23], which states that a base type can have more than one powertype. In addition to these notions from the literature, MLT establishes other valuable relations for capturing constraints in multi-level models. This paper uses the powertype and partition relations for interaction between adjacent order types. According to MLT, type t1 a base type t2 if all instances of t1 specialize t2 and all possible specializations of t2 are instances of t1. In this case, instances of t1 are applicable to instances of t2, but t1 does not define classification criteria. Thus, all specializations of t2, including t2 itself, are instances of t1. A type t1 a type t2 if all instances of t1 are proper specializations of t2. To specialize t2, in this case, the instances of t1 apply a classification criterion. Thus, t2 is not an instance of t1. Other specializations of t2 may exist from other classification criteria distinct from the one established by the type t1. A type t1 t2 if t1 categorizes t2, and each instance of t2 is an instance of exactly one instance of t1. In Figure 2 partitions into and . According to the rule, each instance of is either a or a , but not both simultaneously, which establishes that this specialization is disjoint and complete. For more details on MLT other readings are recommended [18] [19]. In addition to organizing types into orders and defining intra-level and cross-level relations, MLT addresses the properties of higher-order types [24]. For MLT, not all properties of higher-order types behave identically. Thus, it employs regularity properties to handle what MLM calls deep instantiation. In deep instantiation, the properties of a higher-order type affect entities at lower levels [22]. According to Almeida et al. [24], the values assigned to a regularity property of a higher-order type (such as second- and third-order types) affect the intents of instances of these types. Through this property type, invariant aspects are established for instances of a type. For example, you can determine a value or constrain the type of assignment for a property by determining its type(s) or a set of allowed types [22]. Another type of property is the direct property [24]. These properties are specific to the higher-order type and are not inherited by its instances. For more details on properties in MLT, refer to [21] [24]. The UFO concepts can be associated with the MLT elements. UFO is a foundational ontology built upon theories from the areas of formal ontology, philosophical logic, philosophy of language, linguistics, and cognitive psychology [25][26]. It establishes a taxonomy to categorize concepts of a domain based on the principles of individuality, rigidity, and dependency, reducing ambiguity, and facilitating understanding. UFO-MLT is obtained when UFO concepts are associated with MLT elements, respecting axioms, structural relations, and patterns [18][27]. It is an approach to develop domain conceptual models that represent types and types of types, adhering to the rules of a foundational ontology [27]. Thus, the UFO taxonomy provides patterns for types of types in the model. These patterns guide the modeler in defining higher-order types and their relationships. 4. Ontological foundations for descriptors of catalogued resources in repositories In the context of Information Science, "represent" consists of describing a given information resource according to its features. This description should be sufficient to allow its identification among other resources and its comprehension without requiring access [28]. To provide such a description, DCAT proposes an ontology of descriptors for catalogued resources. Its entities are constituted by metadata from standardized vocabularies that establish the minimum characteristics to describe the resources. This section is responsible for an ontological analysis of these entities, i.e., the associated types/concepts. This analysis emphasizes their ontological nature and establishes the categories of types. Furthermore, as previously mentioned, it provides tools for understanding the parts behavior, employing formal theories such as the theory of essence and identity, parts (mereology), unity and plurality, and dependency [29]. Applying the multi-level foundational ontology to DCAT aims to: (i) allow an analysis of the conceptual nature of terms, identifying their different categories [30]; and (ii) obtain an increase in expressiveness with the treatment of the multiple classification levels, reducing complexity and facilitating understanding [31]. The analysis of the DCAT entities is detailed below. Initially, the study was carried out using the UFO categories. After that, we adopted UFO-MLT for the appropriate treatment of the entities identified as higher-order types. The models presented in this paper were developed using the Visual Paradigm tool, version 16.3, with the OntoUML plugin8. This plugin supports OntoUML modeling by installing the necessary stereotypes and tagged values. In addition, it adds features like a custom user interface, smart diagram coloring, partial and complete model checking, and model transformation [25]. It is worth mentioning that OntoUML is a Conceptual Modeling language whose primitives reflect the ontological micro-theories that comprise the UFO [1][25]. 4.1. An ontological analysis of DCAT Analyzing the DCAT entities with UFO allowed categorizing and differentiating them according to their ontological nature. Figure 3 presents a conceptual model of DCAT with the UFO categorization. Due to space limitations, it covers the DCAT entities but not their properties. In DCAT, dcat:Resource represents the catalogued resource descriptor. It gathers the common properties of different resource descriptors. Because it defines the properties essential to some instances, mainly for dataset descriptors, but incidental to others, it is categorized as a . However, when applying DCAT in a specific catalogue context, it is necessary to configure the concepts for that domain. Thus, we consider that dcat:Resource should be configured with the catalogue descriptors common (mandatory) properties. In this case, it is categorized as . A dataset, either a digital or physical object, is described using dcat:Dataset and dcat:Distribution. The dcat:Dataset describes the general characteristics of the object, i.e., a more conceptual view, while dcat:Distribution describes the material part, i.e., its different serializations. A dcat:Dataset is an independent entity that provides the principle of identity, individuation, and persistence for all its instances, in this case, the descriptors of the conceptual features of a dataset. Considering these characteristics, a dcat:Dataset is categorized as a . Figure 3: Representation of a DCAT excerpt applying UFO. A distribution refers to the serialization of the dataset for access or transfer. The dcat:Distribution describes these different serializations. In a catalogue, the dcat:Distribution instances are existentially dependent on exactly one dcat:Dataset instance. Due to this dependency, if a dcat:Dataset instance is removed, all its dcat:Distributions instances are also removed. Furthermore, the absence of the dcat:Dataset instance to describe it makes it impossible to access and even understand the dcat:Distribution instances. Without a dcat:Distribution instance, essential dataset information is not available. Thus, it is part of the characterization of the dataset. Regarding these characteristics, it is categorized in UFO as a . Therefore, the dcat:Distribution is existentially dependent on a dcat:Dataset, contemplating its own , i.e., the properties that characterize it. Furthermore, as an ontological nature in UFO [25], provides an identification principle to distinguish instances of distribution descriptors representing different serialization formats or even versions of the catalogued dataset. 8 https://github.com/OntoUML/ontouml-vp-plugin According to DCAT, dcat:Catalog is a specialization of dcat:Dataset. Each instance of dcat:Catalog is a set of metadata record that describes catalogued resources, i.e., its instances represent collections of descriptors of datasets and data services and even other catalogues. The latter, here, is understood as organizational structures for datasets in a repository. As a specialization responsible for organizing the repository resource descriptors, dcat:Catalog inherits characteristics from dcat:Dataset and presents the rigidity property, i.e., it is essential for all its instances. It is then categorized as a . The dcat:DataService describes the data services of a repository. Similar to the analysis performed on dcat:Dataset, a dcat:DataService is an independent entity that describes related operations such as discovery, access, or processing functions on data or related resources. It is categorized as a that provides an identity principle for its instances. The dcat:CatalogRecord describes registering a specific dataset or data service descriptor in a catalogue. This entity is not mandatory and, when used, works as a log for resource descriptor publications. Analysis of its characteristics allows us to identify that this entity is existentially dependent on the catalogued resource descriptor (dcat:Resource) and the catalogue descriptor (dcat:Catalog) where the resource was inserted. Based on this analysis, it is categorized as a , representing the mereological sum of the two entities [32]. Moreover, it originates in the insertion event of a descriptor in the catalogue. Unlike the other entities that describe the resources catalogued in a repository, the dcat:Relationship allows describing qualified relationships between dataset descriptors or from a dataset descriptor to another resource descriptor. It is used when the relationship is known but is not standardized among DC terms or PROV-O properties. The entity employs the dcat:hadRole attribute to define the role of the relationship. In dcat:Resource, the dcat:qualifiedRelation property is responsible for the association to dcat:Relationship, identifying the target descriptor of the relation. On the other hand, dct:relation property in dcat:Relationship associates the source descriptor. This entity defines new types of relationship descriptors and is therefore categorized as a . The dcat:Role is the entity that establishes the role of one resource descriptor concerning another when it is associated with dcat:Relationship. In order to qualify relationships, DCAT recommends the adoption of controlled vocabularies such as Geographic Information Metadata9 (ISO-19115-1), Link Relations by Internet Assigned Numbers Authority10 (IANA-RELATIONS), DataCite Metadata Schema11, and others. Since it establishes the role of an additional relationship to be described and handled in a specific catalog, similarly to dcat:Relationship, we categorize dcat:Role as a . It is important to note that, associated with prov:Attribution, dcat:Role is also used to describe the function of an agent concerning a resource descriptor, establishing attributions not foreseen in DCAT. This paper focuses on the aspects related to creating new relationship descriptors. 4.2. Categorizing higher-order types The ontological analysis of DCAT identified the existence of higher-order types. These types provide flexibility to the model, allowing new relationships between types of resource descriptors to be described. As aforementioned, the entity dcat:Relationship holds relationships that cannot be described according to the vocabularies and ontologies standardized by DCAT. When configuring a catalog for a specific domain, the relationship descriptors (instances) must be represented at the same level that details the different descriptors and their properties. Therefore, the entities dcat:Relationship and dcat:Role should be relocated according to their order as specializations of <1stOT> and instances of <2ndOT>. The appropriate allocation of entities promotes a better understanding of the model. In order to represent the concept associated with dcat:Relationship and establish the relationships between descriptor types, the catalogued resource descriptor type (:CataloguedResourceDescriptorType) and the dataset descriptor type (:DatasetDescriptorType) were defined. According to MLT, they are also proper specializations of <1stOT> and instances of <2ndOT>. 9 https://www.iso.org/standard/53798.html 10 https://www.iana.org/assignments/link-relations/link-relations.xhtml 11 https://schema.datacite.org/ The former is powertype of the entity dcat:Resource. As a result, all the specializations of this entity become instances of the :CataloguedResourceDescriptorType, including dcat:Resource. It is classified as . The catalogued data descriptor type (:CataloguedDataDescriptorType) partitions dcat:Resource and has dcat:Dataset and dcat:DataService as instances. By partitioning an entity categorized as and having entities as instances, it is classified as . According to MLT [19], if a type t1 is powertype of a type t2 and a type t3 also categorizes the type t2, then t3 proper specializes t1. Based on this rule, :CatalogueDataDescriptorType proper specializes :CatalogueResourceDescriptorType. Specialization between types indicates that all instances of the type will also be instances of the supertype [21]. The :DatasetDescriptorType proper specializes :CataloguedDataDescriptorType and it is powertype of the dcat:Dataset type, with dcat:Catalog as its instance. For establishing a subset of catalogued data descriptor types, it is classified as . The dcat:Relationship is existentially dependent on the catalogued resource descriptor type (:CataloguedResourceDescriptorType) and the catalogued dataset descriptor type (:DatasetDescriptorType). This dependency that binds entities together by mediation relationships characterizes it as a , and it is twofold. The first dependency refers to the source descriptor type of the relationship. The second refers to the descriptor type related to the first, the target. Importantly, not all resource descriptor types have descriptors for additional relationships. To improve semantics, we add the source descriptor type (:SourceDescriptorType) and the target dataset descriptor type (:TargetDatasetDescriptorType) as specializations of the catalogued resource descriptor type and the dataset descriptor type, respectively. Furthermore, they identify accidental types that have relationship descriptors. Therefore, :SourceDescriptorType is categorized as , enabling any descriptor type as a relationship source, and :TargetDatasetDescriptorType is categorized as . Based on the relation, it establishes as the relationship target the types it instantiates, i.e., dcat:Dataset and any of its specializations. Hence, the dcat:Relationship maintains DCAT compliance, describing the relationship between dataset descriptors or between a resource descriptor and a dataset descriptor. The dcat:Role establishes the role of one resource descriptor in relation to another when it is associated with dcat:Relationship. Its analysis categorizes it as . As a , it can be projected into value spaces, establishing domains of standardized values to be adopted. To represent the distinct functions of dcat:Role, it was specialized into :RelationRole and :AttributionRole, each associated with its own value space. These specializations are categorized as . In this context, the :RelationRole specifies a characterization relationship with dcat:Relationship. Thus, instances of dcat:Relationship will bear the standardized term associated with it. Importantly, when expressing in the model that a quality characterizes a type, the instance of the characterized type will be the bearer of this quality instance [32]. Thus, a dcat:Relationship instance will have a dcat:Role instance associated with it, using the dcat:hasRole property. This instance will express the role of a specific relationship type between catalogued resources of a particular domain. When we establish types to be assigned to this property, we define what is called a regularity property in MLT [21]. That is, the property establishes the values to be accepted for its instances. Figure 4 shows an extract of the MLT-UFO-based DCAT model, employing MLT base types to handle type orders, and UFO categorization. The “instanceOf” edges between domain types and the <1stOT> and <2ndOT> base types have been omitted for legibility reasons. Specializing the base type we have dcat:Resource, dcat:Dataset, dcat:Distribution, dcat:Dataservice, dcat:Catalog, and dcat:CatalogRecord. Specializing the <1stOT> base type we have :CatalogueResourceDescriptorType, :CataloguedDataDescriptorType, :DatasetDescriptorType, dcat:Role and their specializations, and dcat:Relationship. Entities with highlighted borders were included in the model to enhance semantics. We observe two important contributions by employing a model that handles types of types. The first is for the catalogue manager, who will have a model for the governance of the descriptor types, as well as their relationships. Through this model, it is possible to define conformity restrictions that users must respect when publishing descriptors in the catalogue. The second refers to the availability of the RDF model associated with the data that provides the agents (humans or machines) the knowledge of the catalogue structure and the standardized terms adopted, contributing to interoperability mechanisms. Therefore, the second-order domain types establish, together with DCAT concepts, a language structure for understanding the catalogue organization. As an example, we can cite the :DatasetDescriptorType. This type registers the different types of dataset descriptors associated with the catalogue. Thus, it may be considered relevant for a specific domain besides dcat:Catalog, a descriptor type that differentiates datasets that handle databases from those that describe data files. The defined types are instances that specialize the dcat:Dataset type. The value spaces for dcat:Role also deserve attention. By defining these spaces, the manager can establish vocabularies and ontologies for its catalog, providing compliance rules for new publications. In addition, these standards are available for agent access, contributing to understanding the catalogue structure. Figure 4: Extract from the Multi-level Model for DCAT. It should be noted that the associated modeling profile provides additional metaproperties to the DCAT model. Using languages like OntoUML, axioms are generated from these metaproperties and incorporated into the specification results, restricting the interpretation of the terms to the desired reality. 5. Employing MLT-DCAT for a repository This section provides an overview of applying the MLT-DCAT model to define resource descriptors for a catalogue representing a DSpace repository. Due to space limitations, we will exemplify the treatment for an existing relationship between resources in the repository. In the model in Figure 5, some types and relations have been omitted to emphasize the new domain-specific elements added to the model. These elements have highlighted borders. Figure 5 presents a relationship descriptor instantiated from the dcat:Relationship to fit a specific domain, in this example, to describe the item (dataset) association to its owner catalogue. This association is illustrated in the DSpace model in Figure 1. Thus, the model addresses the :Relation_isOwner between the dataset descriptor and the catalogue descriptor that owns it, in compliance with a catalogue that represents catalogued resources in a DSpace repository. This relationship is the bearer of the IANA_relator:own term, conforming to the value spaces defined for :RelationRole, i.e., terms from the IANA relators and ISO-19115-1 vocabularies. At the bottom of Figure 5, we find :Relation_isOwner as the instance of dcat_Relationship. It has the IANA_relator:own term associated with dcat:hadRole property. In addition, it defines instances of :SourceDescriptorType and :TargetDatasetType as valid types for range of the dct:relation and dcat:qualifiedRelation properties, respectively. Finally, for the management of relationship types, dct:issued and dct:modified have been added as direct properties of the type. This type is instantiated in the model as :Relation_isOwner, a relator that associates a dcat:Dataset with its owner collection (dcat:Catalog). The relationship is represented between the COVID-19 data collection descriptor (ex:COVID_19_Collection_Descriptor) and COVID-19 item descriptor from Hospital1 (ex:COVID_19_Hospital1_Item_Descriptor). Figure 5: Example of the MLT-UFO-DCAT model representing the Owner relationship for a DSpace repository. To illustrate the benefits of modeling, based on the contributions presented in the previous section, we provide some competency questions (CQ) answered with the model in Figure5. CQ1. Which types of catalogued resource descriptors are handled in the catalogue? It considers the instances of :CataloguedResourceDescriptorType. CQ2. Which types of descriptors are specific to datasets? It considers the instances of :DatasetDescriptorType. CQ3. Which descriptor types are related to dataset descriptor types? It refers to the instances of :SourceDescriptorType. CQ4. Which are the additional relationships employed in the catalogue? They are the instances of dcat:Relationship. CQ5. Which types of descriptors does each additional relationship associate? They are obtained with the analysis of the pairs instantiated :SourceDescriptorType and :TargetDatasetType for each instantiated dcat:Relationship. CQ6. What are the patterns established for the roles played by the additional relationships? It refers to the value spaces established for :RelationRoleValue. The presented CQs assist administrators in managing the catalogue. For agents (human and machine), they allow checking descriptor types and acceptable standards before submission for publication. It could also be applied when publishing a descriptor. As previously mentioned, the model, in RDF, would make available the structure adopted in the catalogue, as well as the standardized terms (semantic artifacts). In the example of Figure 5, we highlight the standardization for the creation of new relationships, with the definition of value spaces associated with two standardized vocabularies (IANA Relations and ISO-19115-1). For software agents, the second-order domain types establish, together with the DCAT concepts, a language structure for understanding the catalogue organization, contributing to interoperability mechanisms. These types may also be employed by a software for catalogue management (or a data platform software), allowing the inclusion of new types in a dynamic way. An example would be the addition, by the manager, of two new types of dataset descriptor. The first to handle databases (:DatabaseDataset) and the second one, data files (:DataFileDataset). As instances of the :DatasetDescriptorType type, they would specialize dcat:Dataset, keeping the rules established by its powertype. Summarizing this section, modeling with categorized types establishes conformance constraints that domain users must respect. In the ontology-based model of Figure 5, it is possible to explicitly identify the description of a domain specific relationship and the value space employed to compose these relationship descriptors. Furthermore, the adoption of standardized value spaces for relationship roles (i) establishes restrictions to be respected when publishing descriptors, and (ii) clarifies the standards adopted by the catalogue. These aspects contribute to interoperability between repositories by providing means to discover, access, and understand the stored data. 6. Conclusion and future works As discussed in this paper, the proliferation and use of diverse platforms for digital repositories with differentiated infrastructure for data storage demand special attention to interoperability. To this end, these platforms are highly dependent on metadata to support the discovery, access, and interoperability of their catalogued resources. Analyzing these metadata allows us to identify that they constitute a domain of descriptors that need a consistent conceptualization, explicitly and formally represented, aiming at sharing the concepts involved. This paper presented an ontological review of DCAT and proposed an extension of its model by applying foundational ontologies. These ontologies clarify the nature of concepts and improve their semantics. As a result, they make explicit the information structures employed for representation. According to Guizzardi [1], the quality of an information system is associated with how truthful its information structures are concerning the reality aspects it needs to represent. Thus, by applying MLT combined with UFO to DCAT, we add a formal ontological structure that defines aspects independent of their particular nature to its types. These aspects help modelers better understand domain concepts, providing ambiguity handling and defining constraints to better align them to reality. With its treatment of higher-level types, MLT has provided a new perspective on DCAT concepts, adding types that assist with their understanding. Moreover, the reorganization and the new types can be employed by software for catalogue management and access to their descriptors. According to Kühne [33], if we interpret an ontological metamodel as a domain specific language definition, we can consider it as a linguistic metamodel. In this context, the higher-order types of the proposed model could be treated as a linguistic metamodel that makes explicit the specializations presented in DCAT and possible extensions to a specific domain. Thus, ontologies based on this canonical model can extend the interoperability mechanisms of digital platforms to repositories, maintaining compliance with protocols such as OAI-PMH and making RDF information available to search engines and retrieval agents. In this way, the improvement of model structures is reflected in the quality of the repository software and interoperability mechanisms that use it, providing: (i) a representation that handles ambiguities; and (ii) management of constraints that promote the proper functioning of solutions. As future work, we are studying the representation of the attributes and relations associated with DCAT entities with MLT. In parallel, we are considering a treatment to distinguish different forms of dataset representation. This explicit distinction will allow the catalogue to manage datasets with equivalent or non-equivalent distributions, avoiding semantic overload. Thus, the model will contribute to the representation of datasets in institutional repositories working with different communities and forms of data treatment. Furthermore, from the generated model, we intend to implement an operational ontology with gUFO ontology [34], a lightweight implementation of UFO, and to establish a FAIR DP with native conformance. 7. Acknowledgements This work has been partially supported by CAPES student grants (Process numbers 223038.014313/2020-19 and 88887.613048/2021-00). 8. References [1] G. Guizzardi. Ontology, ontologies and the “I” of FAIR. Data Intelligence, v. 2, n. 1-2, p. 181-191, 2020. DOI: 10.1162/dint_a_00040. [2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, et al. Comment: The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data, v. 3, n. March, 2016. DOI: 10.1038/sdata.2016.18. [3] B. Mons. Invest 5% of research funds in ensuring data are reusable. Nature, v. 578, n. 7796, p. 491- 491, 2020. DOI: https://doi.org/10.1038/d41586-020-00505-7. [4] L. O. B. Santos, et al. FAIR data points supporting big data interoperability. Enterprise Interoperability in the Digitized and Networked Factory of the Future. ISTE, London, p. 270-279, 2016. [5] C. C. Austin, S. Brown, N. Fong, C. Humphrey, A. Leahey, P. Webster. (2016). Research data repositories: review of current features, gap analysis, and recommendations for minimum requirements. IASSIST Quarterly, 39(4), 24-24. [6] R. Albertoni, D. Browning, S. Cox, A. G. Beltran, A. Perego, P. Winstanley. Data Catalog Vocabulary (DCAT) - Version 2. W3C Recommendation. 4 February 2020. W3C Recommendation. URL: https://www.w3.org/TR/vocab-dcat-2/ [7] H. Sheridan et al. Data Curation through Catalogs: A Repository-Independent Model for Data Discovery. Journal of eScience Librarianship, v. 10, n. 3, p. 4, 2021. [8] D. Connolly. Catalogs: Resource Description and Discovery. W3C. 24 February 2014. URL: https://www.w3.org/Search/catalogs.html [9] UMU. Storage, catalogue, repository and archive - what's the difference? Umeå University, [s.d.]. Disponível em: https://www.umu.se/en/library/research-data/specialised-topics/storage- catalogue-repository-and-archive/ [10] T. Donohue. DSpace 7.x Documentation, fev 03, 2022. URL: https://wiki.lyrasis.org/display/DSDOC7x/DSpace+7.x+Documentation. [11] L. O. Bonino; K. Burger; R. Kaliyaperumal. FAIR DATA POINT – Version 1.1. Jun 29, 2022. URL: https://specs.fairdatapoint.org/. [12] C. Logoze. The open archives initiative protocol for metadata harvesting. http://www. openarchives. org/OAI/2.0/openarchivesprotocol. htm, 2002. [13] World Wide Web Consortium (W3C). Semantic Web - Vocabularies. 2015. https://www.w3.org/standards/semanticweb/ontology. [14] G. Guizzardi. On ontology, ontologies, conceptualizations, modeling languages. In: and (Meta) Models, Frontiers in Artificial Intelligence and Applications, Databases and Information Systems IV, IOS. 2007. [15] F. Maali, R. Cyganiak, V. Peristeras. Enabling interoperability of government data catalogues. In: International Conference on Electronic Government. Springer, Berlin, Heidelberg, 2010. p. 339- 350. [16] J. Mylopoulos. Conceptual modelling and Telos. Conceptual Modeling, Databases, and Case An integrated view of information systems development., p. 49–68, 1992. [17] F. Brasileiro, J. P. A. Almeida, V. A. Carvalho., et al. "Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata", p. 975–980, 2016. DOI: 10.1145/2872518.2891117. [18] V. A. Carvalho. Foundations for Ontology-based Multi-level Conceptual Modeling. Doctoral dissertation - Universidade Federal do Espírito Santo, Brazil, 2016. 36. [19] V. A. Carvalho; J. P. A. Almeida. Toward a well-founded theory for multi-level conceptual modeling. Software & Systems Modeling, v. 17, n. 1, p. 205-231, 2018. DOI: 10.1007/s10270- 016-0538-9. [20] A. Rossini, J. de Lara, E. Guerra, et al. A comparison of two-level and multi-level modelling for cloud-based applications. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 9153, p. 18–32, 2015. DOI: 10.1007/978-3-319-21151-0_2. [21] C. M. Fonseca, et al. Multi-level conceptual modeling: Theory, language and application. Data & Knowledge Engineering, v. 134, p. 101894, 2021. [22] L. Cardelli. Structural subtyping and the notion of power type. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 1988. p. 70-79. [23] J. Odell: Power types. In: Journal of Object-Oriented Programing, 7(2), pp. 8-12. 1994. [24] J. P. A. Almeida, V. A. Carvalho, C. M. Fonseca, G. Guizzardi. (2021). A Note on Properties in Multi-Level Modeling. In 2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C) (pp. 497-501). IEEE. [25] G. Guizzardi, et al. Types and taxonomic structures in conceptual modeling: A novel ontological theory and engineering support. Data & Knowledge Engineering, v. 134, p. 101891, 2021. [26] G. Guizzardi, et al. UFO: Unified Foundational Ontology. Applied Ontology, n. Preprint, p. 1-44, 2021. [27] V. A. Carvalho, J. P. A. Almeida, C. M. Fonseca, G. Guizzardi. Multi-level ontology-based conceptual modeling. Data & Knowledge Engineering, v. 109, p. 3-24, 2017. [28] K. Tomoyose. O Data Catalog Vocabulary (DCAT) para a publicação de dados de pesquisa nos princípios Linked Data. Master’s thesis, Universidade Federal de São Carlos, São Carlos, SP, Brazil. 2021. https://repositorio.ufscar.br/handle/ufscar/14116. [29] J. L. R. Moreira et al. Towards findable, accessible, interoperable and reusable (FAIR) data repositories: improving a data repository to behave as a FAIR data point. Liinc em Revista, v. 15, n. 2, 2019. [30] V. S. Silva, M. L. M. Campos, et al. An Approach for the Alignment of Biomedical Ontologies based on Foundational Ontologies. Journal of Information and Data Management. October, v. 2, n. 309307, p. 557–572, 2011. [31] M. Igamberdiev, G. Grossmann, M. Selway, et al. An integrated multi-level modeling approach for industrial-scale data interoperability, Software and Systems Modeling, v. 17, n. 1, p. 269–294, 2018. DOI: 10.1007/s10270-016-0520-6. [32] C. M. Fonseca, et al. Relations in ontology-driven conceptual modeling. In: International Conference on Conceptual Modeling. Springer, Cham, 2019. p. 28-42. [33] T. Kühne. Matters of (meta-) modeling. Software & Systems Modeling, v. 5, n. 4, p. 369-385, 2006. DOI:10.1007/s10270-006-0017-9 [34] J. P. A. Almeida, G. Guizzardi, R. A. Falbo, T. P. Sales, gUFO: a lightweight implementation of the Unified Foundational Ontology (UFO), 2019.