CoMoDID: Combining explainable artificial
                                intelligence and conceptual modeling for data
                                intensive-domains management
                                Oscar Pastor1,∗,† , Diana Martínez Minguet1,† , Jose Fabián Reyes Román1,† ,
                                Alberto García S.1,† , Ana Leon1,† , Mireia Costa1,† and Ferran Pla1,†
                                1
                                  Valencian Research Institute for Artificial Intelligence (VRAIN). Universitat Politècnica de València, Camí de Vera S/N,
                                Valencia, 46022, Spain


                                                                      Abstract
                                                                      The large and heterogeneous data sets that characterize Data-Intensive Domains (DID) pose a challenge
                                                                      to developing data analysis and management approaches. A successful and efficient data knowledge
                                                                      extraction from DID-based systems is determined by assembling and analyzing such data sets, but
                                                                      integrating their different sources is arduous work. Finding sound solutions for this problem has become
                                                                      a relevant research goal, that existing DID-based systems are not solving in a final, convincing way. To
                                                                      solve this problem, a conceptual characterization of the data sets that constitute DID-based systems is
                                                                      essential. The use of foundational ontologies and conceptual modeling provides an adequate strategy to
                                                                      face the complexity of this problem by clarifying the data structure that is to be analyzed and managed.
                                                                      In this project we tackle this principle, by defining a method grounded on a conceptual model to
                                                                      develop efficient DID-based systems, and by making use of a well-grounded combination of Explainable
                                                                      Artificial Intelligence (XAI) and Machine Learning (ML) techniques to perform data analytics. In addition,
                                                                      the characterization of a platform for the implementation of the method is going to be designed and
                                                                      developed. The project’s chosen domain of application is genomics, specifically in predicting critical
                                                                      diseases before symptoms manifest. Leveraging XAI and ML with genomic information can contribute
                                                                      to the advancement of precision medicine, allowing for the prediction of future diseases based on the
                                                                      available genomic data. The ML dimension will cover the predictive knowledge (is a disease present in a
                                                                      patient?), while the XAI dimension will deal with the explainable part (why the patient has the disease).

                                                                      Keywords
                                                                      Data-Intensive Domains, Conceptual Modeling, Explainable Artificial Intelligence, Precision Medicine


                                ER2023: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: ER Forum, 7th SCME,
                                Project Exhibitions, Posters and Demos, and Doctoral Consortium, November 06-09, 2023, Lisbon, Portugal
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open opastor@dsic.upv.es (O. Pastor); dmarmin@pros.upv.es (D. Martínez Minguet); jreyes@pros.upv.es (J. F. Reyes
                                Román); algarsi3@pros.upv.es (A. García S.); aleon@vrain.upv.es (A. Leon); miscossan@pros.upv.es (M. Costa);
                                fpla@dsic.upv.es (F. Pla)
                                Orcid 0000-0002-1320-8471 (O. Pastor); 0009-0002-3191-1969 (D. Martínez Minguet); 0000-0002-9598-1301 (J. F. Reyes
                                Román); 0000-0001-5910-4363 (A. García S.); 0000-0003-3516-8893 (A. Leon); 0000-0002-8614-0914 (M. Costa);
                                0000-0003-4822-8808 (F. Pla)
                                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
Data has become an invaluable asset in today’s society, and its production is unparalleled,
continually increasing. This presents significant challenges for modern software platforms,
which must store, analyze, and quickly provide access to data for numerous users. Consequently,
various research fields related to data management and processing have undergone profound
transformations[1]. One of the most current, relevant challenges in the software development
context is dealing with DID-based systems, which require extensive and heterogeneous datasets
[2] to create knowledge from data. To develop effective and efficient methods and facilities for
data analysis and management, software developers must integrate complex, distributed, and
heterogeneous datasets from increasingly diverse data-generating technologies (e.g., sensors, the
internet, genome sequencing machines, and other sophisticated devices). Therefore, managing
this massive amount of data to find the most critical and actionable pieces of knowledge has
become a significant challenge.

   A fascinating example of DID-based systems are those that analyze the human genome [3].
Understanding the human genome is a significant scientific challenge, requiring the applica-
tion of sound conceptual modeling techniques to manage such complex systems adequately.
The continuous generation of genomic data from improved sequencing technologies [4, 5, 6]
necessitates selecting the right data management strategy for software platforms. Developing
software systems to deal with these DID are key for a proper genome analysis that would lead
to anticipating future illness in the human population [7].

  To address these issues, this proposal will be grounded on an interdisciplinary scientific
policy especially interested in combining two strong lines of research: conceptual modeling
(CM) [8] and explainable artificial intelligence (XAI) [9]. To this aim, two main components
need to be explored, designed, and developed: i) A method to deal with DIDs problem’s man-
agement (the methodological perspective) correctly and efficiently, and ii) a “materialization” of
the method in the form of a platform intended to assess the solution’s value in a challenging
and specially selected DID as the one related to the understanding of the human genome (the
practical perspective).

   In this scenario, applying a methodological framework based on XAI and CM to address
DIDs concerns effectively becomes a relevant, promising strategy that forms the basis of the
scientific approach used to achieve the project’s major goal. On the one hand, CM is recognized
as crucial for developing data-oriented computer systems, ensuring an accurate representation
of the application domain independently of the system that will be developed to address a
real-world problem. This is especially relevant when we want to “understand data” in a DID
context, which in our case applies to genomics. On the other hand, there is the application
of XAI principles [10, 11], which describe a system in which humans can easily understand
the results that an AIsystem provides, focusing primarily on understanding exactly “how” and
“why” decisions are taken to reach results [9, 12]. For DID-based systems, where the right
representation of concepts becomes a crucial step, CM becomes the perfect partner for a useful
XAI application [10] since by visualizing the relevant concepts, the structure of meaning people
use to understand the domain is clearly represented.
  Our approach -both methodological (a method) and practical (a platform for the genomics
domain)- is based on the group’s expertise [13, 14], focusing on understanding data’s true nature,
employing CM techniques, and addressing challenges such as data volume and processing.


1.1. Details of the project
The project combines XAI and CM for Data Intensive-Domains Management (CoMoDID). It is a
four-year project (Sept. 2022 – Dec. 2025). Currently, the Research team is constituted by Óscar
Pastor López, Juan Carlos Casamayor Ródenas, Tanja E. Vos, Lluís-F. Hurtado, Encarna Segarra,
Ferran Pla, Fernando García Granada, José F. Reyes Román (Postdoctoral Researcher), Alberto
García Simón (Postdoctoral Researcher) and Diana Martínez Minguet (Predoctoral Researcher),
in collaboration with the Genomics Team of the PROS research group. The project is supported
by the Generalitat Valenciana through the CIPROM/2021/023 project.


2. Project goals, tangible outputs & expected outcomes
The research proposed in this project focuses on the design of solutions for DIDs problems since
existing frameworks to build DID-based systems lack a sound conceptual modeling grounding,
and too frequently, ad-hoc implementations are built. Both the method and the platform to
materialize the solution in order to show how it works for a selected DID conform to the two
major objectives of this project:

    • Definition of a general method (the so-called DELFOS method (Figure 1), as the method
      to be used for the CoMoDID project) applicable to any DID for facing its analysis and
      design, which is based on a sound combination of conceptual modeling techniques and
      XAI technologies.
    • Development of a technological platform, that will instantiate and support the method in
      a particularly challenging and complex DID context: the genomic domain.

  To address the methodological and practical components of the approach, we break these
objectives into specific goals (G) with associated work packages (WPs) which the tangible
outputs obtained so far result from.

2.1. Specific goals for a general method
Figure 1 represents the different components that constitute the approach proposed by the
DELFOS method, which is the adopted name for the method we aim to develop in this project
proposal. The two specific goals to achieve our objective are:

 G1. Ontological characterization of DIDs. (WP1: Ontological characterization of DIDs.):
     The study and analysis of existing foundational ontologies related to DID characterization
     in conjunction with the state of the art about existing solutions for the development of
     DID platforms are reflected in the doctoral thesis:
Figure 1: Components of the proposed DELFOS method. The first component oversees the data
extraction and integration, based on the requirements represented in a conceptual model. With data
structured and ready to be processed, the second component applies the most suitable XAI and/or ML
techniques to satisfy knowledge requirements. The third component stores classified data using the
adequate technology and data schema according to the analytical requisites. Stored data is used to build
specific tools (fourth component), based on interaction models, that allow the extraction of knowledge.


         – García Simón, A. (2022). Understanding the Code of Life: Holistic Conceptual
           Modeling of the Genome. Universitat Politècnica de València. https:// doi.org/ 10.
           4995/ Thesis/ 10251/ 191432
      A preliminary definition of a foundational ontology for the development of DID platforms
      is published in:
         – Bernasconi, A., Guizzardi, G., Pastor, O., & Storey, V. C. (2022). Semantic interoper-
           ability: ontological unpacking of a viral conceptual model. BMC Bioinformat-
           ics, 23(11), 1-23. DOI: https:// doi.org/ 10.1186/ s12859-022-05022-0
 G2. Integration of XAI techniques for data management and exploitation. (WP2:
     Integration of XAI techniques for data management): Focusing on the case study of the
     genomic domain as one of the major DIDs that currently exist, a study and statistical
     comparison of different data sources with information associated with two groups of
     diseases: cancer and heart disease has been carried out. The results of these studies have
     been reported in:
         – Costa, M., García S, A., & Pastor, O. (2022). A Comparative Analysis of the
           Completeness and Concordance of Data Sources with Cancer-Associated
           Information. In International Conference on Conceptual Modeling (pp. 35-44).
           Springer, Cham. DOI: https:// doi.org/ 10.1007/ 978-3-031-22036-4_4
         – Costa, M., García S, A., & Pastor, O. (2022). Conceptual Modeling-Based Car-
           diopathies Data Management. In International Conference on Conceptual Modeling
           (pp. 15-24). Springer, Cham. DOI: https:// doi.org/ 10.1007/ 978-3-031-22036-4_2
      In order to replicate some of the criteria used by clinicians for genomic data to streamline
      the process of selecting relevant data by reducing the effort of manual activities per-
      formed by experts, different XAI techniques have been established according to the data
      collections to be analyzed, and different data storage and integration techniques have
      been evaluated.
         – García S, A., Costa, M., Leon, A., & Pastor, O. (2022). The challenge of managing the
           evolution of genomics data over time: a conceptual model-based approach.
           BMC Bioinformatics, 23(11), 1-33. DOI: https:// doi.org/ 10.1186/ s12859-022-04944-z
      (WP4: Analysis and design of a tool to help with the writing of medical case reports in
      a genomic domain): As a first draft, a Deep Learning (Transformers) model has been
      trained for a multi-class, multi-label classification task of radiology medical reports using
      Transfer Learning techniques:
         – Contreras-Ochando, L., León, A., Hurtado, L.F., Pla, F., Segarra, E. (2023) Enhancing
           Precision Medicine: An Automatic Pipeline Approach for Exploring Genetic
           Variant-Disease Literature The 4th International Workshop on Conceptual Modeling
           for Life Sciences (CMLS) @ER202 (Under revision)

2.2. Specific goals for a technological platform
The instantiation of the method in the Genomics domain aims to validate that the method is
a very complex DID, to provide a technological platform to collect, manage and analyze the
generated data in practical settings in order to improve the understanding of the human genome
challenge, and ultimately to obtain relevant value by the extraction of knowledge from the data.
To achieve these objectives, the following goals are defined:

 G3. Definition of the interaction mechanisms for DID-based systems. (WP3: Interaction
     Definition for DID-based Systems): The analysis of the interaction requirements for DID-
     based systems and the elicitation of such requirements to design sustainable interfaces
     are developed in:
         – Bernasconi, A., García S, A., Ceri, S., & Pastor, O. (2022). A Comprehensive
           Approach for the Conceptual Modeling of Genomic Data. In International
           Conference on Conceptual Modeling (pp. 194-208). Springer, Cham. DOI: https:
           // doi.org/ 10.1007/ 978-3-031-17995-2_14
         – García, A., Costa, M., León, A., Reyes, J. F., & Pastor, Ó. (2023). Human-Centered De-
           sign for the Efficient Management of Smart Genomic Information. In Proceed-
           ings of the 18th International Conference on Evaluation of Novel Approaches to Software
           Engineering-ENASE (pp. 1-12). DOI: https:// www.doi.org/ 10.5220/ 0011635800003464
         – Pastor, Ó., León Palacio, A., Panach, J. I., García, A., Costa, M. & Reyes Román, J. F.
           (2023). Usability Evaluation of a Method to Analyze Data Intensive Domains.
           Multimedia Tools and Applications (To be published).

Remaining goals to be tackled are G4. Development of a platform to support the DELFOS
method and G5. DELFOS method and platform validation, with the associated homony-
mous word packages (WP5 and WP6). This WPs leverage on all the previous ones which are in
the process of improvement and development.

  Finally, WP7: Communication, dissemination and exploitation of the results, is ubiquitous and
ensures consistent dissemination, visibility and outreach to all relevant stakeholders.

   The expected outcomes of the project are the development of a solid method that can be
applied to any complex DID, and the implementation of a platform that enables the instantiation
of the method for a particular DID-based system (genomics), providing the technological support.


3. Relevance for ER
The proposed project is aligned with several research topics relevant to the conceptual mod-
eling community. It is highly relevant to the topics of Ontological and cognitive foundations
and Semantics in conceptual modeling since the incorporation of foundational ontologies and
conceptual modeling in the project contributes to a solid theoretical foundation concerning
DID-based systems, in combination with the project’s focus on developing standardized ap-
proaches for data integration and analysis which involves addressing semantic aspects. In the
same line, the project is relevant to the topic of Complex management of large conceptual
models, given that the project addresses the challenge of managing large and heterogeneous
data sets in complex DID-based systems.

   In another direction, the project aims to develop a method and platform to automate the
development of DID-based systems, including data modeling. In this context, using Artificial
Intelligence is useful for optimizing and automatizing data analysis. However, in the Precision
Medicine field, where the practical instantiation of the project is embedded, the necessity of
transparency and minimization of uncertainties is essential for the resulting decisions to be
explainable. XAI satisfies these requirements, thus being suitable for data analysis counceling.
The use of XAI and ML techniques involves knowledge representation and reasoning for accu-
rate data analysis, being directly related to the topic of Logic-based knowledge representation
and reasoning.

   Overall, the proposed project’s alignment with various research topics highlights its relevance
and potential contributions to conceptual modeling, as well as knowledge representation and
reasoning in the context of DIDs. It aims to address existing challenges and improve the
efficiency and accuracy of DID-based systems, offering valuable insights for data analysts in
diverse research fields based on conceptual modeling techniques and foundational ontologies.
4. Current Project Status
The project is in the first quarter of its development and is well on schedule. So far, the project
tasks have involved the exhaustive characterization of the framework elements, as well as the
development of precursory solutions. The ongoing tasks concern further investigation and
the extension and improvement of the proposed solutions, for instance, the generation of a
preliminary platform prototype and precursory validation tests for both the method and the
platform. On the other hand, fruitful discussions with external companies have revealed future
lines of research which are being addressed in a new Ph.D. thesis conducted within the scope of
this project, regarding domain-specific concerns related to the genomic field.


Acknowledgments
This work was supported by the Generalitat Valenciana through the CoMoDiD project (CIPROM/2021/023),
through a GVA-Predoctoral Research Grant (ACIF/2021/117), a Margarita Salas Grant, and the
Spanish State Research Agency through the DELFOS (PDC2021-121243-I00,MICIN/AEI/10.13039/501
100011033) and SREC (PID2021-123824OB-I00) projects, and co-financed with ERDF and the
European Union Next Generation EU/PRTR.


References
 [1] A. Margara, G. Cugola, N. Felicioni, S. Cilloni, A model and survey of distributed data-
     intensive systems, 2022.
 [2] A. Elizarov, B. Novikov, S. Stupnikov, Data Analytics and Management in Data Intensive
     Domains, Springer International Publishing, 2020.
 [3] M. Felderer, B. Russo, F. Auer, On Testing Data-Intensive Software Systems, 2019, pp.
     129–148. doi:10.1007/978- 3- 030- 25312- 7_6 .
 [4] M. Cowley, R. Davis, Next-generation sequencing and emerging technologies, Seminars
     in Thrombosis and Hemostasis 45 (2019). doi:10.1055/s- 0039- 1688446 .
 [5] D. Rigden, X. Fernandez, The 27th annual nucleic acids research database issue and
     molecular biology database collection, Nucleic Acids Research 48 (2020) D1–D8. doi:10.
     1093/nar/gkz1161 .
 [6] W. Mccombie, J. McPherson, E. Mardis, Next-generation sequencing technologies, Cold
     Spring Harbor Perspectives in Medicine 9 (2018) a036798. doi:10.1101/cshperspect.
     a036798 .
 [7] B. Louie, P. Mork, F. Martin-Sanchez, A. Halevy, P. Tarczy-Hornoch, Data integration and
     genomic medicine, Journal of biomedical informatics 40 (2007) 5–16. doi:10.1016/j.jbi.
     2006.02.007 .
 [8] A. Olivé, Conceptual Modeling of Information Systems, 2007. doi:10.1007/
     978- 3- 540- 39390- 0 .
 [9] S. Spreeuwenberg, AIX: Artificial Intelligence needs explanation: Why and how trans-
     parency increases the success of AI solutions., Amsterdam: LibRT: the Lab for Intelligent
     Business Rules Technology, 2019.
[10] O. Pastor, A. Palacio, J. Reyes Román, J. Casamayor, Modeling Life: A Conceptual
     Schema-centric Approach to Understand the Genome, 2017, pp. 25–40. doi:10.1007/
     978- 3- 319- 67271- 7_3 .
[11] A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Bar-
     bado González, S. Garcia, S. Gil-Lopez, D. Molina, V. R. Benjamins, R. Chatila, F. Herrera,
     Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and chal-
     lenges toward responsible ai, Information Fusion 58 (2019). doi:10.1016/j.inffus.2019.
     12.012 .
[12] J. Qin, M. Žumer, X. Wang, W. Fan, Conceptual models and ontological schemas for seman-
     tically sustainable digital libraries, 2020, pp. 441–442. doi:10.1145/3383583.3398545 .
[13] L. Kalinichenko, A. Volnova, E. Gordov, K. Nadezhda, D. Kovaleva, O. Malkov, I. Okladnikov,
     N. Podkolodnyy, A. Pozanenko, N. Ponomareva, S. Stupnikov, A. Fazliev, Data access chal-
     lenges for data intensive research in russia 10 (2016) 2–22. doi:10.14357/19922264160101 .
[14] S. Spreeuwenberg, Choose for AI and for Explainability, 2020, pp. 3–8. doi:10.1007/
     978- 3- 030- 40907- 4_1 .