=Paper= {{Paper |id=Vol-3618/pe_paper_5 |storemode=property |title=A smart data holistic approach for context-aware data analytics (AETHER-UA) (short paper) |pdfUrl=https://ceur-ws.org/Vol-3618/pe_paper_5.pdf |volume=Vol-3618 |authors=Ana Lavalle,Alejandro Maté,Juan Trujillo,Miguel A. Teruel,Alexander Sánchez Díaz |dblpUrl=https://dblp.org/rec/conf/er/LavalleM0TD23 }} ==A smart data holistic approach for context-aware data analytics (AETHER-UA) (short paper)== https://ceur-ws.org/Vol-3618/pe_paper_5.pdf
                                A smart data holistic approach for context-aware data
                                analytics (AETHER-UA)
                                Ana Lavalle1 , Alejandro Maté1 , Juan Trujillo1 , Miguel A. Teruel1 and
                                Alexander Sánchez Díaz1
                                1
                                  Lucentia Research Group - Department of Software and Computing Systems, University of Alicante, Carretera de San
                                Vicent del Raspeig, s/n, 03690, San Vicent del Raspeig, Spain


                                                                      Abstract
                                                                      A smart data holistic approach for context-aware data analytics (AETHER-UA) is one of the four sub-
                                                                      projects, developed in the University of Alicante, being part of the whole project AETHER. This project
                                                                      is being developed by four partners: (i) University of Malaga - Coordinator; (ii) University of Alicante,
                                                                      (iii) University of Castilla La-Mancha, and (iv) University of Seville. The project is funded by the Ministry
                                                                      of Science and Innovation. The main goal of this project is to advance towards a knowledge-based
                                                                      framework integrating novel solutions for data, process and business analytics. The research activities
                                                                      for designing and developing Aether will mainly focus on three main challenges: the characterization of
                                                                      the datasets, the improvement and automation of the algorithms, and the generation of mechanisms
                                                                      to enhance model explainability and interpretation of the results. The project is highly related to data
                                                                      processing, integration, analysis and modeling. More concretely, within the AETHER-UA project, several
                                                                      proposals are being developed for the modeling of user’s requirements for Machine Learning applications,
                                                                      the developing of a framework based on Model Driven Development (MDD) for eXplanable Artificial
                                                                      Intelligence and several approaches for the data bias analysis.

                                                                      Keywords
                                                                      Smart Data, Data Integration, Data Bias, Conceptual Modeling, User’s requirements, Modeling of Machine
                                                                      Learning Applications




                                1. Project description
                                This project will develop Aether, a data platform oriented as an extension of BIGOWL [1], an
                                ontology to support knowledge management in Big Data analytics, by including new dimensions
                                such as data quality, and cybersecurity, plus application domains like process analytics and
                                business analytics. Thus, the target metadata framework will include aspects that have never
                                been integrated into a single solution. This novel framework will open new research challenges
                                in taking advantage of it in data analytics: how the metadata can be discovered, linked to the
                                different analysis, and exploited.
                                   The first challenge is the characterization of the input datasets. Knowing properties from

                                ER2023: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: ER Forum, 7th SCME,
                                Project Exhibitions, Posters and Demos, and Doctoral Consortium, November 06-09, 2023, Lisbon, Portugal
                                Envelope-Open alavalle@dlsi.ua.es (A. Lavalle); amate@dlsi.ua.es (A. Maté); jtrujillo@dlsi.ua.es (J. Trujillo); materuel@dlsi.ua.es
                                (M. A. Teruel); alexander.sanchez@ua.es (A. S. Díaz)
                                Orcid 0000-0002-8399-4666 (A. Lavalle); 0000-0001-7770-3693 (A. Maté); 0000-0003-0139-6724 (J. Trujillo);
                                0000-0003-0102-6918 (M. A. Teruel); 0000-0002-6236-9095 (A. S. Díaz)
                                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the datasets such as semantics, topology, or statistical features, is critical for data optimization,
process and business analytics. These properties can be used to automatically select the best
algorithms depending on the problem and the input data, and fine-tune their configuration to
get high-quality results.
   However, the characterization of the data at the beginning of the analysis processes will dis-
cover properties that may not be valid as the data are transformed or on the inferred knowledge.
So, the second challenge will be to research how these properties are transmitted throughout
the analysis workflows or how new properties can be derived, and how they can be used to:
facilitate or automate the selection and configuration of the algorithms needed in the analysis of
the process; measure the actionability of the results based on values of authority and provenance
or reliability and confidence; derive an explanation of why the result is the one we obtain and
be able to make it intelligible. Thus, we aim to move from the use of analysis algorithms such
as black boxes to facilitate the understanding and interpretability of the results by analysts.
   All of these steps forward cannot be isolated from some fundamental aspects: (1) data tend
to be managed through workflows that can be represented by using business processes, which
implies new challenges; (2) the quality of the data is crucial for later analysis; (3) not only the
data are relevant for analysis, cybersecurity aspects must also be included, and (4) the advance
after the application of the various techniques must be measured by using KPIs.
   All these research activities will be applied in real contexts by developing pilots in industrial
contexts in different application domains taking advantage of existing collaborations with other
research organizations or companies in a multidisciplinary approach: precision agriculture,
personalized medicine, bioinformatics, Smart Cities, logistics, industry 4.0 and remote sensing.
   This project is being developed by four partners: (i) University of Malaga - Coordinator; (ii)
University of Alicante, (iii) University of Castilla La-Mancha, and (iv) University of Seville. The
project is funded by the Ministry of Science and Innovation and has a duration of four years.


2. Project information
    • Project name: A smart data holistic approach for context-aware data analytics (Aether)
    • Duration: From Jan, 1st, 2021 until Dec, 31st, 2024
    • Partners: In the following, we classify the participants grouped by the corresponding
      partner participating in the project
         – Aether-UMA: KHAOS Research group - University of Málaga (UMA). Khaos
           main background is related to semantics technologies.
         – Aether-UCLM: Alarcos and GSyA research groups - University of Castilla-La
           Mancha (UCLM). The Alarcos Research Group main background is data quality,
           the quality of databases, data warehouses, data models, and data itself. The GSyA
           research group’s main background is related to software development security and
           security management and governance.
         – Aether-UA: Lucentia Research group - University of Alicante (UA). The
           Lucentia Group has extensive experience in the field of Business Intelligence (BI),
           Data Analytics, Big Data, and Conceptual modeling of BI applications.
         – Aether-US: IDEA Research group - University of Sevilla (US). The IDEA Group
           is devoted to research on issues related to business process models and diagnosis of
           systems based on data analysis.

    • Funding agency: Ministry of Science and Innovation
    • Website: https://aether.es/


3. Project goals
Figure 1 summarizes the project goals. The main goal is to advance towards a knowledge-based
framework integrating novel solutions for data, process and business analytics. This goal will
be approached by extending BIGOWL to cope with relevant dimensions as the core of Aether
framework (General Goal 1). The research activities for designing and developing Aether will
mainly focus on three main challenges: the characterization of the datasets (General Goal 2),
the improvement and automation of the algorithms (General Goal 3), and the generation of
mechanisms to enhance model explainability and interpretation of the results (General Goal 4).
The project results will be applied to relevant scenarios (General Goal 5). These five general
goals can be divided into more specific goals with their responsable researchers (due to space
constraints we cannot provide the list of all the researchers):




Figure 1: Project goals
• General goal 1 (responsible UMA): To design the semantic core for the Aether frame-
  work to deal with relevant aspects of data, process and business analytics.
• General goal 2 (Responsible UCLM): To investigate data characterization mechanisms
  for the analysis of datasets (not only Big Data) to obtain semantic (domain), topological,
  statistical (data biases, kind of distribution, data type, descriptive statistics, categories)
  and other properties (e.g., quality, security, provenance).
• General goal 3: (Responsible US) To design and develop new solutions for improved
  data analytics, process analytics and business analytics, taking advantage of the analysis
  of the datasets and processes.
• General goal 4 (Responsible UA): To facilitate the interpretability and the explainability
  of analytic results, with particular focus on aligning Machine Learning techniques with
  business objectives. This will lead to new insights and the identification of novel, relevant
  analysis.
     – Specific goal 4.1 (I. Navas): To study the use of knowledge graphs and ontologies
       to enhance the explicability of the models (of the algorithms) used in the analysis
       of data, to improve the actionability of results maintaining the privacy rules about
       data determined.
     – Specific goal 4.2 (J.C. Trujillo): To define a methodology to derive indicators
       and visualizations from the metadata generated by the various investigations and
       consolidated in BIGOWL, taking into account user goals, semantics and machine
       learning algorithms outputs.
     – Specific goal 4.3 (A.J. Varela): Reasoning about the security vulnerabilities and
       vulnerable configurations to improve the interpretability and usability based on the
       results over the datasets and the analysis of their security requirements, features,
       configurations, and the information of the system.
     – Specific goal 4.4 (J.A. Cruz): To define quantitative (synthetic) indicators to
       measure the interpretability of the results of data analysis and machine learning
       algorithms. These indicators will allow us to warn users about underlying aspects
       that may affect their interpretation.
     – Specific goal 4.5 (M.T. Gómez): To study the alignment between the applied
       techniques in the data analytics of business processes, and the obtained information,
       using the knowledge graphs and ontologies to facilitate the interpretability and
       usability of the results (of the algorithms) of the data analysis even before the
       algorithms are applied to ascertain the complexity and the usability of the used
       techniques.
     – Specific goal 4.6 (A. Maté): To develop reasoning techniques that use the seman-
       tics of the analytic workflow and KPI definitions to facilitate the identification of
       potential causes for underperforming objectives as well as their implications for the
       organization.
• General goal 5 (Responsible UMA): To develop ”Piloting” actions, through the devel-
  opment of pilot experiences with companies and organizations in the application of the
  results of the project.
4. Methodology and Work Plan
4.1. Methodology
Our work methodology builds on the reference framework provided by the Enterprise Unified
Process (Enterprise Unified Process. Several authors. Enterprise Unified Process Publishing,
2020, ISBN: 978-1867420781), which proposes an iterative, incremental life cycle that is very
appropriate for research projects. It has been adapting it for the last decade, so it is very
solid nowadays. As a summary, the main phases into which the project was divided and
then the disciplines that must be performed are as follows: (i) Inception, (ii) Preparation, (iii)
Construction, (iv) Transition, and (v) Recapping.

    • Inception: the goal is to identify the working hypotheses and the objectives, verify that
      they are realistic and aligned with current trends, ensure that they are well aligned with
      the existing research programs, and check that they will have positive scientific, technical,
      social and economic impacts. It is a good idea to use brainstorming meetings in this phase,
      so that researchers may contribute with their ideas. This phase has been developed at the
      beginning of the project proposal preparation.
    • Preparation: the goal is to work on the general goal and the motivation of the project
      plus a summary of the state of the art that supports them. It is a good idea to use a
      brainstorming meeting so that every researcher can contribute to his or her ideas. After
      that, the focus must shift towards devising the work packages, their specific goals, the
      dissemination, transfer, and contingency plans. It is a good idea to work in small groups
      in brainstorming meetings so that every researcher can contribute ideas. This phase
      has been developed during the project proposal preparation. Still, it will be revisited in
      the first month of the project development to update it according to any new insight
      discovered.
    • Construction: this phase focuses on defining the goals within each work package and
      dividing these work packages into several tasks. It is a good idea to work in small groups
      that organise a kick-off meeting per goal in which they agree on the tasks to be done,
      the workflow, the intermediate milestones, and the key performance indicators. Then,
      the researchers work individually or very small groups and meet periodically in plenary
      workshops in which they present their results, discuss new ideas, and identify deviations
      and corrective actions.
    • Transition: the goal is to transfer the research results to the end-users and the industry. We
      will pay special attention to transferring them to the companies that support the proposal
      using seminars and workshops. We will also pay special attention to complementary
      projects in which we can apply our results in the context of their industrial projects.
    • Recapping: the goal is to recap on how the project has gone on to identify strong and
      weak points that may help work better in future projects.

   Our software development methodology combines Kanban and iterative prototyping. Kanban
is an agile methodology based on keeping a board to visualise the tasks to carry out (backlog), in
progress, and finish. Thus, the team members can see all the tasks in their context. When some
tasks are completed, the next one is taken from the backlog. The idea behind using prototypes
is to develop incomplete versions of the software project to get a working system quickly. So, it
can be easily tested, and the users can provide feedback early. Once a prototype is ready, a new
one is planned to cover new features, and so on. Prototyping can be combined with continuous
integration, so that software is produced in short cycles to ensure that it can be ready to be
released at any time by frequently applying the steps of testing, building, and releasing. The
project will be managed with the help of OpenProject (https://www.openproject.org/), which
was specifically designed to help with project planning and scheduling, to define user stories
and tasks, to assign them to developers, and to report on timings and costs.
   The source code of the software projects will be monitored by the Travis (https://www.
travis-ci.org) continuous integration system. The applications will be packed into containers
using Docker (https://www.docker.com), which has proven to help deploy the results very
easily.


5. Work plan
The table shown in Figure 2 summarises the work plan and how it is related to interdisciplinary
dimensions of the project (* means that the task involves several dimensions). In the following
Section 5.1, we describe the main work Package developed by the University of Alicante pointing
out the main role of conceptual modeling in different areas.

5.1. WP4: Model interpretability and result explainability (WP Leader: A.
     Maté)
The goal of this WP is to facilitate more confident and accurate decisions by improving the
interpretability and explainability of analytic workflows and outputs. The aim is to provide
decision-makers with additional pieces of evidence and insights on the rationale behind analyti-
cal workflows such as data subsets that lead to the activation of Machine Learning & Artificial
Intelligence rules, the semantic relationship between the data being analyzed and other data




Figure 2: Work plan
sets stored by the organization, or the connection between analytic outputs and business goals
and their implications.

    • Task 4.1. (UMA, UA, US, UCLM): Knowledge graph based explainability (9 Months).
      This task will analyse the use of knowledge graphs to enhance the understanding of the
      datasets and the models used to transform them through data analytics. The semantics
      defined at the different levels of knowledge will enable the exploration of these knowledge
      graphs to extract relevant knowledge. Data privacy will be preserved when exploring
      external knowledge graphs.
         – Deliverable 4.1 [REPORT] Analysis of knowledge graphs to improve model ex-
           plainability (M39).
    • Task 4.2. (UA, UMA): Semantically enhanced dashboards and scorecards (9 months).
      This task aims at developing a methodology for deriving dashboards and scorecards
      exploiting the semantic information in the Aether framework. To this aim, previous
      works of the UA will be combined with the expertise on semantics from UMA to create a
      methodology that takes into account not only user goals and the data at hand, but also
      data semantics to create richer visualizations and dashboards.
         – Deliverable 4.2 [REPORT] Methodology for the derivation of semantically en-
           hanced dashboards and scorecards (M39).
    • Task 4.3. (US, UCLM): Testing of Vulnerabilities (9 months). Automatic creation
      of security tests from exploiting databases based on the extracted vulnerabilities from
      relevant databases to create models that describe the possible vulnerabilities and evolution
      of the systems. These models will be used to verify and diagnose the running configuration
      further to define tests according to their potential vulnerabilities.
         – Deliverable 4.3 [REPORT, SOFTWARE] Report the automatic extraction of the
           models from databases, and software to verify and diagnose configurations, and
           security tests (M41).
    • Task 4.4. (UCLM, UMA, UA): Data Analytics Results’ Interpretability (12 months). The
      interpretation of the results of the Machine Learning models is vital because (1) it helps to
      correct unwanted biases, (2) it facilitates the detection of noise that alters the prediction
      and (3) it allows to confirm that the variables that the model has selected are significant
      in the framework of the problem. However, when faced with different possible models,
      there is currently no mechanism for measuring the level of interpretability of the model.
      Therefore, in this task, a series of experiments will be carried out to identify possible
      metrics that will allow the user to define the level of interpretability of the algorithms.
         – Deliverable 4.4 [REPORT] Experimental results in measuring interpretability of
           data analytics results in the project (M44).
    • Task 4.5. (US, UMA, UA): Alignment among processes and discovery techniques (10
      months). How the best configuration of the process mining techniques can be ascertained
      before the techniques are applied. To fulfil this task, it is necessary to characterise the
      processes to discover the relation between the parameters of the techniques and the
      processes where they are applied.
         – Deliverable 4.5 [REPORT] Methodology to facilitate the alignment of the most
           suitable algorithms and setting according to data sets in process discovery (M44)
    • Task 4.6. (UA, UMA): Business reasoning framework (10 months). This work package
      aims at developing a novel methodology for reasoning on the current performance of the
      organization by making the system aware of existing KPIs and analytic workflows (ML
      outputs and data transformations) to provide as much insights as possible on the root
      cause of underperforming business objectives.
         – Deliverable 4.6 [REPORT] Business reasoning framework description (M46).


6. Project relevance
Aether project aims to enhance the whole data analytics process, seamlessly augmenting
datasets and algorithms with semantic and domain knowledge, while also improving their
security, quality and actionability. The expected result is a new way to approach data analytics,
enabling advanced reasoning and applications with less effort, augmenting the data catalogue
incorporated in the business decisions. In turn, this change has profound implications for
academia, society, and industry. Numerous organizations, such as hospitals, universities or
industries, can harness this new approach to easily integrate new datasets into their analysis
while having a clearer view of their implications for their objectives.


7. Current state
As it can be summarized, the project is highly focused on “Data”. In a more particular way,
we should point out that conceptual modeling has a relevant role in many Work Packages and
transversal tasks. One of the main novel research lines under development in this project is the
modeling of user’s requirements for Machine Learning Applications. Currently, we are working
on the extension of the iStar [2] modeling approach in order to be able to gather the main ML
requirements.
   On the other hand, data bias is one of the most relevant issues to be tackled when working
with external data and/or developing ML models. In this way, we have developed several works
where conceptual modeling and visualization play relevant roles in finding the main data bias
in a set of data [3, 4, 5, 6]. These works can be applied to small and also Big Data sources [7, 8].
On the other hand, one of the current research lines will be focused on the modeling of ML
models in order to model the most relevant data to train the ML models [9, 10, 11, 12, 13].
   We also apply different Process Mining techniques for the discovery of defacto process models,
conformance analysis and extensions by combining AI models. We use algorithms such as
Fuzzy Miner to manage levels of abstraction in the models by tackling both correlation and
relevance variables, and the Trace Alignment algorithm, for clustering based on Agglomerative
Hierarchical Clustering (AHC) to detect patterns in process traces [14].
References
 [1] C. Barba-González, J. García-Nieto, M. del Mar Roldán-García, I. Navas-Delgado, A. J.
     Nebro, J. F. Aldana-Montes, Bigowl: Knowledge centered big data analytics, Expert
     Systems with Applications 115 (2019) 543–556.
 [2] J. M. Barrera, A. Reina-Reina, A. Lavalle, A. Maté, J. Trujillo, An extension of istar for
     machine learning requirements by following the prise methodology, Available at SSRN
     4358075 (2023).
 [3] A. Lavalle, A. Maté, J. Trujillo, J. García-Carrasco, Law modeling for fairness requirements
     elicitation in artificial intelligence systems, in: Proceedings of the 41st International
     Conference on Conceptual Modeling, ER 2022, Springer, 2022, pp. 423–432.
 [4] M. del Mar Roldán-García, J. García-Nieto, A. Maté, J. Trujillo, J. F. Aldana-Montes,
     Ontology-driven approach for kpi meta-modelling, selection and reasoning, International
     Journal of Information Management 58 (2021) 102018.
 [5] A. Lavalle, A. Maté, J. Trujillo, J. García, A methodology based on rebalancing techniques
     to measure and improve fairness in artificial intelligence algorithms, in: Proceedings of the
     24nd International Workshop on Design, Optimization, Languages and Analytical Process-
     ing of Big Data DOLAP@EDBT/ICDT 2022, volume 3130 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2022, pp. 81–85.
 [6] A. Lavalle, A. Mate, J. Trujillo, M. A. Teruel, A data analytics methodology to visually
     analyze the impact of bias and rebalancing, IEEE Access 11 (2023) 56691–56702.
 [7] A. Maté, J. Peral, J. Trujillo, C. Blanco, D. García-Saiz, E. Fernández-Medina, Improving
     security in nosql document databases through model-driven modernization, Knowledge
     and Information Systems 63 (2021) 2209–2230.
 [8] C. Blanco, D. García-Saiz, D. G. Rosado, A. Santos-Olmo, J. Peral, A. Maté, J. Trujillo,
     E. Fernández-Medina, Security policies by design in nosql document databases, Journal of
     Information Security and Applications 65 (2022) 103120.
 [9] S. García-Ponsoda, J. García-Carrasco, M. A. Teruel, A. Maté, J. Trujillo, Feature engineering
     of eeg applied to mental disorders: a systematic mapping study, Applied Intelligence (2023)
     1–41.
[10] J. García-Carrasco, A. Maté, J. Trujillo, A data-driven methodology for guiding the selection
     of preprocessing techniques in a machine learning pipeline, in: International Conference
     on Advanced Information Systems Engineering, Springer, 2023, pp. 34–42.
[11] R. Tardío, A. Maté, J. Trujillo, Beyond tpc-ds, a benchmark for big data olap systems
     (bdolap-bench), Future Generation Computer Systems 132 (2022) 136–151.
[12] J. M. Barrera, A. Reina, A. Mate, J. C. Trujillo, Fault detection and diagnosis for industrial
     processes based on clustering and autoencoders: a case of gas turbines, International
     Journal of Machine Learning and Cybernetics 13 (2022) 3113–3129.
[13] A. Reina Reina, J. M. Barrera, B. Valdivieso, M.-E. Gas, A. Maté, J. C. Trujillo, Machine
     learning model from a spanish cohort for prediction of sars-cov-2 mortality risk and critical
     patients, Scientific Reports 12 (2022) 5723.
[14] J.-F. Rodríguez-Quintero, A. Sánchez, L. Iriarte Navarro, A. Maté, M. Marco Such, J. Trujillo,
     Fraud audit based on visual analysis: A process mining approach, 2021-05-21.