MATILDA: Inclusive Data Science Pipelines Design through Computational Creativity Genoveva Vargas-Solar1 , Khalid Belhajjame2 , Javier A. Espinosa-Oviedo1,3 , Santiago Negrete-Yankelevich4 and José-Luis Zechinelli-Martini5 1 CNRS, Univ Lyon, INSA Lyon, UCBL, LIRIS, UMR5205, F-69221, France 2 PSL, Université Paris Dauphine, LAMSADE, UMR7243, France 3 CPE Lyon, 43 Blvd. du 11 Novembre 1918, 69616 Villeurbanne Cedex, France 4 Universidad Autónoma Metropolitana (Cuajimalpa). Avenida Vasco de Quiroga 4871, Cuajimalpa de Morelos 05348, Ciudad de México 5 Fundación Universidad de las Américas-Puebla, Exhacienda Sta. Catarina Mártir s/n 72820 San Andrés Cholula, Mexico Abstract This paper argues for developing innovative data science frameworks that render the latest progressions in data engineering and artificial intelligence accessible to non-technical users across diverse fields. Such frameworks would empower these users to leverage advanced data science solutions’ capabilities fully. We propose a methodology that merges computational creativity with conversational computing to facilitate an intuitive pathway for non-experts to navigate and derive insights from datasets. We present MATILDA, a platform rooted in creativity-driven data science, and demonstrate its utility in augmenting the data science pipeline’s design process through the synergy of human innovation and algorithmic ingenuity. Keywords Data science pipelines, graph analytics, knowledge graphs, computational creativity 1. Introduction cacies of the underlying technologies while still allowing them to fully exploit the capabilities of these advanced Harnessing extensive datasets across numerous sectors tools to satisfy the specific analytical requirements of via data science techniques offers substantial economic their respective fields. and social benefits. These methods are predominantly Addressing complex transdisciplinary issues requires within the purview of individuals proficient in AI, with a collaborative effort where experts from diverse fields deep knowledge of mathematics, statistics, numerical engage in dialogue and exchange ideas, a process that analysis, and artificial intelligence frameworks. However, inherently demands creativity1 Creating data science so- there is a growing need for these data science method- lutions calls for a multifaceted approach, incorporating ologies to be accessible and applicable to inquiries and algorithmic, data-centred, information technology, and challenges from a broader range of disciplines, cater- cross-disciplinary perspectives, particularly from those ing to users who may not possess extensive expertise in outside the data science domain. Consequently, there data science. These methodologies must extend beyond is a pressing need for innovative solutions underpinned the data science community to embrace users from non- by Creativity that can make data science accessible to technical backgrounds—such as engineering, humanities, non-experts, allowing them to intuitively navigate data and social sciences—who rely on data analysis to address repositories and distil valuable knowledge. research questions pertinent to their domains. In this paper, we explore an approach that melds com- This shift necessitates the emergence of novel data putational creativity, enabling users to venture into new science solutions that capitalise on and democratise the realms of data analysis design with conversational com- latest advancements in data engineering and AI. These puting, which offers user-friendly abstractions for di- solutions should shield non-technical users from the intri- recting and customising their data analysis endeavours without the necessity of engaging with intricate technical Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 specifics. Joint Conference (March 25-28, 2024), Paestum, Italy. Accordingly, the remainder of the paper is organised * Genoveva Vargas-Solar. † as follows. Section 2 gives a general overview of ap- The authors’ list is alphabetical except for the first author. $ genoveva.vargas-solar@cnrs.fr (G. Vargas-Solar); 1 khalid.belhajjame@dauphine.fr (K. Belhajjame); Creativity is a process that can combine familiar ideas in new ways, javiera.espinosa@liris.cnrs.fr (. J. A. Espinosa-Oviedo); explore the potential within existing conceptual spaces, or trans- snegrete@cua.uam.mx (S. Negrete-Yankelevich); form these spaces to allow for previously inconceivable ideas. Cre- joseluis.zechinelli@udlap.mx (J. Zechinelli-Martini) ativity is not considered novelty but the capacity to generate surpris- © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ing and valuable ideas that push beyond conventional boundaries CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) [1]. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings proaches contributing to model creative-based processes are capable of interpreting and generating text. With and friendly design data science pipelines. It discusses over 540 billion parameters, the Google Pathways model how both areas can provide novel ways of designing can explain jokes, follow a chain of reasoning, recognise data science-driven solutions. Section 3 after that de- patterns, perform Q&A sessions on scientific knowledge, scribes the challenges of modelling creative-driven data and summarise texts. As more parameters are added to science design processes. It gives the general lines of an language models, the depth of "understanding" they can approach that can enhance and envision a new way of demonstrate expands. "The painting fool" is a system addressing analytics problems using data and artificial in- developed by Simon Colton that draws portraits taking telligence models. Section 4 introduces a creativity-based into account emotional information obtained from the data science design platform. It describes the general ar- subjects being painted through a camera [10]. Negrete- chitecture and functions and shows how it can support Yankelevich and Morales-Zaragoza [4] propose a model the design process of data science pipelines guided by to develop and assess creativity in computational agents human and computational creativity. Finally, Section 5 embedded in mixed teams. The Apprentice Framework concludes the paper and discusses future work. model establishes a series of roles (or levels of responsi- bility) agents can play within the group over time with the possibility of ascent through the ladder as the system 2. Related work is developed, acquiring thus more responsibility in the creative process. The model also helps identify aspects Addressing the creative design of data science pipelines of the product being produced, at which point the agent to make them inclusive for non-experts requires drawing is supposed to be creative. By keeping track of both re- from methods and results from three areas: creative mod- sponsibilities and aspects, it is possible to plan and assess els, friendly data science and provenance. This section the development of the system. Negrete-Yankelevich and gives an overview of the relevant results that provide a Morales-Zaragoza [11] propose a framework to create scientific background to the project and help to contex- animatics. These animated storyboards constitute an es- tualise our objectives. sential artefact in producing animations by a team of expert animators that made an award-winning series of Creativity-driven systems Artificial intelligence (AI) one-minute shorts for Mexican TV called Imaginantes offers opportunities to reify and transform how we think (Televisa, "Imaginantes* - YouTube."). The system’s cre- about human cognitive capabilities. In this context, com- ativity is measured by how well the overall creativity of putational creativity (CC)2 , aims at studying human cre- the team is affected by the system’s performance. ativity and building systems that perform in such a way as to be considered creative. Concerning creativity, whether Developing friendly data science solutions. this cognitive capability is individual or collective. CC Friendly data science systems must provide intuitive and has moved from the classic individualistic and cognitive interactive access to data processing operations in an model of creativity [1] to a social and collective creativity agile and visual step-by-step manner [12]. They should model [2, 3, 4, 5, 6, 7]. help a user to derive conclusions about the data collection Collective-creativity models are concerned with under- content and identify the potential questions that data can standing the roles and tasks that different agents (both help answer. Through conversational loops and feedback, human and non-human) play in a process that requires a friendly exploration and analysis system must calibrate creativity and how creativity can be measured in this the tasks according to the data’s characteristics and the context [8, 9]. Co-creativity is an excellent approach to user’s expertise and expectations. Through metadata col- establishing efficient, context-aware interactive systems lection and user profiling, an exploration and analysis and setting long-term programmes where solutions to conversation loop should propose actions, insight, and problems requiring human and machine collaboration results’ display (and visualisations) that assist the user can be studied. in completing a given goal. CC has been widely applied in art with systems that promote "artificial" creation. For example, Disco Diffu- sion is an online tool that runs on Google Collab to exe- Discussion. We believe the collective CC can repro- cute Python programs that, using a learning model, result duce the collaborative transdisciplinary conditions in in "creative" artwork. Language models such as GPT-3 which data science solutions are developed. It can drive the proposal of data science solutions combining data, 2 algorithms, and computing resources to model complex Computational creativity is the study of building software that exhibits behaviour that would be deemed creative in humans. Such systems and contribute to answering research questions creative software can be used for autonomous creative tasks, such as to understand and predict them. While computational inventing mathematical theories, writing poems, painting pictures, creativity and conversational techniques have proven and composing music (https://computationalcreativity.net effective, they have not been explored with the design policies can impact the quality of life of different categories of exploratory data analysis pipelines. Moreover, the of citizens willing to evolve in a given urban area?. De- two approaches are somewhat opposed because conver- cision makers now call for data scientists’ creativity to sational techniques tend to rely on known territories (i.e. provide studies and mathematical evidence of the kind previously explored data manipulation and analysis ac- of urban changes to be considered in public policies. tions). In contrast, computational creativity allows for exploring unknown territories (data manipulation and Designing a data science pipeline. Datasets and analysis), which may, in some cases, prove more effective. research questions drive the design of DS pipelines. Our work addresses two challenges. On the one hand, Through a simplified creative scenario using the main adapt and leverage both techniques to design an efficient phases of a DS pipeline: (1) collect or search for datasets and exploratory data analysis pipeline. On the other that can be used for answering a research question, and hand, strike the right balance when creating data analysis then (2) prepare them (explore, clean, engineer) to feed pipelines between ’known’ prior data exploration and one or several Artificial Intelligence (AI) models. (3) analysis actions and ’unknown’ creative actions. Models are trained and tested with dataset fragments. These tasks are calibrated recurrently until specific per- formance scores are reached. (4) Results are constantly 3. Creative process for designing assessed and eventually considered good enough to be data science driven solutions interpreted by experts, and conclusions are drawn on answering the initial research question to some extent. Data science pipelines combining machine learning and Sketching a creative process, the elements to consider deep learning are the new query types with specific needs are: What data is needed to answer the research ques- regarding how data must be structured and managed. tion and develop a strategy to collect them? How do we The “one all-fits-all” data structure and associated man- transform the initial research question into a quantita- agement functions approach are no longer adapted for tive statement that can be addressed by mathematical or data science queries. Indeed, every query has a specific AI models? Which model families can be pertinent for objective (modelling, prediction), and its design entirely answering the question? How do you design a series of depends on the input dataset and an initial research ques- tasks where data are processed? How do you connect tion (RQ). The data science query is not based on an the results format with the research question statement? explicit knowledge of the data. It includes tasks devoted How do we determine whether results converge? How to mathematically understanding the data; then, the par- do you decide whether results are fair enough for con- tial results of those tasks determine the design of other sidering an answer? studies devoted to the computation of a model repre- For example, data scientists can film civilians in the senting some hidden knowledge. Given statistical and target urban spaces to collect their behavioural patterns machine learning methods and a target objective, data on how they occupy and evolve along those spaces be- scientists rely on libraries that provide methods that they fore and after implementing public spaces. Extract be- combine to define a data science pipeline. The results havioural patterns that imply designing a DS pipeline for obtained by this pipeline are never definite, and they are processing videos and detecting civilians, for example, always, to some degree, close to the target. using perceptrons [13] and behaviour patterns within a To illustrate the design process of a DS pipeline, con- series of scenes. The patterns can then be classified ac- sider the following scenario. Consider a trendy decision- cording to properties that detect changes before and after making group willing to adopt a data-driven approach implementing some change. Other possibilities would for designing public policies to enhance citizens’ lives be to run other data collection techniques like question- in urban spaces and reduce energy and economic costs. naires to describe urban civilians’ behaviour through Public policies are intended to modify built environments quantitative variables that can be correlated for detecting to improve them from financial and well-being perspec- changes produced after applying public policies. The pos- tives. Decision-makers know that from the urbanism sibilities are numerous, and they rely on data scientists’ perspective, small changes in the built environment can expertise, on the facilities or not for collecting certain alter how people use the space. For instance, increasing types of data (e.g., video vs questionnaires) and their pedestrian areas in a city downtown close to restaurant knowledge of specific AI models’ families. zones reduces CO2 footprint. Still, it impacts the influx of restaurant customers in the area and lowers real estate Discussion Data Science and Machine Learning Envi- prices. Customers can suddenly start preferring restau- ronments provide all the necessary AI models. They are rants close to parking slots. People living in the area can supported by enactment stacks that deal with the stor- have problems accessing it and park their cars close to age, fragmentation, indexing and distribution of the data home. The research question is to which extent public required and produced by the tasks composing a pipeline. the system and the type of feedback to be given What are the rules and strategies to combine different by humans. components that can transform input data into models Intervene the process with an agent by selecting a and predictions that provide quantitative elements to relevant subprocess where creativity would con- answer initial research questions? tribute significantly to the overall solution and as- Generative artificial intelligence 3 has started to be sess how it works. Then, try other similar subpro- consolidated into solutions that give the illusion of cre- cesses and verify again. This bottom-up approach ation through interactive and conversational approaches would establish a practice to turn the overall pro- 4 . Systems like chat-GPT, in its various versions, mimic cess into a friendly one in a stepwise fashion. conversational and question-answering experiences in- • Collecting provenance and data from DS pipelines tended to perform target tasks or produce “new" content design tasks: implement processes for data cura- based on existing evidence. The principle of this sys- tion, annotation, identification, and quality con- tem is synthesising the creation process as an exercise of trol in research. wrapping together “content" with specific characteristics • Proposing an ad-hoc computational creativity and considering some constraints to produce artefacts tool for making DS science pipelines design- that look, to some extent, novel. friendly for non-data scientists. In the case of DS pipelines, the first challenge is to model the creation process behind them. How does some- one (a domain expert) state a research question so that 4. Towards a Human in the loop a data-driven quantitative study can be run? How are creative platform for designing data collected and selected to answer such questions? Which comes first, data or questions? How do we con- data science pipelines clude that given datasets representing observations of an Figure 1 shows the general architecture of the MATILDA object of study are representative enough to produce a platform that assists people with different expertise to fol- model or predict the behaviour of that object? How is low a creative process for designing DS pipelines given the human integrated into the loop and intervene in the datasets and target research questions. The platform design milestones of a DS pipeline? relies on a step-by-step conversational approach based on our previous work [12] and provides interaction en- Challenges and Open Issues. A computational- try points to allow humans feedback, validate and guide creativity-based methodology for designing DS pipelines the creative process. For each phase of a DS pipeline should consider at least the following scientific chal- (data exploration and preparation, fragmentation, train- lenges and associated open problems: ing, testing and assessing), the platform suggests possible scenarios that are adopted or not. Therefore the platform • Modelling hybrid (human and nonhuman) relies on a knowledge base representing data science creativity-driven data science pipelines’ design: pipelines, with research questions and data features mod- propose a computational creativity model to rep- elled that can be used to propose solutions similar as case resent end-to-end pipeline design. The creativity based reasoning approaches. model can integrate design patterns like the ones 1. Data search: given keywords about the topic or presented in [14] (design, mutant shopping, cho- a sample of data to be analysed, the platform re- rus line, simulation and approximating feedback, lies on queries as answers and exploration tech- entertaining evaluations and no blank canvas). niques to propose related data sets. The platform Depending on the tasks to be designed within a shows the possible questions associated with DS pipeline, different creativity patterns can best data through "queries as answers" techniques. be adapted to address the task. Through an interactive process, a data scientist • Define the interaction among humans designing can converge to a sample of data representative a DS pipeline and artificial system(s) that can of the type of questions she/he wishes to express take on tasks and propose results. Model the in- (e.g., factual, modelling, prediction, etc.). put/output required to feed and expect to/from 2. Designing data exploration and cleaning pipeline: 3 According to the Bing chat-GPT and validated by this paper’s given a dataset, the platform performs a quantita- authors: Generative AI refers to a category of artificial intelli- tive analysis of the attributes, their dependencies gence (AI) algorithms that generate new outputs based on the data and their values’ distribution. The platform also they have been trained on (www.weforum.org/agenda/2023/02/ generative-ai-explain-algorithms-work/). suggests cleaning and data engineering strategies, 4 Microsoft Deepspeed https://github.com/microsoft/DeepSpeed/ allowing data to have specific mathematical prop- tree/master/blogs/deepspeed-chat erties. The platform gathers information about Figure 1: Matilda platform creation pipeline. their decisions by interacting with the data scien- of datasets and technology for transforming any phe- tists. This information can be used to keep track nomenon produced in reality into digital data and the of the design process. For now, this is a very variety of algorithms (Mathematical and artificial intel- quantitative perspective of the creation process, ligence models), the design of data science solutions re- even if, for future work, we will try to approach mains artisanal. The impact on person-hours and eco- creativity with other perspectives. nomic investment is not anecdotic. The time has come 3. DS pipeline creation: the current platform does to propose methodologies that can formalise the design not rely on existing AI model recommendation of data science solutions and model the “know-how” de- systems but on knowledge about the questions veloped by data scientists during the creation process. previously addressed with AI models; it pro- Besides, data science addresses trans-disciplinary chal- poses building blocks that can be combined into lenges. It is critical to bridge the gap between technical pipelines. These building blocks could be used to vocabulary, tasks, and the vocabulary of other disciplines answer the questions produced in 1). The build- and users with different expertise. This strategy will ing blocks include suggestions on the scores that ensure the usability and acceptability of solutions (i.e. can be used for assessing and calibrating train- pipelines). In summary, data science must become in- ing phases. The platform is also shared for every clusive and accessible to all. Our work addresses this building block with similar solution contexts in challenge by aiming to adopt computational creativity which they have been used. methods to model the data science design process(es) that combine human and nonhuman creativity. Our DS creativity platform allows us to study how over- The platform MATILDA proposed in this paper is based all creativity is affected if computer systems take over on the original methodologies that we propose. It con- different roles within the design of data science pipelines. tributes to creating data science pipelines according to The platform provides a collaborative environment that the expectations of knowledge discovery. It is interesting integrates an artificial actor within in the creative pro- for answering target research questions, the input data’s duction process of DS pipelines by data scientists. characteristics and the data scientists’ models. Creativity- based methodologies applied to data science will make it accessible and inclusive to address increasingly complex 5. Conclusions and Future Work problems humanity faces. The research and development market associated with data science is fuelling the economies of countries in 6. Acknowledgements the world. Almost all sectors in the global economies see data science as a promising alternative to develop The work reported in this paper is performed in the con- original solutions to critical societal problems and pro- text of the project FRIENDLY 5 funded by the inter-group mote data-driven decision-making processes that can program of the laboratory LIRIS, Lyon. create know-how and value. Yet, despite the availability 5 http://vargas-solar.com/friendly/ References [1] M. A. Boden, et al., The creative mind: Myths and mechanisms, Psychology Press, 2004. [2] M. L. Maher, Evaluating creativity in humans, com- puters, and collectively intelligent systems, in: Pro- ceedings of the 1st DESIRE Network Conference on Creativity and Innovation in Design, 2010, pp. 22–28. [3] M. L. Maher, Computational and collective cre- ativity: Who’s being creative?, in: ICCC, 2012, pp. 67–71. [4] S. Negrete-Yankelevich, N. Morales-Zaragoza, The apprentice framework: planning and assessing cre- ativity, Proceedings of the Fifth International Con- ference on Computational Creativity, 2014. [5] R. K. Sawyer, Explaining creativity: The science of human innovation, Oxford university press, 2011. [6] A. K. Goel, A. G. de Silva Garza, Special issue on ar- tificial intelligence in design, Journal of Computing and Information Science in Engineering 10 (2010). [7] N. Gu, P. Amini Behbahani, A critical review of computational creativity in built environment de- sign, Buildings 11 (2021) 29. [8] A. A. Kantosalo, J. M. Toivanen, H. T. T. Toivo- nen, Interaction evaluation for human-computer co-creativity: A case study, in: Proceedings of the sixth international conference on computational creativity, Brigham Young University, 2015. [9] A. Kantosalo, S. Riihiaho, Experience evalua- tions for human–computer co-creative processes– planning and conducting an evaluation in practice, Connection Science 31 (2019) 60–81. [10] S. Colton, J. W. Charnley, A. Pease, Computational creativity theory: The face and idea descriptive models., in: ICCC, Mexico City, 2011, pp. 90–95. [11] S. Negrete-Yankelevich, N. Morales-Zaragoza, e- motion: a system for the development of creative animatics, Proceedings of the Fourth International Conference on Computational Creativity, 2013. [12] P. Bethaz, K. Belhajjame, G. Vargas-Solar, T. Cerquitelli, Ds4all: All you need for democratiz- ing data exploration and analysis, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021, pp. 4235–4242. [13] E. Cruz-Esquivel, Z. J. Guzman-Zavaleta, An exam- ination on autoencoder designs for anomaly detec- tion in video surveillance, IEEE Access 10 (2022) 6208–6217. [14] P. Glines, I. Griffith, P. M. Bodily, Software design patterns of computational creativity: A systematic mapping study., in: ICCC, 2021, pp. 218–221.