<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ontology for data science research results reuse</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Aritha</forename><surname>Kumarasinghe</surname></persName>
							<email>dewnith.kumarasinghe@rtu.lv</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Applied Computer Systems</orgName>
								<orgName type="institution">Riga Technical University</orgName>
								<address>
									<addrLine>6A Kipsalas Street</addrLine>
									<postCode>LV-1048</postCode>
									<settlement>Riga</settlement>
									<country key="LV">Latvia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marite</forename><surname>Kirikova</surname></persName>
							<email>marite.kirikova@rtu.lv</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Applied Computer Systems</orgName>
								<orgName type="institution">Riga Technical University</orgName>
								<address>
									<addrLine>6A Kipsalas Street</addrLine>
									<postCode>LV-1048</postCode>
									<settlement>Riga</settlement>
									<country key="LV">Latvia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Ontology for data science research results reuse</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">22FD480E0976110F57C36601EEA7A337</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Ontology</term>
					<term>Data Science</term>
					<term>Research Results Reuse</term>
					<term>Project Attributes1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Data Science is the science that relates to the extraction of knowledge and information from data. As the amount of data we produce increases, data science projects have become a very popular endeavor in recent years, accompanied by an increased interest in research relating to data science resources such as the data sources, algorithms, technologies, and visualizations as well as the application domains of these data science resources. The amalgamation of the results gained by data science projects can be a complex process that can be time and labor-intensive. This research seeks to reduce the project complexity by proposing an ontology that can represent data science (research) project based on domain-specific (data science) project attributes that can represent all conceivable aspects of a data science project.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Data science aims to clean, prepare, and analyze different data sets to extract meaning from data <ref type="bibr" target="#b0">[1]</ref>. With the increased number of applications of data science in different sectors/domains such as social housing, shipping, and automotive retail <ref type="bibr" target="#b0">[1]</ref>, to name a few, there is an increased amount of knowledge being produced by these projects relating to how data science resources can be applied in data science projects and/or domains. How this knowledge from projects can be accumulated for reuse in future projects is the issue that will be the focus of this paper.</p><p>One solution for this problem is the use of a knowledge graph which is a knowledge representation that can effectively organize and represent knowledge <ref type="bibr" target="#b1">[2]</ref>. This solution was showcased in our previous work related to a knowledge graph for reusing research knowledge on related works in data analytics <ref type="bibr" target="#b2">[3]</ref>. The presented knowledge graph utilized a star-like ontology based on analytics project attributes, as a schema. The 18 defined analytics project attributes were based on an initial literature review and represented different aspects of data analytics projects such as the data analysis algorithm(s) used, the data set(s) used, and the analysis software(s) or tool(s) used, etc.</p><p>The star-like ontology was defined based on a triple structure that considers the subject to be the data analytics project, the object to be the data analytics project attribute relating to a specific aspect of a data science project, and the predicate to be the relationship type defined based on the data analytics project attribute. This structure meant that each attribute type needed to be represented as a class in the ontology with a corresponding property defined for the class representing the data analytics project, relating it to the data analytics project attribute value. This meant that when additional analytics project attributes were defined, it increased the complexity (number of classes) of the ontology on a class level, and when the ontology changed, the process related to changing the ontology was complex (editing many "one-level" ontology classes and their properties). Additionally, this ontology, like any other ontology, relied on a static representation of the data analytics domain and assumed that the user is only interested in the aspects of the data analytics project that are represented by the analytics project attribute types. This work seeks to resolve the issue by making an ontology that can be relatively easily modified. The previous ontology was constructed inductively by considering the data analytics projects. The one proposed in this paper is built to represent what can be considered as the already established(standard) aspects of a data science project defined based on the data science body of knowledge; it also seeks to allow additional aspects to be introduced (on an instance level) based on the users' needs or the results produced by data science projects. We accomplish this by proposing an ontology that does not seek to represent the domain of data science but instead to represent projects within said domain. As there is no specific body of knowledge in data analytics available, we chose the body of knowledge that represents data science thus, we decided to increase the scope of the ontology to data science with data analysis being considered a knowledge area under this domain as defined in the Data Science Body of Knowledge (DS-BoK) by the EDISON project <ref type="bibr" target="#b3">[4]</ref>, but this extension of the scope is rather formal because the coverage of attributes in the previous ontology are specific mostly to the data analysis aspects of data science. So, instead of data analytics, we have a data science project attribute defined as a class with the individual data science project attributes defined as instances of that class. We also propose a novel method with which data science attributes can be defined based on the Data Science Competence Framework (CF-DS), also created by the EDISON project <ref type="bibr" target="#b4">[5]</ref> to represent all currently recognized aspects of projects in the data science domain, with the possibility to define more data science project attributes based on research projects as instances of the ontology.</p><p>To ensure that the created ontology conforms to existing knowledge engineering practices we followed the Ontology Definition 101 methodology <ref type="bibr" target="#b5">[6]</ref>. Section 2 of this paper outlines the definition process of the proposed ontology based on the steps of the Ontology Definition 101 methodology. Section 3 demonstrates how this created ontology can be applied for data science knowledge reuse. The final Section 4 concludes the paper and outlines what future research work can be done.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Ontology definition process</head><p>In this section, the ontology definition process is briefly presented.</p><p>The Ontology Definition 101 methodology <ref type="bibr" target="#b5">[6]</ref> was used as a basis for ontology definition. It states the following steps within their knowledge-engineering methodology:</p><p>1. Determine the domain and scope of the ontology. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Domain and scope of the ontology</head><p>The domain of the proposed ontology is projects within data science that aim to clean, prepare, and analyze different data sets for extracting meaning from data <ref type="bibr" target="#b0">[1]</ref>. The goal of the ontology is to accumulate and preserve knowledge that is produced in data science projects. To encapsulate this knowledge, the ontology will represent the different kinds of resources used within the domain of data science based on data science project attribute types (such as algorithms used, data sources used, and visualizations used) and based on these attribute types the knowledge presented by the ontology will change. It should be noted that this ontology does not seek to directly represent the data science domain but instead to represent projects within this domain (Fig. <ref type="figure" target="#fig_0">1</ref>) via the data science attribute type that represents the resources within the data science domain. A partial representation of the data science domain is possible through the resources represented.</p><p>To represent data science projects, the ontology will try to answer the following four competency questions (that are defined in the Ontology Definition 101 methodology as 'One of the ways to determine the scope of the ontology is to sketch a list of questions that a knowledge base based on the ontology should be able to answer'):</p><p>1. What data science projects have been/are being completed? 2. What data science application domains have the data science projects been completed in? 3. What type of data science project attributes represent the resources that are used in data science projects? 4. What attributes represent a completed/ongoing data science project?</p><p>Based on the competency question relating to the type of data science project attributes as well as the one relating to the application domains of data science resources, further competency questions such as 'Given a data science application domain (representing the application domains of data science resources) what machine learning algorithms can be used(representing the resources that are produced within the data science domain)?' can be inferred. Basing the ontology on data science project attribute types (e.g., algorithms used) and data science project attributes (e.g., Decision Trees) separately allows for more domain-specific knowledge to be inferred from any knowledge graph that utilizes this ontology as a schema. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Reusing existing ontologies</head><p>In the authors' previous work <ref type="bibr" target="#b2">[3]</ref> an ontology for reusing research knowledge on related work in data analytics based on 18 analytics projects was proposed and used. This work concerns an ontology with a similar purpose but expands to the domain of data science, which is larger than data analytics.</p><p>In the domain science data science, several ontologies have been proposed and some of them are considered in this section.</p><p>There is an ontology <ref type="bibr" target="#b6">[7]</ref> for Big Data Analytics as a Service (MBDSaaS) based on the declarative (sub)model proposed in the TrustwOrthy model-awaRE Analytics Data platform (TOREADOR) project, intending to aid 'incompatibility management and the creation of OWL-S descriptions enabling different approaches for the selection tasks'. This ontology concerns such aspects as (1) Data preparation, all activities aimed to prepare data for analytics; (2) Data representation, how data are represented and representation choices for each analysis process; (3) Data analytics, the analytics to be computed; (4) Data processing, how data are routed and parallelized; and (5) Data visualization and reporting, an abstract representation of how the results of analytics are organized for display and reporting. A similar ontology is proposed within Intelligent Big Data Analytics as an intermediary between the abstract tasks in workflows of data mining to automate the data mining process <ref type="bibr" target="#b7">[8]</ref> but based on the Cross Industry Standard for Process Mining (CRISP-DM). Another noteworthy ontology within data analytics is an ontology-based framework relating to the recommendation of an analysis method <ref type="bibr" target="#b8">[9]</ref>. This work proposes an ontology based on data sources and analysis methods and demonstrates the value of ontology-based applications. The proposed ontologies do have the potential to further expand on data analysis method-related aspects of the ontology we propose.</p><p>To compare with our ontology used in the previous work <ref type="bibr" target="#b2">[3]</ref>, the above-mentioned approaches define data analytics in a much narrower sense. In our approach, the data analytics project practically included all the above-mentioned aspects; however, for instance, the ontology in <ref type="bibr" target="#b6">[7]</ref> goes to a higher level of detail regarding each of the aspects while, in our case, there is no further classification of individuals. The ontology in <ref type="bibr" target="#b7">[8]</ref> is more complex and might be harder to apply to information that is available in the scientific/project literature regarding specific data science projects. It would also be harder to maintain such ontology given that this ontology could change on a class level, whereas the ontology proposed in this paper would practically act as an OWL-based data schema that would only change on an instance level.</p><p>Thus, the question is about the granularity of the ontology. We might assume that higher granularity of ontology might give additional opportunities in knowledge amalgamation; however, as was already mentioned, the level of detail available in scientific works or project documentation does not always allow us to go to that level of detail. Also, the higher the level of detail, the more often reconsidering an ontology itself might be needed. Thus, the open question is what level of detail might be useful in amalgamating knowledge in the data science domain and what frameworks or initiatives might be used to maintain the ontology used to refer to the work in the respective domain.</p><p>As shown above, there are many applications of ontologies within the domain of data science and the proposed ontologies have different purposes. To our knowledge, there have not yet been ontologies proposed for the reuse of knowledge within the data science domain. In this paper, we, based on our experience with scientific work in data analytics, propose an ontology that is based on the skills and knowledge units defined within the EDSF (EDISON Data Science Framework) Competency Framework for Data Science (CF-DS) <ref type="bibr" target="#b4">[5]</ref>.</p><p>The final ontology can be used to reduce the complexity related to knowledge reuse in data science/data science projects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Defining classes and class properties</head><p>Within the previous sections, the following terms have been recurrent:</p><p>1. Data Science Project -Projects that that aim to clean, prepare, and analyze different data sets for extracting meaning from data 2. Data Science Project Attribute Type -Representation of the type of attributes that would represent a data science project resource such as the data mining algorithm used within the project <ref type="bibr" target="#b9">[10]</ref>. When defining these attribute types, it is important to consider those that have already been recognized as well as newer technologies within the domain of data science. These two types of data science attributes can be represented as standard and custom data science attribute types. 3. Data Science Project Attribute -An attribute that represents a single or multiple data science project such as decision trees <ref type="bibr" target="#b10">[11]</ref> which would be of the attribute type data mining algorithms used. 4. Data Science Application Domain -This is the domain in which data science is being applied such as social housing, shipping, and automotive retail <ref type="bibr" target="#b0">[1]</ref>. It should be noted that, to our understanding, there is no existing taxonomy of the application domain of data science; therefore, the classification of a data science project to an application domain is at the user's discretion.</p><p>These four classes (and two subclasses) will be sufficient for accumulating data science knowledge based on data science project attributes and the class properties shown in Fig. <ref type="figure" target="#fig_1">2</ref>. OWL and RDFS are used for the definition of classes and class properties given that this ontology is meant to be used for a knowledge graph, created using RDF, as was the case in the authors' previous work <ref type="bibr" target="#b2">[3]</ref>. However, unlike the previously defined ontology, the ontology proposed in this paper allows the data science project attribute types to be defined on an instance level (thereby reducing the complexity of the ontology in relation to the number of classes) and does not require maintenance as would be the case for most of the other ontologies. This is because the conceptualization of the data science domain is done through data science project attributes, which are represented at an instance level and not the class level. Specifics of how this ontology can be utilized to store knowledge from completed or ongoing projects are demonstrated through the definition of instances for this ontology in Section 3.</p><p>Before the utility of the ontology can be demonstrated, the standard data science project attribute types need to be defined in such a way that the already recognized aspects of the data science domain must be represented; this is done in the next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Data science project attribute types definition process</head><p>The data science project attribute types definition process is done based on the Data Science Competency Framework (CF-DS) Release Two <ref type="bibr" target="#b4">[5]</ref> defined as part of the EDISON Data Science Framework (EDSF) (the result of the EU-funded EDISON project), which is a collection of documents that defines the Data Science Profession which includes the aforementioned CF-DS and the Data Science Body of Knowledge (DS-BoK) <ref type="bibr" target="#b3">[4]</ref>.  The base knowledge body does not directly address the research results in data science; rather, it reflects the knowledge that is needed to achieve them. Therefore, the attribute types of data science projects can be defined only indirectly through the concepts available in the chosen body of knowledge. In this work, we consider only the relationships between knowledge and skill (excluding those relating to the analytics languages, tools, platforms, and Big Data infrastructure) based on the preestablished notation that knowledge is a prerequisite of skillful action <ref type="bibr" target="#b11">[12]</ref>.</p><p>Unlike the previous work that limited the scope of the ontology to data analytics, in this work, the scope represented in the ontology is expanded upon by defining data science project attribute types in such a way that they map the skills defined in the CF-DF to the knowledge units defined in the CF-DF in a way that each knowledge area has at least one associated skill ensuring that all knowledge required for data science related skills are represented in the ontology. This would allow for the representation of the data science domain as a whole and the encapsulation of knowledge relating to all established aspects of a data science project.</p><p>To ensure that the defined data science project attribute types are accurate and are traceable to CF-DS knowledge units/topics and the skills, the data science project attributes are defined in an X_Y_Z format (Fig. <ref type="figure" target="#fig_4">4</ref>), where X represents (Data Science)Domain Specific Key Words present found in both the Knowledge topics/units required and the Skills(such as Data Mining, Supervised Machine Learning, and Predictive Analytics), Y represents the (Data Science) Domain Specific Resources (such as Techniques, Tools, and Algorithms), and Z represents Actions Verbs(such as used, implemented, and developed) that are defined based on the action words(such a use, implement, and develop) that are mentioned in the CF-DS skills. This provides a formal meta-structure for data science project attribute types that were missing in the previously defined data analytics project attributes, and it enables the systemization of the data science attribute definition process. Based on the defined Data Science Project Attribute Type Structure the EDSF CF-DS Skills, and EDSF CF-DS Knowledge unit/topics were manually parsed to recognize the relevant Keywords and Action Verbs. An example of how this was accomplished can be seen in Fig. <ref type="figure" target="#fig_5">5</ref>, which shows how three data science project attributes were defined to map the knowledge units KDSDA01, KDSDA02, and KDSDDA03 to the skill SDSDA01. Fig. <ref type="figure" target="#fig_5">5</ref> shows that the defined DS Project Attribute Types have additional text within brackets; this text is introduced to provide additional specificity for the domain-specific keywords and was defined based on the Skills or Knowledge Areas/Topics.</p><p>Utilizing this Data Science (DS) Project Attribute Types Definition Process, 77 Data Science Attribute Types were defined, mapping all knowledge areas to at least one skill, thus representing all currently established aspects of data science projects and providing the possibility to define more data science attributes for capturing data science research results than were discovered by bottom-up approach in our previous work. All the skills themselves have at least one corresponding DS Project Attribute Type, with the only exception being SDSENG12 -Use of Recommender or Ranking system, but as this skill relates to the Recommender and Ranking Systems, the authors took the liberty to consider these systems as information systems which allowed to map this skill to KDSENG10 -Information Systems, collaborative systems by defining two DS Project Attribute Types: (i) (Information)Recomender_System_Used and (ii) (Information)Ranking_System_Used</p><p>The table containing a list of all identified unique keywords, resources, and action words demonstrating the relationships between the EDSF CF-DS Skills, Knowledge, and the defined data science project attributes is available in a GitHub repository <ref type="bibr" target="#b15">[16]</ref>. It should be noted that, in some cases, the data science project attribute types have missing action verbs and/or resources due to limited text in either the knowledge unit/topic (e.g., KDSDA13 -Optimisation) or the skills (e.g., KDSDA14 -Optimisation). In some cases, the authors applied placeholders Y and Z, which were used to maximize the number of aspects of the data science domain represented by the data science project attributes: 16 (20%) of the DS project attributes are missing a Domain Specific Resource, and 3 of them are also missing a Domain Specific Action Verb. These missing values simply reduce the specificity of the defined DS Project Attribute Types while still providing a (limited) representation of this aspect of the data science project.</p><p>When comparing the newly defined data science project attribute types with the previously defined data analytics project attribute types <ref type="bibr" target="#b2">[3]</ref> (of which there are only 18), it is possible to conclude that these attribute types are general to the data science domain, whereas those previously defined and not present in the new ontology were specific to data analytics (which now is considered as a sub-domain). This is evidenced, for instance, by specific attribute types relating to data visualization and interactive results (dashboards) created, which here are not represented due to not being a knowledge unit related to data visualization (although it is represented as a skill related to tools and software) in the CF-DS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Utility of the created ontology</head><p>Given the original intention of the ontology of being used as a schema for a knowledge graph, the UML schema (Fig. <ref type="figure" target="#fig_6">6</ref>) was defined for a knowledge graph using OWL classes and subclasses as well as object properties.</p><p>The UML class diagram shows the classes and subclasses defined for this ontology, here the class(rdf: ID="DSProjectAttributeType") was defined for the DS (Data Science) Project representing the resources available in the data science domain. This will organize the knowledge gained within a previously completed data science project and will relate to instances of the class(rdf: ID="DSProjectAttribute") through an 'isAnAttributeOfType' relationship, for instance:</p><p>-'DSProject ABC'(instance of &lt;owl:Class rdf:ID="DSProject"&gt;) hasDSProjectAttrbiuteValue(property of &lt;owl:Class rdf:ID=" DSProject "&gt;) 'Decision Trees' (instance of &lt;owl:Class rdf:ID="DSProjectAttributeValue"&gt;) -'Decision Trees' (instance of &lt;owl:Class rdf:ID="DSProjectAttribute"&gt;) isAnAttributeOfType(property of &lt;owl:Class rdf:ID="DSProjectAttribute"&gt;) '(Supervised)MachineLearning_Technology/Algorithm/Tool_Used'(instance of &lt;owl:Class rdf:ID="DSProjectAttribute"&gt;) -'DSProject ABC'(instance of &lt;owl:Class rdf:ID="DSProject"&gt;) hasDSProjectAttributeOfType(property of &lt;owl:Class rdf:ID=" DSProject"&gt;) '(Supervised)MachineLearning_Technology/Algorithm/Tool_Used'(instance of &lt;owl:Class rdf:ID="DSProjectAttributeType"&gt;)</p><p>The three triples mentioned above represent the relationship between the data science project, data science project attribute type, and data science attribute in the form of DSProject ABC, (Supervised) MachineLearning_Technology/Algorithm/Tool_Used', Decision Tree. Similarly, other data science project attributes can represent project features such as Natural Language Processing_Method_Used, Data Preparation/Data Preprocessing_Method_Used, Performance/Accuracy_Metric_Used, etc. The additional RDF triples mentioned below relate the data science project to a data science application domain.</p><p>-'DSProject ABC'(instance of &lt;owl:Class rdf:ID="DSProject"&gt;) relatesToDSProjectDomain(property of &lt;owl:Class rdf:ID=" DSProject"&gt;) 'Health Care' (instance of &lt;owl:Class rdf:ID="DSProjectDomain"&gt;)</p><p>This new RDF triple, combined with the reasoning capabilities of knowledge graphs realized through rule-based reasoning, allows for inferring what resources can be used within a specific data science application domain. The rule-based inference is realized in this ontology through the use of a SWRL <ref type="bibr" target="#b12">[13]</ref> rule:</p><p>-DSProject(?project) ^ hasDSProjectAttribute(?project, ?tool) ^ relatesToDSProjectDomain(?project, ?domain) -&gt; canBeUsedInDSApplicationDomain(?tool, ?domain)</p><p>To demonstrate the use of this ontology from a practical standpoint, we implemented a knowledge graph that will store, and present knowledge acquired from a single project within the data science domain <ref type="bibr" target="#b13">[14]</ref> that introduces SatelliteBench, a framework for satellite image extraction and vector embeddings generation and it's utility in creating predictive models for poverty, education, and dengue prediction. The information provided in this project is presented using Protégé <ref type="bibr" target="#b14">[15]</ref> (Fig. <ref type="figure" target="#fig_6">6</ref>). The fact that all these resources can be utilized in the domain of Public Health can also be inferred using the SWRL rule that was defined relating the DSProjectDomain and the DSProjectAttribute instances. Most new instances required the definition of Custom data science project attribute types. It should be mentioned that these instances were defined using ChatGPT 4o (accessed 15 th of August, 2024) with a query that outlined the structure of the ontology (including definitions of object properties as given in Section 2.3), provided instances of Standard Project Attributes(mentioned in Section 2.4) , and an attachment of the pdf version of the research article <ref type="bibr" target="#b13">[14]</ref> combined with a request to present the result as instances in RDF/XML format.</p><p>This demonstrates the possibility of utilizing this ontology with LLMs or other advanced text parsing technologies to automate the accumulation and presentation of knowledge gained from completed or ongoing data science projects. The reliability of LLMs for the production of the instances for the defined data science project ontology requires further research, but this work demonstrates how domain-specific attribute types defined in a format that mimics natural language allow formulated queries to be used by LLMs easily.</p><p>The flexibility of the ontology enabled through the class representing data science project attribute types allows the user to define data resources they are interested in and to disregard the instances they are not interested in (demonstrated by the fact that of the 77 standard attribute types defined, only one 'PredictiveAnalytics_Method_Used' was needed to represent the project, whereas 7 additional custom attribute types needed to be defined). This flexibility also allows automation of the knowledge accumulation process based on the information that is available with the project reports (in this case being a research article published as a result of this data science project), and also to represent resources that are not widely used (represented by the custom data science project attributes such as DataFusion_Method_Used, PovertyIndex_Assessment_Used, etc. (Fig. <ref type="figure" target="#fig_5">5</ref>)) or have recently been introduced by a research project.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions and future works</head><p>This work outlines the definition of an ontology that can be used to facilitate the reuse of knowledge acquired through the completion of data science projects. This ontology is based on data science project attributes (a concept of project attribute was introduced for data analytics in an author's previous work <ref type="bibr" target="#b2">[3]</ref>) that are meant to represent various aspects of data science projects concerning the various kinds of resources (e.g., machine learning algorithms) that are available within the domain of data science represented by the data science attribute types (machine learning algorithms used). This shifts the goal of the ontology from representing a domain to representing projects within that domain in relation to the resources available within that domain. An ontology was created using the Ontology Definition 101 Methodology with the data science project attribute and attribute type as classes within this ontology. This paper also introduces a method that can be used to systematically define Data science attribute types based on the EDISON Data Science Competence Framework <ref type="bibr" target="#b4">[5]</ref> to represent all currently recognized aspects of data science.</p><p>The utility of the data science project ontology was demonstrated by representing the knowledge acquired from a single data science project <ref type="bibr" target="#b13">[14]</ref>, presenting knowledge such as the Dengue Prediction Model as a Predictive Method Used and the Multimodal Fusion Pipeline as a Data Fusion Technique Used. Through the use of a SWRL rule that relates the defined data science project attributes to a data science application domain, it is possible to infer the application domain of these data science resources which in this case was public health. Also demonstrated is how using this ontology (with instances of the types of project attributes within the domain of data science) in tandem with advanced text parsing technologies such as LLMs, it is possible to automate the knowledge accumulation process through the use of project reports (which in the case of the discussed project <ref type="bibr" target="#b13">[14]</ref> was a single research article).</p><p>This paper demonstrates how shifting the goal of the ontology for domain representation to the representation of projects within a domain can simplify the ontology itself as well as reduce the complexity related to the maintenance of the ontology by making it more instance-centric than class-centric.</p><p>Future works can demonstrate the further utilization of the ontology-defined data science project attributes for knowledge representation and automation of the knowledge accumulation process.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Proposed solution for the accumulation of knowledge produced within data science projects.</figDesc><graphic coords="4,86.05,228.72,424.90,337.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: UML class diagram representing the Ontology for Data Science Projects.</figDesc><graphic coords="6,129.00,471.25,336.56,172.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3</head><label>3</label><figDesc>is a model that is meant to give the reader an understanding of elements within the CF-DS and to represent the knowledge and skills defined in the CF-DS. The CF-DS utilizes keywords to represent the skills and knowledge with keywords followed by a number; examples of the values are also shown in the diagram.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Competency groups, competencies, skills, and knowledge units identified/defined in EDSF CF-DS [5].</figDesc><graphic coords="7,117.00,433.06,361.10,242.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: UML Diagram representing the structure of Data Science Project Attribute Type and it's relationship between the EDSF CF-DS Skill, and EDSF CF-DS Knowledge unit/topic.</figDesc><graphic coords="8,90.50,469.87,414.00,57.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Data Science Project Attribute Types definition process demonstrated in relation to the EDSF CF-DS Knowledge unit/topics KDSDA01, KDSDA02, KDSDDA03, and the skill SDSDA01.</figDesc><graphic coords="9,86.05,100.24,424.90,148.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: OntoGraf view of Data Science Project Ontology with instances representing a data science project [14].</figDesc><graphic coords="11,86.05,309.46,424.90,124.35" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A review of data science in business and industry and a future view</title>
		<author>
			<persName><forename type="first">G</forename><surname>Vicario</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Coleman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Stochastic Models in Business and Industry</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="6" to="18" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A review: Knowledge reasoning over knowledge graph</title>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert systems with applications</title>
		<imprint>
			<biblScope unit="volume">141</biblScope>
			<biblScope unit="page">112948</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Knowledge Graph for Reusing Research Knowledge on Related Work in Data Analytics</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kumarasinghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kirikova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Advanced Information Systems Engineering</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Nature Switzerland</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="186" to="199" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Demchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Manieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Belloum</surname></persName>
		</author>
		<ptr target="https://edison-project.eu/sites/edison-project.eu/files/filefield_paths/edison_ds-bok-release2-v04.pdf" />
		<title level="m">EDISON Data Science Framework: Part 2. Data Science Body of Knowledge (DS-BoK) Release 2</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Demchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Manieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Belloum</surname></persName>
		</author>
		<ptr target="https://edison-project.eu/sites/edison-project.eu/files/filefield_paths/edison_cf-ds-release2-v08_0.pdf" />
		<title level="m">EDISON Data Science Framework: Part 1. Data Science Competence Framework (CF-DS) Release 2</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Ontology development 101: A guide to creating your first ontology</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Mcguinness</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An OWL Ontology for Supporting Semantic Services in Big Data Platforms</title>
		<author>
			<persName><forename type="first">D</forename><surname>Redavid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Corizzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Malerba</surname></persName>
		</author>
		<idno type="DOI">10.1109/BigDataCongress.2018.00039</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE International Congress on Big Data (BigData Congress)</title>
				<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="228" to="231" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Onotology-based service discovery for intelligent Big Data analytics</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">H</forename><surname>Akila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Siriweera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Paik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">T G S</forename><surname>Kumara</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICAwST.2015.7314022</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 7th International Conference on Awareness Science and Technology (iCAST)</title>
				<meeting><address><addrLine>Qinhuangdao, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="page" from="66" to="71" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An Ontology-Based Framework for Analysis Recommendation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Henriques</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Stacey</surname></persName>
		</author>
		<idno type="DOI">10.1109/BIBE.2014.70</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Bioinformatics and Bioengineering</title>
				<meeting><address><addrLine>Boca Raton, FL, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="277" to="282" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Data science and its relationship to big data and datadriven decision making</title>
		<author>
			<persName><forename type="first">F</forename><surname>Provost</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Fawcett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big data</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="51" to="59" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Ye</surname></persName>
		</author>
		<title level="m">Data mining: theories, algorithms, and examples</title>
				<imprint>
			<publisher>CRC press</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Know how</title>
		<author>
			<persName><forename type="first">Jason</forename><surname>Stanley</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<publisher>OUP Oxford</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">SWRL: A semantic web rule language combining OWL and RuleML</title>
		<author>
			<persName><forename type="first">I</forename><surname>Horrocks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">F</forename><surname>Patel-Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Boley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tabet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Grosof</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">W3C Member submission</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">79</biblScope>
			<biblScope unit="page" from="1" to="31" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A multimodal framework for extraction and fusion of satellite images and public health data</title>
		<author>
			<persName><forename type="first">D</forename><surname>Moukheiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Restrepo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Cajas</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41597-024-03366-1</idno>
	</analytic>
	<monogr>
		<title level="j">Sci Data</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">634</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">The Protégé project: A look back and a look forward. AI Matters</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Musen</surname></persName>
		</author>
		<idno type="DOI">10.1145/2557001.25757003</idno>
	</analytic>
	<monogr>
		<title level="j">Association of Computing Machinery Specific Interest Group in Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2015-06">June 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Kumarasinghe</surname></persName>
		</author>
		<ptr target="https://github.com/ArithaRTU/Ontology-for-Data-Science-Research-Results-Reuse.git" />
		<title level="m">Ontology for Data Science Research Results Reuse (Version 1)</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>Computer software</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
