Towards Data Integrity Verification for More Sustainable Petroleum Industry Yuanwei Qu1,* , Zhuoxun Zheng2,1 , Baifan Zhou3,1 , Yan Zhou1 , Nicolau Santos4 , Ognjen Savkovic5 , Arild Waaler1 and David Cameron1 1 Department of Informatics, University of Oslo, Norway 2 Bosch Center for AI, Germany 3 Department of Computer Science, Oslo Metropolitan University, Norway 4 Federal University of Rio Grande do Sul, Brazil 5 Free University of Bozen-Bolzano, Italy Abstract As a conventional energy industry, the petroleum industry is responsible for supplying over half of the world’s energy. Facilitating sustainable development for petroleum energy production remains crucial. Data methods have emerged as powerful tools to advance sustainability by enabling efficient resource management and risk mitigation. However, the reliable implementation of data-driven methods relies on high-quality data, necessitating the verification of data integrity on substantial data volumes. To this end, this poster paper presents our ongoing research, leveraging ontologies and knowledge graphs as shared knowledge representation, and provides preliminary results on data integrity verification. Based on the ontologies, we formulate domain knowledge integrity constraints and test three technologies of integrity verification: Python, PySpark, and SPARQL, for exploring future potential industrial adoption. Keywords integrity verification, petroleum industry, sustainability 1. Introduction Background. With the development of technology, society is becoming increasingly aware of the importance of sustainable development. As a conventional energy industry, the petroleum industry is still responsible for supplying over half of the world’s energy [1]. Therefore, it is still extremely important to facilitate sustainable development for petroleum energy production. Being one of the most proactive industries embracing new technologies, the petroleum industry is widely adopting data-driven approaches to enhance energy production efficiency and safety, thereby reducing potential losses and environmental pollution during the production process. Currently, data-driven artificial intelligence (AI) has been widely applied to the petroleum industry to increase production efficiency and safety, from enhanced oil production and recovery to undesired event prediction [2]. This brings benefits such as reducing costly well tests, mitigating risks, increasing safety and operation efficiency. As results, data-driven AI contributes ISWC2023: The 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece * Corresponding author. $ quy@ifi.uio.no (Y. Qu) Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings a b Oil Well Water Well Main Well Subwell main well Role Role Role Role St 𝑄% (wt) BFO:roleOf V2 manifold RO:partOf FSO:exchangesFluidWith Wellbore O3PO:well O3PO: choke valve S1 S2 S1 RO:partOf subClass 𝑄!# 𝑄!" 𝑄!$ downstreamTo Of upstreamTo Well O3PO: choke choke choke Section Valve valve 1 valve 2 valve 3 BFO:hasContinuant FSO:suppliesFluidTo PartAtSomeTime Manifold subClass Of V1 V2 V3 O3PO: O3PO:piece O3PO: production of equipment Sensor zone hasUnit subwell 1 subwell 2 subwell 3 Datatype: String (w1) (w2) (w3) hasValue Datatype: Float rdfs:subClassOf owl:DatatypeProperty owl:ObjectProperty rdfs:Literal rdfs:Class Figure 1: (a) a simplified depiction of an oil production well example (S=sensor, V=valve, Q=fluid flow rate). (b) schematic illustration of the proposed ontology (partial) for petroleum production to suistainability by improving energy production efficiency, and reducing accidents with severe environmental damage, such as oil leaks. Challenge. The performance of data-driven approaches heavily relies on the data quality. Thereby, it presents an important challenge in ensuring the integrity of the data, which is crucial for data-driven solutions to deliver reliable predictions. In petroleum production, the purpose of collecting data from various sensors is to allow domain experts to analyse and make informed decisions based on their knowledge and experience. In this context, we discuss one of the issues of data integrity: the data should follow certain constraints of physical laws. This is in addition to other issues, such as missing values, sensor precision errors, etc. For verifying the constraints of physical laws, domain knowledge plays an important role, and should be incorporated in the integrity checking. Semantic technology is suited here due to its transparency, which the domain experts tend to have a high chance to trust because they can observe that their domain knowledge is used and how it is used. Our approach. In this poster paper, we present our ongoing research on a semantic solution for data integrity verification for petroleum industry. We develop a draft of a petrolum ontology aligned with upper ontologies as shared representation for data integration and knowledge representation; we construct knowledge graphs for transparent and unified human understand- ing; we experiment technologies for data integrity verification, including Python, PySpark and SPARQL. We provide preliminary experiment results and discussion of adoption. 2. Data and Knowledge Representation Ontology for petroleum production. In the petroleum domain, ontologies for petroleum exploration [3], reservoir [4], offshore production plant [5], subsurface fault [6], and petroleum risk assessment [7]. To meet our needs of verifying data integrity for the petroleum industry, we develop a draft of an ontology for petroleum production wells. Fig. 1a presents a simplified visual depiction of oil production wells. The fluid flow depicted in the figure originates from subsurface sub-wells, then merges towards the main well, and in the end reaches the offshore production platform. Fig. 1b depicts our petroleum ontology written in OWL 2, which includes 15 classes, 11 object properties, and 2 datatype properties. It contains several classes that are essential to the industry, including well section and subwell role. The ontology further includes core relations such as upstreamTo and downstreamTo, which facilitate the representation of the spatial locations of each well section and indicate the flow direction of the fluid supplied from the production zone. To ensure compatibility and interoperability, our ontology has been aligned with a domain ontology: the Offshore Petroleum Production Plant Ontology (O3PO) [8], which is built upon the Basic Formal Ontology (BFO) [9], Relation Ontology (RO) [10], and Industrial Ontologies Foundry Core [11], and Flow System Ontology (FSO) [12]. By using these classes and relations, our ontology provides a structured framework for integrating data, renaming the variables, and capturing and organising relevant knowledge and data, and supporting the integrity verification. Data from petroleum production wells. The data collected from the sensors in the production wells are typically presented in relational tables, including various sensor measurements on the range of well sections from the bottom-hole to the wellhead. These measurements include parameters such as pressure, temperature, and flow rate of each well section. Additionally, the collected data can contain other important equipment information, such as the ratio between the choke opening rate and flow coefficients. These relational tables provide a structured format for data-driven approaches to make predictions. Knowledge graphs for petroleum wells. Based on the proposed ontology, we construct knowledge graphs (KG) (Fig. 2) with domain experts to illustrate the production wells, well sections, well sensors, and their relations. These KGs serve as a flexible foundation to formalise the domain knowledge , and to support a transparent shared understanding between the domain experts, semantic experts, data scientists, etc. The KGs can also have the potential for sophisticated reasoning, for example, using domain-knowledge-based constraints to detect subtle anomalies, identify potential erroneous data, and ensure the consistency of the delivered data. 3. Integrity Verification with Preliminary Evaluation Integrity constraints. The integrity constraints play a crucial role in ensuring the quality of petroleum data. After renaming the features, we can formulate these constraints based on domain knowledge for verifying the data integrity. This allows validation of the sensor measurements, ensuring that they align with physical laws and empirical expectations. Here we give three examples (Fig. 2b): Example 1: the flow rate at any position within a well must consistently equal the flow rate of its upstream or downstream locations. Example 2: the total flow rate of the main well should precisely match the sum of the flow rates of all merged (or split) wells. Example 3: for each well, both the flow rate and pressure consistently adhere to the principles outlined in the Bernoulli function. Well 2 Example1: For each well section, the flowrate of fluid is always Main Well Subwell Well 2 Role Role shutdown equal to upstream or downstream fluid flowrates. Section 2 valve 𝑄& = 𝑄', for i, j from the same Well RO:partOf BFO: BFO:hasRole SELECT ?ent WHERE { hasRole upstreamTo ?ent a :Entity ; Well 2 Well 2 :hasWell1Section1Q ?q1 ; Well 2 :hasWell1Section2Q ?q2 ; Well A pressure Section 1 :hasWell1Section3Q ?q3 . sensor FILTER (?q1 != ?q2 || ?q2 != ?q3 || ?q1 != ?q3)} BFO:hasContinuant PartAtSomeTime Well 2 Example2: The sum of the subwell fluid flowrates equals to the fluid temperature FSO:feedsFuildTo sensor flowrate of the main well. 𝑄 = . 𝑄 $ %& SELECT ?ent WHERE { & ?ent a :Entity ; Well 1 Well 1 :hasWell1Q ?q1 ; pressure Section 1 :hasWell2Q ?q2 ; sensor 1 :hasMainWellQ ?q0 . RO:partOf Well 1 FILTER ( ABS(?q0 - ?q1 - ?q2) > 0.01)} Manifold Well 1 temperature sensor Example3: Constraint according to Bernoulli function: upstreamTo 𝑄 = 𝐢1 βˆ— 𝑐𝑣 βˆ— (𝑝 !" βˆ’π‘ #" )/𝐢2 SELECT ?ent WHERE { BFO: Well 1 Well 1 ?ent rdf:type :Entity ; hasRole Section 2 choke valve :hasChokeValue ?cv ; :hasChokeUpstreamP ?p1 ; Subwell upstreamTo :hasChokeDownstreamP ?p2 ; Role :hasWell1Section1Q ?q . Well 1 BIND (0.0007598* 0.0007598* ?cv * ?cv *(?p1- Well 1 pressure ?p2)/1.02 AS ?y) a Section 3 sensor FILTER (ABS(?q*?q - ?y) < 0.01)} b Figure 2: (a) Schematic illustration of the production KG, (b) example constraints formulated both in equations and SPARQL queries. The β€œstrict equal” is relaxed to account for calculation errors. The missing values are handled by other queries, and we assume all values exist here. C1, C2 in Example 3: constants in Bernoulli function. (ms) Example 1 Example 3 (GB) Example 2 Data 100000 5 Python Pyspark SPARQL csv ttl 10000 4 1000 3 100 2 10 1 (10! records) 1 0 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 b 1 3 5 7 10 a Figure 3: (a) Running time of verification (x-axis: million records) (b) data size along number of records. Experiment dataset. To test the verification performance, we generate a large number of data with a random ratio of integrity violations regarding examples 1-3 in Fig. 2b (real data contain few violations) based on real production data, provided by two world-leading energy companies. In total, we generated five such tables with sizes ranging from 145MB to 1.45GB, containing from 1 million to 10 million records. In addition, we generate corresponding KGs (Fig. 2a) following our ontology. These KGs are saved as Turtle files ranging in size from 440MB to 4.4GB. Implementation. We implement the constraints in Example 1-3 with (a) Python, because it is relatively easy to learn and it is popular among the petroleum domain experts; (b) PySpark, for its similarity to Python and that it unlocks the potential of parallelizable computation; (c) and SPARQL, for its popularity in the semantic community. The Python implementation uses common libraries such as Pandas, Numpy. PySpark is the Python API for Apache Spark, which is a distributed computing framework that enables parallel computation for dealing with large-scale datasets. The SPARQL is implemented with Jena and Fuseki. Jena is an open-source Java framework for semantic applications, while Fuseki is for setting SPARQL endpoint. Results and discussion. From the results (Fig. 3) we can see that the Python running time increases significantly when the data size grows, while the running time for PySpark and SPARQL changes insignificantly. We postulate that the reason is that the data size is under a certain threshold so that the most consumed time for PySpark and SPARQL is used for loading the environment, not for querying. The results indicate that both PySpark and SPARQL have the potential for verifying large datasets. Yet, note that for generating the ttl files for SPARQL to query, it takes a large amount of time (some minutes to some hours). Besides, many domain experts are familiar with Python, but unfamiliar with Jena Fuseki and SPARQL. We expect they tend to learn writing constraint queries in PySpark than in SPARQL. All these factors need to be taken into account in considering industrial adoption. Acknowledgements This work is supported by the Norwegian Research Council via PeTWIN (294600), DigiWell(308817) and SIRIUS (237898). References [1] H. Ritchie, et al., Energy, Our World in Data (2022). Https://ourworldindata.org/energy. [2] L. Kuang, et al., Application and development trend of artificial intelligence in petroleum exploration and development, Petroleum Exploration and Development 48 (2021) 1–14. [3] J. Ge, Z. Li, T. Li, A novel chinese domain ontology construction method for petroleum exploration information., J. Comput. 7 (2012) 1445–1452. [4] F. Cicconeto, L. V. Vieira, M. Abel, R. dos Santos Alvarenga, J. L. Carbonera, L. F. Garcia, Georeservoir: An ontology for deep-marine depositional system geometry description, Computers & Geosciences 159 (2022) 105005. [5] N. Santos, et al., O3po: A domain ontology for offshore petroleum production plants, SSRN 4280151 (2022). [6] Y. Qu, M. Perrin, A. Torabi, M. Abel, M. Giese, Geofault: A well-founded fault ontology for interoperability in geological modeling, arXiv preprint arXiv:2302.07059 (2023). [7] P. F. Silva, L. F. Garcia, G. Figueiredo, R. J. de Moraes, R. K. Romeu, How do specialists express risks: an applied ontology for the oil & gas domain., in: ONTOBRAS, 2021, pp. 114–125. [8] N. O. Santos, M. Abel, F. H. Rodrigues, D. Schmidt, Towards an ontology of offshore petroleum production equipment, CEUR-WS, 2022. [9] R. Arp, et al., Building ontologies with basic formal ontology, MIT Press, 2015. [10] B. Smith, et al., Relations in biomedical ontologies, Genome biology 6 (2005) 1–15. [11] B. Smith, et al., A first-order logic formalization of the industrial ontologies foundry signature using basic formal ontology., in: JOWO, 2019. [12] V. Kukkonen, et al., An ontology to support flow system descriptions from design to operation of buildings, Automation in Construction 134 (2022) 104067.