=Paper=
{{Paper
|id=Vol-3005/sample-8col
|storemode=property
|title=Auditable Semantic Web Machine Learning Systems
|pdfUrl=https://ceur-ws.org/Vol-3005/08paper.pdf
|volume=Vol-3005
|authors=Laura Waltersdorfer
}}
==Auditable Semantic Web Machine Learning Systems==
Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 Auditable Semantic Web Machine Learning Systems Laura Waltersdorfer[0000−0002−6932−5036] TU Wien, Vienna, Austria, laura.waltersdorfer@tuwien.ac.at Abstract. Research in neurosymbolic Artificial Intelligence (AI) ap- proaches has surged recently: Symbolic and sub-symbolic methods are combined to solve complex tasks. Nevertheless the significance of this field, little systematised knowledge exists yet. To scope our research, we will focus on semantic web machine learning systems (SWeMLS). Fur- thermore, AI systems have been under scrutiny due to prominent cases of biased or incorrect systems in sensitive domains. Thus, arises the need to make hybrid systems auditable, supporting the examination of their cor- rect functioning. However, also this field has received limited attention. To that end, in this thesis, we want to investigate SWeMLS regarding 1) characteristics, interaction patterns and general system aspects, to pro- vide an overview of this emerging field 2) guiding methodologies, tech- nologies to make them auditable and 3) evaluation purposes by designing a generic end-to-end framework. Keywords: auditability · semantic web · machine learning. 1 Problem Statement Traditionally, AI research has been divided into symbolic and sub-symbolic approaches: Sub-symbolic techniques, i.e. machine learning methods and deep learning have successfully been applied to a variety of complex problem contexts, including computer vision, information retrieval and speech recognition [15]. On the other hand symbolic approaches, including logical and semantic web meth- ods, are well suited for reasoning and making latent knowledge explicit. Both approaches are well established in industry and academia, however also have limitations: Common criticism towards machine learning models is the lack of explainability [10] and the difficulty to be generalisable beyond training data [3]. In contrast, the creation and maintenance of symbolic knowledge is effort- intensive and interoperability between different models is challenging [5]. Thus, scientific interest has grown on how to benefit from the strengths of combining both approaches machine learning and symbolic domain knowledge [5] [21], while overcoming the aforementioned challenges. Neuro-symbolic AI describing the combination of both approaches [10], is also referred to as the third wave of AI. With the emergence and success of innovative approaches, such as artificial neural networks [25] and knowledge Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 57 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 graphs, industrial applications have surged. This development lead to a high divergence in architectures, models and applied techniques and therefore opening a major need for understanding these systems. While there are significant initial works in this area, such as Van Harmelen and ten Teije proposing a set of design patterns for hybrid systems [26], Seeliger et al. highlighting the semantic aspects of hybrid systems [23] and Sapna et al. exploring these from the machine learning perspective [22], there is not yet a systematic investigation from a general perspective. Thus, one goal of this research is to systematically research the combination of symbolic and sub-symbolic approaches. Due to the breadth of the field and missing taxonomies to characterise such systems, we want to focus on seman- tic web and machine learning systems (SWeMLS), as a subset of neuro- symbolic AI systems. To manage the scope of this emerging field, our working definition is as follows: In our understanding, SWeMLS need to have a machine learning component interacting with a symbolic knowledge component aiming to achieve a task. Heterogenous data sources Trusted Data Information System My result changed Data Capture -Why? ? ML analytics User ? For what was my Data Store data used? Knowledge Analyst Graph Fig. 1. Use Case for auditable SWeMLS. The second goal of this research is concerned with the auditability of SWeMLS. With the rise of data-intensive analysis and black-box AI systems, and the coverage of prominent cases resulting in biased or incorrect results [7] [19], concerns over the correct functioning of complex systems have grown. As a result, audits aimed at AI have become also more relevant [20, 1] to check for problematic behavior of complex systems. However, with growing complex- ity, specially in the emerging field of AI, but also in the subset of SWeMLS, a variety of challenges arise. First, the traceability of full system logs becomes un- manageable as systems reach a higher level of complexity [11]. Second, although, audits occur increasingly frequently, a principled methodology or process is also missing covering both lifecycles of machine learning and semantic web compo- nents. There are initial efforts, focusing on one subcomponent, such as Model Cards as proposed in [17], or an semantic framework for supporting the AI design lifecycle phase [18]. 58 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 To illustrate the relevance of auditability in the context of hybrid systems (see Fig. 1), we introduce a real-world use case from the medical domain, in the WellFort project1 . Personal data from medical devices is shared with the system and also user consent for research purposes. Medical researchers can access this trusted data information system and may analyse anonymised user data for experiments. Both stakeholders, users and analysts have diverging auditing needs for the system: 1) users how and in which context their personal data is used and 2) for the analyst/system operator perspective the evidence of conducted analysis and retrieved results. Analysis is conducted through machine learning, the semantic web compo- nent is used to check for checking user consent. Auditability in our context aims to go beyond standard provenance of who, when and what, and links additional contextual data to competency questions to support an auditor in examining the functionality of a system. 2 Importance Increased use of opaque applications and data mining in sensitive areas, such as health care, HR and education, leads to the following hypothesis: Demand for recurring, controlled examination and verification of hybrid systems, both internally and externally will grow to prevent undesired impacts on stakehold- ers. Based on this context we identified the following three main challenges in the context of auditable SWeMLS (cf. Fig. 2): P1) a missing systematic un- derstanding of the characteristics and building blocks of SWeMLS, P2) unclear requirements and capabilities for auditing SWeMLS, and P3) missing guidance on evaluation of the auditability level of SWeMLS. 3 Related Work Semantic Web Machine Learning Systems can be considered as a subset of neuro-symbolic systems, which yet lack a concrete taxonomy. Previous research related to SWeMLS has focused on specific application areas or supersets, such as explainable AI [23] or recommender systems [22]. There have been initial cate- gorisation efforts, such as [26], presenting an initial boxology for hybrid reasoning and learning systems focusing on neuro-symbolic systems. Besold et al. examine neuro-symbolic learning and reasoning from a cognitive perspective and point out several open research directions such as the confluence of knowledge rep- resentation and machine learning [4]. Hitzler et al. provide an initial overview of the integration of neuro-symbolic approaches and semantic web in a posi- tion paper [13]. Other related surveys include [21], focusing on data mining and knowledge discovery through semantic web technologies, while [8] is a qualita- tive, non-systematic review concentrating on machine learning techniques with semantic web technologies. Specific research targeting SWeMLS is limited and 1 https://www.sba-research.org/research/projects/wellfort/ 59 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 Semantic Web Machine Learning Auditability Context Context P2 Unclear requirements and capabilities for auditing SWeMLS. ML Audit resource ML engineer Auditor P1 Systematic P3 understanding of SWeMLs No guidance on is missing. SWeML evaluation of auditability system of selected systems. . SW resource Context SW engineer Stakeholder Fig. 2. Problem statement systematized approaches are missing (cf. P1 in Fig. 2), however the interest and variety of connected topics displays much activity in this field. Auditability Semantic Web Machine Learning Systems Historical and method- ological lessons for auditing can be learned from various domains where audits are common, such as financial, aerospace or medical [20]. Algorithmic and AI audits are still under-researched, however efforts have been made to close this gap: Bandy provides a systematic overview of audits on different public-facing algorithms [1]. Sikos and Philip investigate provenance-aware technologies and data models, showcasing the applicability of semantic web technologies. [24]. In [18], Naja et al. propose an approach to audit ML lifecycle of systems supported by semantic technologies, however it is currently semi-automatic and covering only the design phase of ML systems. Concluding, existing works already inves- tigated auditability from different contexts. However, current solutions are not covering the complexity of SWeMLS, requirements and capabilities for auditing such systems are missing (cf. P2 in Fig. 2) The reliance on (primarily) manual approaches does not scale and makes the evaluation of auditability of SWeMLS challenging (cf.P3 in Fig. 2). 4 Research Questions and Expected Results Based on the analysis of the research area and concrete gaps, this thesis aims to investigate the following overall research question: What are general characteristics of semantic web machine learning systems and how to support their auditability? In particular, we will investigate the following four focused research questions: RQ 1: What are key characteristics and technological elements of semantic web and machine learning systems? To address the gap of sys- 60 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 tematic knowledge (P1) due to the recent surge in hybrid systems combining various methods of both approach, we aim to establish a systematically derived taxonomy and characteristic of key technological elements of such systems based on a systematic mapping study and use case analysis. RQ 2: What are requirements and capabilities to enable auditing SWeML? This question aims to enable audits of these systems (P2). This en- compasses the process of auditing as well as the capabilities needed to automate such processes, while building on the findings of RQ1, using the taxonomy to categorize system capabilities and requirements regarding their auditability. RQ 3: What method enables a semantic-based auditing of SWeML systems? In order to ensure credibility of audits, external parties are invited to conduct the audit. However, in order to enable both internal and external parties to audit SWeML systems, a method is needed to provide the desired information (P2). Traditionally, for this purpose log information is gathered, leading to rich, but often unstructured data. With a semantic-based auditing we want to enrich log-based provenance and provide additional contextual information for audits in a proof of concept. RQ 4: How to assess the level of auditability of a SWeML system? Based on the identified typologies from RQ1, requirements and capabilities from RQ 2 and the baseline implementation from RQ 3 we will develop an evaluation framework for auditable SWeMLS (addressing P3). The goal will be to provide i) a method for the evaluation of auditability and ii) the support of automatic evaluation. Based on this framework, we will conceptualise approaches for auto- mated evaluation techniques, such as the generation of test cases or graph-based query templates and also test it with suitable users in real world use cases. 5 Research Plan and Preliminary Results We will apply the design science approach [12] and engineering cycle based on Wieringa [28]. In the problem investigation phase, we decided to focus on systems that incorporate a semantic web structure and a machine learning component solving a certain task to provide scope for the research. This scoping is necessary to con- Auditable Semantic Web Machine Learning Systems Systematic Mapping Study Application Characteristics Characteristics Patterns of Characteristics Auditability Areas of ML of SW Connections Fig. 3. Focus of Systematic Mapping Study ceptualise and limit the broad topic of neuro-symbolic systems to a manageable 61 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 breadth for a survey. We are conducting a systematic mapping study [14] to identify key characteristics of SWeMLS (cf. Fig. 3), initial results are discussed in [27]. Currently, we are finalising the data extraction which will be concluded by data analysis. Based on these findings we will design a taxonomy for SWeML systems and basic processing flows between components, thus addressing RQ1 and P1. Furthermore, we will analyse two exemplary use cases with SWeMLS that need to be auditable, which will be analysed for stakeholder, data and processing flows. The first use case is situated in the medical domain, aiming at integrating heterogeneous, sensitive data from multiple data sources. Auditability is added to increase transparency and credibility for conducted analyses via the provided platform. The second use case is in the ecological domain2 , combining semantic web and machine learning components to enrich the provided data [6]. In the treatment design phase, we will incorporate the findings of the problem investigation and will derive requirements from the discussed use cases. Based on these requirements and key characteristics from the taxonomy, we will conceptualise building blocks to model the lifecycle of SWeML systems. For the medical use case, we have extended the PROV-DM datamodel [2] for identified requirements for auditability and showed the feasability of our approach [9]. In the treatment validation, developed solutions will be evaluated based on the use cases and the coverage of the identified requirements and capabilities. Also the usability of the approach will be assessed in evaluation scenarios. In the treatment implementation the results of the previous phases will be incorporated to demonstrate the feasibility of the approach. Furthermore, suggestions and improvements to extending existing standards and processes will be discussed. 6 Evaluation The taxonomy of SWeMLS will be based on the results from the systematic map- ping study and also bottom-up via the investigated system architectures from the use cases. Requirements and capabilities will be also derived from the use cases to build the auditable SWeMLS framework and methodology. For this purpose, the methodology for developing provenance- aware applications described in [16] will be applied and extended. To validate our approach, we will conduct user studies with the developed auditable SWeMLS framework concerning usability (e.g. execution time, handling) and coverage of the identified requirements and will be compared to existing frameworks and approaches. Specifically, evalua- tion metrics for provenance-aware applications mentioned in [16] will be also considered, including design-based metrics and implementation metrics. 2 http://www.obaris.org 62 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 7 Reflection and Future Work With increasing complexity of SWeMLS, the need for auditing hybrid systems rises, to achieve various other goals such as explainability or reproducibility. An essential first step was to scope this research to semantic web and machine learning systems. The focus will be to complete the systematic mapping study to derive characteristics and a taxonomy of SWeMLS. Requirements and capa- bilities will be analysed through case studies of the discussed projects to identify needs for auditability. 8 Acknowledgement I would like to thank Dr. Marta Sabou, Dr. Fajar J. Ekaputra and Dr. Tomasz Miksa for their invaluable support and inputs. This work was funded by the Austrian Research Promotion Agency FFG under grant 871267 (WellFort) and 877389 (OBARIS). References 1. Bandy, J.: Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits. Proceedings of the ACM on Human-Computer Interaction 5(CSCW1), 1–34 (2021) 2. Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., et al.: Prov-DM: The prov data model. W3C Recommendation (2013) 3. Bengio, Y., Deleu, T., Rahaman, N., Ke, R., Lachapelle, S., Bilaniuk, O., Goyal, A., Pal, C.: A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912 (2019) 4. Besold, T.R., Garcez, A.d., Bader, S., Bowman, H., Domingos, P., Hitzler, P., Kühnberger, K.U., Lamb, L.C., Lowd, D., Lima, P.M.V., et al.: Neural- symbolic learning and reasoning: A survey and interpretation. arXiv preprint arXiv:1711.03902 (2017) 5. Bonatti, P.A., Decker, S., Polleres, A., Presutti, V.: Knowledge graphs: New direc- tions for knowledge representation on the semantic web (dagstuhl seminar 18371). In: Dagstuhl Reports. vol. 8. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2019) 6. Breit, A., Waltersdorfer, L., Ekaputra, F.J., Sabou, M.: An Architecture for Ex- tracting Key Elements from Legal Permits. In: 2020 IEEE International Conference on Big Data (Big Data). pp. 2105–2110. IEEE (2020) 7. Buolamwini, J., Gebru, T.: Gender shades: Intersectional accuracy disparities in commercial gender classification. In: Conference on fairness, accountability and transparency. pp. 77–91. PMLR (2018) 8. D’Amato, C.: Machine Learning for the Semantic Web: Lessons learnt and next research directions. Semantic Web 11(1), 195–203 (Jan 2020) 9. Ekaputra, F.J., Ekelhart, A., Mayer, R., Miksa, T., Šarčević, T., Tsepelakis, S., Waltersdorfer, L.: Semantic-enabled Architecture for Auditable Privacy-Preserving Data Analysis. Semantic Web Journal (under-review) (2021) 63 Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021 10. Garcez, A.d., Lamb, L.C.: Neurosymbolic AI: The 3rd Wave. arXiv preprint arXiv:2012.05876 (2020) 11. Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: What for? What form? What from? The VLDB Journal 26(6), 881–906 (2017) 12. Hevner, A.R., March, S.T., Park, J., Ram, S.: Design Science in Information Sys- tems Research. Design Science in IS Research MIS Quarterly 28(1), 75–105 (2004) 13. Hitzler, P., Bianchi, F., Ebrahimi, M., Sarker, M.K.: Neural-symbolic integration and the semantic web. Semantic Web 11(1), 3–11 (2020) 14. Kitchenham, B., Charters, S., et al.: Guidelines for performing systematic literature reviews in software engineering version 2.3. Engineering 45(4ve), 1051 (2007) 15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 16. Miles, S., Groth, P., Munroe, S., Moreau, L.: Prime: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3), 1–42 (2011) 17. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency. pp. 220–229 (2019) 18. Naja, I., Markovic, M., Edwards, P., Cottrill, C.: A Semantic Framework to Sup- port AI System Accountability and Audit. In: Proceedings of the 2021 Extended Semantic Web Conference (2021) 19. Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S.: Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019) 20. Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith- Loud, J., Theron, D., Barnes, P.: Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. pp. 33–44 (2020) 21. Ristoski, P., Paulheim, H.: Semantic web in data mining and knowledge discovery: A comprehensive survey. Journal of Web Semantics 36, 1–22 (2016) 22. Sapna, R., Monikarani, H., Mishra, S.: Linked data through the lens of machine learning: an enterprise view. In: 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT). pp. 1–6. IEEE (2019) 23. Seeliger, A., Pfaff, M., Krcmar, H.: Semantic web technologies for explainable machine learning models: A literature review. Proceedings of the 1st Workshop on Semantic Explainability co-located with the 18th International Semantic Web Conference (ISWC 2019) 2465, 30–45 (2019) 24. Sikos, L.F., Philp, D.: Provenance-aware knowledge representation: A survey of data models and contextualized knowledge graphs. Data Science and Engineering 5, 293–316 (2020) 25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp. 3104–3112 (2014) 26. Van Harmelen, F., Ten Teije, A.: A boxology of design patterns for hybrid learning and reasoning systems. Journal of Web Engineering 18(1-3), 97–124 (2019) 27. Waltersdorfer, L., Breit, A., Ekaputra, F., Sabou, M.: Bridging Semantic Web and Machine Learning: First Results of a Systematic Mapping Study (accepted for publications). In: Proceedings of the International Conference on Database and Expert Systems Applications (2021) 28. Wieringa, R.: Design Science Methodology for Information Systems and Software Engineering. Springer Berlin Heidelberg (2014) 64