Towards a visual framework for the incorporation of knowledge in the phases of machine learning CJ Swanepoel and KM Malan Department of Decision Sciences, University of South Africa. swanecj@unisa.ac.za Abstract. Incorporating domain knowledge into machine learning al- gorithms to some extent is almost unavoidable. Doing it well and ex- plicitly can avoid unnecessary bias, improve efficiency and accuracy, and increase transparency. To increase an awareness of the relative contri- butions of domain knowledge and machine learning expertise, as well as an indication of the direction of information flow, a tentative qualitative visualisation framework is suggested, and two examples are given. It is hoped that such a mechanism will encourage reflection on the (sometimes implicit and innate) inclusion of domain knowledge in machine learning systems. Keywords: Domain knowledge · Machine learning · Characterisation framework · Visualisation. 1 Introduction Machine learning as a component of artificial intelligence, and especially deep learning, has experienced phenomenal growth over the last couple of years. (For example, in 1998 there were two papers with primary subcategory ‘Machine learning’ on arXiv, and in 2017 there were 2 332 papers [19].) This growth is the result of the advances in computing power, especially through the use of graphics processing units and more recently tensor processing units, the ubiquity of available data, the connectivity afforded by the internet, and major advances in machine learning algorithms by Bengio, Hinton and LeCun, amongst others [14]. In well defined domains where data is plentiful or cheap to generate and where the context is stable, machine learning can be an extremely useful tool. However, many machine learning techniques (and especially deep learning) are fragile, greedy, shallow and not transparent [8, 12, 27]. The techniques are fragile because transfer to a slightly different domain or context usually breaks them; greedy because they require huge amounts of (labelled) data; shallow because they depend on superficial features and do not possess an underlying model based on the physical reality; and not transparent because the internal structure is often too complex to analyse and connect to the features that determine the output. This necessitates careful attention to subtle aspects of the machine learning development process. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 CJ Swanepoel, KM Malan This paper focusses on the inclusion of domain knowledge in machine learning, because this is one factor that can affect the fragility, greediness, shallowness and lack of transparency of machine learning. The aim is to provide a frame- work to assist practitioners to make explicit the different ways in which domain knowledge and machine learning expertise are incorporated in the stages of ma- chine learning. Being aware of subtle inclusions of domain knowledge and innate knowledge endowed to the machine learning system might assist in identifying opportunities to improve its performance, accuracy or even transparency, and can contribute to more accurate reporting of machine learning system designs. 2 The inclusion of domain knowledge The recent successes of machine learning that depend on data only, and use ‘no domain specific’ data (e.g. AlphaZero) create the impression that domain specific knowledge is not really necessary in machine learning, and might even reduce the ‘generality’ of the resulting system. The practical difficulties in finding useful representations of expert knowledge, and the fact that domain knowledge is often not complete or perfect, reinforces this view [27]. Two seemingly diametrically opposed views are expressed in recent literature [15, 13]: on the one hand, the view that all problems can be solved by scaling the model up and rely on the data only (AlphaZero, for example), and on the other hand, the belief that using a combination of data and domain knowledge will eventually prove to be the best approach (an idea already propounded by Alan Turing in his 1950 paper [23]). The ‘data only’ camp demonstrated spectacular results, particularly in the domain of game play and text generation, although the fear exists that it will not generalise easily (mainly due to data constraints in most domains) and that the performance will hit a ceiling. Even when domain knowledge is not explicitly injected (the ‘data only’ ap- proach), implicit domain knowledge almost always features in machine learning [9]. This implicit or innate domain knowledge can include the choice and struc- ture of the algorithm, representational formats, and innate knowledge or expe- rience [13]. For example, the choice of data encoding method depends partly on an understanding of the problem domain, and can greatly influence the effective- ness of the algorithm (see, e.g. [7, 18]). Feature selection is often task dependent and sometimes even based on intuition [5, 26]. The type of algorithm used also depends heavily on the nature of the problem domain. Marcus [13] quotes Pe- dro Domingos: “[Machine learning] paradigms differ in what assumptions they encode, and what form of additional knowledge they make it easy to encode.” The structure of a deep learning neural network is influenced by the nature of the problem. In the description of the neural network architecture for AlphaGo Zero, where emphasis was placed on using as little as possible explicit domain knowledge, it is stated that “History features Xt , Yt are necessary, because Go is not fully observable solely from the current stones. . . ” [21]. Further examples of Towards a visual framework for the incorporation of knowledge in ML 3 the extensive embedding of human domain specific knowledge in the construction of AlphaGo Zero and AlphaZero are given in [13]. In a genetic algorithm the choice of cross-over and mutation mechanisms will to some extent depend on implicit domain knowledge [10]. Risk or loss functions include implicit domain knowledge, but can also encode selected prior knowledge [15, 22]. Constraining the output of a machine learning system based on heuristics or rules from the problem domain is common practice. (For example, all Go playing algorithms before AlphaGo Zero routinely removed stupid (but legal) moves [21].) In most (non-game) domains there are additional considerations for including domain knowledge, such as the fact that data can be difficult to get, or expensive, or might include bias, or be unbalanced (for example, the lack of edge cases). It is often necessary to explicitly include additional information to get to a feasible, unbiased or useful solution [4, 6]. The explicit inclusion of domain knowledge can potentially also improve the transparency of the model [28]. Conversely, including less explicit domain knowledge might lead to more general algorithms. There are some obstacles to the inclusion of domain knowledge, though. These include the difficulty of finding workable encodings and injection mechanisms, the fact that most domain experts are not data science experts and vice versa [27]. The restrictions introduced by injecting domain knowledge can potentially also prevent the discovery of valid but unexpected solutions [3]. The explicit integration of knowledge into machine learning is called ‘informed machine learning’ by von Rueden et al. [25]. They developed a taxonomy for the explicit integration of knowledge. Their proposed taxonomy contains three components: the type of knowledge, the method used to integrate the knowledge into the machine learning system, and lastly where in the machine learning pipeline the integration happens. It is a useful tool to classify papers in the assisted or informed machine learning domain. However, the implicit inclusion of domain knowledge in the form of innate knowl- edge and convention or experience is often not recognised and neglected in the reporting on and meta-analysis of machine learning algorithms. Many of the choices made throughout all phases of the machine learning process are based on some understanding of aspects of the problem or task, augmented by ex- pertise and experience in the machine learning domain. This paper attempts to provide a tentative high level framework to visually characterise the inclusion of any domain knowledge into machine learning. 3 Proposed framework In Figure 1 the generic phases of the machine learning pipeline are given as coloured blocks labelled with a capital letter. A very brief description of the phases is given in Table 1, and a more comprehensive account is given in Sec- tion 4. 4 CJ Swanepoel, KM Malan Although the phases are listed more or less in the order in which they occur in the machine learning pipeline, it is an idealised representation that does not necessarily reflect the actual workflow of a specific system – in practice the order in which the phases are executed can be convoluted, and might include several iterative loops. The colours group the phases into six clusters: problem for- mulation (green), data preparation (yellow), machine learning activities (blue), output constraints (red), interpretation and explanation (brown), and external comparison (green). A subjective qualitative estimate of the magnitude of the contribution from re- spectively domain knowledge and machine learning is represented by the relative thicknesses of the arrows in the figure. In cases where the emphasis is on the exploitation of data only, many of the arrows from domain knowledge will be thin or have zero thickness, for example. The direction of the arrows indicates the direction of information flow. Such a representation will be unique to every particular instance of a machine learning system, and can give a quick visual overview of the nature of the interactions in that instance. The representation is subjective, and as such cannot provide accurate or quanti- tative data. However, this simplified model allows a quick high-level evaluation of the relative contributions from the two knowledge domains. A: Problem identification and formula- H: Machine learning algorithm structure tion determination B: Data sourcing/labelling I: Learning process mechanisms C: Data cleaning/validation/quality eval- J: Hyperparameter tuning uation K: Constraining outputs D: Data augmentation L: Interpretation/validation E: Data encoding M: Explanation F: Machine learning algorithm selection N: Comparative evaluation G: Feature engineering Table 1. Phases of machine learning Domain Knowledge A B C D E F G H I J K L M N Machine Learning Expertise Fig. 1. The contribution of domain knowledge and machine learning expertise to a typical machine learning workflow Towards a visual framework for the incorporation of knowledge in ML 5 4 Phases of the machine learning pipeline A: Problem identification and formulation Thorough knowledge of the do- main is necessary for this step. The context will determine what data can be collected, what the objective(s) is (are), and what information might be available that is not included in the data. B: Data sourcing/labelling Although the machine learning process often starts with available data, in some cases data will have to be sourced or labelled. Domain knowledge in the form of deep knowledge of the relationships be- tween different features, the nature of the data, the difficulty and cost of labelling as well as an understanding of machine learning algorithms and how the data will be used can contribute to the quality of the data that is eventually used, and hence influence the outcome or success of the machine learning process. C: Data cleaning/validation/quality evaluation Knowledge of the prop- erties of the domain and the data collection methodology can assist in iden- tifying outliers or invalid data points, and allow for an evaluation of the quality of the data. It will also give insight into the coverage of the data space (whether there are areas where data is too sparse to be useful). D: Data augmentation Both knowledge of the domain and the machine learn- ing algorithm is required for the successful augmentation of data. The impu- tation of missing data points, for example, can take various forms, and some of the techniques might be counterproductive in the training of a model. Another example: the perturbation of images by shifting a few pixels hori- zontally or vertically to provide additional training data assumes an under- standing that such a translation will indeed provide novel information to the machine learning system, while it remains valid as a representation and does not influence the labelling of the image. E: Data encoding Mainly machine learning expertise is required. However, a deep understanding of the domain knowledge environment is assumed. For example, one-hot encoding can be difficult or impossible to work with when the feature space is very large. Piech et al. [17] address this data encoding challenge by utilising a random low-dimensional representation of a one-hot high-dimensional vector. This encoding is motivated by the idea of compressed sensing introduced by Baraniuk in 2007 [1] as an effective method to capture and represent compressible signals at a rate significantly below the Nyquist rate. Another beautiful example is given in the paper by Lusci et al. [11] where molecules are represented as ensembles of directed acyclic graphs as input to a recursive neural network to predict the solubility of these molecules. F: Machine learning algorithm selection This deals with the choice of a suitable machine learning algorithm. A deep understanding of the working of different machine learning techniques is required, as well as an understanding 6 CJ Swanepoel, KM Malan of the nature of the problem domain. For example, if the problem has a temporal component, a recurrent neural network might be a good fit, or if filtering is required, an auto-encoder should be considered. Also see the paper by Olson et al. [16] where thirteen machine learning algorithms are compared over 165 publicly available classification problems. G: Feature engineering Feature engineering includes binning, transformation of features, scaling, grouping operations and feature selection. Domain knowl- edge in the form of an understanding of the relationships between features and the information content of different features are required. H: Machine learning algorithm structure determination This is sometimes described as more of an art than a science. Typically the structure of the machine learning algorithm is determined empirically through experimenta- tion. Here experience with similar or related problems, which is a form of domain knowledge, plays a huge role. For example, Chandrasekaran et al. [2] in a recent paper designed a machine learning predictor for density of state and charge density of a material or molecule. For the charge density, the modelling is done with a simple fully connected neural network with one output neuron. The local density of states spectrum, on the other hand, is modelled with a recurrent neural network, where the local density of state at every energy window is represented as a single output neuron (linked via a recurrent layer to other neighbouring energy windows). Domain knowledge therefore played a huge part in determining the structure of the machine learning system. I: Learning process mechanisms (transfer function / learning mechanism / mutation operator, etc. selection) This is probably the area where there is the biggest opportunity of innovation. Novel functions for cross-over, or innovative transfer functions can hugely influence the way in which the search space is traversed. J: Hyperparameter tuning As with stage H, this is an area that is mostly approached empirically. The performance metric used in the hyperparameter optimisation process is influenced by domain knowledge. Setting ranges for grid or random searches to optimise hyperparameters, as well as determining of which of the hyperparameters should be included in the search cannot be analytically determined, but knowledge of the hyperparameter behaviour in related problems can provide a good starting point for an empirical search strategy. K: Constraining outputs This is one of the most important mechanisms to include domain knowledge in the machine learning process. Techniques used include the augmentation or restriction of the loss or risk function, and filtering or transforming intermediate outputs or the final output. Stewart and Ermon [22], for example, use laws from physics to constrain the output space in training a convolutional neural network to track objects without using any labelled examples. Towards a visual framework for the incorporation of knowledge in ML 7 L: Interpretation / validation A good understanding of the underlying knowl- edge domain will allow an evaluation of the feasibility or quality of the out- comes. In simple classification problems this is not really an issue, but for many decision support applications this is essential. The output of a machine learning algorithm might also increase our understanding of the domain, hence the possibility of a two-way arrow in the proposed framework. M: Explanation One of the biggest criticisms against many machine learning techniques is the lack of transparency, or the ability to explain the output of the machine learning system. Legal and moral requirements dictate that in certain environments it should be possible to justify the outcomes in the light of the inputs. Here an understanding of the domain complexities as well as the machine learning mechanism is required – although this might not be sufficient in many cases. In the proposed framework the direction of the arrow towards the domain knowledge bar indicates the flow of information that might add to our understanding of the domain. In addition, an analysis of how the machine learning process obtained the outputs could add to machine learning expertise. N: Comparative evaluation Comparing the performance of a machine learn- ing algorithm against previously applied approaches (benchmarking) is often necessary to evaluate the performance of a new approach. This also con- tributes to the knowledge base of machine learning. 5 Case studies/examples In this section two recent contributions to the machine learning environment are described briefly, and the proposed visual representations of knowledge are given for both. 5.1 Combination of domain knowledge and deep learning for sentiment analysis, by Vo et al. In the paper ‘Combination of domain knowledge and deep learning for sentiment analysis’, published in 2017, Vo et al. [24] found that existing approaches in the application of machine learning to sentiment analysis suffer from two major drawbacks, the first of which is that until then nobody has paid attention to the different types of sentiment terms. Different domains use different terms to express positive and negative sentiments, and some words carry a higher emotive content than others. Secondly, the loss functions used previously did not include a measure of the magnitude of sentiment misclassification, and did not distinguish between different types of misclassification. To address these two issues, they proposed using sentiment scores (learnt by quadratic programming) to augment training data; and introduced a penalty 8 CJ Swanepoel, KM Malan matrix to enhance the loss function. The enhancements were applied to a stan- dard sentiment analysis workflow using a convolutional neural network. To eval- uate the success of their approach, they compared the performance of the new system with a baseline convolution neural network sentiment analyser, as well as with a traditional support vector machine based sentiment analyser. In the com- parative analysis the new approach performs better than the previous versions, showing that the inclusion of the two enhancements are useful in this domain. This knowledge and the description of the novel enhancements led to the dia- gram in Figure 2, where the incorporation of additional domain knowledge in the ‘data augmentation’ (D) and ‘constraining outputs’ (K) phases is emphasised. Domain Knowledge A B C D E F G H I J K L M N Machine Learning Expertise Fig. 2. The contribution of domain knowledge and machine learning expertise to the sentiment analyser of [24] 5.2 Mastering the game of Go without human knowledge, by Silver et al. During 2016 and 2017 DeepMind released three versions of their Go playing software. The first system, now called AlphaGo Lee, defeated the world champion Lee Seedol 4–1 in a five game match in March 2016. AlphaGo Lee used two neural networks – a ‘policy’ and a ‘value’ network, as well as a Monte Carlo tree search algorithm. It was trained using historic game information, and improved through self play. A second version, AlphaGo Zero, was introduced in a paper in Nature on 19 October 2017 [21]. The title of the paper: ‘Mastering the game of Go without human knowledge’ expresses the major claim of this version – that it used no human or domain knowledge except for the rules of the game. A third version of the software (AlphaZero) was introduced in a paper published on arXiv in December 2017 [20]. This version added chess and shogi to the repertoire of games mastered by the system. It was claimed that in this progression the newer version each time learnt quicker and exceeded the performance of its predecessor. The contribution of knowledge into the second version, AlphaGo Zero, is discussed in this section. Towards a visual framework for the incorporation of knowledge in ML 9 In [13] Marcus extensively discusses different inclusions (some not mentioned in the original paper) of domain knowledge into the AlphaGo Zero machine learning system. He argues that the use of carefully constructed Monte Carlo tree search machinery, the artful placement of convolutional layers that allow the system to recognise that many patterns on the board are translation invariant, and the application of a sampling algorithm for dealing with reflections and rotations constitute the injection of domain knowledge into the system. Based mostly on Marcus’s assessment, rough qualitative judgements on the in- clusion of domain knowledge into AlphaGo Zero were made that are reflected in Figure 3. Domain Knowledge A B C D E F G H I J K L M N Machine Learning Expertise Fig. 3. The contribution of domain knowledge and machine learning expertise to Al- phaGo Zero The lack of explicit contributions from domain knowledge in the phase of ‘con- straining outputs’ (K) is indicative of the fact that this popular method of inject- ing domain knowledge was not used. There are, however, substantial contribu- tions in phases F and H, the selection of the machine learning algorithm (a combi- nation of Monte Carlo search trees and reinforcement learning) and the structure determination of the algorithm (taking into account symmetries, for example). The direction of information flow in the three phases ‘interpretation/validation’ (L), ‘explanation’ (M) and ‘comparative evaluation’ (N) should also be noted. Domain knowledge and machine learning expertise are expanded to varying de- grees with a successful implementation of a machine learning system. In the case of AlphaGo Zero new (‘alien’) gameplay strategies were discovered (do- main knowledge), and some insight gained into the comparative performance of different approaches (machine learning expertise). 10 CJ Swanepoel, KM Malan 6 Conclusion A visualisation scheme for the inclusion of domain knowledge and machine learn- ing expertise into machine learning systems is proposed. The hope is that it will aid in greater awareness of the innate, implicit and explicit use of domain knowl- edge in the machine learning workflow. References 1. Baraniuk, R.: Compressive sensing [lecture notes]. IEEE Signal Processing Maga- zine 24(4), 118–121 (Jul 2007). https://doi.org/10.1109/msp.2007.4286571 2. Chandrasekaran, A., Kamal, D., Batra, R., Kim, C., Chen, L., Ramprasad, R.: Solving the electronic structure problem with machine learning. npj Computational Materials 5(1) (Feb 2019). https://doi.org/10.1038/s41524-019-0162-7 3. Childs, C.M., Washburn, N.R.: Embedding domain knowledge for machine learn- ing of complex material systems. MRS Communications pp. 1–15 (Jul 2019). https://doi.org/10.1557/mrc.2019.90 4. Choo, J., Liu, S.: Visual analytics for explainable deep learning. IEEE Computer Graphics and Applications 38(4), 84–92 (Jul 2018). https://doi.org/10.1109/mcg.2018.042731661 5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493– 2537 (2011), http://dl.acm.org/citation.cfm?id=2078186 6. Du, M., Liu, N., Hu, X.: Techniques for interpretable machine learning. arXiv:1808.00033 (2018) 7. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (Jan 1992). https://doi.org/10.1162/neco.1992.4.1.1 8. Ghorbani, A., Abid, A., Zou, J.: Interpretation of Neural Networks is Fragile. arXiv:1710.10547 arXiv:1710.10547 (Oct 2017) 9. Hessel, M., van Hasselt, H., Modayil, J., Silver, D.: On inductive biases in deep reinforcement learning. arXiv:1907.02908 (2019) 10. Johns, M.B., Mahmoud, H.A., Walker, D.J., Ross, N.D.F., Keedwell, E.C., Savic, D.A.: Augmented evolutionary intelligence. In: Proceedings of the Genetic and Evolutionary Computation Conference on – GECCO '19. ACM Press (2019). https://doi.org/10.1145/3321707.3321814 11. Lusci, A., Pollastri, G., Baldi, P.: Deep architectures and deep learning in chemoinformatics: The prediction of aqueous solubility for drug-like molecules. Journal of Chemical Information and Modeling 53(7), 1563–1575 (Jul 2013). https://doi.org/10.1021/ci400187y 12. Marcus, G.: Deep learning: A critical appraisal. arXiv:1801.00631 (2018) 13. Marcus, G.: Innateness, AlphaZero, and artificial intelligence. arXiv:1801.05667 (2018) 14. Marx, V.: Machine learning, practically speaking. Nature Methods 16(6), 463–467 (May 2019). https://doi.org/10.1038/s41592-019-0432-9 15. Muralidhar, N., Islam, M.R., Marwah, M., Karpatne, A., Ramakrishnan, N.: Incorporating prior domain knowledge into deep neural networks. In: 2018 IEEE International Conference on Big Data (Big Data). pp. 36–45 (Dec 2018). https://doi.org/10.1109/BigData.2018.8621955 Towards a visual framework for the incorporation of knowledge in ML 11 16. Olson, R.S., La Cava, W., Mustahsan, Z., Varik, A., Moore, J.H.: Data-driven Ad- vice for Applying Machine Learning to Bioinformatics Problems. arXiv:1708.05070 arXiv:1708.05070 (Aug 2017) 17. Piech, C., Spencer, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L.J., Sohl- Dickstein, J.: Deep knowledge tracing. arXiv:1506.05908 (2015) 18. Potdar, K., Pardawala, T.S., Pai, C.D.: A comparative study of cat- egorical variable encoding techniques for neural network classifiers. In- ternational Journal of Computer Applications 175(4), 7–9 (Oct 2017). https://doi.org/10.5120/ijca2017915495 19. Shoham, Y., Perrault, R., Brynjolfsson, E., Clark, J., Manyika, J., Niebles, J.C., Lyons, T., Etchemendy, J., Grosz, B., Bauer, Z.: The AI Index 2018 Annual Report. AI Index Steering Committee, Human-Centered AI Initiative, Stanford University, Stanford, CA (Dec 2018) 20. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T.P., Simonyan, K., Hassabis, D.: Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:1712.01815 (2017) 21. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (Oct 2017). https://doi.org/10.1038/nature24270 22. Stewart, R., Ermon, S.: Label-free supervision of neural networks with physics and domain knowledge. arXiv:1609.05566 (2016) 23. Turing, A.M.: I.—Computing Machinery and Intelligence. Mind LIX(236), 433– 460 (Oct 1950). https://doi.org/10.1093/mind/lix.236.433 24. Vo, K., Pham, D., Nguyen, M., Mai, T., Quan, T.: Combination of do- main knowledge and deep learning for sentiment analysis. In: Lecture Notes in Computer Science, pp. 162–173. Springer International Publishing (2017). https://doi.org/10.1007/978-3-319-69456-6_14 25. von Rueden, L., Mayer, S., Garcke, J., Bauckhage, C., Schuecker, J.: Informed Machine Learning – Towards a Taxonomy of Explicit Integration of Knowledge into Machine Learning. arXiv:1903.12394 (Mar 2019) 26. Yang, L., Zheng, Z., Sun, J., Wang, D., Li, X.: A domain-assisted data driven model for thermal comfort prediction in buildings. In: Proceedings of the Ninth International Conference on Future Energy Systems. pp. 271–276. e-Energy ’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3208903.3208914 27. Yu, T., Jan, T., Simoff, S., Debenham, J.: Incorporating prior domain knowledge into inductive machine learning. Technical report, International Institute of Fore- casters (IIF), University of Massachusetts, Amherst, USA (Oct 2006) 28. Yu, T., Simoff, S., Jan, T.: VQSVM: A case study for incorporating prior domain knowledge into inductive machine learning. Neurocomputing 73(13-15), 2614–2623 (Aug 2010). https://doi.org/10.1016/j.neucom.2010.05.007