Improving Taxonomy Maintenance: Automated Splitting and Merging of Taxonomies Moritz Flöter2 , Pascal Reuss12 , Johannes Ude2 , and Klaus-Dieter Althoff12 1 German Research Center for Artificial Intelligence Kaiserslautern, Germany http://www.dfki.de 2 Institute of Computer Science, Intelligent Information Systems Lab University of Hildesheim, Hildesheim, Germany http://www.uni-hildesheim.de Abstract. In Case-Based Reasoning (CBR), taxonomies are often used to model similarities. For complex domains and tasks such taxonomies tend to increase in size making them hard to model and maintain. This especially holds true if a group of people is working on the same taxonomy simultaneously. In this paper, we propose a solution by dividing larger taxonomies into sub-taxonomies, which can be regarded as individual and independent taxonomies. Through compilation, they can be merged and put back into their original context. 1 Introduction For structured CBR systems, taxonomies are a relatively efficient way of defining similarities between different values of a certain attribute. This is achieved by ordering the possible values in hierarchical form as nodes of a tree and assigning similarity values to each node [2][3]. Beginning from the most general descriptor for the attribute in the root node, the values on each child node get more and more specific towards the leaf nodes of the tree. The similarity values follow that pattern usually starting at 0.0 for the root node and ending at 1.0 for the leaves. The value of each node serves as its unique identifier as the calculation of similarities has to be consistent. Once the tree is complete, the similarity between two different known values of the same attribute is calculated by choosing the closest common predecessor in the tree-structure. In case the attribute-values are identical or if the query contained a more general attribute-value that is a direct predecessor to the value in a case from the case-base, the resulting similarity is 1.0. It should be noted that the same concepts for similarity-calculation remains applicable in our approach for a compiled taxonomy. Modeling taxonomies can be difficult and can cause a high effort, especially when modeling hundreds of values in a single taxonomy. For example, a taxon- omy of aircraft systems may contain several thousand values from components (e.g. cabin) to systems (e.g. in-flight entertainment system), to system parts (e.g. audio system) to so-called Line-Replaceable-Units (LRU, e.g. display, speaker). Modeling each system or system part in an individual taxonomy would reduce Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB. Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org the modeling and maintenance effort, but similarity computation between values of different taxonomies will become very difficult. Therefore, splitting a big tax- onomy into smaller sub-taxonomies solves just a part of the effort problem. To compute the similarities between values of different sub-taxonomies, the relevant sub-taxonomies should be merged during retrieval. This way, the modeling and maintenance effort is reduced, while still allowing a comparison of values from different sub-taxonomies. The motivation for this work has its seeds in the OMAHA research project[4]. In this project we modeled taxonomies for aircraft systems with several thousand nodes. A taxonomy with three subsystems has 7668 nodes. Maintenance of this taxonomies is only possible with high effort. Therefore, the idea is to split up the big taxonomies into several smaller taxonomies with only dozens of nodes. This way we could maintain the smaller taxonomies with less effort. To have the same effect on similarity as the origin taxonomy, the smaller taxonomies have to be merged for similarity measurement. For the rest of this paper, we will regard taxonomies from their structural point of view, namely trees. When splitting up this structure, there are some obstacles that need to be overcome. It is important that the similarity measure of the sub-taxonomies can be used even when the respective attribute values are put back into their original context. Furthermore, because we are splitting up a structure, that could mean that some parts of the structure are reused at more than one single point. This offers the chance of eliminating redundant work on repetitive parts of the taxonomy, but also forces us to consider means to keep the nodes uniquely identifiable. This paper is structured as follows, in Section 2 we describe some related work and distinguish our approach from previous research. Section 3 describes in more detail the approach of splitting and merging taxonomies and its imple- mentation within the tool myCBR[1]. We then give an argumentative evaluation and conclude with an outlook on future work. 2 Related Work There is plenty of research on taxonomies and on combining and merging them, that is not directly related to CBR. Several algorithms have been described to merge different taxonomies and to unify them, such as RCC-5[10] or the Ex- tended Integrated Taxonomy Generation algorithm[8]. These algorithms merge taxonomies from different sources into a target taxonomy. During the merge process duplicate values are removed, sub-taxonomies and leaves are combined, and sometimes unused nodes are removed. The intention of these algorithms is to unify the values of different taxonomies to put them into a common context. Our approach tends not to unify the values of smaller taxonomies, but allows the integration of several taxonomies via reference nodes into another taxonomy. This way we also put the taxonomies into a common context, but with the focus on similarity computation during retrieval. There is also some research in the context of dynamic similarity measures. Approaches range from dynamic similarity on databases[6] via their use in data mining and clustering[9] to adaptive similarity measures in CBR[7]. These ap- proaches use similarity measures to retrieve data from data sources while dy- namically adapting these measures depending on the content of the query. Our approach also allows a dynamic similarity measure, using the dynamic composi- tion of taxonomies for retrieval. This can be done by the user before the retrieval or depending on the content of the query during the retrieval. This allows us to compose a taxonomy as similarity measure for attributes depending on the required information from sub-taxonomies. 3 Splitting and Merging of Taxonomies In this section we describe our approach for splitting and merging taxonomies and the current implementation in the open source tool myCBR. 3.1 General introduction to the concept of splitting taxonomies Before splitting up taxonomies and being able to combine them dynamically, it is necessary to model an interface between taxonomies that defines their relation. By assigning a unique identifier to each sub-taxonomy, it is possible to reference them when needed. A node in one taxonomy could for example reference an entire separate taxonomy by its name. As trees contain one single root element, the referencing node therefore gets replaced with the root-node of the referenced taxonomy. If a referenced sub-taxonomy can not be found, for example because it was deleted, the referencing node will be treated as a leaf node. Because the important part in the context of CBR is the use of taxonomies as similarity measures, the similarity values of the sub-taxonomies have to remain consistent. Maintaining the original values of the original single taxonomy would be the most obvious solution but does defeat the purpose of splitting the taxon- omy in the first place. If all similarity-values of the nodes are simply kept, the sub-taxonomies continue to be dependent on the context from which they are referenced from. It should be possible to model the similarity values of each sub- taxonomy individually without that kind of dependency to use a sub-taxonomy as an individual similarity measure. This also allows the same sub-taxonomy to be reused in multiple contexts by multiple different taxonomies - even across different projects - that reference it. Considering that a typical taxonomy has the similarity value of 0.0 in its root node and that similarity-values have to increase from the root towards the leaves of the taxonomy, a solution to this problem has to adjust the similarity values of the referenced sub-taxonomy when inserting it back into the project- context. When inserting a sub-taxonomy into another taxonomy, the important parameter to define the context is the similarity-value in the referencing node. Therefore, it is also important that a referencing node keeps the value that was originally in place at its position when a taxonomy is split. Fig. 1. Taxonomy with Reference Taxonomy: form factor form factor mobile stationary REFERENCE Laptop other mobile Workstation Server devices Taxonomy: other mobile devices other mobile devices Smartphone Tablet 3.2 Merging of taxonomies Once taxonomies are separated it should be possible to merge them together into a single taxonomy that can directly be used in the context of CBR. For the merging procedure we propose the compilation of a combined taxonomy, based on a formula that adjusts the similarity-values of a sub-taxonomy when inserted into a taxonomy that references it. In the following formula the similarity value of the referencing node is repre- sented by simref erencing while simref erenced.root represents the value stored for the root node of the referenced taxonomy. Given the old similarity value of any node of the referenced sub-taxonomy simold , it is possible to calculate a new similarity-value simcalculated for the node in the project-context. simold − simref erenced.root simcalculated = simref erencing + (1 − simref erencing ) 1 − simref erenced.root This formula proportionally maps the similarity values of the referenced sub- taxonomy from the interval [simref erenced.root , 1] to the interval [simref erencing , 1]. If simref erencing or simref erenced.root is already 1.0, all nodes of the sub-taxonomy are inserted into the project-context with the similarity-value of 1.0. The map- ping is achieved following the pattern of the mathematical mapping from a source-interval to a target-interval as shown later in this section with xmapped . Depending on the conventions of a project, one might assume that simref erenced.root always carries the value 0.0. This does however not have to be the case. Some projects might choose to keep their existing values when they split taxonomies because the people working on the different parts of the taxonomy are already used to seeing specific values for their part of the taxonomy. Using the proposed version of the formula, that remains possible while assuming simref erenced.root = 0.0 would offer the possibility to further simplify it. Fig. 2. Calculation of Similarity-Values Taxonomy: form factor form factor sim = 0 mobile stationary sim = 0.2 sim = 0.5 Laptop REFERENCE Workstation Server sim = 0.8 other mobile sim = 0.8 sim = 0.6 devices sim = 0.4 Calculation of similarity-values with the propsed formula For this graphical representation simplified to: f(sim_old) Taxonomy: other mobile devices other mobile devices sim = f(0) Smartphone Tablet sim = f(0.5) sim = f(0.5) Given that all nodes below the referencing node have higher or equal calcu- lated similarity-values compared to it, the condition to have increasing similarity- values from top to bottom is fulfilled. In the following, we describe the merge process of taxonomies. We call the referencing taxonomy T axA while the referenced taxonomy is called T axB. The similarity-values of the nodes in T axA remain the same as before. T axB initially starts with a root-node that carries the same similarity-value as the referencing node. All other nodes are assigned the same similarity-values as before. Because the goal is to be able to look at the taxonomies as separate and independent constructs, the values of T axB are now mapped proportionally from the original interval [simref erencing , 1] to [0, 1]. T axB now carries similarity- values that are detached from the original project-context. Finally and most importantly, the resulting values calculated for the project- context are actually identical to the values of the original taxonomy. In fact it does not matter to which interval [b, 1] we mapped the original values to as long as b 6= 1. Proof (Calculated similarity-values are identical to original after merge). Assuming the original values of a sub-taxonomy were mapped from [a,1] to [b,1], a, b 6= 1 and a given mapped similarity simmapped of an arbitrary element from that sub-taxonomy, we have to prove for that the calculated similarity based on simmapped is the same as the original similarity simoriginal (simcalc = simoriginal ) Note: b=a is not excluded Inserting into the formula, we get simmapped − b simcalculated = a + (1 − a) 1−b As proportional mapping from an interval [g, h] to another interval [i, j] follows the scheme x−g xmapped = i + (j − i) h−g and we originally mapped from [a,1] to [b,1], we get simoriginal − a simmapped = b + (1 − b) 1−a simoriginal − a b+ (1 − b) − b ⇒ simcalculated = a + 1−a (1 − a) 1−b simoriginal − a (1 − b) =a+ 1−a (1 − a) = simoriginal 1−b  3.3 Compilation of taxonomies When implementing the compilation algorithm for sub-taxonomies, the basic structure follows the patterns used for conventional trees. Each node contains a descriptive id which matches an attribute value of the CBR application and a similarity value. A non-leaf node may have children nodes attached directly to it or via a reference to another taxonomy. In our implementation the decision between children nodes and a reference is exclusive in order to keep the overall structure as simple to understand as possible. This, however, is a pure design decision and does not impose a tech- nical restriction. The decision was made in order to achieve a clear distinction between the two different ways in which children may be included. Nodes that reference another taxonomy are called referencing nodes while others are called non-referencing nodes. The compiler gets passed a set of sub-taxonomies as resources for the com- pilation as well as the information on which sub-taxonomy defines the root of the desired output. Because all of the sub-taxonomies are referenced by their id/name, it makes sense to use a key-value mapping for efficient access to the requested sub-taxonomy. When inserting the same taxonomy multiple times into one big taxonomy at different referencing nodes, the identification of the different instances of the sub-taxonomy needs to remain unambiguous. For those use-cases, the compiler offers the option to generate unique names for each of the nodes by inserting the name of the referencing nodes parent as a prefix before the names of the nodes in the referenced taxonomy. Using ’:’ as a separator symbol between the names, the names for the taxonomy-nodes from ’other mobile devices’ as shown in Figure 2 would result in ’mobile:other mobile devices’, ’mobile:Smartphone’ and ’mobile:Tablet’. Figure 3 shows an overview of the algorithm executed in order to compile the taxonomy. It should be noted that the algorithm treats directly attached children and children that are inserted by reference in exactly the same way. A referencing node gets replaced by the node that it references and the children of that referenced node are considered its children. Fig. 3. Overview of the Compilation-Algorithm compileTaxonomy Creation of an empty tree/taxonomy as target (the root of that tree is the first target node) compileNode Following the context of the target node for every childnode of the original node: Transfer of adjusted values Addition of a new node underneath the from the original node: current node in the target taxonomy as - Similarity new target node - Name Readjustment of node values: -similarity of current target node original is - id/name of current target node non-referencing node under consideration of the new context-parameters Adjustment of context parameters: original is - sim_referencing XOR referencing node - sim_reference.root - prefix 3.4 Runtime depending on input size We define the number of nodes of the resulting taxonomy as input length n. This definition of the input length is somewhat unconventional as it assumes information about the resulting taxonomy. It still makes sense though when put into the context of our concept. The alternative to this algorithm and concept is to have large singular taxonomies with all their disadvantages. The concept of subtaxonomies offers the possibility to represent the same taxonomies in a simplified way - therefore taking the resulting singular taxonomy as input size is not as far fetched as it might seem at first. The basic processing of each node happens in O(1) as the only action performed for each node is the adjustment of its values. Each node needs to be processed one time resulting in O(n) for the average, worst and best case. The part where a non-linear increase of the runtime is possible would be the process of finding the according taxonomy for a reference. As mentioned before, an access-optimize storage model for access from a given name/identificator of a taxonomy would make sense here. The expected effort for finding elements by using hashing is within O(1) depending on the amount of elements in the hashtable compared to the size of the hashtable [5, Chapter 6.4]. While the worst case can be O(n), optimized hybrid implementations like Java 8’s HashMap offer a runtime of O(log(n)) for the worst case by using search trees. The algorithm is able to efficiently compile a taxonomy within an expected runtime of O(n). In the end even the worst case runtime of the algorithm is in O(n ∗ log(n)). It should be noted that this would assume that the largest part of nodes are referencing nodes, which does not make sense in real life scenarios for obvious reasons. Assuming a low (compared to the number of nodes in total) and constant number of referencing nodes we further approach O(n) even when hitting the worst case for the hashtable. 3.5 Implementation within myCBR myCBR is an open source tool and library for CBR modeling and retrieval tasks that integrates structural CBR into an easy to use interface, the myCBR workbench and can also be accessed through an API. It is a joint effort of the Competence Center CBR at DFKI, Germany, and the School of Computing and Technology at UWL, UK. Various industry partners like Airbus and Lufthansa Industry Solutions work together with those institutes to explore new possibili- ties and approaches to CBR systems[4]. Before the integration of sub-taxonomies into myCBR, the tool already of- fered support for traditional taxonomies that was used as the basis for the in- tegration. As a matter of fact, the integration was achieved while keeping the existing calculations for similarities between cases intact. Sub-taxonomies get compiled into a construct that remains largely unchanged from the original structure which was already used before for the calculation of similarities be- tween cases. While adding the functionality for sub-taxonomies along with cycle-detection3 during the compilation of sub-taxonomies to the SDK, the user interface also had to be adjusted. We created a separate perspective for sub-taxonomies in the workbench, where all sub-taxonomies are collected. The tool myCBR separates attributes and similarity functions in a way that different types of similarity functions can be assigned to an attribute. The addition of the function-type ’Sub-taxonomy’ allows to assign a single sub-taxonomy to an attribute that is then compiled to a complete taxonomy using all other sub-taxonomies that exist in the project. Traditional taxonomies that existed in older myCBR projects can be im- ported as sub-taxonomies. The export of single sub-taxonomies is also possible to transfer them to other myCBR projects. Those taxonomies can be split ac- cording to their logical parts (e.g. every system inside of a plane gets its own sub-taxonomy for the plane-taxonomy). However, it serves the purpose of low- ering the effort for maintenance to pay special attention to duplicate (sub)parts inside of existing taxonomies. Processing a traditional taxonomy to gain a set of sub-taxonomies that elim- inates duplicate information is a workflow which consists of three steps. First we check if at least one of the imported taxonomies can be simplified. A taxonomy can be simplified if there is repetitive information across several taxonomies or within the same taxonomy. If there are several taxonomies that can be simplified, we use the new myCBR functions to split the repetitive part from the original taxonomy. Then we use set reference to replace the according part in all other taxonomies. After this process we check again if the taxonomies can be simpli- fied even further. If no further simplification is possible, the workflow is finished. Figure 4 shows an event-driven process chain for this workflow. Fig. 4. Workflow to simplify taxonomies with myCBR Workbench a new set of check if the the taxonomies taxonomy is references are taxonomies is XOR taxonomies can XOR can be split taxonomy set references split set imported be simplified simplified function the taxonomies event myCBR can not be myCBR myCBR simplified component Figure 5 shows the interface of the myCBR-Workbench in the ’Sub-taxonomies’- perspective.In order to use a sub-taxonomy for the calculation of similarities for an attribute, the user has to switch to the ’Modeling’-perspective and add a Sub-taxonomy-Function as a similarity function to an attribute inside of the 3 For example: TaxA references TaxB and TaxB references TaxA resulting in a circle Fig. 5. Compilation of sub-taxonomies current project. By assigning a Sub-taxonomy-Function as similarity function, the user can now choose any of the taxonomies contained in the collection of sub-taxonomies inside of the project. That taxonomy will be used as the root taxonomy for the compilation of a single taxonomy out of all referenced sub- taxonomies. As a result of the selection, the taxonomy gets compiled and myCBR can now perform the same similarity-calculations on that taxonomy that were possi- ble before, using traditional taxonomies. In contrast to traditional taxonomies, this compiled version of sub-taxonomies can not be modified directly. All modi- fications have to be performed directly on the sub-taxonomies which it consists of. After modification of one or multiple of those sub-taxonomies, the user can choose to recompile this taxonomy at any time by clicking ”Reload from Sub- taxonomies”. 4 Evaluation Following the principle of separation of concerns it makes sense to split up a taxonomy into multiple meaningful parts. Being able to regard each resulting part separately allows for individual modeling thereby making work easier for the ones working on the different parts. Without using the concept of sub-taxonomies proposed in this paper, a change in a parent node needs to be reflected in all the nodes that succeed it. In our concept, each sub-taxonomy that is integrated under the according node through compilation does not need to be touched. In instances where parts of a taxonomy are reused across multiple projects or multiple taxonomies in the same project, there is also less modeling effort as the changes do not need to be copied over through multiple taxonomies anymore and can instead be implemented by one single sub-taxonomy. While we already demonstrated the feasibility of using sub-taxonomies through mathematical means earlier in this paper, we also computed multiple random splits of taxonomies provided by airbus at 20 % of each taxonomies nodes, mapped each resulting sub-taxonomy from [simroot , 1] to [0, 1] 4 and finally compared the resulting compiled taxonomies to the original taxonomy checking for deviations. Fig. 6. Evaluation Method Split Normalize Compile Compare with Original ? = This process was repeated 10 000 times for the following taxonomies of the OMAHA research project resulting in the processing of 1 020 000 nodes in all iterations combined: tax engine type was split 6 times (Taxonomy size:30 nodes) sim tax ata was split 10 times (Taxonomy size:52 nodes) sim tax emitter was split 3 times (Taxonomy size:18 nodes) default function was split 0 times (Taxonomy size:2 nodes) Alongside consistency checks on the general structure of the taxonomy (nam- ing, tree structure), we also measured the average and maximum error between the nodes of the compiled taxonomy compared to its original counterpart. As expected, the results showed no measurable deviation from the original taxon- omy. 5 Summary and outlook In this paper, we presented a concept that offers the functionality to combine separately created and maintained sub-taxonomies into a single taxonomy. The feasibility of this concept has been mathematically proven and demonstrated 4 Mapping → Normalization of the sub-taxonomy with the integration into the myCBR-project. It was further shown that the concept does not impose any noteworthy computational overhead. During the design we aimed for a concept of taxonomies that allows for efficient maintenance and we assume that the adaption of this concept would bring benefits to any area where complex and hard to maintain taxonomies are prevalent. It is possible to separate complex taxonomies with relatively low effort and risk as similarity values can be kept identical before and after this process. On top of that our concept makes sharing parts of a taxonomy across one or multiple projects possible and comfortable. For the future an evaluation of this concept with respect to the maintenance effort after and before introduction together with industry partners would be desirable. References 1. Bach, K., Sauer, C.S., Althoff, K.D., Roth-Berghofer, T.: Knowledge modeling with the open source tool mycbr. In: Proceedings of the 10th Workshop on Knowledge Engineering and Software Engineering (2014) 2. Bergmann, R.: On the use of taxonomies for representing case features and local similarity measures. In: Proceedings of the Sixth German Workshop on CBR. pp. 23–32 (1998) 3. Bergmann, R.: Experience Management, Foundations, Development, Methodol- ogy and Internet-Based-Applications. Springer Verlag Berlin Heidelberg New York (2002) 4. BMWI: Luftfahrtforschungsprogramms v (2013), http://www.bmwi.de/ BMWi/Redaktion/PDF/B/bekanntmachung-luftfahrtforschungsprogramm-5, property=pdf,bereich=bmwi2012,sprache=de,rwb=true.pdf 5. Knuth, D.: The art of computer programming. Addison-Wesley Pub. Co, Reading, Mass (1973) 6. Köppelmann, M., Lange, D., Lehmann, C., Marszalkowski, M., Naumann, F., Ret- zlaff, P., Stange, S., Voget, L.: Scalable similarity search with dynamic similarity measures (2012) 7. Long, J., Stoecklin, S., Schwartz, D.G., Patel, M.K.: Adaptive similarity metrics in case-based reasoning 8. Raunich, S., Rahm, E.: Target-driven merging of taxonomies. CoRR abs/1012.4855 (2010), http://arxiv.org/abs/1012.4855 9. Ruiz-Miró, M.J., Miró-Julià, M.: Dynamic similarity and distance measures based on quantiles. In: Computer Aided Systems Theory – EUROCAST 2015 (2015) 10. Thau, D., Bowers, S., Ludäscher, B.: Merging taxonomies under rcc-5 algebraic articulations. In: Proceedings of the 2Nd International Workshop on Ontologies and Information Systems for the Semantic Web. pp. 47–54. ONISW ’08, ACM, New York, NY, USA (2008), http://doi.acm.org/10.1145/1458484.1458492