=Paper=
{{Paper
|id=Vol-1788/STIDS2016_T01
|storemode=property
|title=Scalable Semantically Driven Decision Trees for Crime Data
|pdfUrl=https://ceur-ws.org/Vol-1788/STIDS_2016_T01_Johnson_Karabatis.pdf
|volume=Vol-1788
|authors=Shawn Johnson,George Karabatis
|dblpUrl=https://dblp.org/rec/conf/stids/JohnsonK16
}}
==Scalable Semantically Driven Decision Trees for Crime Data==
Scalable Semantically Driven Decision Trees for Crime Data Shawn Johnson, George Karabatis Department of Information Systems University of Maryland, Baltimore County (UMBC), 1000 Hilltop Circle, Baltimore, MD 21250, USA {yv74924, GeorgeK}@umbc.edu Abstract. When dealing with large volumes of data in 1) We allow user specified search preferences to be organizations, there is always a need to associate data expressed with robust semantics resulting in with its appropriate meaning, since the same data object higher precision and recall that closely match the may have different meaning to different users. This user's intended search terms. creates a problem of delivering search results that is different from a requester’s intended purpose. To solve 2) We have developed a method for user specified this problem, we propose a parallelizable framework capable of capturing user specified constraints that are semantics to be expressed probabilistically in the both semantically relevant to a search/domain in search. Search results that have a semantic question as well as contextually relevant to a user and/or similarity to a set of user specified preferences organization. are also returned enabling us to handle uncertainty in our approach as well. I. INTRODUCTION When attempting to find data that is relevant to a 3) We have developed a way for multiple sets of user and/or organization based on a query, it is very user specified preferences to be expressed using common to retrieve exact matches in response to a the same ontology. search. While using the exact matching of strings as user search keywords can retrieve exact results, a user 4) Our methodology allows for robust semantics to may be looking for data with a specific meaning be expressed in a parallelizable way. based on his or her intended search preferences but the data that is provided by a system or organization 5) We have implemented a prototype and evaluated may have a completely different meaning. This our methodology by conducting experiments problem is further compounded by having to ensure using precision and recall as metrics. that data that has recently been ingested into a system is consistent with data that currently resides on the Motivating Example: Many municipalities often same system. This can create an intractable problem have a need to capture semantically relevant data for for a user such as having to constantly poll and an intended purpose. For example, Bob, a detective classify new data or accept new data that may be with the city of Baltimore, needs to get the current inappropriately classified to ensure the semantic statistics of all aggravated theft incidents between meaning of the data is consistent. In addition, the 1964 and 2013, in Baltimore, Maryland, USA. Bob problem of new data being added to a system on a wants to compare this data with the same data on a massive scale necessitates a scalable solution. We national scale over the same time period. propose an approach which allows users to Furthermore, there are different types of data on thefts personalize search terms according to the same set of that Bob wants to compare against. To further concepts where search results can be universally compound the issue, different vendors provide understood by the same community of users different labels for the same type. The impact of according to personalizable search profiles. Our capturing more semantically relevant data and getting approach also enhances the accuracy of search results more accurate results enables law enforcement by returning semantically similar results from a officials to make better informed decisions because specific domain. The impact of our approach the information they are looking for is more precise dramatically enhances the ability of users to and relevant to a specific situation for which the personalize a search thus retrieve more accurate officials need to make a decision about. results. With the introduction of our framework, we make a few key contributions: We describe our approach in section II, and then we continue on with a discussion of our experiments STIDS 2016 Proceedings Page 2 in section III, a validation of our approach in IV, and 2. Time Context - The time context is anything that then a discussion of relevant work in section V. We describes any kind of temporal attributes related then finish up our work with concluding thoughts and to an entity such as a year proposed future work in section VI. 3. Location Context - The location context describes any kind of location attribute related to an entity II. APPROACH such as a city or state To begin our learning, we make use of a single ontology that represents our categories of context, 4. Activity Context - The activity context describes with each category of context represented via separate a goal related to entity such as how much of branches of the ontology. something a law enforcement official may be looking for such as a total amount of violent Continuing our motivating example, we show a crime sample representation of our ontology using figure 1 to show the taxonomical tree of terms that a user can 5. Relations Context - The relations context are select as user preferences and how they organized. attributes of an entity that describe an entity’s Utilizing the Operational of Context [1] we model our relation to another entity or its parts ontology using 5 separate categories of context. User preferences in our ontology are saved with literals that are added to each node a user has selected. We chose not to add countries in our ontology because we are assuming that our domain is within all 50 states within the USA. Bob the detective stores his preferences in a user profile that matches parts of the ontology (in figure 1) such as all aggravated thefts that have occurred in Baltimore, Maryland, between 1964 and 2013. A special depth first search algorithm then searches the ontology and matches saved user specified literals for each node within each branch of the ontology and creates a special in memory tree model. This in memory tree model is a sub-tree of the ontology that matches all of the user selected preferences that were found in the original ontology. Referring again to figure 1, only parts of the ontology that match aggravated theft, Baltimore, Maryland that are between 1964 and 2013 are copied into the new in memory tree model. After the original ontology has been parsed and the in memory tree model has been implemented, we are now able to build the rest of our decision tree. Records (files) are initially classified based on user preferences using the in memory tree model provided in figure 2. For example, all records that are classified must include records that have the attributes Fig. 1 Sample representation of an ontology: Baltimore, Maryland, aggravated theft, or between the years 1964-2013 (meeting Bob's specified user Operational Definition of Context [1], we model our preferences). Now that only instances remain that ontology using 5 separate categories of context. They match the user specified constraints above, (i.e., the are: records are all containing values for the attributes Baltimore, Maryland, aggravated theft, or in the range 1. Individuality Context - The individuality context 1964-2013) we must build out the remaining part of a are attributes that describe an entity’s type such tree based on the number of user specified splits or as specifying the type of theft criteria to enable the system to return search results with increasing levels of precision by dividing the records into further subsets as specified by the user STIDS 2016 Proceedings Page 3 with each additional split. For example, if the user has in-memory tree model is created copying each of the specified that he or she wishes to have the ontology paths that a user has selected. First, the user specifies be 5 hops deep and the ontology is only 3 hops deep, the number of hops deep (or total depth) he or she additional splits must be completed (if there are wishes for the size of the decision tree to be. Second, enough records left to split against in order to meet the user picks from a list of selected labels that have the user's specified preferences). Referring to figure 2, already been mapped via our ontology via a tree like additional splits are made using the lowest level drop down list or via a search box. (The mappings are collection of leaves for the individuality and location a graphical representation of the ontology, where the branches of the ontology (additional splits were not user can select any one of the nodes of the ontology as made to the time branch of the ontology because the a user preference). After our search is complete we user chose not to do so since year was the most populate a list of labels from our ontology and copy granular measure of time specified in the data). the names of the nodes (and their parents) that the Because the user has selected the decision tree to be 4 user has selected. The same set of user preferences hops deep, additional splits are made using the lowest will be copied from all five categories of the ontology collection of leaves for each branch of the ontology. below (with each branch of the ontology being stored In the location ontology, the lowest collection of as a separate but parallel part of our in memory tree leaves in that branch of the ontology are Baltimore, model). The in memory tree model persists in Potomac, Los Angeles, Sacramento, Albany, and New memory as a service. Files are matched against the York City. Because Potomac has the lowest impurity user preferences that have been stored in the in- (see Semantically Driven Gini Algorithm) computed memory tree model. from the lowest collection of leaves, it is selected as the first candidate split using the Semantically Driven Gini-Index Algorithm. No more splits are performed Semantically Driven Gini-Split on the location branch of the ontology because it is now 4 hops deep. On the individuality branch of the ontology, a single split is made using The Semantically Driven Gini-Index Algorithm, with Vehicle Theft Potomac vehicle theft being selected because it had the lowest impurity (see Semantically Drive Gini Algorithm) of the lowest leaves. No more splits are made for the individuality branch of the ontology because it is now 4 hops deep as well. Our special in memory tree Aggravated Theft model/user driven ontology looks (logically) like Baltimore Figure 2. Any documents that match any of the user preferences or Semantically Driven Gini-Index Maryland Theft Algorithm are then tagged using a custom document summarization algorithm. The tagged files are then copied to The Hadoop Distributed File System (HDFS). Separate mapper calls are then kicked off using search conditions that meet the user specified Location preferences. A single reducer job is kicked off Individuality Root consolidating a set of files that meet a final set of preferences as specified by the user in the in-memory tree model. A final set of key value pairs of files using the key as the name of the matching file and a value of 1 that matches all of the user specified preferences Time are emitted signifying that the algorithm is complete. For example, only the names of files that meet all the user preferences Baltimore, aggravated theft, and between the time range of 1964-2013 are emitted. 1964-2013 In Memory Tree Building Algorithm: To utilize only paths of the ontology that the user has selected as Fig. 2. A logical representation of an in memory tree model (including splits from the Semantically Driven Gini-Index his or her search preferences, a smaller graph of the Algorithm): STIDS 2016 Proceedings Page 4 GINI (t ) 1 ¦[ p( j | t )]2 (1) We implement our depth first search algorithm j using the following representation: C for Category Input: Lowest collection of leaves for each branch of Contextual Model Data, O for Ontology, for each the in-memory model branch of the ontology (each primary branch of the ontology connected to the root), where Output: Additional children nodes for each branch of represents a current node, represents a child the in-memory model node, a indicates the current level (depth) of the ontology, b where a holds a 1 or 0 (1 if for each node Where t is the node, j is the class, and p is probability selected by user or 0 if not selected by the user holds of class j given a node t. The algorithm works as a pointer to the child node, and c is a list of all other follows: user specified preferences, and d is a list of pointers to all child nodes of the current (parent) nodes. 1) Initial splits are made based on classes In memory tree algorithm pseudocode: specified in the in memory tree model. For example, aggravated theft is a candidate for a Input: Ontology O decision tree split since aggravated theft is a Output: in-memory tree model class in the tree (originally specified within the ontology). LOOP: Repeat the following steps (for a + 1 of the current node until a = the lowest leaf in the tree), for 2) Additional splits at the next level of the tree - , until all of O or the entire ontology is finished are implemented using the same collection of leaves (i.e., the lowest collection of leaves 1) For each category of context from Root we for a given branch of the ontology stored in represent our Ontology as follows: O = the memory tree model) with the next split { . For each branch of the tree being the node with next lowest impurity using the calculation defined in (1). = and C is the starting node for each tree (i.e., the starting node after root is 3) Additional splits are induced using the same Location, Time, Individuality, Activity, and lowest collection of leaves (for each branch Relations for each of the 5 categories of of the tree) until the max number of splits context). has been reached either by the following: 2) For a of of add to c of where is a current node, is x A program driven default - This condition the child node of , a is the current depth happens when a program driven default of , b holds a 1 or 0 (1 if a user has chosen number of splits is reached for a given branch of the ontology. For example, if the a user preference or 0 if the node has not been program default is set to 5 splits, then the selected as a user preference), c holds a list of all other user specified preferences and d is a decision tree will not split beyond 5 hops list of pointers from to children deep for that given branch. This is true regardless of whether the split was based nodes on a modeling part of the ontology or a part of The Semantically Driven Gini- 3) If no children exist for d, d = NULL. Index Algorithm used to calculate a split. END LOOP; x A user specified limit is reached - This condition happens when the user specifies a max number of splits he or she sets for a Semantically Driven Gini-Index Algorithm: Once given branch of a decision tree. For the in-memory tree model building algorithm is example, if the max number of splits that complete, our custom Semantically Driven Gini-Index is specified is 7 for a particular branch of Algorithm is kicked off. First, we define our the decision tree then the decision tree will Semantically Driven Gini-Index Algorithm which stop inducing additional splits in that calculates the impurity of each node as follows: branch of the decision tree beyond that number regardless of whether or not the nodes being split come from the ontology STIDS 2016 Proceedings Page 5 or additional splits are determined by The Semantically Driven Gini-Index 2) Ontologies but No Context - Experiments with Algorithm. files that were tagged using ontologies in RDF/OWL Files saved in the in memory model, x The max number of possible splits has but without any user preferences saved within been reached - This condition occurs when them (copying the entire ontology for each all possible splits from within an ontology category of context). The tagged files were then as well as all of the lowest level of leaves processed in a MapReduce Job producing the have been utilized in a split resulting in a results. max number of splits that can be used to build a given decision tree. For example, if 3) Ontologies and Context - These experiments the ontology is 4 hops deep and the lowest were executed using the files that were tagged level of leaves totals 4 leaves as well, this using both the ontologies modeled in RDF/OWL makes the max number of splits possible Files, but also with the saved user preferences for the decision tree to be 8 split. that were parsed from the RDF/OWL Files as well. The tagged files were then processed using III. EXPERIMENTS MapReduce producing the results. To validate our approach, we took roughly 100,000 files from the UCR Data Repository and IV. VALIDATION processed them against 60 user specified preferences stored within the in memory tree model. The We use two well-known metrics to validate our semantically mapped features were saved as tags in a approach: Precision and Recall. Recall is defined as modified version of each file, making matching for the fraction of the records retrieved that are relevant each set of user preferences a matter of matching the to the query. In other words, recall reveals the tags that have been specified by the user saved in the percentage of the retrieved and relevant records that in memory tree model. Finally, the resulting are relevant (whether in the answer set or outside the MapReduce Jobs generated a file name with a value answer set). Precision is defined as the fraction of the of 1 for each set of semantically mapped preferences retrieved records that are relevant to the search. In that were matched. We used the following sample other words, precision reveals the percentage of Scenario 1: Bob is looking for all crimes that occurred retrieved and relevant records in the answer set. within the state of Maryland, between 1998 and 2002, Recall and precision are calculated as follows: he is searching for crime totals for larceny theft in which the files are saved as .xls files. Scenarios 2 through 10 are variants of scenario 1, where semantically mapped preferences were matched (2) against the same files from the UCR Data Repository Values for scenarios 2-10 were chosen at random. We have run three sets of experiments: (3) 1) No Context and No Ontologies - Experiments with user preferences as exact search terms on MapReduce Jobs. 1 0.8 0.6 0.4 No Ontologies and No Context 0.2 0 Ontologies and No Context Ontologies and Context Fig. 3. Experiment results using recall STIDS 2016 Proceedings Page 6 1 0.8 0.6 0.4 No Ontologies or Context 0.2 Ontologies and No Context 0 Ontologies and Context Fig. 4. Experiment results using precision In the above scenarios or sets of test conditions that matching terms in the in memory tree that were parsed from simulated using sets of user specified preferences (like in ontology and stored in the in memory tree model without any Scenario 1 discussed above), the recall was computed for no user specified preferences and matching them against the ontologies and no context by dividing matching exacting properties and content of the files that are being parsed. search terms from a user’s search against the content of the Ontologies and no context resulted in a low precision as well files that are being parsed. The recall was extremely low for because the entire ontology was stored in each in memory tree no ontologies and no context because the exact key words model resulting in tagged files that only partially or did not used in MapReduce Jobs matched a very small percentage of match a user's intended search terms at 26% precision. The the files that were relevant to a user's search. The mean precision for ontologies and context was computed by taking average of all 10 scenarios was 28%. The recall for ontologies any of the matching preferences stored in the in memory tree and no context was computed by taking the matching terms in model that parsed from the ontology and selected by the user the in memory tree model that were parsed from the ontology from the ontology and matching them against properties and and matching them against properties and content in the files content and of the files being parsed. Ontologies and context being parsed. The searches for ontologies and no context resulted in a very high precision versus no ontologies and no yielded a higher recall because all the files that were tagged context and ontologies and no context, because documents using the in memory tree model matched 1 or more of the that returned key value pairs in our MapReduce Jobs matched classes specified in the ontology with a total of 47% recall, the semantically mapped user preferences as well the semantic resulting in almost a 20% increase from no ontologies and no constraints specified in the ontology resulting in an 80% context. The recall for ontologies and context was computed precision; a 54% increase in ontologies and no context and a by taking the matching preferences stored in the in memory 63% increase in accuracy vs. no ontologies and no context. tree model and matching them against properties and content of the files being parsed. Searches involving ontologies and In summary we learned that enabling a user to pick context had a very high recall because all results retrieved semantically enriched preferences from an ontology of terms matched both the semantically driven preferences that were that reflect an existing domain from a corpus can lead to a saved in the in memory tree model as well as with the much higher precision and recall than using exact search terms semantically specified constraints from the ontology resulting or just using semantically enriched search terms parsed from in 63% recall, a 16% increase from ontologies and no context an ontology. By building an in memory tree model we are able and a roughly 35% increase improvement in recall vs. no to both represent the knowledge representative of a domain ontologies and no context. and we also enable personalization by a user of that knowledge as well. In order to enable robust personalization, Precision was computed for no ontologies and no context semantics must first be reflected in a model before a user can by dividing matching exacting search terms from a user’s pick them. This is depicted in our results by showing that the search against the content of the files that are being parsed. precision and recall is higher than with just choosing exact The precision for no ontologies and no context was extremely search terms and the precision and recall further improves low because the overwhelming majority of the search results when allowing a user to pick attributes he or she wishes to use that were returned did not match the intended user search in a search with terms picked from the ontology. preferences because exact key words were used for each search resulting in a 17% precision. The precision for ontologies and no context was computed by taking any of the STIDS 2016 Proceedings Page 7 V. RELATED WORK References Zhang et al. [2] formulated an approach that calculated [1] A. Zimmermann, A. Lorenz and R. Oppermann, "An Operational Definition of Context," 2007. the information gain for finding the best split for a node between two or more separate taxonomies. This approach does J. [2]ZhJ Zhang, A. Silvescu and V. Honava, "Ontology-Driven Induction of not incorporate any kind of semantic inference or contextually Decision Trees at Multiple Levels of Abstraction," Berlin Heidelberg, 2002. driven attributes for building a decision tree, neither does it address any issues with trying to make a decision tree [3] B. Gajderowicz, M. Soutchanski and A. Sadeghian, "Trees, Ontology parallelizable. Gajderowicz et al. formulated an approach for Enhancement through Inductive Decision Trees," in Uncertainty enriching manually created ontologies using decision trees [3]. Reasoning for the Semantic Web II, Springer Berlin Heidelberg, 2013. They also developed a system for using decision trees for ontology matching [4]. Johnson et al., formulated an approach [4] B. Gajderowicz, "Using Decision Trees for Inductively Driven Semantic Integration and Decision Matching," Ryerson University, Toronto, 2011. for enriching ontologies off of custom built decision trees [5]. Bouza et al. described an approach by using an ontology to [5] I. Johnson, J. Abécassis, B. Charnomordic, S. Destercke and R. build user profiles to make various recommendations on user Thomopoulos, "Making Ontology-Based Knowledge and Decision Trees behavior [6]. Our approach not only allows a user to specify Interact: An Approach to Enrich Knowledge and Increase Expert preferences, but also ensures that they are semantically similar Confidence in Data-Driven Models," in Knowledge Science, Engineering and Management, Springer Berlin Heidelberg, 2010. to any search results; it is parallelizable too. Fanizzi et al. developed a novel framework for learning custom description [6] A. Bouza, G. Reif, A. Bernstein and H. Gall, "SemTree: Ontology-Based logic learning languages using decision trees. While this Decision Tree Algorithm for Recommender Systems," Karlshruhe, 2008. approach is novel for learning description logic concept definitions, it does not incorporate user preferences [7]. [7] N. Fanizzi, C. d’Amato and F. Esposito, "Induction of Concepts in Web Ontologies through Terminological Decision Trees," Barcelona, 2010. Nenkova et al. developed a technique for summarizing documents based on frequency [8]. Arun and Gunavathi [8] A. Nenkova, L. Vanderwende and K. R. McKeown, "A Compositional developed a technique for summarizing documents using Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization," Seattle, Washington, 2006. context sensitive weights for indexing [9]. This work did not utilize the contextual properties of parts of a document nor [9] J. M.E and C. Gunavathi, "Document Summarization and Classification were user preferences utilized when creating the summaries. sing Concept and Context Similarity Analysis," 2014. Witte et al. developed a fuzzy graph technique for multi- document summarization [10]. Barzilay et al. developed a [10] R. Witter, R. Krestel and S. Bergler, "Context-based Multi-Document technique for summarizing documents using the contextual Summarization Using Fuzzy Coreference Cluster Graphs". attributes found in text across a series of documents [11]. Yang et al. developed a summarizing framework using the [11] R. Barzilay, K. R. McKeown and M. Elhadad, "Information Fusion in social contextual information, but they did not utilize user the Context of Multi-Document Summarization". specified preferences beyond social ones such as a time or location that a document was created [12]. [12] Z. Yang, C. Keke, J. Tang, Z. Li, S. Zhong, L. Jaunzi and Y. Zi, "Social Context Summarization," Beijing, China, 2011. VI. CONCLUSIONS [13] Apache Spark, "Apache Spark," [Online]. Available: http://spark.apache.org. [Accessed 10 06 2016]. We described and validated an approach to specify semantically driven user preferences in a parallelizable way. [14] T. Boujari, "Instance-Based Ontology Alignment Using Decision Trees," We also encountered a few limitations. First, we found that Institutionen för Datavetenskap, 2012. extensive exploration and sampling of data files was needed to be able to properly model preferences in our ontology to [15] H. F. Witschel, "Using Decision Trees and Text Mining Techniques for confirm a consistent structure of a file format when creating Extending Taxonomies," in In Proceedings of Learning and Extending Lexical Ontologies by Using Machine Learning Methods, 2005. our in memory tree model. Second, we found that the user preferences that were specified in the in memory model [16] W. J. Wei Dai, "A MapReduce Implementation of C4.5 Decision Tree needed to closely mirror the user preferences that were Algorithm," International Journal of Database Theory and Application, specified in the ontology or this would lead to searches pp. 49-60, 2014. returning incorrect results or errors resulting in our program because the structure of RDF/OWL Model was incorrect. For [17] R. Mirambicka, A. R. Sulthana and G. Vadivu, Decision Tree Applied to our future work we plan on testing much larger datasets. Also Learning Relations Between Ontologies. planned are further attempts to model more expressive attributes for user specified preferences. Finally, we plan on [18] D. Jeon and W. Kim, "Development of Semantic Decision Tree," in Data Mining and Intelligent Information Technology Applications (ICMiA), utilizing Apache Spark [13] and new versions of Hadoop to 2011 3rd International Conference on, 2011. allow for both a more novel design and implementations of our approach. STIDS 2016 Proceedings Page 8 [19] M. Patil, S. Khomane, V. Saykar and K. Moholkar, "Web People Search Using Ontology Based Decision Trees," International Journal of Data Mining & Knowledge Management Process, 2012. [20] B. Panda, J. S. Herbach, S. Basu and R. J. Bayardo, "PLANET: Massively Parallel Learning of Tree Ensembles," Google, 2009. [21] J. Han, Y. Liu and X. Sun, "A Scalable Random Forest Algorithm Based on MapReduce," in Software Engineering and Service Science (ICSESS), 2013 4th IEEE International Conference on, Bejing, China, 2013. [22] W. Yin, V. Simmhan and V. K. Prasanna, "Scalable Regression Tree Learning on Hadoop Using OpenPlanet," in MapReduce '12 Proceedings of third international workshop on MapReduce and its Applications, New York, New York, 2012. [23] S. d. Río, V. López, J. M. Benítez and F. Herrera, "On The Use of MapReduce For Imbalanced Big Data Using Random Forest," Information Sciences, pp. 112-137, 2014. [24] S. Tyree, K. Q. Weinberger and K. Agrawal, "Parallel Boosted Regression Trees for Web Search Ranking," in WWW 2011 – Session: Ranking, NY, NY, 2011. [25] X. Zhanga, C. Liua, S. Nepalb, C. Yanga, W. Dou and J. Chen, "A Hybrid Approach For Scalable Sub-Tree Anonymization Over Big Data Using MapReduce On Cloud," Journal of Computer and System Sciences, pp. 1080-1020, 2014. [26] J. Ye, J.-H. Chow, J. Chen and Z. Zheng, "Stochastic Gradient Boosted Distributed Decision Trees," in CIKM '09 Proceedings of the 18th ACM Conference on Information and Knowledge Management, New York, New York, 2009. [27] A. Ghoting, P. Kambadur, E. Pednault and R. Kannan, "NIMBLE: a Toolkit For The Implementation of Parallel Data Mining and Machine," in KDD '11 Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, New York, 2011. [28] G.-Q. Wu, H.-G. Li, X.-G. Hu, Y.-J. Bi, J. Zhang and W. Xindong, "MReC4.5: C4.5 Ensemble Classification with MapReduce," in 2009 Fourth ChinaGrid Annual Conference, Yantai, Shandong, 2009. STIDS 2016 Proceedings Page 9