=Paper=
{{Paper
|id=Vol-1788/STIDS2016_T01
|storemode=property
|title=Scalable Semantically Driven Decision Trees for Crime Data
|pdfUrl=https://ceur-ws.org/Vol-1788/STIDS_2016_T01_Johnson_Karabatis.pdf
|volume=Vol-1788
|authors=Shawn Johnson,George Karabatis
|dblpUrl=https://dblp.org/rec/conf/stids/JohnsonK16
}}
==Scalable Semantically Driven Decision Trees for Crime Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1788/STIDS_2016_T01_Johnson_Karabatis.pdf</pdf>
<pre>
Scalable Semantically Driven Decision Trees for
                 Crime Data
                                          Shawn Johnson, George Karabatis
                                         Department of Information Systems
                                 University of Maryland, Baltimore County (UMBC),
                                  1000 Hilltop Circle, Baltimore, MD 21250, USA
                                          {yv74924, GeorgeK}@umbc.edu

Abstract. When dealing with large volumes of data in          1) We allow user specified search preferences to be
organizations, there is always a need to associate data          expressed with robust semantics resulting in
with its appropriate meaning, since the same data object         higher precision and recall that closely match the
may have different meaning to different users. This              user's intended search terms.
creates a problem of delivering search results that is
different from a requester’s intended purpose. To solve
                                                              2) We have developed a method for user specified
this problem, we propose a parallelizable framework
capable of capturing user specified constraints that are         semantics to be expressed probabilistically in the
both semantically relevant to a search/domain in                 search. Search results that have a semantic
question as well as contextually relevant to a user and/or       similarity to a set of user specified preferences
organization.                                                    are also returned enabling us to handle
                                                                 uncertainty in our approach as well.
                  I. INTRODUCTION
     When attempting to find data that is relevant to a       3) We have developed a way for multiple sets of
user and/or organization based on a query, it is very            user specified preferences to be expressed using
common to retrieve exact matches in response to a                the same ontology.
search. While using the exact matching of strings as
user search keywords can retrieve exact results, a user       4) Our methodology allows for robust semantics to
may be looking for data with a specific meaning                  be expressed in a parallelizable way.
based on his or her intended search preferences but
the data that is provided by a system or organization         5) We have implemented a prototype and evaluated
may have a completely different meaning. This                    our methodology by conducting experiments
problem is further compounded by having to ensure                using precision and recall as metrics.
that data that has recently been ingested into a system
is consistent with data that currently resides on the              Motivating Example: Many municipalities often
same system. This can create an intractable problem           have a need to capture semantically relevant data for
for a user such as having to constantly poll and              an intended purpose. For example, Bob, a detective
classify new data or accept new data that may be              with the city of Baltimore, needs to get the current
inappropriately classified to ensure the semantic             statistics of all aggravated theft incidents between
meaning of the data is consistent. In addition, the           1964 and 2013, in Baltimore, Maryland, USA. Bob
problem of new data being added to a system on a              wants to compare this data with the same data on a
massive scale necessitates a scalable solution. We            national scale over the same time period.
propose an approach which allows users to                     Furthermore, there are different types of data on thefts
personalize search terms according to the same set of         that Bob wants to compare against. To further
concepts where search results can be universally              compound the issue, different vendors provide
understood by the same community of users                     different labels for the same type. The impact of
according to personalizable search profiles. Our              capturing more semantically relevant data and getting
approach also enhances the accuracy of search results         more accurate results enables law enforcement
by returning semantically similar results from a              officials to make better informed decisions because
specific domain. The impact of our approach                   the information they are looking for is more precise
dramatically enhances the ability of users to                 and relevant to a specific situation for which the
personalize a search thus retrieve more accurate              officials need to make a decision about.
results. With the introduction of our framework, we
make a few key contributions:                                     We describe our approach in section II, and then
                                                              we continue on with a discussion of our experiments


                                            STIDS 2016 Proceedings Page 2
in section III, a validation of our approach in IV, and          2.   Time Context - The time context is anything that
then a discussion of relevant work in section V. We                   describes any kind of temporal attributes related
then finish up our work with concluding thoughts and                  to an entity such as a year
proposed future work in section VI.
                                                                 3.   Location Context - The location context describes
                                                                      any kind of location attribute related to an entity
                       II. APPROACH                                   such as a city or state
    To begin our learning, we make use of a single
ontology that represents our categories of context,              4.   Activity Context - The activity context describes
with each category of context represented via separate                a goal related to entity such as how much of
branches of the ontology.                                             something a law enforcement official may be
                                                                      looking for such as a total amount of violent
     Continuing our motivating example, we show a                     crime
sample representation of our ontology using figure 1
to show the taxonomical tree of terms that a user can            5.   Relations Context - The relations context are
select as user preferences and how they organized.                    attributes of an entity that describe an entity’s
Utilizing the Operational of Context [1] we model our                 relation to another entity or its parts
ontology using 5 separate categories of context.
                                                                      User preferences in our ontology are saved with
                                                                 literals that are added to each node a user has selected.
                                                                 We chose not to add countries in our ontology
                                                                 because we are assuming that our domain is within all
                                                                 50 states within the USA. Bob the detective stores his
                                                                 preferences in a user profile that matches parts of the
                                                                 ontology (in figure 1) such as all aggravated thefts
                                                                 that have occurred in Baltimore, Maryland, between
                                                                 1964 and 2013. A special depth first search algorithm
                                                                 then searches the ontology and matches saved user
                                                                 specified literals for each node within each branch of
                                                                 the ontology and creates a special in memory tree
                                                                 model. This in memory tree model is a sub-tree of the
                                                                 ontology that matches all of the user selected
                                                                 preferences that were found in the original ontology.
                                                                 Referring again to figure 1, only parts of the ontology
                                                                 that match aggravated theft, Baltimore, Maryland that
                                                                 are between 1964 and 2013 are copied into the new in
                                                                 memory tree model.

                                                                      After the original ontology has been parsed and
                                                                 the in memory tree model has been implemented, we
                                                                 are now able to build the rest of our decision tree.
                                                                 Records (files) are initially classified based on user
                                                                 preferences using the in memory tree model provided
                                                                 in figure 2. For example, all records that are classified
                                                                 must include records that have the attributes
Fig. 1 Sample representation of an ontology:                     Baltimore, Maryland, aggravated theft, or between the
                                                                 years 1964-2013 (meeting Bob's specified user
Operational Definition of Context [1], we model our              preferences). Now that only instances remain that
ontology using 5 separate categories of context. They            match the user specified constraints above, (i.e., the
are:                                                             records are all containing values for the attributes
                                                                 Baltimore, Maryland, aggravated theft, or in the range
1.   Individuality Context - The individuality context
                                                                 1964-2013) we must build out the remaining part of a
     are attributes that describe an entity’s type such
                                                                 tree based on the number of user specified splits or
     as specifying the type of theft
                                                                 criteria to enable the system to return search results
                                                                 with increasing levels of precision by dividing the
                                                                 records into further subsets as specified by the user


                                               STIDS 2016 Proceedings Page 3
with each additional split. For example, if the user has     in-memory tree model is created copying each of the
specified that he or she wishes to have the ontology         paths that a user has selected. First, the user specifies
be 5 hops deep and the ontology is only 3 hops deep,         the number of hops deep (or total depth) he or she
additional splits must be completed (if there are            wishes for the size of the decision tree to be. Second,
enough records left to split against in order to meet        the user picks from a list of selected labels that have
the user's specified preferences). Referring to figure 2,    already been mapped via our ontology via a tree like
additional splits are made using the lowest level            drop down list or via a search box. (The mappings are
collection of leaves for the individuality and location      a graphical representation of the ontology, where the
branches of the ontology (additional splits were not         user can select any one of the nodes of the ontology as
made to the time branch of the ontology because the          a user preference). After our search is complete we
user chose not to do so since year was the most              populate a list of labels from our ontology and copy
granular measure of time specified in the data).             the names of the nodes (and their parents) that the
Because the user has selected the decision tree to be 4      user has selected. The same set of user preferences
hops deep, additional splits are made using the lowest       will be copied from all five categories of the ontology
collection of leaves for each branch of the ontology.        below (with each branch of the ontology being stored
In the location ontology, the lowest collection of           as a separate but parallel part of our in memory tree
leaves in that branch of the ontology are Baltimore,         model). The in memory tree model persists in
Potomac, Los Angeles, Sacramento, Albany, and New            memory as a service. Files are matched against the
York City. Because Potomac has the lowest impurity           user preferences that have been stored in the in-
(see Semantically Driven Gini Algorithm) computed            memory tree model.
from the lowest collection of leaves, it is selected as
the first candidate split using the Semantically Driven
Gini-Index Algorithm. No more splits are performed                            Semantically Driven Gini-Split
on the location branch of the ontology because it is
now 4 hops deep. On the individuality branch of the
ontology, a single split is made using The
Semantically Driven Gini-Index Algorithm, with                                                                   Vehicle Theft
                                                                    Potomac
vehicle theft being selected because it had the lowest
impurity (see Semantically Drive Gini Algorithm) of
the lowest leaves. No more splits are made for the
individuality branch of the ontology because it is now
4 hops deep as well. Our special in memory tree                                                                        Aggravated Theft
model/user driven ontology looks (logically) like                   Baltimore
Figure 2.

     Any documents that match any of the user
preferences or Semantically Driven Gini-Index
                                                                    Maryland                                                  Theft
Algorithm are then tagged using a custom document
summarization algorithm. The tagged files are then
copied to The Hadoop Distributed File System
(HDFS). Separate mapper calls are then kicked off
using search conditions that meet the user specified                 Location
preferences. A single reducer job is kicked off                                                                            Individuality
                                                                                                  Root
consolidating a set of files that meet a final set of
preferences as specified by the user in the in-memory
tree model. A final set of key value pairs of files using
the key as the name of the matching file and a value
of 1 that matches all of the user specified preferences                                          Time
are emitted signifying that the algorithm is complete.
For example, only the names of files that meet all the
user preferences Baltimore, aggravated theft, and
between the time range of 1964-2013 are emitted.                                                1964-2013

     In Memory Tree Building Algorithm: To utilize
only paths of the ontology that the user has selected as     Fig. 2. A logical representation of an in memory tree model
                                                             (including splits from the Semantically Driven Gini-Index
his or her search preferences, a smaller graph of the        Algorithm):


                                           STIDS 2016 Proceedings Page 4
                                                                       GINI (t ) 1  ¦[ p( j | t )]2              (1)
     We implement our depth first search algorithm                                           j
using the following representation: C for Category            Input: Lowest collection of leaves for each branch of
Contextual Model Data, O for Ontology, for each               the in-memory model
branch of the ontology (each primary branch of the
ontology connected to the root), where                        Output: Additional children nodes for each branch of
represents a current node,           represents a child       the in-memory model
node, a indicates the current level (depth) of the
ontology, b where a holds a 1 or 0 (1 if for each node        Where t is the node, j is the class, and p is probability
selected by user or 0 if not selected by the user holds       of class j given a node t. The algorithm works as
a pointer to the child node, and c is a list of all other     follows:
user specified preferences, and d is a list of pointers to
all child nodes of the current (parent) nodes.                    1) Initial splits are made based on classes
In memory tree algorithm pseudocode:                                  specified in the in memory tree model. For
                                                                      example, aggravated theft is a candidate for a
Input: Ontology O                                                     decision tree split since aggravated theft is a
Output: in-memory tree model                                          class in the tree (originally specified within
                                                                      the ontology).
 LOOP: Repeat the following steps (for a + 1 of the
current node until a = the lowest leaf in the tree), for          2)   Additional splits at the next level of the tree
   - , until all of O or the entire ontology is finished               are implemented using the same collection of
                                                                       leaves (i.e., the lowest collection of leaves
    1)    For each category of context from Root we                    for a given branch of the ontology stored in
         represent our Ontology as follows: O =                        the memory tree model) with the next split
         {                . For each branch of the tree                being the node with next lowest impurity
                                                                       using the calculation defined in (1).
            =       and C is the starting node for each
         tree (i.e., the starting node after root is
                                                                  3) Additional splits are induced using the same
         Location, Time, Individuality, Activity, and
                                                                     lowest collection of leaves (for each branch
         Relations for each of the 5 categories of
                                                                     of the tree) until the max number of splits
         context).
                                                                     has been reached either by the following:
    2) For a of           of       add          to c of
              where        is a current node,         is               x A program driven default - This condition
       the child node of        , a is the current depth                 happens when a program driven default
       of , b holds a 1 or 0 (1 if a user has chosen                     number of splits is reached for a given
                                                                         branch of the ontology. For example, if the
       a user preference or 0 if the node has not been
                                                                         program default is set to 5 splits, then the
       selected as a user preference), c holds a list of
       all other user specified preferences and d is a                   decision tree will not split beyond 5 hops
       list of pointers from                to children                  deep for that given branch. This is true
                                                                         regardless of whether the split was based
       nodes
                                                                         on a modeling part of the ontology or a
                                                                         part of The Semantically Driven Gini-
    3) If no children exist for d, d = NULL.                             Index Algorithm used to calculate a split.

END LOOP;                                                              x A user specified limit is reached - This
                                                                         condition happens when the user specifies
                                                                         a max number of splits he or she sets for a
    Semantically Driven Gini-Index Algorithm: Once                       given branch of a decision tree. For
the in-memory tree model building algorithm is                           example, if the max number of splits that
complete, our custom Semantically Driven Gini-Index                      is specified is 7 for a particular branch of
Algorithm is kicked off. First, we define our                            the decision tree then the decision tree will
Semantically Driven Gini-Index Algorithm which                           stop inducing additional splits in that
calculates the impurity of each node as follows:                         branch of the decision tree beyond that
                                                                         number regardless of whether or not the
                                                                         nodes being split come from the ontology


                                            STIDS 2016 Proceedings Page 5
             or additional splits are determined by The
             Semantically        Driven       Gini-Index         2) Ontologies but No Context - Experiments with
             Algorithm.                                             files that were tagged using ontologies in
                                                                    RDF/OWL Files saved in the in memory model,
         x The max number of possible splits has                    but without any user preferences saved within
           been reached - This condition occurs when                them (copying the entire ontology for each
           all possible splits from within an ontology              category of context). The tagged files were then
           as well as all of the lowest level of leaves             processed in a MapReduce Job producing the
           have been utilized in a split resulting in a             results.
           max number of splits that can be used to
           build a given decision tree. For example, if          3) Ontologies and Context - These experiments
           the ontology is 4 hops deep and the lowest               were executed using the files that were tagged
           level of leaves totals 4 leaves as well, this            using both the ontologies modeled in RDF/OWL
           makes the max number of splits possible                  Files, but also with the saved user preferences
           for the decision tree to be 8 split.                     that were parsed from the RDF/OWL Files as
                                                                    well. The tagged files were then processed using
                 III. EXPERIMENTS                                   MapReduce producing the results.
     To validate our approach, we took roughly
100,000 files from the UCR Data Repository and
                                                                                    IV. VALIDATION
processed them against 60 user specified preferences
stored within the in memory tree model. The                           We use two well-known metrics to validate our
semantically mapped features were saved as tags in a             approach: Precision and Recall. Recall is defined as
modified version of each file, making matching for               the fraction of the records retrieved that are relevant
each set of user preferences a matter of matching the            to the query. In other words, recall reveals the
tags that have been specified by the user saved in the           percentage of the retrieved and relevant records that
in memory tree model. Finally, the resulting                     are relevant (whether in the answer set or outside the
MapReduce Jobs generated a file name with a value                answer set). Precision is defined as the fraction of the
of 1 for each set of semantically mapped preferences             retrieved records that are relevant to the search. In
that were matched. We used the following sample                  other words, precision reveals the percentage of
Scenario 1: Bob is looking for all crimes that occurred          retrieved and relevant records in the answer set.
within the state of Maryland, between 1998 and 2002,             Recall and precision are calculated as follows:
he is searching for crime totals for larceny theft in
which the files are saved as .xls files. Scenarios 2
through 10 are variants of scenario 1, where
semantically mapped preferences were matched                                                                                (2)
against the same files from the UCR Data Repository
Values for scenarios 2-10 were chosen at random.

We have run three sets of experiments:                                                                                      (3)
1) No Context and No Ontologies - Experiments
   with user preferences as exact search terms on
   MapReduce Jobs.


         1
       0.8
       0.6
       0.4                                                                          No Ontologies and No Context
       0.2
         0                                                                          Ontologies and No Context
                                                                                    Ontologies and Context


                                        Fig. 3.   Experiment results using recall


                                           STIDS 2016 Proceedings Page 6
              1
            0.8
            0.6
            0.4                                                                         No Ontologies or Context
            0.2                                                                         Ontologies and No Context
              0
                                                                                        Ontologies and Context


                                           Fig. 4. Experiment results using precision


     In the above scenarios or sets of test conditions that      matching terms in the in memory tree that were parsed from
simulated using sets of user specified preferences (like in      ontology and stored in the in memory tree model without any
Scenario 1 discussed above), the recall was computed for no      user specified preferences and matching them against the
ontologies and no context by dividing matching exacting          properties and content of the files that are being parsed.
search terms from a user’s search against the content of the     Ontologies and no context resulted in a low precision as well
files that are being parsed. The recall was extremely low for    because the entire ontology was stored in each in memory tree
no ontologies and no context because the exact key words         model resulting in tagged files that only partially or did not
used in MapReduce Jobs matched a very small percentage of        match a user's intended search terms at 26% precision. The
the files that were relevant to a user's search. The mean        precision for ontologies and context was computed by taking
average of all 10 scenarios was 28%. The recall for ontologies   any of the matching preferences stored in the in memory tree
and no context was computed by taking the matching terms in      model that parsed from the ontology and selected by the user
the in memory tree model that were parsed from the ontology      from the ontology and matching them against properties and
and matching them against properties and content in the files    content and of the files being parsed. Ontologies and context
being parsed. The searches for ontologies and no context         resulted in a very high precision versus no ontologies and no
yielded a higher recall because all the files that were tagged   context and ontologies and no context, because documents
using the in memory tree model matched 1 or more of the          that returned key value pairs in our MapReduce Jobs matched
classes specified in the ontology with a total of 47% recall,    the semantically mapped user preferences as well the semantic
resulting in almost a 20% increase from no ontologies and no     constraints specified in the ontology resulting in an 80%
context. The recall for ontologies and context was computed      precision; a 54% increase in ontologies and no context and a
by taking the matching preferences stored in the in memory       63% increase in accuracy vs. no ontologies and no context.
tree model and matching them against properties and content
of the files being parsed. Searches involving ontologies and          In summary we learned that enabling a user to pick
context had a very high recall because all results retrieved     semantically enriched preferences from an ontology of terms
matched both the semantically driven preferences that were       that reflect an existing domain from a corpus can lead to a
saved in the in memory tree model as well as with the            much higher precision and recall than using exact search terms
semantically specified constraints from the ontology resulting   or just using semantically enriched search terms parsed from
in 63% recall, a 16% increase from ontologies and no context     an ontology. By building an in memory tree model we are able
and a roughly 35% increase improvement in recall vs. no          to both represent the knowledge representative of a domain
ontologies and no context.                                       and we also enable personalization by a user of that
                                                                 knowledge as well. In order to enable robust personalization,
     Precision was computed for no ontologies and no context     semantics must first be reflected in a model before a user can
by dividing matching exacting search terms from a user’s         pick them. This is depicted in our results by showing that the
search against the content of the files that are being parsed.   precision and recall is higher than with just choosing exact
The precision for no ontologies and no context was extremely     search terms and the precision and recall further improves
low because the overwhelming majority of the search results      when allowing a user to pick attributes he or she wishes to use
that were returned did not match the intended user search        in a search with terms picked from the ontology.
preferences because exact key words were used for each
search resulting in a 17% precision. The precision for
ontologies and no context was computed by taking any of the


                                               STIDS 2016 Proceedings Page 7
                   V. RELATED WORK                                                                References
     Zhang et al. [2] formulated an approach that calculated       [1] A. Zimmermann, A. Lorenz and R. Oppermann, "An Operational
                                                                       Definition of Context," 2007.
the information gain for finding the best split for a node
between two or more separate taxonomies. This approach does        J.
                                                                   [2]ZhJ Zhang, A. Silvescu and V. Honava, "Ontology-Driven Induction of
not incorporate any kind of semantic inference or contextually     Decision Trees at Multiple Levels of Abstraction," Berlin Heidelberg, 2002.
driven attributes for building a decision tree, neither does it
address any issues with trying to make a decision tree             [3] B. Gajderowicz, M. Soutchanski and A. Sadeghian, "Trees, Ontology
parallelizable. Gajderowicz et al. formulated an approach for          Enhancement through Inductive Decision Trees," in Uncertainty
enriching manually created ontologies using decision trees [3].        Reasoning for the Semantic Web II, Springer Berlin Heidelberg, 2013.
They also developed a system for using decision trees for
ontology matching [4]. Johnson et al., formulated an approach      [4] B. Gajderowicz, "Using Decision Trees for Inductively Driven Semantic
                                                                       Integration and Decision Matching," Ryerson University, Toronto, 2011.
for enriching ontologies off of custom built decision trees [5].
Bouza et al. described an approach by using an ontology to         [5] I. Johnson, J. Abécassis, B. Charnomordic, S. Destercke and R.
build user profiles to make various recommendations on user            Thomopoulos, "Making Ontology-Based Knowledge and Decision Trees
behavior [6]. Our approach not only allows a user to specify           Interact: An Approach to Enrich Knowledge and Increase Expert
preferences, but also ensures that they are semantically similar       Confidence in Data-Driven Models," in Knowledge Science, Engineering
                                                                       and Management, Springer Berlin Heidelberg, 2010.
to any search results; it is parallelizable too. Fanizzi et al.
developed a novel framework for learning custom description
                                                                   [6] A. Bouza, G. Reif, A. Bernstein and H. Gall, "SemTree: Ontology-Based
logic learning languages using decision trees. While this              Decision Tree Algorithm for Recommender Systems," Karlshruhe, 2008.
approach is novel for learning description logic concept
definitions, it does not incorporate user preferences [7].         [7] N. Fanizzi, C. d’Amato and F. Esposito, "Induction of Concepts in Web
                                                                       Ontologies through Terminological Decision Trees," Barcelona, 2010.
     Nenkova et al. developed a technique for summarizing
documents based on frequency [8]. Arun and Gunavathi               [8] A. Nenkova, L. Vanderwende and K. R. McKeown, "A Compositional
developed a technique for summarizing documents using                  Context Sensitive Multi-document Summarizer: Exploring the Factors
                                                                       That Influence Summarization," Seattle, Washington, 2006.
context sensitive weights for indexing [9]. This work did not
utilize the contextual properties of parts of a document nor
                                                                   [9] J. M.E and C. Gunavathi, "Document Summarization and Classification
were user preferences utilized when creating the summaries.            sing Concept and Context Similarity Analysis," 2014.
Witte et al. developed a fuzzy graph technique for multi-
document summarization [10]. Barzilay et al. developed a           [10] R. Witter, R. Krestel and S. Bergler, "Context-based Multi-Document
technique for summarizing documents using the contextual                Summarization Using Fuzzy Coreference Cluster Graphs".
attributes found in text across a series of documents [11].
Yang et al. developed a summarizing framework using the            [11] R. Barzilay, K. R. McKeown and M. Elhadad, "Information Fusion in
social contextual information, but they did not utilize user            the Context of Multi-Document Summarization".
specified preferences beyond social ones such as a time or
location that a document was created [12].                         [12] Z. Yang, C. Keke, J. Tang, Z. Li, S. Zhong, L. Jaunzi and Y. Zi, "Social
                                                                        Context Summarization," Beijing, China, 2011.


                    VI. CONCLUSIONS                                [13] Apache Spark, "Apache Spark," [Online]. Available:
                                                                        http://spark.apache.org. [Accessed 10 06 2016].
     We described and validated an approach to specify
semantically driven user preferences in a parallelizable way.      [14] T. Boujari, "Instance-Based Ontology Alignment Using Decision Trees,"
We also encountered a few limitations. First, we found that             Institutionen för Datavetenskap, 2012.
extensive exploration and sampling of data files was needed to
be able to properly model preferences in our ontology to           [15] H. F. Witschel, "Using Decision Trees and Text Mining Techniques for
confirm a consistent structure of a file format when creating           Extending Taxonomies," in In Proceedings of Learning and Extending
                                                                        Lexical Ontologies by Using Machine Learning Methods, 2005.
our in memory tree model. Second, we found that the user
preferences that were specified in the in memory model             [16] W. J. Wei Dai, "A MapReduce Implementation of C4.5 Decision Tree
needed to closely mirror the user preferences that were                 Algorithm," International Journal of Database Theory and Application,
specified in the ontology or this would lead to searches                pp. 49-60, 2014.
returning incorrect results or errors resulting in our program
because the structure of RDF/OWL Model was incorrect. For          [17] R. Mirambicka, A. R. Sulthana and G. Vadivu, Decision Tree Applied to
our future work we plan on testing much larger datasets. Also           Learning Relations Between Ontologies.
planned are further attempts to model more expressive
attributes for user specified preferences. Finally, we plan on     [18] D. Jeon and W. Kim, "Development of Semantic Decision Tree," in Data
                                                                        Mining and Intelligent Information Technology Applications (ICMiA),
utilizing Apache Spark [13] and new versions of Hadoop to               2011 3rd International Conference on, 2011.
allow for both a more novel design and implementations of
our approach.


                                                 STIDS 2016 Proceedings Page 8
[19] M. Patil, S. Khomane, V. Saykar and K. Moholkar, "Web People Search
     Using Ontology Based Decision Trees," International Journal of Data
     Mining & Knowledge Management Process, 2012.

[20] B. Panda, J. S. Herbach, S. Basu and R. J. Bayardo, "PLANET:
     Massively Parallel Learning of Tree Ensembles," Google, 2009.

[21] J. Han, Y. Liu and X. Sun, "A Scalable Random Forest Algorithm Based
     on MapReduce," in Software Engineering and Service Science (ICSESS),
     2013 4th IEEE International Conference on, Bejing, China, 2013.

[22] W. Yin, V. Simmhan and V. K. Prasanna, "Scalable Regression Tree
     Learning on Hadoop Using OpenPlanet," in MapReduce '12 Proceedings
     of third international workshop on MapReduce and its Applications,
     New York, New York, 2012.

[23] S. d. Río, V. López, J. M. Benítez and F. Herrera, "On The Use of
     MapReduce For Imbalanced Big Data Using Random Forest,"
     Information Sciences, pp. 112-137, 2014.

[24] S. Tyree, K. Q. Weinberger and K. Agrawal, "Parallel Boosted
     Regression Trees for Web Search Ranking," in WWW 2011 – Session:
     Ranking, NY, NY, 2011.

[25] X. Zhanga, C. Liua, S. Nepalb, C. Yanga, W. Dou and J. Chen, "A
     Hybrid Approach For Scalable Sub-Tree Anonymization Over Big Data
     Using MapReduce On Cloud," Journal of Computer and System
     Sciences, pp. 1080-1020, 2014.

[26] J. Ye, J.-H. Chow, J. Chen and Z. Zheng, "Stochastic Gradient Boosted
     Distributed Decision Trees," in CIKM '09 Proceedings of the 18th ACM
     Conference on Information and Knowledge Management, New York,
     New York, 2009.

[27] A. Ghoting, P. Kambadur, E. Pednault and R. Kannan, "NIMBLE: a
     Toolkit For The Implementation of Parallel Data Mining and Machine,"
     in KDD '11 Proceedings of the 17th ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, New York, New
     York, 2011.

[28] G.-Q. Wu, H.-G. Li, X.-G. Hu, Y.-J. Bi, J. Zhang and W. Xindong,
     "MReC4.5: C4.5 Ensemble Classification with MapReduce," in 2009
     Fourth ChinaGrid Annual Conference, Yantai, Shandong, 2009.


                                                           STIDS 2016 Proceedings Page 9

</pre>