=Paper= {{Paper |id=Vol-1733/paper-12 |storemode=property |title=Feature Generation using Ontologies during Induction of Decision Trees on Linked Data |pdfUrl=https://ceur-ws.org/Vol-1733/paper-12.pdf |volume=Vol-1733 |authors=Yordan Terziev |dblpUrl=https://dblp.org/rec/conf/semweb/Terziev16 }} ==Feature Generation using Ontologies during Induction of Decision Trees on Linked Data== https://ceur-ws.org/Vol-1733/paper-12.pdf

Feature Generation using Ontologies during Induction of
Decision Trees on Linked Data

Yordan Terziev

University of Duisburg-Essen, Essen, Germany
yordan.terziev@paluno.uni-due.de

Abstract. Linked data has the potential of interconnecting data from different
domains, bringing new potentials to machine agents to provide better services
for web users. The ever increasing amount of linked data in government open
data, social linked data, linked medical and patients’ data provides new oppor-
tunities for data mining and machine learning. Both are however strongly de-
pendent on the selection of high quality data features to achieve good results. In
this work we present an approach that uses ontological knowledge to generate
features that are suitable for building a decision tree classifier addressing the
specific data set and classification problem. The approach that we present has
two main characteristics - it generates new features on demand as required by
the induction algorithm and uses ontological knowledge about linked data to re-
strict the set of possible options. These two characteristics enable the induction
algorithm to look for features that might be connected through many entities in
the linked data enabling the generation of cross-domain explanation models.

1 Introduction

Humans can apply knowledge from one situation to another where concepts with
similar properties occur. For example, the similarity of cold and flu as diseases is
known to humans and when decision about the treatment of these diseases is made,
they know that the knowledge about the one can be reused on the other because they
have similar causes and treatment that mainly mitigates symptoms like pain or fever.
Machine learning (ML) algorithms for classification on the other hand don’t auto-
matically explore possible relationships between properties of different instances to
build better models. While it is possible to identify correlation of particular data fea-
tures to the class prior to the execution of the ML algorithm, the algorithms don’t
attempt to generate better features during model induction.
For example, if the training data of ML algorithm contains data of two patients that
were treated with paracetamol, the one because he had flu and the other because he
had cold – the relationship between the common symptoms and treatment, will remain
undiscovered in the induced model. One reason for this is that ML techniques are
typically applied on data, where the attributes are preselected and their values are
considered simple types (e.g. the attribute disease and it’s values cold and flu).
The training data for the ML algorithms is typically prepared manually by the user
through three steps: data selection, data preprocessing and data transformation. In the
first two steps the data instances for learning are selected, cleaned and formatted in
the format required for the ML algorithms. In the last step the features (data attrib-
utes) used in the ML algorithm are selected and/or generated through scaling, decom-
position or aggregation of existing features. Back to our example, a user of ML algo-
rithm might expand the feature disease in the data set with the associated feature
symptoms in order to achieve better prediction model and capture the “hidden” cau-
sality of treating symptoms with paracetamol instead of treating diseases.
The relationships between the features and their values such as between cold and
flu and their symptoms is however frequently available in ontologies. This leads to the
central idea of the work – to use existing ontological relationships between concepts
to generate new features that improve the quality of the induced ML model.
This work is structured as follows: In the next section a formal representation of
the problem is introduced, followed by an overview of the approach in section 3. Af-
terwards in section 4 we discuss the relevancy of the problem and what benefits
would the approach bring. Section 5 presents the related work and differentiates the
problem and our solution from the existing ones. In section 6 we present the research
questions we plan to address and hypotheses we have. In section 7 the evaluation plan
is presented followed by reflections in section 8 that conclude this work.

2 Problem Statement

To investigate the problem of using external ontological knowledge on the attributes
and their values, we consider a classical supervised learning problem where we have a
training set 𝑆 of 𝑁 training examples of the form {(𝑥1 , 𝑦1 ), … , (𝑥𝑁 , 𝑦𝑁 )} such that 𝑥𝑖 is
the feature vector of the 𝑖 − 𝑡ℎ example, described by set of discrete features 𝑋, and
𝑦𝑖 is its discrete class label (i.e., class).
In classical supervised learning the feature vectors 𝑥𝑖 are of the form [𝑓1 , 𝑓2 , … , 𝑓𝑛 ],
where the features are either binary (i.e.,𝑓𝑖 ∈ {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}), numerical (i.e., 𝑓𝑖 ∈ ℝ),
or nominal (i.e., 𝑓𝑖 ∈ 𝑆𝑌𝑀, 𝑤ℎ𝑒𝑟𝑒 𝑆𝑌𝑀 𝑖𝑠 𝑎 𝑓𝑖𝑛𝑖𝑡𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑠𝑦𝑚𝑏𝑜𝑙𝑠). The goal is to
produce a model y = f(x) from the 𝑁 training examples, so that it will predict the clas-
ses y of future unknown examples x with high accuracy. The function f(x) may also
contain logical operations between the attributes. For the rest of the work we use the
notation 𝑣𝑖𝑗 to represent the value of feature 𝑓𝑗 of the 𝑖 − 𝑡ℎ instance.
In this work, we consider a scenario where 𝑓𝑗 is concept in the ontology and its fea-
ture values are vertices in a linked data graph (e.g. SWRC Ontology [1] and the AIFB
Dataset [2]). Linked data graph is a directed multigraph represented in the form G=
{𝑉, 𝐸 } and is built of set of triples in the form {(s, p, o)}, where the subject 𝑠 ∈ 𝑉 is an
entity, the predicate 𝑝 denotes a property, and the object 𝑜 ∈ 𝑉 is either another entity
or literal. Each edge 𝑒 ∈ 𝐸 in the graph is defined by a triple (s, p, o).
So for example the concept Employee might have the property birth date which is
connected to a literal of type Date, but it might also have a relationship to another
concept Project, which connects the concept Person to the project he/she is working
at. The ML task might be to train a classifier on existing linked data in order to find
out suitable affiliation for new employees. The linked data and ontology graphs pro-
vide useful information for the model learning. For example, in the affiliation predic-
tion task it might be the conferences at which the employee has published papers.
Finding suitable features for building a good classification model poses a difficult
problem because a good feature might be referenced through many other types of
concepts. For example, for the task of assigning a social network user to a particular
group, a good feature might be the religious view of the front man of the music group
liked by the user. This information is however 3 hops away from the origin user ver-
tex in the linked data graph. Furthermore, in other classification problems the features
providing good data predictions might be much further away than 3 hops. Not know-
ing the depth to which to explore the RDF graph causes further problems:

 The number of features to consider, grows exponentially with every hop away
from the origin.
 A suitable halting criteria is required so that the approach doesn’t search indefinite-
ly throughout the ontology and linked data graph.
 Evaluating features with the entire set 𝑆 might be too computationally expensive.

3 Approach

In this section we present an approach that is capable of generating new features from
linked data and associated ontology during the induction of a decision tree. In deci-
sion tree induction, a feature 𝑓𝑥 is said to be better than another 𝑓𝑦 , if by splitting on
its values the impurity (entropy) of the dataset is reduced more than if the split would
be executed on the feature 𝑓𝑦 . The entropy reduction is known as information gain
(IG), and measures the information gained by separating on а feature in bits. IG is
only one possible split evaluation function - to denote split evaluation functions in
general we use 𝐺(. ). Since the values of 𝑓𝑥 are vertices in the linked data graph G, our
goal is to find related features that have higher value of G(.) than the original feature.
To explore the related features, we expand the original features (concepts in the on-
tology graph) with related concepts connected through outgoing properties and add
these new featureс to the feature vector (cf. Fig. 1). Doing this for all connected fea-
tures repeatedly, would eventually lead to exponentially growing number of options
that should be evaluated. This requires heuristics that can restrict the number of pos-
sible features. The solution we propose is to consider only entities that are semantical-
ly related to the origin over some threshold value.
Semantic relatedness calculation in ontologies hasn’t been widely studied and only
few works exist. Mazuel et al. [3] calculate the semantic relatedness between ontolog-
ical concepts pair only on semantic correct paths as defined in [4]. Therefore in our
work we expand features from the origin feature in a breadth first search (BFS) man-
ner considering the rules for semantically correct paths defined by Hirst & St-Onge
[4]. The authors associate a direction in Upward(U), Downward(D) and Horizon-
tal(H) for each property type and give three rules to define a semantically correct path
in terms of the three directions. Finally, Hirst & St-Onge enumerate 8 patterns of
semantically-correct paths which match their three rules: {U, UD, UH, UHD, D, DH,
HD, H}. Only concepts on outgoing paths from the origin entity conforming to these
patterns are considered as possible features in the further process.
Employee Affiliation(Class) Part of SWRC Ontology
... ...
http://…/id34 Knowledge Management Thing
... is a
... is a
Feature generation (Hop 1) Project is a
Person
WorksAtProject Publication Employee Affiliation(Class) ... ... is a Organisation
http://…/proj/id31 http://…/pub/id15 http://…/id34 Knowledge Management worksAtProject
Publication affiliation??
Feature generation (Hop 2) publication Employee

PRJ_FinancedBy PRJ_... Pub_PublishedBy Pub_... WorksAtProject(PRJ) Publication (Pub) Employee Affiliation(Class)
http://…/org/id22 … http://…/pub/id17 … http://…/proj/id31 http://…/pub/id15 http://…/id34 Knowledge Management
Fig. 1. Feature Generation using SWRC Ontology

Another problem we are addressing is the selection of suitable sized subset, as
searching with the entire dataset might be impossible if the dataset is too large. To
address this issue we propose the application of a statistical technique called Hoeffd-
ing bound [5] used in a streaming algorithm for decision tree induction [6]. As ex-
tending each value of feature 𝑓𝑥 and calculating the resulting split value with the func-
tion G(.) for each related feature would be to expensive, we propose the use of the
Hoeffding bound to select how much instances are enough to make the decision
whether an entity should be expanded or not (i.e. expand to connected features).
To present the Hoeffding bound, consider a real-valued random variable r whose
range is R. Suppose we have made 𝑛 independent observations of this variable, and
computed their mean 𝑟̅. The Hoeffding bound states that, with probability 1 − δ , the
true mean of the variable is at least 𝑟̅−∈, where
1
𝑅 2 ln( )
∈= √ 𝛿
(1)
2𝑛

In our work similar to [6], we use the Hoeffding bound to find lowest possible val-
ue for G(.) with the considered number of instances and probability 1 − δ. While the
Hoeffding bound in [6] is used to select between two possible features in our work we
use it to make the decision whether the algorithm should expand the feature search
one hop further away from the origin in the ontology graph. This decision is made
based on two further parameters specified by the user: ER represents the expected
entropy reduction represented in percent and DF(.), which is a decreasing function
calculated based on ER and the number hops from the origin (NH). Basically the de-
cision to expand the search a hop further is made in case that there are no features in
the current depth that fulfil the expectations set by DF. The DF function is decreasing
so that the expectations on the new features are lowered with increasing number of
hops and the algorithm terminates. Setting high ER expectations and slowly decreas-
ing DF function would eventually lead to much deeper exploration of the graph.
Before expanding to related concepts, it should be however ensured that a good es-
timate of the G(.) of the currently available features has been calculated. We do that
by checking if the value of ∈ is smaller than a user defined threshold value 𝜎. If this
condition is fulfilled this means that the induction algorithm has collected enough
samples for a good estimate of G(.), but the best feature in the current hop is just not
suitable for the classification problem. In that case, we expand all the features one hop
further. To present the modifications we’ve made on the Hoeffding tree algorithm [6]
(cf. Table 1), we further introduce the parameter ES, which is a dynamically managed
expansion schema constructed of new attributes added during induction. It is required
for sorting the new examples in the tree (expansion of the original feature vector x).

Table 1. Procedure: SemanticHoeffdingTree (S;X;G;δ; 𝜎;ER;DF)

Let HT be a tree with a single leaf 𝑙1 (the root).
Let 𝑋1 = 𝑋 ∪ {𝑋∅ }
Let 𝐺1̅ (𝑋∅ ) be the 𝐺̅ obtained by predicting the most fre-
quent class in S
For each class 𝑦𝑘
For each value 𝑥𝑖𝑗 of each attribute 𝑋𝑗 ∈ 𝑿
Let 𝑛𝑖𝑗𝑘 (𝑙1 ) = 0
For each example (𝑥, 𝑦𝑘 ) 𝑖𝑛 𝑆
Expand x according to ES
Sort (𝑥, 𝑦) into a leaf l using HT
For each 𝑥𝑖𝑗 in x such that 𝑋𝑖 ∈ 𝑋𝑙
Increment 𝑛𝑖𝑗𝑘 (𝑙).
Label 𝑙 with the majority class among the examples at l.
If examples at l not all of the same class, then
Compute 𝐺̅𝑙 (𝑋𝑗 ) for each attr. 𝑋𝑗 ∈ 𝑿𝒍 − {𝑋∅ } using 𝑛𝑖𝑗𝑘 (𝑙)
Let 𝑋𝑎 be the attribute with the highest 𝐺̅𝑙
Compute ∈ using Equation 1.
If 𝐺̅𝑙 (𝑋𝑎 ) − 𝐷𝐹(𝐸𝑅, 𝑁𝐻) >∈ and 𝑋𝑎 ≠ 𝑋∅ , then
Replace l by an internal node that splits on 𝑋𝑎 .
For each branch of the split
Add a new leaf 𝑙𝑚 , and let 𝑿𝒎 = 𝑿 − {𝑋𝑎 }
Let 𝐺𝑚 ̅ (𝑋∅ ) be the 𝐺̅ obtained by predicting the
most frequent class at 𝑙𝑚
For each class 𝑦𝑘 and each value 𝑥𝑖𝑗 of each
attribute 𝑿𝒊 = 𝑿𝒎 − {𝑋∅ }
Let 𝑛𝑖𝑗𝑘 (𝑙𝑚 ) = 0.
Else If ∈< 𝜎, then
Expand each 𝑋𝑗 ∈ 𝑿𝒍 with sem. related concepts
Add generated attributes(related concepts) to ES
Adjust 𝑛𝑖𝑗𝑘 (𝑙) to the newly introduced features
NH=NH+1
Return HT.
4 Relevancy

The ever increasing amount of linked data in different domains ranging from govern-
ment open data, to social linked data [7] yet to linked medical and patients data [8]
enables cross-domain interlinking. It provides new opportunities to ML communities:
such as:

 Analysis of Linked Open Data (LOD): Interesting here is analysis of government
open data as presented in [9], but also interesting for medical research e.g. connect-
ing medical gene research and statistics about patients’ disease progression.
 Social networks data has inherently graph structure (such as Facebook Graph API),
which can be transformed to linked data by using approaches like the one present-
ed in [7]. One example would be to find out why particular group of people clicked
on specific advertising e.g. user clicking on holiday ad because their friends recent-
ly posted or liked pictures from a holiday resort location.
 A typical application field of ML algorithms are recommender systems. Bouza et
al. [10] present an approach which uses semantic information available for items to
build item’s feature vector and subsequently using the items rating by a user to
build user’s decision tree model. The model is then capable to assign a rating to
items not yet rated by the user. Similarly, our approach can be used to dynamically
select suitable features for the user’s rating decisions.

5 Related Work

Typically learning from linked data (RDF data) is divided in pre-processing, instance
extraction optional feature extraction and the actual learning [11]. In the pre-
processing step – some of the RDF verbosity is reduced, additionally methods like
RDFS/OWL inferencing can be used to more efficiently expose the relevant infor-
mation. In RDF it is typically accepted that an instance is represented by a resource in
the RDF graph [11], however the resource itself doesn’t contain any information
about the instance. The actual description of the instances is represented by the neigh-
borhood around the resource. Therefore, machine learning approaches such as [12,
13] achieve instance extraction by extracting a full subgraph to a given depth.
After the instance extraction two further options are available either one executes
the learning algorithms directly on the extracted subgraphs or extracts feature vectors.
In the first option different kernel functions are applied - one representative is the
Weisfeiler-Lehman (WL) graph kernel, which computes the number of subtrees
shared between two graphs by using the WL test of graph isomorphism. While the
WL kernel is designed for all kind of graphs, Lösch et al. propose in their work [13]
graph kernels specialized to RDF. Their kernel addresses specifics of the RDF graph
such as that RDF node labels are used as identifiers occurring only once per graph and
nodes may have a high degree. This differs from e.g. chemical compound graphs that
usually have few node labels which occur frequently in the graph and nodes have a
low degree. However, both graph kernel approaches do not create feature vectors and
work with a fixed depth of instance extraction – usually 2 hops.
Another approach is [14] where the authors introduce an expressive graph-based
language for extracting features from linked data and a theoretical framework for
constructing feature vectors from the extracted features. The construction of the fea-
ture vector there is mainly driven by the provided queries (SPARQL/NAGA) and
their result, which is used as basis for the feature generation.
Yet another work that uses similar technique as ours for creating features vectors
out of extracted instances is the one presented by Paulheim [9], where however the
main goal is the generation of possible interpretations for statistics using linked open
data. The main difference to our work regarding the feature generation technique is
that the author proposes an approach that firstly generates all possible properties for
all features, followed by feature selection afterwards. Further differences to our work
are that no selection of subset of instances for the feature generation step is done as
we propose with the Hoeffding bound and all related features are considered.

6 Research Question and Hypotheses

Two main research question should be answered:

 How can ontological knowledge (especially relationships between concepts) be
used to automatically generate features for a specific entity and a given classifica-
tion problem using instance extraction with dynamic depth (cf. section 3)?
 How can ontological knowledge and instance extraction with dynamic depth be
used to automatically improve manually generated/selected features in training da-
ta so that the accuracy of the induced ML model on unknown examples improves?

The main hypothesis is that enabling machine learning algorithms to dynamically
expand the set of possible features through the linked data graph would improve the
classification accuracy of the produced classifier on unknown instances in comparison
to models induced with static depth of instance extraction.

7 Evaluation Plan

To validate our hypothesis, we envision twofold evaluation: first we would compare
our approach to the induction of decision tree using the unmodified Hoeffding tree
algorithm [6] with a fixed depth of instance extraction (2 hops) as suggested in [12,
13]. We do this evaluation to minimize the effect of the underlying ML induction
algorithm and only measuring the effect of the dynamic expansion depth.
Second our goal is to compare the classification performance of the models con-
structed with the here presented approach to the state of the art approaches for learn-
ing on RDF Data. The works of Lösch et al [13] and Vries,G.de [12] are envisioned
as baselines. In particular, we compare the approach to both works, because they re-
strict the depth of the instance extraction to 2 hops.
We plan to execute the evaluation on the entity classification and link prediction
learning tasks presented in [13]. The algorithms would be trained and tested on the
two datasets used in [13] and [12] namely: the AIFB dataset [2] (SWRC Ontology),
where the main task is classifying the affiliation of a staff member to a research
group, and the dataset from LiveJournal.com (Friend of a Friend ontology) where the
main task is to classify persons in one of four age classes.

8 Reflections

In this work only the general idea of dynamic feature generation during induction
of decision tree is presented and some specifics haven’t been addressed such as: how
referenced list are handled (e.g. multiple published papers), how semantic similarity
can be used in this context or how cycles in the linked data graph are managed. Fur-
ther research question that is within the research scope, but not directly addressed in
this work is how multiple origins would be handled and how would this impact fea-
ture selection. All these are subject of future work.
While a similar technique of expanding linked data values has been explored in [9]
with the goal of generating hypothesis, in the current work we suggest a decision tree
induction algorithm that executes an on demand expansion of linked data features
considering only semantically related concepts. We expect that through these charac-
teristics of the induction algorithm the graph can be explored in greater depth, thus
finding cross-domain explanations, that eventually would be a better representation
for the classification problem.
Another difference of our technique to the one presented in [9] is that we explore
new features with a subset of the instances using the Hoeffding bound [5] to select
how many instances are enough in order to evaluate if deeper exploration is required
or the currently selected features are good enough. Thus we envision that the ap-
proach is capable of exploring more possible features with less expensive calculation
of split evaluation functions.

Acknowledgments

I would like to thank my supervisors Prof. Dr. Volker Gruhn and Dr. Tobias Brück-
mann for their support and the opportunity for the realization of this work.

References

1. York Sure-Vetter, Stephan Bloehdorn, Peter Haase, Jens Hartmann, Daniel
Oberle: The SWRC Ontology - Semantic Web for Research Communities. In:
Carlos Bento, Amilcar Cardoso, Gael Dias (ed.) Proceedings of the 12th Portu-
guese Conference on Artificial Intelligence - Progress in Artificial Intelligence
(EPIA 2005), 3803, pp. 218–231. Springer, Covilha, Portugal (2005)
2. Bloehdorn, S., Sure, Y.: Kernel methods for mining instance data in ontologies.
Springer (2007)
3. Mazuel, L., Sabouret, N.: Semantic Relatedness Measure Using Object Properties
in an Ontology. In: Proceedings of the 7th International Conference on The Se-
mantic Web, pp. 681–694. Springer-Verlag, Berlin, Heidelberg (2008)
4. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detec-
tion and correction of malapropisms. WordNet: An electronic lexical database
305, 305–332 (1998)
5. Wassily Hoeffding: Probability Inequalities for Sums of Bounded Random Vari-
ables. Journal of the American Statistical Association 58, 13–30 (1963)
6. Domingos, P., Hulten, G.: Mining High-speed Data Streams. In: Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 71–80. ACM, New York, NY, USA (2000)
7. Weaver, J., Tarjan, P.: Facebook linked data via the graph API. Semantic Web 4,
245–250 (2013)
8. Pathak, J., Kiefer, R.C., Chute, C.G.: Applying Linked Data Principles to Repre-
sent Patient’s Electronic Health Records at Mayo Clinic: A Case Report. In: Pro-
ceedings of the 2Nd ACM SIGHIT International Health Informatics Symposium,
pp. 455–464. ACM, New York, NY, USA (2012)
9. Paulheim, H.: Generating Possible Interpretations for Statistics from Linked Open
Data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.)
The Semantic Web: Research and Applications, 7295, pp. 560–574. Springer Ber-
lin Heidelberg (2012)
10. Amancio Bouza, Gerald Reif, Abraham Bernstein and Harald Gall: SemTree:
Ontology-Based Decision Tree Algorithm for Recommender Systems
11. Peter Bloem, Gerben de Vries: Machine Learning on Linked Data, a Position
Paper co-located with European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases (ECML PKDD 2014), Nancy,
France, September 19th, 2014. In:
12. Vries, G. de: A Fast Approximation of the Weisfeiler-Lehman Graph Kernel for
RDF Data. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) Machine
Learning and Knowledge Discovery in Databases, 8188, pp. 606–621. Springer
Berlin Heidelberg (2013)
13. Lösch, U., Bloehdorn, S., Rettinger, A.: Graph Kernels for RDF Data. In: Sim-
perl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) The Semantic
Web: Research and Applications, 7295, pp. 134–148. Springer Berlin Heidelberg
(2012)
14. Cheng, W., Kasneci, G., Graepel, T., Stern, D., Herbrich, R.: Automated feature
generation from structured knowledge. In: Proceedings of the 20th ACM interna-
tional conference on Information and knowledge management, pp. 1395–1404
(2011)