A Comparison of Propositionalization Strategies
 for Creating Features from Linked Open Data

                         Petar Ristoski and Heiko Paulheim

                         University of Mannheim, Germany
                       Research Group Data and Web Science
               {petar.ristoski,heiko}@informatik.uni-mannheim.de


       Abstract. Linked Open Data has been recognized as a valuable source
       for background information in data mining. However, most data min-
       ing tools require features in propositional form, i.e., binary, nominal or
       numerical features associated with an instance, while Linked Open Data
       sources are usually graphs by nature. In this paper, we compare different
       strategies for creating propositional features from Linked Open Data (a
       process called propositionalization), and present experiments on different
       tasks, i.e., classification, regression, and outlier detection. We show that
       the choice of the strategy can have a strong influence on the results.


Keywords: Linked Open Data, Data Mining, Propositionalization, Feature Gen-
eration


1    Introduction

Linked Open Data [1] has been recognized as a valuable source of background
knowledge in many data mining tasks. Augmenting a dataset with features taken
from Linked Open Data can, in many cases, improve the results of a data mining
problem at hand, while externalizing the cost of maintaining that background
knowledge [18].
     Most data mining algorithms work with a propositional feature vector rep-
resentation of the data, i.e., each instance is represented as a vector of features
hf1 , f2 , ..., fn i, where the features are either binary (i.e., fi ∈ {true, f alse}), nu-
merical (i.e., fi ∈ R), or nominal (i.e., fi ∈ S, where S is a finite set of symbols).
Linked Open Data, however, comes in the form of graphs, connecting resources
with types and relations, backed by a schema or ontology.
     Thus, for accessing Linked Open Data with existing data mining tools, trans-
formations have to be performed, which create propositional features from the
graphs in Linked Open Data, i.e., a process called propositionalization [11]. Usu-
ally, binary features (e.g., true if a type or relation exists, false otherwise) or
numerical features (e.g., counting the number of relations of a certain type) are
used [21]. Other variants, e.g., computing the fraction of relations of a certain
type, are possible, but rarely used.
2       Petar Ristoski and Heiko Paulheim

    Our hypothesis in this paper is that the strategy of creating propositional
features from Linked Open Data may have an influence on the data mining result.
For example, promiximity-based algorithms like k-NN will behave differently
depending on the strategy used to create numerical features, as that strategy
has a direct influence on most distance functions.
    In this paper, we compare a set of different strategies for creating features
from types and relations in Linked Open Data. We compare those strategies
on a number of different datasets and across different tasks, i.e., classification,
regression, and outlier detection.
    The rest of this paper is structured as follows. Section 2 gives a brief overview
on related work. In section 3, we discuss a number of strategies used for the
generation of propositional features. Section 4 introduces the datasets and tasks
used for evaluation, and provides a discussion of results. We conclude with a
review of our findings, and an outlook on future work.


2     Related Work
In the recent past, a few approaches for propositionalizing Linked Open Data
for data mining purposes have been proposed. Many of those approaches are
supervised, i.e., they let the user formulate SPARQL queries, which means that
they leave the propositionalization strategy up to the user, and a fully automatic
feature generation is not possible. Usually, the resulting features are binary, or
numerical aggregates using SPARQL COUNT constructs [2, 8, 9, 16, 10]. In [21], we
have proposed an unsupervised approach allowing for both binary features and
numerical aggregates.
    A similar problem is handled by Kernel functions, which compute the dis-
tance between two data instances. They are used in kernel-based data mining and
machine learning algorithms, most commonly support vector machines (SVMs),
but can also be exploited for tasks such as clustering.. Several kernel functions
suitable for Linked Open Data have been proposed [3, 7, 14]. While Kernel func-
tions can be designed in a flexible manner, and support vector machines are
often performing quite well on classification and regression tasks, they cannot
be combined with arbitrary machine learning methods, e.g., decision tree learn-
ing.


3     Strategies
When creating features for a resource, we take into account the relation to other
resources. We distinguish strategies that use the object of specific relations, and
strategies that only take into account the presence of relations as such.

3.1    Strategies for Features Derived from Specific Relations
Some relations in Linked Open Data sources play a specific role. One exam-
ple are rdf:type relations assigning a direct type to a resource. A statement r
Propositionalization Strategies for Creating Features from Linked Open Data                    3

                                                                                   owl:Thing
                                                        rdfs:subclassOf
                                                                               dbpedia:Artist
                                                   rdfs:subclassOf
                                                                          dbpedia:Person
                                               rdfs:subclassOf
 +64 other relations                                                 dbpedia:Artist
                                          rdfs:subclassOf
                                                                dbpedia:MusicArtist

                                                             rdf:type
                                     dbpedia-owl:genre
                                                               dbpedia:Trent_Reznor
    dbpedia:Industrial_rock                                  dbpedia-owl:instrument
dbpedia:New_Wave_music                                                 dbpedia:Piano
               +7 more                                                       dbpedia:Guitar

                                                                                 +19 more

Fig. 1: Example DBpedia resource (dbpedia:Trent Reznor) and an excerpt of its types and relations


rdf:type C is typically translated into description logics as C(r), i.e., rdf:type
is treated differently from any other predicate. For some datasets, similar rela-
tions exist, e.g., the dcterms:subject relations in DBpedia [13] which contain
a link to the category of the original Wikipedia article a DBpedia resource is
derived from.
    For such relations, we propose three strategies:
 – Creating a binary feature indicating presence or absence of the relation’s
   object.
 – Creating a relative count feature indicating the relative count of the relation’s
   object. For a resource that has a relation to n objects, each feature value is
   1
   n.
 – Creating a TF-IDF feature, whose value is n1 · log |{r|C(r)}|
                                                            N
                                                                 , where N is the
   total number of resources in the dataset, and |{r|C(r)}| denotes the number
   of resources that have the respective relation r to C.
    The rationale for using relative counts is that if there are only a few relations
of a particular kind, each individual related object may be more important. For
example, for a general book which has a hundred topics, each of those topics
is less characteristic for the book than a specific book with only a few topics.
Thus, that strategy takes into account both the existence and the importance of
a certain relation.
    The rationale for using TF-IDF is to further reduce the influence of too
general features, in particular when using a distance-based mining algorithm.
Table 1 shows the features generated for the example depicted in Fig.1. It can
be observed that using TF-IDF implicitly gives a higher weight to more specific
features, which can be important in distance-based mining algorithms (i.e., it
increases the similarity of two objects more if they share a more specific type
than a more abstract one).
4          Petar Ristoski and Heiko Paulheim


Table 1: Features for rdf:type and relations as such, generated for the example shown in Fig. 1. For
TF-IDF, we assume that there are 1,000 instances in the dataset, all of which are persons, 500 of
which are artists, and 100 of which are music artists with genres and instruments.
                            Specific relation: rdf:type      Relations as such
       Strategy       MusicArtist Artist Person Agent Thing genre instrument
       Binary                true true true true true true                true
       Count                     –       –        –    –   –     9           21
       Relative Count          0.2     0.2      0.2  0.2 0.2 0.091        0.212
       TF-IDF                0.461 0.139          0    0   0 0.209        0.488


3.2      Strategies for Features Derived from Relations as Such

Generic relations describe how resources are related to other resources. For ex-
ample, a writer is connected to her birthplace, her alma mater, and the books
she has written. Such relations between a resource r and a resource r0 are ex-
pressed in description logics as p(r, r0 ) (for an outgoing relation) or p(r0 , r) (for
an incoming relation), where p can be any relation.
   In general, we treat incoming (rel in) and outgoing (rel out) relations. For
such generic relations, we propose four strategies:

    – Creating a binary feature for each relation.
    – Creating a count feature for each relation, specifying the number of resources
      connected by this relation.
    – Creating a relative count feature for each relation, specifying the fraction of
      resources connected by this relation. For a resource that has total number of
      P outgoing relations, the relative count value for a relation p(r, r0 ) is defined
          n
      as Pp , where np is the number of outgoing relations of type p. The feature
      is defined accordingly for incoming relations
                                                                   n              N
    – Creating a TF-IDF feature for each relation, whose value is Pp ·log |{r|∃r0 :p(r,r 0 )}| ,
                                                                 0        0
      where N is the overall number of resources, and |{r|∃r : p(r, r )}| denotes
      the number of resources for which the relation p(r, r0 ) exists. The feature is
      defined accordingly for incoming relations.

    The rationale of using relative counts is that resources may have multiple
types of connections to other entities, but not all of them are equally important.
For example, a person who is mainly a musician may also have written one book,
but recorded many records, so that the relations get different weights. In that
case, he will be more similar to other musicians than to other authors – which
is not the case if binary features are used.
    The rationale of using TF-IDF again is to reduce the influence of too general
relations. For example, two persons will be more similar if both of them have
recorded records, rather than if both have a last name. The IDF factor accounts
for that weighting. Table 1 shows the features generated from the example in
Fig. 1.
Propositionalization Strategies for Creating Features from Linked Open Data        5

4     Evaluation

We evaluated the strategies outlined above on six different datasets, two for each
task of classification, regression, and outlier detection.


4.1    Tasks and Datasets

The following datasets were used in the evaluation:

 – The Auto MPG data set1 , a dataset that captures different characteristics
   of cars (such as cyclinders, transmission horsepower), and the target is to
   predict the fuel consumption in Miles per Gallon (MPG) as a regression
   task [23]. Each car in the dataset was linked to the corresponding resource
   in DBpedia.
 – The Cities dataset contains a list of cities and their quality of living (as
   a numerical score), as captured by Mercer [17]. The cities are mapped to
   DBpedia. We use the dataset both for regression as well as for classification,
   discretizing the target variable into high, medium, and low.
 – The Sports Tweets dataset consists of a number of tweets, with the target
   class being whether the tweet is related to sports or not.2 The dataset was
   mapped to DBpedia using DBpedia Spotlight [15].
 – The DBpedia-Peel dataset is a dataset where each instance is a link between
   the DBpedia and the Peel Sessions LOD datasets. Outlier detection is used
   to identify links whose characteristics deviate from the majority of links,
   which are then regarded to be wrong. A partial gold standard of 100 links
   exists, which were manually annotated as right or wrong [19].
 – The DBpedia-DBTropes dataset is a similar dataset with links between DB-
   pedia and DBTropes.

    For the classification and regression tasks, we use direct types (i.e., rdf:type)
and DBpedia categories (i.e., dcterms:subject), as well as all strategies for
generic relations. For the outlier detection tasks, we only use direct types and
generic relations, since categories do not exist in the other LOD sources involved.
An overview of the datasets, as well as the size of each feature set, is given in
Table 2.
    For classification tasks, we use Naı̈ve Bayes, k-Nearest Neighbors (with k=3),
and C4.5 decision tree. For regression, we use Linear Regression, M5Rules, and
k-Nearest Neighbors (with k=3). For outlier detection, we use Global Anomaly
Score (GAS, with k=25), Local Outlier Factor (LOF), and Local Outlier Prob-
abilities (LoOP, with k=25). We measure accuracy for classification tasks, root-
mean-square error (RMSE) for regression tasks, and area under the ROC curve
(AUC) for outlier detection tasks.
1
    http://archive.ics.uci.edu/ml/datasets/Auto+MPG
2
    https://github.com/vinaykola/twitter-topic-classifier/blob/master/
    training.txt
6        Petar Ristoski and Heiko Paulheim


Table 2: Datasets used in the evaluation. Tasks: C=Classification, R=Regression, O=Outlier Detec-
tion
 Dataset         Task # instances # types # categories # rel in # rel out # rel in & out
 Auto MPG         R           391     264         308      227        370            597
 Cities          C/R          212     721         999    1,304     1,081           2,385
 Sports Tweets    C         5,054   7,814      14,025    3,574     5,334           8,908
 DBpedia-Peel     O         2,083      39            -     586        322            908
 DBpedia-DBTropes O         4,228     128            -     912     2,155           3,067


    The evaluations are performed in RapidMiner, using the Linked Open Data
extension [22]. For classification, regression, and outlier detection, we use the
implementation in RapidMiner where available, otherwise, the corresponding
implementations from the Weka3 and Anomaly Detection [6] extension in Rapid-
Miner were used. The RapidMiner processes and datasets used for the evalua-
tion can be found online.4 The strategies for creating propositional features from
Linked Open Data are implemented in the RapidMiner Linked Open Data ex-
tension5 [22].


4.2    Results

For each of the three tasks we report the results for each of the feature sets,
generated using different propositionalization strategies. The classification and
regression results are calculated using stratified 10-fold cross validation, while
for the outlier detection the evaluations were made on the partial gold standard
of 100 links for each of the datasets.6
    Table 3 shows the classification accuracy for the Cities and Sports Tweets
datasets. We can observe that the results are not consistent, but the best results
for each classifier and for each feature set are achieved using different representa-
tion strategy. Only for the incoming relations feature set, the best results for the
Cities dataset for each classifier are achieved when using the Binary strategy,
while for the Sports Tweets dataset the best results are achieved when using
Count strategy. We can observe that for most of the generic relation feature sets
using TF-IDF strategy leads to poor results. That can be explained with the
fact that TF-IDF tends to give higher weights to relations that appear rarely in
the dataset, which also might be a result of erroneous data. Also, on the Cities
dataset it can be noticed that when using k-NN on the incoming relations feature
set, the difference in the results using different strategies is rather high.
3
  https://marketplace.rapid-i.com/UpdateServer/faces/product_details.
  xhtml?productId=rmx_weka
4
  http://data.dws.informatik.uni-mannheim.de/propositionalization_
  strategies/
5
  http://dws.informatik.uni-mannheim.de/en/research/
  rapidminer-lod-extension
6
  Note that we measure the capability of finding errors by outlier detection, not of
  outlier detection as such, i.e., natural outliers may be counted as false positives.
Propositionalization Strategies for Creating Features from Linked Open Data                           7


Table 3: Classification accuracy results for the Cities and Sports Tweets datasets, using Naı̈ve
Bayes(NB), k-Nearest Neighbors (k-NN, with k=3), and C4.5 decision tree (C4.5) as classifica-
tion algorithms, on five different feature sets, generated using three propositionalization strategies,
for types and categories feature sets, and four propositionalization strategies for the incoming and
outgoing relations feature sets. The best result for each feature set, for each classification algorithm
is marked in bold.
                Datasets                  Cities           Sports Tweets
      Features     Representation NB k-NN C4.5 Avg. NB k-NN C4.5 Avg.
                   Binary          .557 .561 .590 .569 .8100 .829 .829 .822
      types        Relative Count .571 .496 .552 .539 .809 .814 .818 .814
                   TF-IDF         .571 .487 .547 .535 .821 .824 .826 .824
                   Binary          .557 .499 .561 .539 .822 .765 .719 .769
      categories Relative Count .595 .443 .589 .542 .907 .840 .808 .852
                   TF-IDF          .557 .499 .570 .542 .896 .819 .816 .844
                   Binary         .604 .584 .603 .597 .831 .836 .846 .838
                   Count           .566 .311 .593 .490 .832 .851 .854 .845
      rel in
                   Relative Count .491 .382 .585 .486 .695 .846 .851 .7977
                   TF-IDF          .349 .382 .542 .424 .726 .846 .849 .8077
                   Binary          .476 .600 .567 .547 .806 .823 .844 .824
                   Count          .499 .552 .585 .546 .799 .833 .850 .827
      rel out
                   Relative Count .480 .584 .566 .543 .621 .842 .835 .766
                   TF-IDF          .401 .547 .585 .511 .699 .844 .841 .7949
                   Binary         .594 .585 .564 .581 .861 .851 .864 .859
                   Count           .561 .542 .608 .570 .860 .860 .871 .864
      rel in & out
                   Relative Count .576 .471 .565 .537 .700 .845 .872 .8058
                   TF-IDF          .401 .462 .584 .482 .751 .848 .861 .820


    Table 4 shows the results of the regression task for the Auto MPG and
Cities datasets. For the Auto MPG dataset, for M5Rules and k-NN classifiers
the best results are achieved when using Relative Count and TF-IDF for all
feature sets, while the results for LR are mixed. For the Cities dataset we can
observe that the results are mixed for the types and categories feature set, but
for the generic relations feature sets, the best results are achieved when using
Binary representation. Also, it can be noticed that when using linear regression,
there is a drastic difference in the results between the strategies.
    Table 5 shows the results of the outlier detection task for the DBpedia-
Peel and DBpedia-DBTropes datasets. In this task we can observe much higher
difference in performances when using different propositionalization strategies.
We can observe that the best results are achieved when using relative count
features. The explanation is that in this task, we look at the implicit types
of entities linked when searching for errors (e.g., a book linked to a movie of
the same name), and those types are best characterized by the distribution of
relations, as also reported in [20]. On the other hand, TF-IDF again has the
tendency to assign high weights to rare features, which may also be an effect of
noise.
    By analyzing the results on each task, we can conclude that the chosen propo-
sitionalization strategy has major impact on the overall results. Also, in some
8        Petar Ristoski and Heiko Paulheim


Table 4: Root-mean-square error (RMSE) results for the Auto MPG and Cities datasets, using
Linear Regression (LR), M5Rules (M5), and k-Nearest Neighbors(k-NN, with k=3) as regression
algorithms, on five different feature sets, generated using three propositionalization strategies, for
types and categories feature sets, and four propositionalization strategies for the incoming and
outgoing relations feature sets. The best result for each feature set, for each regression algorithm is
marked in bold.
              Datasets               Auto MPG                 Cities
    Features     Representation LR M5 k-NN Avg.        LR    M5 k-NN Avg.
                 Binary         3.95 3.05 3.63 3.54 24.30 18.79 22.16 21.75
    types        Relative Count 3.84 2.95 3.57 3.45 18.04 19.69 33.56 23.77
                 TF-IDF         3.86 2.96 3.57 3.46 17.85 18.77 22.39 19.67
                 Binary         3.69 2.90 3.61 3.40 18.88 22.32 22.67 21.29
    categories Relative Count 3.74 2.97 3.57 3.43 18.95 19.98 34.48 24.47
                 TF-IDF         3.78 2.90 3.56 3.41 19.02 22.32 23.18 21.51
                 Binary         3.84 2.86 3.61 3.44 49.86 19.20 18.53 29.20
                 Count          3.89 2.96 4.61 3.82 138.04 19.91 19.2 59.075
    rel in
                 Relative Count 3.97 2.91 3.57 3.48 122.36 22.33 18.87 54.52
                 TF-IDF         4.10 2.84 3.57 3.50 122.92 21.94 18.56 54.47
                 Binary         3.79 3.08 3.59 3.49 20.00 19.36 20.91 20.09
                 Count          4.07 2.98 4.14 3.73 36.31 19.45 23.99 26.59
    rel out
                 Relative Count 4.09 2.94 3.57 3.53 43.20 21.96 21.47 28.88
                 TF-IDF         4.13 3.00 3.57 3.57 28.84 20.85 22.21 23.97
                 Binary         3.99 3.05 3.67 3.57 40.80 18.80 18.21 25.93
                 Count          3.99 3.07 4.54 3.87 107.25 19.52 18.90 48.56
    rel in & out
                 Relative Count 3.92 2.98 3.57 3.49 103.10 22.09 19.60 48.26
                 TF-IDF         3.98 3.01 3.57 3.52 115.37 20.62 19.70 51.89


cases there is a drastic performance differences between the strategies that are
used. Therefore, in order to achieve the best performances, it is important to
choose the most suitable propositionalization strategy, which mainly depends on
the given dataset, the given data mining task, and the data mining algorithm to
be used.
   When looking at aggregated results, we can see that for the classification and
regression tasks, binary and count features work best in most cases. Furthermore,
we can observe that algorithms that rely on the concept of distance, such as
k-NN, linear regression, and most outlier detection methods, show a stronger
variation of the results across the different strategies than algorithms that do
not use distances (such as decision trees).


5     Conclusion and Outlook
Until now, the problem of finding the most suitable propositionalization strategy
for creating features from Linked Open Data has not been tackled, as previous
researches focused only on binary, or in some cases numerical representation of
features. In this paper, we have compared different strategies for creating propo-
sitional features from types and relations in Linked Open Data. We have imple-
mented three propositionalization strategies for specific relations, like rdf:type
Propositionalization Strategies for Creating Features from Linked Open Data                        9


Table 5: Area under the ROC curve (AUC) results for the DBpedia-Peel and Dbpedia-DBTropes
datasets, using Global Anomaly Score (GAS, with k=25), Local Outlier Factor (LOF), and Local
Outlier Probabilities (LoOP, with k=25) as outlier detection algorithms, on four different feature
sets, generated using three propositionalization strategies, for types feature set, and four proposi-
tionalization strategies for the incoming and outgoing relations feature sets. The best result for
each feature set, for each outlier detection algorithm is marked in bold.
           Datasets                DBpedia-Peel          DBpedia-DBTropes
 Features     Representation GAS LOF LoOP Avg. GAS LOF LoOP Avg.
              Binary         0.386 0.486 0.554 0.476 0.503 0.627 0.605 0.578
              Relative Count 0.385 0.398 0.595 0.459 0.503 0.385 0.314 0.401
 types
              TF-IDF         0.386 0.504 0.602 0.497 0.503 0.672 0.417 0.531
              Binary          0.169 0.367 0.288 0.275 0.425 0.520 0.450 0.465
              Count           0.200 0.285 0.290 0.258 0.503 0.590 0.602 0.565
 rel in
              Relative Count 0.293 0.496 0.452 0.414 0.589 0.555 0.493 0.546
              TF-IDF          0.140 0.353 0.317 0.270 0.509 0.519 0.568 0.532
              Binary          0.250 0.195 0.207 0.217 0.325 0.438 0.432 0.398
              Count           0.539 0.455 0.391 0.462 0.547 0.577 0.522 0.549
 rel out
              Relative Count 0.542 0.544 0.391 0.492 0.618 0.601 0.513 0.577
              TF-IDF          0.116 0.396 0.240 0.251 0.322 0.629 0.471 0.474
              Binary          0.324 0.430 0.510 0.422 0.351 0.439 0.396 0.396
              Count           0.527 0.367 0.454 0.450 0.565 0.563 0.527 0.553
 rel in & out
              Relative Count 0.603 0.744 0.616 0.654 0.667 0.672 0.657 0.665
              TF-IDF          0.202 0.667 0.483 0.451 0.481 0.462 0.500 0.481


and dcterms:subject, and four strategies for generic relations. We conducted
experiments on six different datasets, across three different data mining tasks,
i.e. classification, regression and outlier detection. The experiments show that
the chosen propositionalization strategy might have a major impact on the over-
all results. However, it is difficult to come up with a general recommendation for
a strategy, as it depends on the given data mining task, the given dataset, and
the data mining algorithm to be used.
    For future work, additional experiments can be performed on more feature
sets. For example, a feature sets of qualified incoming and outgoing relation
can be generated, where qualified relations attributes beside the type of the
relation take the type of the related resource into account. The evaluation can
be extended on more datasets, using and combining attributes from multiple
Linked Open Data sources. Also, it may be interesting to examine the impact
of the propositionalization strategies on even more data mining tasks, such as
clustering and recommender systems.
    So far, we have considered only statistical measures for feature representation
without exploiting the semantics of the data. More sophisticated strategies that
combine statistical measures with the semantics of the data can be developed.
For example, we can represent the connection between different resources in the
graph by using some of the standard properties of the graph, such as the depth of
the hierarchy level of the resources, the fan-in and fan-out values of the resources,
etc.
10      Petar Ristoski and Heiko Paulheim

    The problem of propositionalization and feature weighting has been exten-
sively studied in the area of text categorization [4, 12]. Many approaches have
been proposed, which can be adapted and applied on Linked Open Data datasets.
For example, adapting supervised weighting approaches, such as [5, 25], might
resolve the problem with the erroneous data when using TF-IDF strategy.
    Furthermore, some of the statistical measures can be used as feature selec-
tion metrics when extracting data mining features from Linked Open Data. For
example, considering the semantics of the resources, the IDF value can be com-
puted upfront for all feature candidates, and can be used for selecting the most
valuable features before the costly feature generation. Thus, intertwining propo-
sitionalization and feature selection strategies for Linked Open Data [24] will be
an interesting line of future work.
    In summary, this paper has revealed some insights in a problem largely over-
looked so far, i.e., choosing different propositionalization for mining Linked Open
Data. We hope that these insights help researchers and practicioners in designing
methods and systems for mining Linked Open Data.


Acknowledgements

The work presented in this paper has been partly funded by the German Re-
search Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD).


References
 1. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So
    Far. International Journal on Semantic Web and Information Systems, 5(3):1–22,
    2009.
 2. Weiwei Cheng, Gjergji Kasneci, Thore Graepel, David Stern, and Ralf Herbrich.
    Automated feature generation from structured knowledge. In 20th ACM Confer-
    ence on Information and Knowledge Management (CIKM 2011), 2011.
 3. Gerben Klaas Dirk de Vries and Steven de Rooij. A fast and simple graph kernel for
    rdf. In Proceedings of the Second International Workshop on Knowledge Discovery
    and Data Mining Meets Linked Open Data, 2013.
 4. Zhi-Hong Deng, Shi-Wei Tang, Dong-Qing Yang, Ming Zhang Li-Yu Li, and Kun-
    Qing Xie. A comparative study on feature weight in text categorization. In Ad-
    vanced Web Technologies and Applications, pages 588–597. Springer, 2004.
 5. Jyoti Gautam and Ela Kumar. An integrated and improved approach to terms
    weighting in text classification. IJCSI International Journal of Computer Science
    Issues, 10(1), 2013.
 6. Markus Goldstein. Anomaly detection. In RapidMiner – Data Mining Use Cases
    and Business Analytics Applications. 2014.
 7. Yi Huang, Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A scalable
    kernel approach to learning in semantic graphs with applications to linked data.
    In 1st Workshop on Mining the Future Internet, 2010.
 8. Venkata Narasimha Pavan Kappara, Ryutaro Ichise, and O.P. Vyas. Liddm: A
    data mining system for linked data. In Workshop on Linked Data on the Web
    (LDOW2011), 2011.
Propositionalization Strategies for Creating Features from Linked Open Data          11

 9. Mansoor Ahmed Khan, Gunnar Aastrand Grimnes, and Andreas Dengel. Two
    pre-processing operators for improved learning from semanticweb data. In First
    RapidMiner Community Meeting And Conference (RCOMM 2010), 2010.
10. Christoph Kiefer, Abraham Bernstein, and André Locher. Adding data mining
    support to sparql via statistical relational learning methods. In Proceedings of
    the 5th European Semantic Web Conference on The Semantic Web: Research and
    Applications, ESWC’08, pages 478–492, Berlin, Heidelberg, 2008. Springer-Verlag.
11. Stefan Kramer, Nada Lavrač, and Peter Flach. Propositionalization approaches
    to relational data mining. In Sašo Džeroski and Nada Lavrač, editors, Relational
    Data Mining, pages 262–291. Springer Berlin Heidelberg, 2001.
12. Man Lan, Chew-Lim Tan, Hwee-Boon Low, and Sam-Yuan Sung. A comprehensive
    comparative study on term weighting schemes for text categorization with support
    vector machines. In Special Interest Tracks and Posters of the 14th International
    Conference on World Wide Web, WWW ’05, pages 1032–1033, New York, NY,
    USA, 2005. ACM.
13. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
    Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören
    Auer, and Christian Bizer. DBpedia – A Large-scale, Multilingual Knowledge Base
    Extracted from Wikipedia. Semantic Web Journal, 2013.
14. Uta Lösch, Stephan Bloehdorn, and Achim Rettinger. Graph kernels for rdf data.
    In The Semantic Web: Research and Applications, pages 134–148. Springer, 2012.
15. Pablo N. Mendes, Max Jakob, Andrés Garcı́a-Silva, and Christian Bizer. Dbpedia
    spotlight: Shedding light on the web of documents. In Proceedings of the 7th
    International Conference on Semantic Systems (I-Semantics), 2011.
16. Jindřich Mynarz and Vojtěch Svátek. Towards a benchmark for lod-enhanced
    knowledge discovery from structured data. In Proceedings of the Second Interna-
    tional Workshop on Knowledge Discovery and Data Mining Meets Linked Open
    Data, pages 41–48. CEUR-WS, 2013.
17. Heiko Paulheim. Generating possible interpretations for statistics from linked open
    data. In 9th Extended Semantic Web Conference (ESWC), 2012.
18. Heiko Paulheim. Exploiting linked open data as background knowledge in data
    mining. In Workshop on Data Mining on Linked Open Data, 2013.
19. Heiko Paulheim. Identifying wrong links between datasets by multi-dimensional
    outlier detection. In Workshop on Debugging Ontologies and Ontology Mappings
    (WoDOOM), 2014, 2014.
20. Heiko Paulheim and Christian Bizer. Type inference on noisy rdf data. In Inter-
    national Semantic Web Conference, pages 510–525, 2013.
21. Heiko Paulheim and Johannes Fürnkranz. Unsupervised Generation of Data Min-
    ing Features from Linked Open Data. In International Conference on Web Intel-
    ligence, Mining, and Semantics (WIMS’12), 2012.
22. Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, and Christian Bizer. Data
    mining with background knowledge from the web. In RapidMiner World, 2014. To
    appear.
23. J. Ross Quinlan. Combining instance-based and model-based learning. In ICML,
    page 236, 1993.
24. Petar Ristoski and Heiko Paulheim. Feature selection in hierarchical feature spaces.
    In Discovery Science, 2014. To appear.
25. Pascal Soucy and Guy W. Mineau. Beyond tfidf weighting for text categorization in
    the vector space model. In Proceedings of the 19th International Joint Conference
    on Artificial Intelligence, IJCAI’05, pages 1130–1135, San Francisco, CA, USA,
    2005. Morgan Kaufmann Publishers Inc.