Towards Automatic Topical Classification of LOD Datasets

                 Robert Meusel                          Blerina Spahiu                    Christian Bizer
          Data and Web Science Group               Department of Computer          Data and Web Science Group
             University of Mannheim                 Science, Systems and              University of Mannheim
           B6 26, Mannheim, Germany                    Communication                B6 26, Mannheim, Germany
              robert@dwslab.de                    University of Milan Bicocca            chris@dwslab.de
                                                Viale Sarca, 336 20126 Milano
                                                  spahiu@disco.unimib.it
                                                     Heiko Paulheim
                                                Data and Web Science Group
                                                   University of Mannheim
                                                 B6 26, Mannheim, Germany
                                                       heiko@dwslab.de

ABSTRACT                                                             Since the proposal of the Linked Data best practices in
The datasets that are part of the Linking Open Data cloud         2006, the Linked Open Data cloud (LOD cloud) has grown to
diagramm (LOD cloud) are classified into the following top-       roughly 1 000 datasets (as of April 2014) [15]. The datasets
ical categories: media, government, publications, life sci-       cover various topical domains, with social media, govern-
ences, geographic, social networking, user-generated con-         ment data, and metadata about publications being the most
tent, and cross-domain. The topical categories were manu-         prominent areas [15].
ally assigned to the datasets. In this paper, we investigate to      The most well-known categorization of LOD datasets by
which extent the topical classification of new LOD datasets       topical domain is the coloring of the LOD cloud diagram.1
can be automated using machine learning techniques and the        Up till now, the topical categories were manually assigned
existing annotations as supervision. We conducted experi-         to the datasets in the cloud either by the publishers of the
ments with different classification techniques and different      datasets themselves via the datahub.io dataset catalog or
feature sets. The best classification technique/feature set       by the authors of the LOD cloud diagram. In this paper, we
combination reaches an accuracy of 81.62% on the task of          investigate to which extent the topical classification of new
assigning one out of the eight classes to a given LOD dataset.    LOD datasets can be automated for upcoming versions of
A deeper inspection of the classification errors reveals prob-    the LOD cloud diagram using machine learning techniques
lems with the manual classification of datasets in the current    and the existing annotations as supervision.
LOD cloud.                                                           Beside creating upcoming versions of the LOD cloud dia-
                                                                  gram, the automatic topical classification of LOD datasets
                                                                  can be interesting for other purposes as well: Agents navi-
Keywords                                                          gating on the Web of Linked Data should know the topical
Linked Open Data, Topic Detection, Data Space Profiling           domain of datasets that they discover by following links in
                                                                  order to judge whether the datasets might be useful for their
                                                                  use case at hand or not. Furthermore, as shown in [15], it is
1.   INTRODUCTION                                                 interesting to analyze characteristics of datasets grouped by
   The Web of Linked Data offers a rich collection of struc-      topical domain, so that trends and best practices that exist
tured data provided by hundreds of different data sources         only in a particular topical domain can be identified.
that use common standards such as dereferencable URIs                In this paper, we present – to the best of our knowledge –
and RDF. The central idea of Linked Data is that data             the first automatic approach to classify LOD datasets into
sources set RDF links pointing at other data sources – e.g.,      the topical categories that are used by the LOD cloud di-
owl:sameAs links – so that all data is connected into a global    agram. Using the data catalog underlying the recent LOD
data space [3, 8]. In this data space, agents can navigate        cloud, we train machine learning classifiers with different
from one data source to another by following RDF links,           sets of features. Our best classification technique/feature
thereby discovering new data sources on the fly.                  set combination reaches an accuracy of 82%.
                                                                     The rest of this paper is structured as follows. Section 2
                                                                  introduces the methodology of our experiments, followed by
                                                                  a presentation of the results in Section 3 and a discussion of
                                                                  remaining classification errors in Section 4. Section 5 gives
                                                                  an overview of related work. We conclude with a summary
                                                                  and an outlook on future work.


Copyright is held by the author/owner(s).                         1
WWW2015 Workshop: Linked Data on the Web (LDOW2015).                  http://lod-cloud.net
2.    METHODOLOGY
   In this section, we first briefly describe the data corpus
that we use for our experiments and the different feature
sets we derive from the data. We than briefly introduce the
classification techniques that we considered and sketch the
final experimental setup that was used for the evaluation.

2.1   Data Corpus
   In order to extract our features for the different datasets
which are contained in the LOD cloud, we used the data
corpus that was crawled by Schmachtenberg et al. [15] and
which was used to draw the most recent LOD cloud diagram.
Schmachtenberg et al. used the LD-Spider framework [9]           Figure 2: Number of datasets per category con-
to gather Linked Data from the Web in April 2014. The            tained in the LOD cloud.
crawler was seeded with URIs from three different sources:
(1) dataset descriptions in lod-cloud group of the datahub.io
dataset catalog, as well as other datasets marked with Linked         used by a dataset form a helpful indicator for deter-
Data related tags within the catalog; (2) a sample of the Bil-        mining the topical category of the dataset. Thus, we
lion Triple Challenge 2012 dataset2 ; and (3) datasets adver-         determine the vocabulary of all terms that are used as
tised on the public-lodw3.org mailing list since 2011. The            predicates or as the object of a type statement within
final crawl contains data from 1 014 different LOD datasets.3         each dataset. Altogether we identified 1 439 different
Altogether 188 million RDF triples were extracted from 900 129        vocabularies being used by the datasets (see [15] for
documents describing 8 038 396 resources. Figure 1 shows              details about the most widely used vocabularies).
the distribution of the number of resources and documents
per dataset contained in the crawl.                              Class URIs (CUri): As a more fine-grained feature, the
                                                                      rdfs: and owl:classes which are used to describe
                                                                      entities within a dataset might provide useful informa-
                                                                      tion to determine the topical category of the dataset.
                                                                      Thus, we extracted all the classes that are used by at
                                                                      least two different datasets, resulting in 914 attributes
                                                                      for this feature set.

                                                                 Property URIs (PUri): Beside the class information of
                                                                     an entity, information about which properties are used
                                                                     to describe the entity can be helpful. For example it
                                                                     might make a difference, if a person is described with
Figure 1: Distribution of the number of resources                    foaf:knows statements or if her professional affiliation
(− − −) and documents (    ) (log scale) per dataset                 is provided. To leverage this information, we collected
contained in the crawl.                                              all properties that are used within the crawled data by
                                                                     at least two datasets. This feature set consists of 2 333
   In order to create the 2014 version of the LOD cloud di-
                                                                     attributes.
agram, newly discovered datasets were manually classified
into one of the following categories: media, government, pub-    Local Class Names (LCN): Different vocabularies might
lications, life sciences, geographic, social networking, user-       contain synonymous (or at least closely related) terms
generated content, and cross-domain. A detailed definition           that share the same local name and only differ in their
of each category is available in [15].                               namespace, e.g. foaf:Person and dbpedia:Person.
   Figure 2 shows the number of datasets per category con-           Creating correspondences between similar classes from
tained in the 2014 version of the LOD cloud. As we can               different vocabularies reduces the diversity of features,
see, the LOD cloud is dominated by datasets belonging to             but on the other side might increase the number of
the category social networking (48%), followed by govern-            attributes which are used by more than one dataset.
ment (18%) and publications (13%) datasets. The categories           As we lack correspondences between all the vocabu-
media and geographic are only represented by less than 25            laries, we bypass this, by using only the local names
datasets within the whole corpus.                                    of the type URIs, meaning vocab1:Country and vo-
2.2   Feature Sets                                                   cab2:Country are mapped to the same attribute. We
                                                                     used a simple regular expression to determine the lo-
  For each of the datasets, we created the following eight           cal class name checking for #, : and / within the type
feature sets based on the crawled data.                              object. By focusing only on the local part of a class
                                                                     name, we increase the number of classes that are used
Vocabulary Usage (VOC): As many vocabularies target
                                                                     by more than one dataset in comparison to CUri and
    a specific topical domain, e.g. bibo bibliographic in-
                                                                     thus generate 1 041 attributes for the LCN feature set.
    formation, we assume that the vocabularies that are
2                                                                Local Property Names (LPN): Using the same assump-
 http://km.aifb.kit.edu/projects/btc-2012/
3                                                                    tion as for the LCN feature set, we also extracted the
 The crawled data is publicly available: http://data.dws.
informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/                   local name of each property that is used by a dataset.
     This results in treating vocab1:name and vocab2:name         2.3   Classification Approaches
     as a single property. We used the same heuristic for           We evaluated the following three classification techniques
     the extraction as for the LCN feature set and generated      on our task of assigning topical categories to LOD datasets.
     3 493 different local property names which are used by
     more than one dataset, resulting in an increase of the       k-Nearest Neighbor: k-Nearest Neighbor (k-NN) classifi-
     number of attributes in comparison to the PUri feature           cation models make use of the similarity between new
     set.                                                             cases and known cases to predict the class for the new
                                                                      case. A case is classified by its majority vote of its
Text from rdfs:label (LAB): Beside the vocabulary-level               neighbors, with the case being assigned to the class
    features, the names of the described entities might               most common among its k nearest neighbors measured
    also indicate the topical domain of a dataset. We                 by the distance function. In our experiments we used a
    thus extracted all values of rdfs:label properties,               k equal to 5 with Euclidean-similarity for non-binary
    lower-cased them, and tokenized the values at space-              term vectors and Jaccard-similarity for binary term
    characters. We further excluded tokens shorter than               vectors.
    three and longer than 25 characters. Afterward, we
    calculated the TF-IDF value for each token while ex-          J48 Decision Tree: A decision tree is a flowchart-like tree
    cluding tokens that appeared in less than 10 and more             structure which is built top-down from a root node and
    than 200 datasets, in order to reduce the influence of            involves some partitioning steps to divide data into
    noise. This resulted in a feature set consisting of 1 440         subsets that contain instances with similar values. For
    attributes.                                                       our experiments we use the Weka implementation of
                                                                      the C4.5 decision tree [12]. We learn a pruned tree,
Top-Level Domains (TLD): Another feature which might                  using a confidence threshold of 0.25 with a minimum
    help to assign datasets to topical categories is the top-         number of 2 instances per leaf.
    level domain of the dataset. For instance, govern-
    ment data is often hosted in the gov top-level domain,        Naive Bayes: As a last classification method, we used Naive
    whereas library data might be found more likely on                Bayes (NB). NB uses joint probabilities of some evi-
    edu or org top-level domains.4                                    dence to estimate the probability of some event. Al-
                                                                      though this classifier is based on the assumption that
In & Outdegree (DEG): In addition to vocabulary-based                 all features are independent, which is violated in many
    and textual features, the number of outgoing RDF                  use cases, NB has shown to work well in practice [14].
    links to other datasets and incoming RDF links from
    other datasets could provide useful information for clas-     2.4   Experimental Setup
    sifying the datasets. This feature could give a hint             In order to evaluate the performance of the three classifi-
    about the density of the linkage of a dataset, as well        cation methods, we use 10-fold cross-validation and report
    as the way the dataset is interconnected within the           the average accuracy in the end.
    whole LOD cloud ecosystem.                                       As the number of datasets per category is not equally dis-
                                                                  tributed within the LOD cloud, which might influence the
   We were able to create all features (except LAB) for 1 001     performance of the classification models, we also explore the
datasets. As only 470 datasets provide rdfs:labels, we            effect of balancing the training data. We used two different
only use these datasets for evaluating the utility of the LAB     balancing approaches: (1) we down sample the number of
feature set.                                                      datasets used for training until each category is represented
   As the total number of occurrences of vocabularies and         by the same number of datasets; this number is equal to
terms is heavily influenced by the distribution of entities       the number of datasets within the smallest category; and
within the crawl for each dataset, we apply two different         (2) we up sample the datasets for each category until each
normalization strategies to the values of the vocabulary-level    category is at least represented by the number of datasets
features VOC, CUri, PUri, LCN, and LPN: On the one hand           equal to the number of datasets of the largest category. The
side, we create a binary version (bin) where the feature vec-     first approach, reduces the chance to overfit a model into the
tors of each feature set consist of 0 and 1 indicating presence   direction of the larger represented classes, but it might also
and absence of the vocabulary or term. The second version,        remove valuable information from the training set, as ex-
the relative term occurrence (rto), captures the fraction of      amples are removed and not taken into account for learning
vocabulary or term usage for each dataset.                        the model. The second approach, ensures that all possible
   The following table shows an example of the two different      examples are taken into account and no information is lost
feature set versions for the terms ti :                           for training, but by creating the same entity many times
                                                                  can result in emphasizing those particular data points. For
                                         Feature Vector           example a neighborhood based classifier might look at the
   Feature Set Version                  t1 t2     t3    t4        5 nearest neighbors, which than could be one and the same
   Term Occurrence                      10   0     2     8        data point, which would result into looking only at the near-
   Binary (bin)                          1   0     1     1        est neighbor.
   Relative Term Occurrence (rto)      0.5   0 0.1 0.4
                                                                  3.    RESULTS
                                                                     In the following, we first report the results of our exper-
                                                                  iments using the different feature sets in separation. After-
4                                                                 ward, we report the results of experiments combining at-
  We restrict ourselves to top-level domains, and not public
suffixes.                                                         tributes from multiple feature sets.
 Table 1: Results of different single feature sets. Best three single and average results are marked in bold.
 Classification                     VOC            CUri               PUri              LCN              LPN
 Approach                         bin   rto      bin    rto         bin    rto        bin   rto        bin   rto      LAB     TLD     DEG
 Major Class                    51.85   51.85   51.85   51.85      51.85    51.85    51.85   51.85   51.85    51.85   33.62   51.85   51.85
 k-NN (no sampling)             77.92   76.33   76.83   74.08   79.81       75.30    76.73   74.38   79.80    76.10   53.62   58.44   49.25
 k-NN (down sampling)           64.74   66.33   68.49   60.67    71.80      62.70    68.39   65.35    73.10   62.80   19.57   30.77   29.88
 k-NN (up sampling)             71.83   72.53   64.98   67.08    75.60      71.89    68.87   69.82    76.64   70.23   43.97   10.74   11.89
 J48 (no sampling)              78.83   79.72   78.86   76.93    77.50      76.40   80.59    76.83    78.70   77.20   63.40   67.14   54.45
 J48 (down sampling)            57.65   66.63   65.35   65.24    63.90      63.00    64.02   63.20    64.90   60.40   25.96   34.76   24.78
 J48 (up sampling)              76.53   77.63   74.13   76.60    75.29      75.19    77.50   75.92    75.91   74.46   52.64   45.35   29.47
 Naive Bayes (no sampling)      34.97   44.26   75.61   57.93    78.90      75.70    77.74   60.77    78.70   76.30   40.00   11.99   22.88
 Naive Bayes (down sampling)    64.63   69.14   64.73   62.39    68.10      66.60    70.33   61.58    68.50   69.10   33.62   20.88   15.99
 Naive Bayes (up sampling)      77.53   44.26   74.98   55.94    77.78      76.12    76.02   58.67    76.54   75.71   37.82   45.66   14.19
 Average (no sampling)          63.91   66.77   77.10   69.65   78.73       75.80   78.35    70.66   79.07    76.53   52.34   45.86   42.19
 Average (down sampling)        62.34   67.34   66.19   62.77    67.93      64.10    67.58   63.38    68.83   64.10   26.38   28.80   23.55
 Average (up sampling)          75.30   64.81   71.36   66.54    76.22      74.40    74.13   68.14    76.36   73.47   44.81   33.92   18.52


3.1    Results for Single Feature Sets                                       Table 2 reports the results for the five different combined
   Table 1 shows the accuracy that is reached using the three              feature sets:
different classification algorithms with and without balanc-
ing the training data. Majority Class is the performance                   ALLrto : Combination of the attributes from all eight fea-
of a default baseline classifier always predicting the largest                ture sets, using the rto version of the vocabulary-based
class: social networking.                                                     features.
   As a general observation, the vocabulary-based feature
                                                                           ALLbin : Combination of the attributes from all eight fea-
sets (VOC, LCN, LPN, CUri, PUri) perform on a similar
                                                                               ture sets, using the bin version of the vocabulary-based
level, where DEG and TLD alone show a relatively poor
                                                                               features.
performance and in some cases are not at all able to beat
the majority class baseline. Classification models based on                NoLabrto : Combination of the attributes from all feature,
the attributes of the LAB feature set perform on average                      without the attributes of the LAB feature set, using
(without sampling) around 20% above the majority base-                        the rto version of the vocabulary-based features.
line, but predict still in half of all cases the wrong category.
Algorithm-wise, the best results are achieved using the deci-              NoLabbin : Combination of the attributes from all feature,
sion tree (J48) without balancing (maximal accuracy 80.59%                    without the attributes of the LAB feature set, using
for LCNrto ) and the k-NN algorithm, also without balanc-                     the bin version of the vocabulary-based features.
ing for the PUribin and LPNbin feature sets. Comparing
the two balancing approaches, we see better results using                  Best3: Includes the attributes from the three best perform-
the up sampling approach for almost all feature sets (except                   ing feature sets from the previous section based on
VOCrto and DEG). In most cases, the category-specific ac-                      their average accuracy: PUribin , LCNbin , and LPNbin .
curacy of the smaller categories is higher when using up
sampling. Using down sampling the learned models make                         We can observe that when selecting a larger set of at-
more errors for predicting the larger categories. Further-                 tributes, our model is able to reach a slightly higher accuracy
more, when comparing the results of the models trained on                  of 81.62% than using just the attributes from one feature set
unbalanced data with the best model trained on balanced                    (80.59%, LCNbin ). Still the trained model is unsure for cer-
data, the models on the unbalanced data are more accurate                  tain decisions and has a stronger bias towards the categories
except for the VOCbin feature set. Having a closer look at                 publications and social networking.
the confusion matrices, we see that the balanced approaches
are in general making more errors when trying to predict                   4.   DISCUSSION
datasets for the larger categories, like social networking and               In the following, we look at the best performing approach
government.                                                                (Naive Bayes trained on the attributes of the NoLabbin fea-
                                                                           ture set using up sampling). Table 3 shows the confusion
                                                                           matrix of this experiment, where on the left side we list the
3.2    Results for Combined Feature Sets                                   predictions by the learned model, while the head names the
   For our second set of experiments, we combine the avail-                actual category of the dataset. As observed in the table,
able attributes from the different feature sets and train again            there are three kinds of errors which occur more frequently
our classification models using the three described algorithms.            than 10 times.
As before, we generate a binary and relative term occurrence                 The most common confusion occurs for the publication
version of the vocabulary-based features. In addition, we                  domain, where a larger number of datasets are predicted to
create a second set (binary and relative term occurrence),                 belong to the government domain. A reason for this is that
where we omit the attributes from the LAB feature set, as                  government datasets often contain metadata about govern-
we wanted to measure the influence of this particular set                  ment statistics which are represented using the same vocab-
of attributes, which is only available for less than half of               ularies and terms (e.g. skos:Concept) that are also used
the datasets. Furthermore we created a combined set of at-                 in the publication domain. This makes it challenging for
tributes consisting of the three best performing feature sets              a vocabulary-based classifier to distinguish those two cate-
from the previous section.                                                 gories apart. In addition, for example the http://mcu.es
                                             Table 2: Results of combined feature sets. Best three results in bold.
                                               Classification                                                                                             Accuracy in %
                                               Approach                                                                            ALLbin        ALLrto     NoLabbin    NoLabrto   Best3
                                               k-NN (no sampling)                                                                      74.93      71.73         76.93      72.63   75.23
                                               k-NN (down sampling)                                                                    52.76      46.85         65.14      52.05   64.44
                                               k-NN (up sampling)                                                                      74.23      67.03         71.03      68.13   73.14
                                               J48 (no sampling)                                                                      80.02       77.92         79.32      79.01   75.12
                                               J48 (down sampling)                                                                     63.24      63.74         65.34      65.43   65.03
                                               J48 (up sampling)                                                                       79.12      78.12         79.23      78.12   75.72
                                               Naive Bayes (no sampling)                                                               21.37      71.03        80.32       77.22   76.12
                                               Naive Bayes (down sampling)                                                             50.99      57.84         70.33      68.13   67.63
                                               Naive Bayes (up sampling)                                                               21.98      71.03        81.62       77.62   76.32


Table 3: Confusion matrix for the NoLabbin feature                                                                                                    Although topical profiling has been studied in other set-
set, with Naive Bayes classification model, balanced                                                                                               tings before, only a small number of methods exist for profil-
by up sampling.                                                                                                                                    ing LOD datasets. These methods can be categorized based
                                                                       True Category                                                               on the general learning approach that is employed into the
                         social networking


                                                                                                                usergen. content
                                                                                                                                                   categories unsupervised and supervised. Where the first cat-
                                                                                                                                                   egory does not rely on labeled input data, the latter is only
                                               crossdomain

                                                             publications

                                                                            government


                                                                                         lifesciences


                                                                                                                                    geographic
                                                                                                                                                   applicable for labeled data.
                                                                                                                                                      Ellefi et al. [5] try to define the profile of datasets using
                                                                                                        media


                                                                                                                                                   semantic and statistical characteristics. They use statistics
     Prediction                                                                                                                                    about vocabulary, property, and datatype usage, as well as
     social networking       489                     4           5               10             2         4            11              1
     crossdomain               1                    10           3                1             1         0             1              1
                                                                                                                                                   statistics on property values, like string lengths, for char-
     publications              8                    10          54                9             4         4             2              2           acterizing datasets. For classification, they propose a fea-
     government                3                     4          14              151             1         2             0              2           ture/characteristic generation process, starting from the top
     lifesciences              5                     3          12                0            72         2             5              5           discovered types of a dataset and generating property/value
     media                     6                     3           4                1             1         7             2              0
     usergen. content          6                     1           1                2             0         2            26              0           pairs. In order to integrate the property/value pairs they
     geographic                1                     5           1                5             1         0             0              8           consider the problem of vocabulary heterogeneity of the datasets
                                                                                                                                                   by defining correspondences between features in different vo-
                                                                                                                                                   cabularies. The authors have pointed out that it is essen-
dataset – the Ministry of Culture in Spain – was manu-                                                                                             tial to automate the feature generations and proposed the
ally labeled as publication within the LOD cloud, whereas                                                                                          framework to do so, but do not evaluate their approach on
the model predicts government which turns out to be a                                                                                              real-world datasets. In our work, we draw from their ideas of
borderline case in the gold standard. A similar frequent                                                                                           using schema-usage characteristics as features for the topical
problem is the prediction of life sciences for datasets in the                                                                                     classification, but focus on LOD datasets.
publications category. This can be observed, e.g., for the                                                                                            An approach to detect latent topics in entity-relation-
http://ns.nature.com/publications/, which describe the                                                                                             ship graphs is introduced by Böhm et al. [4]. Their ap-
publications in Nature. Those publications, however, are of-                                                                                       proach works in two phases: (1) A number of subgraphs
ten in the life sciences field, which makes the labeling in the                                                                                    having strong relations between classes are discovered from
gold standard a borderline case.                                                                                                                   the whole graph, and (2) the subgraphs are combined to
   The third most common confusion occurs between the                                                                                              generate a larger subgraph, which is assumed to represent
usergenerated content and the social networking domain.                                                                                            a latent topic. Their approach explicitly omits any kind of
Here, the problem is in the shared use of similar vocabular-                                                                                       features based on textual representations and solely relies on
ies, such as foaf. At the same time, labeling a dataset as ei-                                                                                     the exploitation of the underlying graph. Böhm et al. used
ther one of the two is often not so simple. In [15], it has been                                                                                   the DBpedia dataset to evaluate their approach.
defined that social networking datasets should focus on the                                                                                           Fetahu et al. [6] propose an approach for creating dataset
presentation of people and their interrelations, while user-                                                                                       profiles represented by a weighted dataset-topic graph which
generated content should have a stronger focus on the con-                                                                                         is generated using the category graph and instances from
tent. Datasets from personal blogs, such as www.wordpress.com,                                                                                     DBpedia. In order to create such profiles, a processing
however, can convey both aspects. Due to the labeling rule,                                                                                        pipeline that combines tailored techniques for dataset sam-
these datasets are labeled as usergenerated content, but our                                                                                       pling, topic extraction from reference datasets, and relevance
approach frequently classifies them as social networking.                                                                                          ranking is used. Topics are extracted using named-entity-
   In summary, while we observe some true classification er-                                                                                       recognition techniques, where the ranking of the topics is
rors, many of the mistakes made by our approach actually                                                                                           based on their normalized relevance score for a dataset.
point at datasets which are difficult to classify, and which                                                                                          While the mentioned approaches are unsupervised, we em-
are rather borderline cases between two categories.                                                                                                ploy supervised learning techniques as we want to exploit the
                                                                                                                                                   existing topical annotation of the datasets in the LOD cloud.
5.     RELATED WORK
  Topical profiling has been studied in the data mining,
database, and information retrieval communities. The re-                                                                                           6.     CONCLUSION AND FUTURE WORK
sulting methods find application in domains such as docu-                                                                                             In this paper, we investigate to which extent the topical
ments classification, contextual search, content management                                                                                        classification of new LOD datasets can be automated using
and review analysis [1, 11, 2, 16, 17].                                                                                                            machine learning techniques. Our experiments indicate that
vocabulary-level features are a good indicator for the topical           efficiently generating structured dataset topic profiles.
domain, yielding an accuracy of around 82%.                              In The Semantic Web: Trends and Challenges - 11th
   The analysis of the limitations of our approach, i.e., the            International Conference, ESWC 2014, Anissaras,
cases where the automatic classification deviates from the               Crete, Greece, May 25-29, 2014. Proceedings, pages
manually labeled one, points to a problem of the categoriza-             519–534, 2014.
tion approach that is currently used for the LOD cloud: All          [7] L. Getoor and C. P. Diehl. Link mining: a survey.
datasets are labeled with exactly one topical category, al-              ACM SIGKDD Explorations Newsletter, 2005.
though sometimes two or more categories would be equally             [8] T. Heath and C. Bizer. Linked data: Evolving the web
appropriate. One such example are datasets describing life               into a global data space. Synthesis lectures on the
science publications, which can be either labeled as publica-            semantic web: theory and technology, 1(1):1–136, 2011.
tions or as life sciences. Thus, the LOD dataset classifica-         [9] R. Isele, J. Umbrich, C. Bizer, and A. Harth.
tion task might be more suitably formulated as a multi-label             LDSpider: An open-source crawling framework for the
classification problem [18, 10].                                         web of linked data. In Proc. ISWC ’10 –Posters and
   A particular challenge of the classification is the heavy             Demos, 2010.
imbalance of the dataset categories, with roughly half of the       [10] B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text
datasets belonging to the social networking domain. Here, a              classification by labeling words. In Proceedings of the
two-stage approach might help, in which a first classifier tries         Nineteenth National Conference on Artificial
to separate the largest category from the rest, while a second           Intelligence, Sixteenth Conference on Innovative
classifier then tries to make a prediction for the remaining             Applications of Artificial Intelligence, July 25-29,
classes. When regarding the problem as a multi-label prob-               2004, San Jose, California, USA, pages 425–430, 2004.
lem, the corresponding approach would be classifier chains,
                                                                    [11] J. Nam, J. Kim, E. Loza Mencı́a, I. Gurevych, and
which make a prediction for one class after the other, taking
                                                                         J. Fürnkranz. Large-scale multi-label text
the prediction of the first classifiers into account as a feature
                                                                         classification - revisiting neural networks. In Machine
for the remaining classifications [13].
                                                                         Learning and Knowledge Discovery in Databases -
   In our experiments, RDF links have not been exploited
                                                                         European Conference, ECML PKDD 2014, Nancy,
beyond dataset in- and out-degree. For the task of web
                                                                         France, September 15-19, 2014. Proceedings, Part II,
page classification, link-based classification techniques, that
                                                                         pages 437–452, 2014.
exploit the contents of web pages linking to a particular
page, often yields good results [7] and it is possible that         [12] J. R. Quinlan. C4.5: Programs for Machine Learning.
such techniques could also work well for classifying LOD                 Morgan Kaufmann Publishers Inc., San Francisco,
datasets.                                                                CA, USA, 1993.
                                                                    [13] J. Read, B. Pfahringer, G. Holmes, and E. Frank.
                                                                         Classifier chains for multi-label classification. In
Acknowledgements                                                         European Conference on Machine Learning and
This research has been supported in part by FP7/2013-2015                Knowledge Discovery in Databases, 2009.
COMSODE (under contract number FP7-ICT-611358).                     [14] I. Rish. An empirical study of the naive bayes
                                                                         classifier. In IJCAI 2001 workshop on empirical
7.   REFERENCES                                                          methods in artificial intelligence, volume 3, pages
 [1] C. C. Aggarwal and C. Zhai. A survey of text                        41–46. IBM New York, 2001.
     clustering algorithms. In Mining Text Data, pages              [15] M. Schmachtenberg, C. Bizer, and H. Paulheim.
     77–128. 2012.                                                       Adoption of the linked data best practices in different
 [2] T. Basu and C. A. Murthy. Effective text classification             topical domains. In The Semantic Web–ISWC 2014,
     by a supervised feature selection approach. In 12th                 pages 245–260. Springer, 2014.
     IEEE International Conference on Data Mining                   [16] P. Shivane and R. Rajani. A survey on effective
     Workshops, ICDM Workshops, Brussels, Belgium,                       quality enhancement of text clustering & classification
     December 10, 2012, pages 918–925, 2012.                             using metadata.
 [3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data -          [17] G. Song, Y. Ye, X. Du, X. Huang, and S. Bie. Short
     the story so far. Int. J. Semantic Web Inf. Syst.,                  text classification: A survey. Journal of Multimedia,
     5(3):1–22, 2009.                                                    9(5):635–643, 2014.
 [4] C. Böhm, G. Kasneci, and F. Naumann. Latent topics            [18] G. Tsoumakas and I. Katakis. Multi label
     in graph-structured data. In 21st ACM International                 classification: An overview. International Journal of
     Conference on Information and Knowledge                             Data Warehousing and Mining, 3(3):1–13, 2007.
     Management, CIKM’12, Maui, HI, USA, October 29 -
     November 02, 2012, pages 2663–2666, 2012.
 [5] M. B. Ellefi, Z. Bellahsene, F. Scharffe, and
     K. Todorov. Towards semantic dataset profiling. In
     Proceedings of the 1st International Workshop on
     Dataset PROFIling & fEderated Search for Linked
     Data co-located with the 11th Extended Semantic Web
     Conference, PROFILES@ESWC 2014, Anissaras,
     Crete, Greece, May 26, 2014., 2014.
 [6] B. Fetahu, S. Dietze, B. P. Nunes, M. A. Casanova,
     D. Taibi, and W. Nejdl. A scalable approach for