Towards Automatic Topical Classification of LOD Datasets Robert Meusel Blerina Spahiu Christian Bizer Data and Web Science Group Department of Computer Data and Web Science Group University of Mannheim Science, Systems and University of Mannheim B6 26, Mannheim, Germany Communication B6 26, Mannheim, Germany robert@dwslab.de University of Milan Bicocca chris@dwslab.de Viale Sarca, 336 20126 Milano spahiu@disco.unimib.it Heiko Paulheim Data and Web Science Group University of Mannheim B6 26, Mannheim, Germany heiko@dwslab.de ABSTRACT Since the proposal of the Linked Data best practices in The datasets that are part of the Linking Open Data cloud 2006, the Linked Open Data cloud (LOD cloud) has grown to diagramm (LOD cloud) are classified into the following top- roughly 1 000 datasets (as of April 2014) [15]. The datasets ical categories: media, government, publications, life sci- cover various topical domains, with social media, govern- ences, geographic, social networking, user-generated con- ment data, and metadata about publications being the most tent, and cross-domain. The topical categories were manu- prominent areas [15]. ally assigned to the datasets. In this paper, we investigate to The most well-known categorization of LOD datasets by which extent the topical classification of new LOD datasets topical domain is the coloring of the LOD cloud diagram.1 can be automated using machine learning techniques and the Up till now, the topical categories were manually assigned existing annotations as supervision. We conducted experi- to the datasets in the cloud either by the publishers of the ments with different classification techniques and different datasets themselves via the datahub.io dataset catalog or feature sets. The best classification technique/feature set by the authors of the LOD cloud diagram. In this paper, we combination reaches an accuracy of 81.62% on the task of investigate to which extent the topical classification of new assigning one out of the eight classes to a given LOD dataset. LOD datasets can be automated for upcoming versions of A deeper inspection of the classification errors reveals prob- the LOD cloud diagram using machine learning techniques lems with the manual classification of datasets in the current and the existing annotations as supervision. LOD cloud. Beside creating upcoming versions of the LOD cloud dia- gram, the automatic topical classification of LOD datasets can be interesting for other purposes as well: Agents navi- Keywords gating on the Web of Linked Data should know the topical Linked Open Data, Topic Detection, Data Space Profiling domain of datasets that they discover by following links in order to judge whether the datasets might be useful for their use case at hand or not. Furthermore, as shown in [15], it is 1. INTRODUCTION interesting to analyze characteristics of datasets grouped by The Web of Linked Data offers a rich collection of struc- topical domain, so that trends and best practices that exist tured data provided by hundreds of different data sources only in a particular topical domain can be identified. that use common standards such as dereferencable URIs In this paper, we present – to the best of our knowledge – and RDF. The central idea of Linked Data is that data the first automatic approach to classify LOD datasets into sources set RDF links pointing at other data sources – e.g., the topical categories that are used by the LOD cloud di- owl:sameAs links – so that all data is connected into a global agram. Using the data catalog underlying the recent LOD data space [3, 8]. In this data space, agents can navigate cloud, we train machine learning classifiers with different from one data source to another by following RDF links, sets of features. Our best classification technique/feature thereby discovering new data sources on the fly. set combination reaches an accuracy of 82%. The rest of this paper is structured as follows. Section 2 introduces the methodology of our experiments, followed by a presentation of the results in Section 3 and a discussion of remaining classification errors in Section 4. Section 5 gives an overview of related work. We conclude with a summary and an outlook on future work. Copyright is held by the author/owner(s). 1 WWW2015 Workshop: Linked Data on the Web (LDOW2015). http://lod-cloud.net 2. METHODOLOGY In this section, we first briefly describe the data corpus that we use for our experiments and the different feature sets we derive from the data. We than briefly introduce the classification techniques that we considered and sketch the final experimental setup that was used for the evaluation. 2.1 Data Corpus In order to extract our features for the different datasets which are contained in the LOD cloud, we used the data corpus that was crawled by Schmachtenberg et al. [15] and which was used to draw the most recent LOD cloud diagram. Schmachtenberg et al. used the LD-Spider framework [9] Figure 2: Number of datasets per category con- to gather Linked Data from the Web in April 2014. The tained in the LOD cloud. crawler was seeded with URIs from three different sources: (1) dataset descriptions in lod-cloud group of the datahub.io dataset catalog, as well as other datasets marked with Linked used by a dataset form a helpful indicator for deter- Data related tags within the catalog; (2) a sample of the Bil- mining the topical category of the dataset. Thus, we lion Triple Challenge 2012 dataset2 ; and (3) datasets adver- determine the vocabulary of all terms that are used as tised on the public-lodw3.org mailing list since 2011. The predicates or as the object of a type statement within final crawl contains data from 1 014 different LOD datasets.3 each dataset. Altogether we identified 1 439 different Altogether 188 million RDF triples were extracted from 900 129 vocabularies being used by the datasets (see [15] for documents describing 8 038 396 resources. Figure 1 shows details about the most widely used vocabularies). the distribution of the number of resources and documents per dataset contained in the crawl. Class URIs (CUri): As a more fine-grained feature, the rdfs: and owl:classes which are used to describe entities within a dataset might provide useful informa- tion to determine the topical category of the dataset. Thus, we extracted all the classes that are used by at least two different datasets, resulting in 914 attributes for this feature set. Property URIs (PUri): Beside the class information of an entity, information about which properties are used to describe the entity can be helpful. For example it might make a difference, if a person is described with Figure 1: Distribution of the number of resources foaf:knows statements or if her professional affiliation (− − −) and documents ( ) (log scale) per dataset is provided. To leverage this information, we collected contained in the crawl. all properties that are used within the crawled data by at least two datasets. This feature set consists of 2 333 In order to create the 2014 version of the LOD cloud di- attributes. agram, newly discovered datasets were manually classified into one of the following categories: media, government, pub- Local Class Names (LCN): Different vocabularies might lications, life sciences, geographic, social networking, user- contain synonymous (or at least closely related) terms generated content, and cross-domain. A detailed definition that share the same local name and only differ in their of each category is available in [15]. namespace, e.g. foaf:Person and dbpedia:Person. Figure 2 shows the number of datasets per category con- Creating correspondences between similar classes from tained in the 2014 version of the LOD cloud. As we can different vocabularies reduces the diversity of features, see, the LOD cloud is dominated by datasets belonging to but on the other side might increase the number of the category social networking (48%), followed by govern- attributes which are used by more than one dataset. ment (18%) and publications (13%) datasets. The categories As we lack correspondences between all the vocabu- media and geographic are only represented by less than 25 laries, we bypass this, by using only the local names datasets within the whole corpus. of the type URIs, meaning vocab1:Country and vo- 2.2 Feature Sets cab2:Country are mapped to the same attribute. We used a simple regular expression to determine the lo- For each of the datasets, we created the following eight cal class name checking for #, : and / within the type feature sets based on the crawled data. object. By focusing only on the local part of a class name, we increase the number of classes that are used Vocabulary Usage (VOC): As many vocabularies target by more than one dataset in comparison to CUri and a specific topical domain, e.g. bibo bibliographic in- thus generate 1 041 attributes for the LCN feature set. formation, we assume that the vocabularies that are 2 Local Property Names (LPN): Using the same assump- http://km.aifb.kit.edu/projects/btc-2012/ 3 tion as for the LCN feature set, we also extracted the The crawled data is publicly available: http://data.dws. informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/ local name of each property that is used by a dataset. This results in treating vocab1:name and vocab2:name 2.3 Classification Approaches as a single property. We used the same heuristic for We evaluated the following three classification techniques the extraction as for the LCN feature set and generated on our task of assigning topical categories to LOD datasets. 3 493 different local property names which are used by more than one dataset, resulting in an increase of the k-Nearest Neighbor: k-Nearest Neighbor (k-NN) classifi- number of attributes in comparison to the PUri feature cation models make use of the similarity between new set. cases and known cases to predict the class for the new case. A case is classified by its majority vote of its Text from rdfs:label (LAB): Beside the vocabulary-level neighbors, with the case being assigned to the class features, the names of the described entities might most common among its k nearest neighbors measured also indicate the topical domain of a dataset. We by the distance function. In our experiments we used a thus extracted all values of rdfs:label properties, k equal to 5 with Euclidean-similarity for non-binary lower-cased them, and tokenized the values at space- term vectors and Jaccard-similarity for binary term characters. We further excluded tokens shorter than vectors. three and longer than 25 characters. Afterward, we calculated the TF-IDF value for each token while ex- J48 Decision Tree: A decision tree is a flowchart-like tree cluding tokens that appeared in less than 10 and more structure which is built top-down from a root node and than 200 datasets, in order to reduce the influence of involves some partitioning steps to divide data into noise. This resulted in a feature set consisting of 1 440 subsets that contain instances with similar values. For attributes. our experiments we use the Weka implementation of the C4.5 decision tree [12]. We learn a pruned tree, Top-Level Domains (TLD): Another feature which might using a confidence threshold of 0.25 with a minimum help to assign datasets to topical categories is the top- number of 2 instances per leaf. level domain of the dataset. For instance, govern- ment data is often hosted in the gov top-level domain, Naive Bayes: As a last classification method, we used Naive whereas library data might be found more likely on Bayes (NB). NB uses joint probabilities of some evi- edu or org top-level domains.4 dence to estimate the probability of some event. Al- though this classifier is based on the assumption that In & Outdegree (DEG): In addition to vocabulary-based all features are independent, which is violated in many and textual features, the number of outgoing RDF use cases, NB has shown to work well in practice [14]. links to other datasets and incoming RDF links from other datasets could provide useful information for clas- 2.4 Experimental Setup sifying the datasets. This feature could give a hint In order to evaluate the performance of the three classifi- about the density of the linkage of a dataset, as well cation methods, we use 10-fold cross-validation and report as the way the dataset is interconnected within the the average accuracy in the end. whole LOD cloud ecosystem. As the number of datasets per category is not equally dis- tributed within the LOD cloud, which might influence the We were able to create all features (except LAB) for 1 001 performance of the classification models, we also explore the datasets. As only 470 datasets provide rdfs:labels, we effect of balancing the training data. We used two different only use these datasets for evaluating the utility of the LAB balancing approaches: (1) we down sample the number of feature set. datasets used for training until each category is represented As the total number of occurrences of vocabularies and by the same number of datasets; this number is equal to terms is heavily influenced by the distribution of entities the number of datasets within the smallest category; and within the crawl for each dataset, we apply two different (2) we up sample the datasets for each category until each normalization strategies to the values of the vocabulary-level category is at least represented by the number of datasets features VOC, CUri, PUri, LCN, and LPN: On the one hand equal to the number of datasets of the largest category. The side, we create a binary version (bin) where the feature vec- first approach, reduces the chance to overfit a model into the tors of each feature set consist of 0 and 1 indicating presence direction of the larger represented classes, but it might also and absence of the vocabulary or term. The second version, remove valuable information from the training set, as ex- the relative term occurrence (rto), captures the fraction of amples are removed and not taken into account for learning vocabulary or term usage for each dataset. the model. The second approach, ensures that all possible The following table shows an example of the two different examples are taken into account and no information is lost feature set versions for the terms ti : for training, but by creating the same entity many times can result in emphasizing those particular data points. For Feature Vector example a neighborhood based classifier might look at the Feature Set Version t1 t2 t3 t4 5 nearest neighbors, which than could be one and the same Term Occurrence 10 0 2 8 data point, which would result into looking only at the near- Binary (bin) 1 0 1 1 est neighbor. Relative Term Occurrence (rto) 0.5 0 0.1 0.4 3. RESULTS In the following, we first report the results of our exper- iments using the different feature sets in separation. After- 4 ward, we report the results of experiments combining at- We restrict ourselves to top-level domains, and not public suffixes. tributes from multiple feature sets. Table 1: Results of different single feature sets. Best three single and average results are marked in bold. Classification VOC CUri PUri LCN LPN Approach bin rto bin rto bin rto bin rto bin rto LAB TLD DEG Major Class 51.85 51.85 51.85 51.85 51.85 51.85 51.85 51.85 51.85 51.85 33.62 51.85 51.85 k-NN (no sampling) 77.92 76.33 76.83 74.08 79.81 75.30 76.73 74.38 79.80 76.10 53.62 58.44 49.25 k-NN (down sampling) 64.74 66.33 68.49 60.67 71.80 62.70 68.39 65.35 73.10 62.80 19.57 30.77 29.88 k-NN (up sampling) 71.83 72.53 64.98 67.08 75.60 71.89 68.87 69.82 76.64 70.23 43.97 10.74 11.89 J48 (no sampling) 78.83 79.72 78.86 76.93 77.50 76.40 80.59 76.83 78.70 77.20 63.40 67.14 54.45 J48 (down sampling) 57.65 66.63 65.35 65.24 63.90 63.00 64.02 63.20 64.90 60.40 25.96 34.76 24.78 J48 (up sampling) 76.53 77.63 74.13 76.60 75.29 75.19 77.50 75.92 75.91 74.46 52.64 45.35 29.47 Naive Bayes (no sampling) 34.97 44.26 75.61 57.93 78.90 75.70 77.74 60.77 78.70 76.30 40.00 11.99 22.88 Naive Bayes (down sampling) 64.63 69.14 64.73 62.39 68.10 66.60 70.33 61.58 68.50 69.10 33.62 20.88 15.99 Naive Bayes (up sampling) 77.53 44.26 74.98 55.94 77.78 76.12 76.02 58.67 76.54 75.71 37.82 45.66 14.19 Average (no sampling) 63.91 66.77 77.10 69.65 78.73 75.80 78.35 70.66 79.07 76.53 52.34 45.86 42.19 Average (down sampling) 62.34 67.34 66.19 62.77 67.93 64.10 67.58 63.38 68.83 64.10 26.38 28.80 23.55 Average (up sampling) 75.30 64.81 71.36 66.54 76.22 74.40 74.13 68.14 76.36 73.47 44.81 33.92 18.52 3.1 Results for Single Feature Sets Table 2 reports the results for the five different combined Table 1 shows the accuracy that is reached using the three feature sets: different classification algorithms with and without balanc- ing the training data. Majority Class is the performance ALLrto : Combination of the attributes from all eight fea- of a default baseline classifier always predicting the largest ture sets, using the rto version of the vocabulary-based class: social networking. features. As a general observation, the vocabulary-based feature ALLbin : Combination of the attributes from all eight fea- sets (VOC, LCN, LPN, CUri, PUri) perform on a similar ture sets, using the bin version of the vocabulary-based level, where DEG and TLD alone show a relatively poor features. performance and in some cases are not at all able to beat the majority class baseline. Classification models based on NoLabrto : Combination of the attributes from all feature, the attributes of the LAB feature set perform on average without the attributes of the LAB feature set, using (without sampling) around 20% above the majority base- the rto version of the vocabulary-based features. line, but predict still in half of all cases the wrong category. Algorithm-wise, the best results are achieved using the deci- NoLabbin : Combination of the attributes from all feature, sion tree (J48) without balancing (maximal accuracy 80.59% without the attributes of the LAB feature set, using for LCNrto ) and the k-NN algorithm, also without balanc- the bin version of the vocabulary-based features. ing for the PUribin and LPNbin feature sets. Comparing the two balancing approaches, we see better results using Best3: Includes the attributes from the three best perform- the up sampling approach for almost all feature sets (except ing feature sets from the previous section based on VOCrto and DEG). In most cases, the category-specific ac- their average accuracy: PUribin , LCNbin , and LPNbin . curacy of the smaller categories is higher when using up sampling. Using down sampling the learned models make We can observe that when selecting a larger set of at- more errors for predicting the larger categories. Further- tributes, our model is able to reach a slightly higher accuracy more, when comparing the results of the models trained on of 81.62% than using just the attributes from one feature set unbalanced data with the best model trained on balanced (80.59%, LCNbin ). Still the trained model is unsure for cer- data, the models on the unbalanced data are more accurate tain decisions and has a stronger bias towards the categories except for the VOCbin feature set. Having a closer look at publications and social networking. the confusion matrices, we see that the balanced approaches are in general making more errors when trying to predict 4. DISCUSSION datasets for the larger categories, like social networking and In the following, we look at the best performing approach government. (Naive Bayes trained on the attributes of the NoLabbin fea- ture set using up sampling). Table 3 shows the confusion matrix of this experiment, where on the left side we list the 3.2 Results for Combined Feature Sets predictions by the learned model, while the head names the For our second set of experiments, we combine the avail- actual category of the dataset. As observed in the table, able attributes from the different feature sets and train again there are three kinds of errors which occur more frequently our classification models using the three described algorithms. than 10 times. As before, we generate a binary and relative term occurrence The most common confusion occurs for the publication version of the vocabulary-based features. In addition, we domain, where a larger number of datasets are predicted to create a second set (binary and relative term occurrence), belong to the government domain. A reason for this is that where we omit the attributes from the LAB feature set, as government datasets often contain metadata about govern- we wanted to measure the influence of this particular set ment statistics which are represented using the same vocab- of attributes, which is only available for less than half of ularies and terms (e.g. skos:Concept) that are also used the datasets. Furthermore we created a combined set of at- in the publication domain. This makes it challenging for tributes consisting of the three best performing feature sets a vocabulary-based classifier to distinguish those two cate- from the previous section. gories apart. In addition, for example the http://mcu.es Table 2: Results of combined feature sets. Best three results in bold. Classification Accuracy in % Approach ALLbin ALLrto NoLabbin NoLabrto Best3 k-NN (no sampling) 74.93 71.73 76.93 72.63 75.23 k-NN (down sampling) 52.76 46.85 65.14 52.05 64.44 k-NN (up sampling) 74.23 67.03 71.03 68.13 73.14 J48 (no sampling) 80.02 77.92 79.32 79.01 75.12 J48 (down sampling) 63.24 63.74 65.34 65.43 65.03 J48 (up sampling) 79.12 78.12 79.23 78.12 75.72 Naive Bayes (no sampling) 21.37 71.03 80.32 77.22 76.12 Naive Bayes (down sampling) 50.99 57.84 70.33 68.13 67.63 Naive Bayes (up sampling) 21.98 71.03 81.62 77.62 76.32 Table 3: Confusion matrix for the NoLabbin feature Although topical profiling has been studied in other set- set, with Naive Bayes classification model, balanced tings before, only a small number of methods exist for profil- by up sampling. ing LOD datasets. These methods can be categorized based True Category on the general learning approach that is employed into the social networking usergen. content categories unsupervised and supervised. Where the first cat- egory does not rely on labeled input data, the latter is only crossdomain publications government lifesciences geographic applicable for labeled data. Ellefi et al. [5] try to define the profile of datasets using media semantic and statistical characteristics. They use statistics Prediction about vocabulary, property, and datatype usage, as well as social networking 489 4 5 10 2 4 11 1 crossdomain 1 10 3 1 1 0 1 1 statistics on property values, like string lengths, for char- publications 8 10 54 9 4 4 2 2 acterizing datasets. For classification, they propose a fea- government 3 4 14 151 1 2 0 2 ture/characteristic generation process, starting from the top lifesciences 5 3 12 0 72 2 5 5 discovered types of a dataset and generating property/value media 6 3 4 1 1 7 2 0 usergen. content 6 1 1 2 0 2 26 0 pairs. In order to integrate the property/value pairs they geographic 1 5 1 5 1 0 0 8 consider the problem of vocabulary heterogeneity of the datasets by defining correspondences between features in different vo- cabularies. The authors have pointed out that it is essen- dataset – the Ministry of Culture in Spain – was manu- tial to automate the feature generations and proposed the ally labeled as publication within the LOD cloud, whereas framework to do so, but do not evaluate their approach on the model predicts government which turns out to be a real-world datasets. In our work, we draw from their ideas of borderline case in the gold standard. A similar frequent using schema-usage characteristics as features for the topical problem is the prediction of life sciences for datasets in the classification, but focus on LOD datasets. publications category. This can be observed, e.g., for the An approach to detect latent topics in entity-relation- http://ns.nature.com/publications/, which describe the ship graphs is introduced by Böhm et al. [4]. Their ap- publications in Nature. Those publications, however, are of- proach works in two phases: (1) A number of subgraphs ten in the life sciences field, which makes the labeling in the having strong relations between classes are discovered from gold standard a borderline case. the whole graph, and (2) the subgraphs are combined to The third most common confusion occurs between the generate a larger subgraph, which is assumed to represent usergenerated content and the social networking domain. a latent topic. Their approach explicitly omits any kind of Here, the problem is in the shared use of similar vocabular- features based on textual representations and solely relies on ies, such as foaf. At the same time, labeling a dataset as ei- the exploitation of the underlying graph. Böhm et al. used ther one of the two is often not so simple. In [15], it has been the DBpedia dataset to evaluate their approach. defined that social networking datasets should focus on the Fetahu et al. [6] propose an approach for creating dataset presentation of people and their interrelations, while user- profiles represented by a weighted dataset-topic graph which generated content should have a stronger focus on the con- is generated using the category graph and instances from tent. Datasets from personal blogs, such as www.wordpress.com, DBpedia. In order to create such profiles, a processing however, can convey both aspects. Due to the labeling rule, pipeline that combines tailored techniques for dataset sam- these datasets are labeled as usergenerated content, but our pling, topic extraction from reference datasets, and relevance approach frequently classifies them as social networking. ranking is used. Topics are extracted using named-entity- In summary, while we observe some true classification er- recognition techniques, where the ranking of the topics is rors, many of the mistakes made by our approach actually based on their normalized relevance score for a dataset. point at datasets which are difficult to classify, and which While the mentioned approaches are unsupervised, we em- are rather borderline cases between two categories. ploy supervised learning techniques as we want to exploit the existing topical annotation of the datasets in the LOD cloud. 5. RELATED WORK Topical profiling has been studied in the data mining, database, and information retrieval communities. The re- 6. CONCLUSION AND FUTURE WORK sulting methods find application in domains such as docu- In this paper, we investigate to which extent the topical ments classification, contextual search, content management classification of new LOD datasets can be automated using and review analysis [1, 11, 2, 16, 17]. machine learning techniques. Our experiments indicate that vocabulary-level features are a good indicator for the topical efficiently generating structured dataset topic profiles. domain, yielding an accuracy of around 82%. In The Semantic Web: Trends and Challenges - 11th The analysis of the limitations of our approach, i.e., the International Conference, ESWC 2014, Anissaras, cases where the automatic classification deviates from the Crete, Greece, May 25-29, 2014. Proceedings, pages manually labeled one, points to a problem of the categoriza- 519–534, 2014. tion approach that is currently used for the LOD cloud: All [7] L. Getoor and C. P. Diehl. Link mining: a survey. datasets are labeled with exactly one topical category, al- ACM SIGKDD Explorations Newsletter, 2005. though sometimes two or more categories would be equally [8] T. Heath and C. Bizer. Linked data: Evolving the web appropriate. One such example are datasets describing life into a global data space. Synthesis lectures on the science publications, which can be either labeled as publica- semantic web: theory and technology, 1(1):1–136, 2011. tions or as life sciences. Thus, the LOD dataset classifica- [9] R. Isele, J. Umbrich, C. Bizer, and A. Harth. tion task might be more suitably formulated as a multi-label LDSpider: An open-source crawling framework for the classification problem [18, 10]. web of linked data. In Proc. ISWC ’10 –Posters and A particular challenge of the classification is the heavy Demos, 2010. imbalance of the dataset categories, with roughly half of the [10] B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text datasets belonging to the social networking domain. Here, a classification by labeling words. In Proceedings of the two-stage approach might help, in which a first classifier tries Nineteenth National Conference on Artificial to separate the largest category from the rest, while a second Intelligence, Sixteenth Conference on Innovative classifier then tries to make a prediction for the remaining Applications of Artificial Intelligence, July 25-29, classes. When regarding the problem as a multi-label prob- 2004, San Jose, California, USA, pages 425–430, 2004. lem, the corresponding approach would be classifier chains, [11] J. Nam, J. Kim, E. Loza Mencı́a, I. Gurevych, and which make a prediction for one class after the other, taking J. Fürnkranz. Large-scale multi-label text the prediction of the first classifiers into account as a feature classification - revisiting neural networks. In Machine for the remaining classifications [13]. Learning and Knowledge Discovery in Databases - In our experiments, RDF links have not been exploited European Conference, ECML PKDD 2014, Nancy, beyond dataset in- and out-degree. For the task of web France, September 15-19, 2014. Proceedings, Part II, page classification, link-based classification techniques, that pages 437–452, 2014. exploit the contents of web pages linking to a particular page, often yields good results [7] and it is possible that [12] J. R. Quinlan. C4.5: Programs for Machine Learning. such techniques could also work well for classifying LOD Morgan Kaufmann Publishers Inc., San Francisco, datasets. CA, USA, 1993. [13] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. In Acknowledgements European Conference on Machine Learning and This research has been supported in part by FP7/2013-2015 Knowledge Discovery in Databases, 2009. COMSODE (under contract number FP7-ICT-611358). [14] I. Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical 7. REFERENCES methods in artificial intelligence, volume 3, pages [1] C. C. Aggarwal and C. Zhai. A survey of text 41–46. IBM New York, 2001. clustering algorithms. In Mining Text Data, pages [15] M. Schmachtenberg, C. Bizer, and H. Paulheim. 77–128. 2012. Adoption of the linked data best practices in different [2] T. Basu and C. A. Murthy. Effective text classification topical domains. In The Semantic Web–ISWC 2014, by a supervised feature selection approach. In 12th pages 245–260. Springer, 2014. IEEE International Conference on Data Mining [16] P. Shivane and R. Rajani. A survey on effective Workshops, ICDM Workshops, Brussels, Belgium, quality enhancement of text clustering & classification December 10, 2012, pages 918–925, 2012. using metadata. [3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - [17] G. Song, Y. Ye, X. Du, X. Huang, and S. Bie. Short the story so far. Int. J. Semantic Web Inf. Syst., text classification: A survey. Journal of Multimedia, 5(3):1–22, 2009. 9(5):635–643, 2014. [4] C. Böhm, G. Kasneci, and F. Naumann. Latent topics [18] G. Tsoumakas and I. Katakis. Multi label in graph-structured data. In 21st ACM International classification: An overview. International Journal of Conference on Information and Knowledge Data Warehousing and Mining, 3(3):1–13, 2007. Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pages 2663–2666, 2012. [5] M. B. Ellefi, Z. Bellahsene, F. Scharffe, and K. Todorov. Towards semantic dataset profiling. In Proceedings of the 1st International Workshop on Dataset PROFIling & fEderated Search for Linked Data co-located with the 11th Extended Semantic Web Conference, PROFILES@ESWC 2014, Anissaras, Crete, Greece, May 26, 2014., 2014. [6] B. Fetahu, S. Dietze, B. P. Nunes, M. A. Casanova, D. Taibi, and W. Nejdl. A scalable approach for