=Paper=
{{Paper
|id=Vol-2465/profiles_paper3
|storemode=property
|title=Towards Automatic Domain Classification of LOV Vocabularies
|pdfUrl=https://ceur-ws.org/Vol-2465/profiles_paper3.pdf
|volume=Vol-2465
|authors=Alexis Pister,Ghislain Auguste Atemezing
|dblpUrl=https://dblp.org/rec/conf/semweb/PisterA19a
}}
==Towards Automatic Domain Classification of LOV Vocabularies==
Towards Automatic Domain Classification of LOV Vocabularies Alexis Pister, Ghislain Atemezing MONDECA, 35 boulevard de Strasbourg 75010 Paris, France.Abstract. Assigning a topic or a domain to a vocabulary in a catalog is not always a trivial task. Fortunately, ontology experts can use their previous experience to easily achieve this task. In the case of Linked Open Vocabularies (LOV), a few number of curators (only 4 people) and the high number of submissions lead to find automatic solutions to suggest to curators a domain in which to attach a newly submitted vocabulary. This paper proposes a machine learning approach to automatically classify new submitted vocabularies into LOV using statistical models which take any texts description found in a vocabulary. The results show that the Support Vector Machine (SVM) model gives the best micro F1-score of 0.36. An evaluation with twelve vocabularies used for testing the classifier shades light for a possible integration of the results to assist curators in assigning domains to vocabularies in the future. Keywords: Ontologies, Classification, Machine Learning, Linked Open Vocabularies 1 Introduction Linked Open Data (LOD) refers to the ecosystem of all the open source struc- tured data which follows the standard web technologies such as RDF, URIs and HTTP. As the number of available data grows with time, new datasets following these principles appear. Linked Open Vocabulary (LOV) 1 is an initiative which aims to reference all the available vocabularies published on the Web follow- ing best practices guided by the FAIR (Findable - Accessible - Interoperable - Reproducible) principles. Each vocabulary can be seen as a knowledge graph, describing the properties and the purpose of the vocabulary, and which can be connected to other vocabularies by different types of links. Therefore, LOV can be seen as a knowledge graph of interlinked vocabularies [16] accessible on the Web of data. When a new ontology is submitted for integration into LOV, a curator needs to assign at least one tag representing a domain or a category to the vocabulary Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 1 https://lov.linkeddata.es/dataset/lov/ Alexis Pister, Ghislain Atemezing among existing 43 categories, such as “Environment”, “Music” or “W3C REC”. A category aims at grouping ontologies according to a domain. For example, the tag “W3C REC” represents ontologies recommended by the W3 Consortium, such as rdf or owl. As the number of domains increases and some vocabularies2 can be relatively small, the tagging process can be biased. Figure. 1 depicts the list of the tags available in LOV as the time of writing this paper, while Figure 2 depicts their distribution. One of the benefit of assigning a tag to a vocabulary is to index it according to a domain and make it easy to access from the interface. For example, to access to vocabularies in the IOT domain, the direct URL in LOV is https://lov.linkeddata.es/dataset/lov/vocabs? tag=IoT. Additionally, any newly added vocabulary should belong to at least one domain. Fig. 1: A view of the list of the tags available in LOV backend used for classifying ontologies We propose a machine learning approach to automatically classify newly submitted vocabularies with statistical models which take texts describing the subjects of the vocabularies as input. Indeed, the majority of the graphs contains a lot of text describing the subjects and the properties of the vocabularies, in the form of string literals. For example, the URI in a given ontology (Class or Property) is often described by the predicate rdfs:comment with a text mention- ing the comment of a given resource. Other predicates are often linked to texts containing information, such as rdfs:label or dct:description. We used all this text information to train several machine learning models in the purpose of classifying the vocabularies into different categories. This paper is structured as follows: Section 2 describes related work in graph classification, followed by the 2 In this paper, the terms ontology and vocabulary are interchangeable Towards Automatic Domain Classification of LOV Vocabularies Fig. 2: Distribution of the tags among the vocabularies in LOV machine learning approach to build the classifier in Section 3. Section 5 provides an evaluation of our approach and a brief conclusion in Section 6 2 Related Work Graph classification is a problem well studied in the literature. Several strate- gies have been developed to tackle this problem such as kernel methods or more recently graph neural networks [13]. However, there is way less work made in knowledge graph classification. What comes closest are the entity or triples clas- sification problems which consist in the categorization of a really small subset of a knowledge graph [17]. It is because these types of graphs are mainly described by their entities and relations, so it would be very difficult to find similarities or dissimilarities between knowledge graphs which share very little or not a com- mon entity or relation, like it is often the case. This is why we used a different approach of traditional graph classification methods for our problem, using a text mining strategy. Indeed, a lot of work has been made in document clas- sification [1]. Various processing methods have been elaborated such as Bag of Words or Latent Semantic Analysis (LSA), whose input can be easily exploited by machine learning algorithms. Classifying datasets created using semantic technologies has been applied in the literature. The most closest work in the literature is described in [7] and [15]. Meusel et al. present a methodology to automatically classify LOD datasets based on the different categories presented in the LOD cloud diagram. The paper uses eight feature sets from the LOD datasets, among others are text from rdfs:label. One of the main conclusions of the paper is that vocabulary- level features are good indicator for the topical domain. Alexis Pister, Ghislain Atemezing While the mentioned approach uses supervised learning, we apply two more steps in preparing the corpus for input of the classifier, using Bag-of-Word and a Truncated SVD transformation. Additionally, we have a very small amount of corpus inherent to the size of vocabularies compared to the entire LOD datasets, and a higher number of available tags (43 in LOV compared to 8 for the LOD cloud). 3 Data Preparation and Machine Learning Models 3.1 Data Preparation Our approach has been to use the texts contained in the vocabularies to classify them into categories. Indeed, usually the subject of a RDF graph and the purpose of its entities are described in string literals following some specific predicates. We first extract this relevant textual information (string or literal) inside each graph (a dump representing the latest version of the vocabulary in N3), and concatenate it into one paragraph describing their subjects. To this end, we first download each recent version of the vocabulary from LOV SPARQL endpoint (taking the most recent version tracked by LOV) and import them into graph objects with RDFLib3 . Listing 1.1 depicts the SPARQL query used to retrieve the latest version of each vocabulary, alongside with their domains and unique prefix. SELECT DISTINCT ? v o c a b P r e f i x ? domain ? v e r s i o n U r i { GRAPH { ? vocab a v o a f : Vocabulary . ? vocab vann : p r e f e r r e d N a m e s p a c e P r e f i x ? v o c a b P r e f i x . ? vocab dcterms : m o d i f i e d ? m o d i f i e d . ? vocab d c a t : keyword ? domain . ? vocab d c a t : d i s t r i b u t i o n ? v e r s i o n U r i . BIND ( STRAFTER(STR( ? v e r s i o n U r i ) , ” / v e r s i o n s / ” ) a s ? v ) BIND(STRBEFORE(STR( ? v ) , ” . ” ) a s ? v1 ) BIND (STR( ? m o d i f i e d ) a s ? d a t e ) FILTER ( ? d a t e = ? v1 ) }} GROUP BY ? v o c a b P r e f i x ? domain ? v e r s i o n U r i ORDER BY ? v o c a b P r e f i x ? domain ? v e r s i o n U r i Listing 1.1: SPARQL query to retrieve the latest versions of vocabularies stored in LOV We then concatenate all the strings followed by the predicates having one of these suffixes : comment, description, label and definition. The predicate rdfs:label is often used to give a name of an URI in natural language, while the suffixes comment, description and definition are used to give insight on the meaning and purpose of a given ontology or entity. The result of this step has been the generation of a paragraph for each vocabulary. As the texts describe the 3 https://github.com/RDFLib/rdflib Towards Automatic Domain Classification of LOV Vocabularies RDF properties of the graphs, they often contain the suffixes of these properties formed of several words not separated by spaces, in camel case format. For ex- ample, if an extracted text mentions the property “UnitPriceSpecification”, this expression will remains as a single unit in the final text. However, it can imply a bias on the statistical model to be applied on this data. Consequently, we sep- arate all these types of expression with spaces, when a uppercase occurs in the middle of a word. Therefore, by using this method, the expression “UnitPrice- Specification” will be transformed to ”Unit Price Specification” in the final text. After this transformation, the whole corpus’ vocabulary is formed of 21, 435 dif- ferent words. The mean word count for the paragraphs is 1168.5, the maximum is 86208 and the minimum 0. Two paragraphs were empty and 25 of them have less than 20 words. The text describing the rooms vocabulary 4 obtained with the pre-processing step described in this section is presented in Listing 1.2. This ontology describes the rooms one can find in a building and has the following assigned tags in LOV: Geography and Environment. Floor Section. Contains. Desk. Building. Floor. A space inside a structure, typically separated from the outside by exterior walls and from other rooms in the same structure by internal walls. A human- made structure used for sheltering or continuous occupancy. Site. A simple vocabulary for describing the rooms in a building. An agent that generally occupies the physical area of the subject resource. Having this property implies being a spatial object. Being the object of this property implies being an agent. Intended for use with buildings, rooms, desks, etc. Room. The object resource is physically and spatially contained in the subject resource. Being the subject or object of this property implies being a spatial object. Intended for use in the context of buildings, rooms, etc. A table used in a work or office setting, typically for reading, writing, or computer use. A named part of a floor of a building. Typically used to denote several rooms that are grouped together based on spatial arrangement or use. A level part of a building that has a permanent roof. A storey of a building. Occupant. An area of land with a designated purpose, such as a university Campus, a housing estate, or a building site. Listing 1.2: Paragraph describing the rooms vocabulary, obtained with the preprocessing pipeline described in Section 3. 4 https://lov.linkeddata.es/dataset/lov/vocabs/rooms Alexis Pister, Ghislain Atemezing 3.2 Machine Learning Models As we cannot feed directly text paragraphs to the machine learning models, we applied a processing pipeline for transforming the texts into fixed-size vectors of attributes. For this purpose, we used several techniques described in [14] : we first apply a Bag-of-Words (BoW) transformation, mapping the texts to vectors containing the frequencies of each word and ngram made of 2 and 3 words in the documents which have a frequency value between 0.025 and 0.25. Then, a Term Frequency-Inverse Document Frequency (TF-IDF) is applied to normalize the frequencies of the words and ngrams by the length of each document. Finally, we apply a Latent Semantic Analysis (LSA) [3] which is a dimensionality reduction technique using a linear algebra method called truncated SVD, to map the space of word frequencies to a smaller space of concepts. Indeed, the dimension of the TF-IDF vectors is big, as it corresponds to the number of words used in the whole corpus plus the frequent ngrams (21,435). It is well-known in the literature that a high number of attributes often impact negatively a machine learning approach [2]. We tried different values of n representing the dimension of the vector space : 50, 150 and 300. These vectors of attributes are then used as input for the machine learning classifiers. The entire processing pipeline is summarized in Figure 3. Fig. 3: Schematic view of the processing pipeline. From left to right, the diagram depicts the different steps: 1-Text extraction from Vocabulary dump; 2-BoW Transformation; 3-Normalization with TF-IDF; 4-Vector dimension reduction and finally the classifiers. We then separated the data in two subsets composing of a training set (80% of the vocabularies) and a test set (the remaining 20%). In this paper, the dataset version of LOV used for the experiment is the snapshot as of May 7th, 2019 5 , containing 666 vocabularies. We claim that the approach described in this paper can be replicated to any type of machine learning multi-label task with a knowledge graph as input. 5 https://tinyurl.com/lovdataset Towards Automatic Domain Classification of LOV Vocabularies As each vocabulary can have one to many tags, we tackle the problem as a multi-label classification task. A machine learning model is trained on the training set, trying to find relation between the attributes describing the graphs and their labels. The trained model is then applied to the test set. The predicted labels are finally compared to the ones tagged by human curators, and the micro precision, recall and f1-measure are computed, which are current supervised learning metrics [11]. We have tested several machine learning models with the python library scikit-learn [10], with an emphasis on the Support Vector Machine (SVM) and the Multi Layer Perceptron (MLP) which are ranked among the best classifiers for text classification task, mainly because they can handle large feature spaces [4, 12]. The K-Nearest-Neighbors (KNN) and the Random Forest (RF) classifiers have been tested as well, because they natively support multi- label classification, as well as the MLP. However, we had to apply a One-vs-Rest strategy for the SVM [9], which consists in training a separate binary classifier for each label. The MLP had one hidden layer of size 100 with a Rectified Linear Unit (ReLU)6 activation function. Similarly, we set the parameters C = 10, gamma = 1 for the SVM, with a radial basis function kernel (RBF kernel)7 and weighting the classes uniformly. We chose k = 7 for the KNN model. 4 Results The results of the classification for the 4 machine learning models, using k = 50, 150, 300 for the truncated SVD are presented in Table 1. The MLP and the SVM give the best micro F1-score respectively of 0.34 and 0.36, with n = 150. n = 50 n = 150 n = 300 Precision Recall F1 Precision Recall F1 Precision Recall F1 SVM 0.22 0.50 0.31 0.39 0.32 0.36 0.47 0.23 0.31 RF 0.74 0.07 0.12 0.7 0.03 0.07 0.68 0.02 0.04 MLP 0.33 0.32 0.33 0.34 0.33 0.34 0.33 0.25 0.29 KNN 0.62 0.10 0.17 0.65 0.06 0.11 0.58 0.06 0.11 Table 1: Results of the classification on the test set for the 4 machine learning algorithms, with 3 values of the dimension of the feature space. 5 Evaluation and Discussion In this section, we describe the evaluation of the classifier on newly submitted ontologies in LOV, and we discuss the results obtained comparing with manual assignment by two curators. 6 The ReLU is the most used activation function in neural network. f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero. 7 https://en.wikipedia.org/wiki/Radial_basis_function_kernel Alexis Pister, Ghislain Atemezing 5.1 Evaluation For evaluating our model, we took a list of 12 vocabularies in the back-end of LOV and asked two curators to assign domains to each of the vocabulary. Then, we passed the same vocabularies to the SVM classifier. The classifier’s results is then compared with the human assignment tags as presented in Table 2. Table 2: Comparison of tags suggested by the classifier and the curator. The underlined tags are the perfect match by both the human and the SVM classifier. Vocabulary URI Curator tag(s) Classifier’s tag(s) https://w3id.org/vir Multimedia Support https://w3id.org/usability Support, Events API https://www.w3.org/ns/solid/ Services, General & Upper Services, terms# General & Upper, RDF http://ns.inria.fr/munc/v2# Metadata RDF https://w3id.org/arco/ Services, Society Catalogs, Events, ontology/core Government, Multimedia https://w3id.org/arco/ Catalogs, society Catalogs, Events, ontology/catalogue Government, Multimedia https://w3id.org/arco/ Support, General & Upper Catalogs, Envi- ontology/context-description ronment, Events, Government, Multimedia https://w3id.org/ Support, General & Upper Catalogs, En- arco/ontology/ vironement, denotative-description Events, Govern- ment, Multime- dia https://w3id.org/arco/ Events, society Events, Catalogs, ontology/cultural-event Goverement, Multimedia https://w3id.org/arco/ Geography, Geometry Catalogs, Events, ontology/location Government, Multimedia https://w3id.org/arco/ General & Upper Catalogs, Envi- ontology/arco ronment, Events, Government, Multimedia https://w3id.org/cocoon/v1.0 Services, Contracts Industry, Services Towards Automatic Domain Classification of LOV Vocabularies As the main goal of the system is to suggest recommendation to a curator, we compute a soft accuracy metric, corresponding to the number of graph with at least one match between one of the curator tags and the classifier suggestions, divided by the total number of tested vocabularies. n associated tags yi =o{yi1 , yi2 , ..., yil } and the prediction For a vocabulary i, its of the classifier yipred = yi1 pred pred pred , yi2 , ..., yim , we say that the classifier is softly pred accurate for the vocabulary i if ∃yik ∈ yipred such that yik pred ∈ yi . The soft accuracy is then computed by the ratio of the number p of outputs softly accurate on the total number n of inputs. We get a result of 0.33 for this evaluation. 5.2 Discussion The results seem average regarding the precision in the detection from the clas- sifier, compared to the curator. Their could be several explanations, like the disparity between the tags in the dataset (13 labels are used in less than 10 vocabularies), or the difference of subjects in vocabularies tagged by the same label. For example, the ”geography” tag is used for the rooms and the Postcode 8 ontologies, whereas they both describe completely different things, thus we can expect different words usage and very different feature vectors. Furthermore, multi-label classification for tagging recommendation is a hard task, especially when the number of possible tag is high (43) and the number of examples is low (666) [5] like in this particular setting . It has been demonstrated that SVM classifiers work well for text classification problem, however their performance decrease strongly as the number of labels increases [6]. The list of domains grows depending on the need and some have a more organizational function. For example, LOV curators introduced the IOT tag to group all the vocabularies related to the IoT domain. Historically, some of the tags are related to W3C vocabularies recommendations (W3C Rec). 6 Conclusion and Future Work This paper addresses one main issue: build and evaluate a classifier based on the content of LOV catalog using machine learning technique. The final goal of this work is to help the human curator of vocabularies to have a list of recommenda- tions for a new ontology submitted in the back-end. The classifier implemented gives a micro F1-score of 36%. Although this score seems low, the system will not be used without a human that validates or not the suggested tag. We do not intend to compare the system with the human curator. Instead, we want to have a system that reduce possible risk of bias when assigning domains to vocabularies and suggest tags to the curator. Future work includes ingesting the feedback from the curators into the classifier to learn from newly added vocab- ularies for a continuous learning workflow, and test deep learning models with a transfer learning strategy to overcome the low-frequency of training examples. 8 https://lov.linkeddata.es/dataset/lov/vocabs/postcode Alexis Pister, Ghislain Atemezing Indeed, deep learning approach can perform well on multi-label classification, but it needs a lot of training examples [8]. References 1. M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919, 2017. 2. P. F. Evangelista, M. J. Embrechts, and B. K. Szymanski. Taming the curse of dimensionality in kernels and novelty detection. In Applied soft computing tech- nologies: The challenge of complexity, pages 425–438. Springer, 2006. 3. N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with random- ness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011. 4. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pages 137– 142. Springer, 1998. 5. I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classification for auto- mated tag suggestion. In Proceedings of the ECML/PKDD, volume 18, 2008. 6. T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. Acm Sigkdd Explorations Newsletter, 7(1):36–43, 2005. 7. R. Meusel, B. Spahiu, C. Bizer, and H. Paulheim. Towards automatic topical classification of lod datasets. In Workshop on Linked Data on the Web, LDOW- co-located with the 24th International World Wide Web Conference, WWW 19 may, volume 1409, 2015. 8. J. Nam, J. Kim, E. L. Mencı́a, I. Gurevych, and J. Fürnkranz. Large-scale multi- label text classification—revisiting neural networks. In Joint european confer- ence on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2014. 9. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008. 10. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 11. D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informed- ness, markedness and correlation. 2011. 12. M. E. Ruiz and P. Srinivasan. Automatic text categorization using neural networks. In Proceedings of the 8th ASIS SIG/CR Workshop on Classification Research, pages 59–72, 1998. 13. F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008. 14. F. Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1–47, 2002. 15. B. Spahiu, A. Maurino, and R. Meusel. Topic profiling benchmarks in the linked open data cloud: Issues and lessons learned. Semantic Web, (Preprint):1–20, 2019. Towards Automatic Domain Classification of LOV Vocabularies 16. P.-Y. Vandenbussche, G. A. Atemezing, M. Poveda-Villalón, and B. Vatant. Linked open vocabularies (lov): a gateway to reusable semantic vocabularies on the web. Semantic Web, 8(3):437–452, 2017. 17. Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by translat- ing on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence, 2014.