Background knowledge for ontology construction Blaz Fortuna and Marko Grobelnik and Dunja Mladenic1 Abstract. In this paper we describe a solution for incorporating 2 ONTOGEN background knowledge into the OntoGen system for semi-automatic ontology construction. This makes it easier for different users to construct different and more personalized ontologies for the same OntoGen [4] is a system for semi-automatic ontology construction, domain. To achieve this we introduce a word weighting schema to screenshot of the tool is presented in the Figure 1. Important part be used in the document representation. The weighting schema is of OntoGen are methods for discovering concepts from a collection learned based on the background knowledge provided by user. It of documents. For the representation of the documents we use the is than used by OntoGen’s machine learning and text mining algo- well established bag-of-words representation which heavily relies on rithms. the weights associated with the words. The weights of the words are commonly calculated by so called TFIDF weighting. We argue that this provides just one of the possible views on the data and propose an alternative word weighting that takes into account the background 1 INTRODUCTION knowledge which provides the user’s view on the documents. OntoGen discovers concepts using Latent Semantic Indexing (LSI) [3] and k-means clustering [6]. The LSI is a method for lin- When using ontology-based techniques for knowledge management ear dimensionality reduction by learning an optimal sub-basis which it is important for the ontology to capture the domain knowledge in approximates documents’ bag-of-words vectors. The sub-basis vec- a proper way. Very often different tasks and users require the knowl- tors are treated as concepts. The k-means method discovers concepts edge to be encoded into ontology in different ways, depending on by clustering the documents’ bag-of-words vectors into k clusters the task. For instance, the same document-database in a company where each cluster is treated as a concept. may be viewed differently by marketing, management, and technical Both methods heavily rely on the representation of the documents. staff. Therefore it is crucial to develop techniques for incorporating Namely, the document representation provides the vectors of the doc- user’s background knowledge into ontologies. uments which LSI tries to approximate and, the basis for clustering In [4] we introduced a system called OntoGen for semi-automatic algorithm is the similarity of document which also depends on the construction of topic ontologies. Topic ontology consists of a set of document representation. topics (or concepts) and a set of relations between the topics which By incorporating background knowledge directly into the doc- best describe the data. The OntoGen system helps the user by discov- ument representation via word weighting, reflecting similarity be- ering possible concepts and relations between them within the data. tween the documents, we enable our methods to discover concepts In this paper we propose a method which extends OntoGen system which resemble the view that the user has on the data. so that the user can supervise the methods for concept discovery by providing background knowledge - his specific view on the data used by the text mining algorithms in the system. To encode the background knowledge we require from the user to group documents into categories. These categories do not need to de- scribe the data in details, the important thing is that they show to the system the user’s view of the data - which documents are similar and which are different from the user’s perspective. The process of man- ually marking the documents with categories is time consuming but can be significantly speeded up by the use of active learning [5], [8]. Another source of such labeled data could be popular online tagging services (e.g Del.icio.us) which allow the user to label the websites of his interests with labels he chose. This paper is organized as follows. In Section 2 we introduce On- toGen system and in Section 3 we derive the algorithm for calculating word weights. We conclude the paper with some preliminary results in Section 4. Figure 1. Screen shot of the interactive system for construction topic 1 Institute Jozef Stefan, Slovenia, email: {blaz.fortuna, marko.grobelnik, ontologies. dunja.mladenic}@ijs.si 3 WORD WEIGHTING (SVM) [2] has been found to increase the performance of classifica- tion by discovering which words are important for determining the 3.1 Bag-of-Words and Cosine Similarity correct category of a document [1]. Most commonly used representation of the documents in text mining The method proceeds as follow. First linear SVM classifier is is bag-of-words representation. Let V = w1 , . . . , wn be vocabulary trained using all the features. Classification of a document is done of words. Let T Fk be the number of occurrences of the word wk in by multiplying the document’s bag-of-words vector with the normal the document. In the bag-of-words representation a single document vector computed by SVM, is encoded as a vector x with elements corresponding to the words xT w = x1 w1 + x2 w2 + . . . + xn wn , (3) from a vocabulary, eg. xk = T Fk . These vectors are in general very sparse since the number of different words that appear in the whole and if the result is above some threshold b then the document is con- collection is usually much larger than the number of different words sidered positive. This process can also be seen as voting where each that appear inside one specific document. word is assigned a vote weight wi and when document is being clas- Measure usually used to compare text documents is the cosine sified each word from the document issues xi wi as its vote. All the similarity and is defined to be the cosine of the angle between two votes are summed together to obtain the classification. A vote can be documents’ bag-of-words vectors, positive (document should belong to the category) or negative (the Pn document should not belong to the category). k=1 xki xkj A simple and naive way of selecting the most important words for sim(xi , xj ) = pPn pPn . (1) the given category would be to select the words with the highest vote k k x x k=1 i i k=1 xki xki values wi for the category. It turns out that it is more stable to select Performance of both bag-of-words representation and cosine similar- the words with the highest vote xi wi averaged over all the positive ity can be significantly improved by introducing word weights. Each documents. word from vocabulary V is assigned a weight and elements of vec- The votes wi could also be interpreted as word weights since they tors xi are multiplied by the corresponding weights. are higher for the words which better separate the documents accord- As we already mentioned, our approach is based on the word ing to the given categories. weights being the key to viewing the same data from different angels. We can use the weights to store the background knowledge since the 3.4 Word Weighting with SVM weights define which words are important. The algorithm we developed for assigning weights using SVM fea- ture selection method is the following: 3.2 TFIDF 1. Calculate a classifier for each category from the document col- Most of the research on word weighting schemas was traditionally lection (one-vs-all method for multi-class classification). TFIDF done in the information retrieval community. A typical goal in in- weighting schema can be used at this stage. Result is a set of SVM formation retrieval is to find the most relevant document from the normal vectors W = {wj ; j = 1, . . . , m}, one for each category. document collection for a given query. Many popular methods from 2. Calculate weighting for each of the categories from its classifier information retrieval are based on measuring cosine similarity be- weight vector. Weights are calculated by averaging votes xi wi tween the documents and a query and their performance can be sig- across all the documents from the category. Only weights with nificantly improved by appropriate weighting of the words. positive average are kept while the negative ones are set to zero. Most of the popular methods for this task developed in last decades This results in a separate set of word weights for each category. do not involve learning. Word weights are calculated by predefined By µjk we denote weight for the k-th word and j-th category. formulas from some basic statistics of the word frequencies inside 3. Weighted bag-of-words vectors are calculated for each document. the document and inside the whole document collection [10]. These Let C(di ) be a set of categories of a document di . Elements of methods are base on intuition and experimental validation. vector xi are calculated in the following way: The most widely used is the TFIDF weighting schema [10] which   defines elements of bag-of-words vectors with the following formula: X xki =  µjk  · T Fk . (4) xki = T Fk · log(N · IDFk ). (2) j∈C(di ) The intuition behind this weighting schema is that the words which This approach has another strong point. Weights are not only se- occur very often are not so important for determining if a pair of lected so that similarities correspond to the categories given by the documents is similar while a not so frequent words occurring in the user but they also depend on the context. Let us illustrate this on a both documents is a strong sign of similarity. The TFIDF weighting sample document which contains words ”machine learning”. If the can be easily modified to include category information by replacing document would belong to category ”learning” then the word ”learn- IDF and number of documents with ICF and number of categories. ing” would have high weight and the word ”machine” low weight. There are many extensions of this schema most famous being However, if the same document would belong to category ”machine Okapi weighting schema [9] which we will skip here since it does learning”, then most probably both words would be found important not incorporate category information. by SVM. 3.3 SVM Feature Selection 4 PRELIMINARY RESULTS As we will see in the next chapter a different approach can also be 4.1 Reuters RCV1 Dataset taken for generating word weights based on feature selection meth- As a document collection for testing our method we chose Reuters ods. Feature selection methods based on Support Vector Machine RCV1 [7] dataset. The reason for which we chose it is that each news 2 article from the dataset has two different types of labels (categories). ACKNOWLEDGEMENTS Each news article is assigned labels according to (1) the topics cov- ered and (2) the countries involved in it. We used a subset of 5000 This work was supported by the Slovenian Research Agency and the randomly chosen documents for the experiments. IST Programme of the European Community under SEKT Seman- A List with the 10 most frequent categories from the used subset tically Enabled Knowledge Technologies (IST-1-506826-IP), NeOn of RCV1 dataset is shown in Table 1. The statistics are for the subset Networked Ontologies (IST-2004- 27595) and PASCAL Network of used in the experiments. Excellence (IST-2002-506778). This publication only reflects the au- thors’ views. Table 1. List of 10 most frequent categories for topics and countries view. REFERENCES T OPICS VIEW C OUNTRIES VIEW CCAT corporate/industrial 46% USA 33% [1] Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Feature se- GCAT government/social 30% UK 11% lection using support vector machines. Proceedings of the Third Inter- MCAT markets 24% Japan 6% national Conference on Data Mining Methods and Databases for En- C15 performance 19% Germany 4% gineering, Finance, and Other Fields, Bologna, Italy, 25–27 September ECAT economics 14% France 4% 2002 C151 accounts/earnings 10% Australia 3% [2] N. Cristianini and J. Shawe-Taylor, An introduction to support vector M14 commodity/markets 10% India 3% machines, Cambridge University Press, 2000 C152 comment/forcast 9% China 3% [3] S. Deerwester, S. Dumais, G. Furnas, T. Landuer and R. Harshman, GPOL domestic politics 7% EEC 3% Indexing by Latent Semantic Analysis, Journal of the American Society M13 money markets 7% Hong Kong 2% of Information Science, vol. 41, no. 6, 391-407, 1990 [4] Fortuna, B., Mladenic, D., Grobelnik, M., (2005a). Semi-automatic construction of topic ontology. Proceedings of the ECML/PKDD Work- shop on Knowledge Discovery for Ontologies. [5] Grobelnik M. & Mladenic D. Automated knowledge discovery in ad- vanced knowledge management. J. of. Knowledge management 2005, 4.2 Results Vol. 9, 132-149. [6] Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., In the Figure 2 are the top 3 concepts discovered with k-means al- 1999 gorithm for both word weighting schemas. Documents are placed [7] Lewis, D. D., Yang, Y., Rose, T., and Li, F. RCV1: A New Bench- also in different concepts. For example, having two documents talk- mark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. ing about the stock prices, one at the New York stock-exchange and [8] Novak B., Mladenic D. & Grobelnik M. Text classification with active the other at the UK stock-exchange. The New York document was learning. Proceedings of GfKl 2005. placed in (1) Market concept (the same as the UK document) and in [9] Robertson, S. E., S. Walker, M. M. Hancock-Beaulieu, M. Gatford and (2) USA concept (while the UK document was placed in (2) Europe A. Payne. Okapi at TREC-4. The Fourth Text REtrieval Conference concept). (TREC-4). 1996 [10] G.Salton. Developments in Automatic Text Retrieval. Science, Vol 253, pages 974-979. 1991 [11] Singhal, A., C. Buckley and M. Mitra. Pivoted Document Length Nor- malization. Proceedings of the 19th ACM SIGIR Conference on Re- search and Development in Information Retrieval. 1996 Figure 2. The top 3 discovered concepts for topic labels (left) and for country labels (right). 5 CONCLUSION In this paper we have presented a method for learning document sim- ilarity measure trough selecting appropriate word weights for bag-of- words document representation model. We selected the word weights by training the SVM linear classifier for given categories and than ex- tracting the word weights from the hyper plane normal vector. The learned word weighting schema was used to adjust the concept dis- covery methods in the OntoGen system to the user’s domain knowl- edge. As part of the future work we plan to extend this method to the text categorization task where category information is known only for the documents from training set. 3