Background knowledge for ontology construction
                                   Blaz Fortuna and Marko Grobelnik and Dunja Mladenic1


Abstract. In this paper we describe a solution for incorporating               2    ONTOGEN
background knowledge into the OntoGen system for semi-automatic
ontology construction. This makes it easier for different users to
construct different and more personalized ontologies for the same              OntoGen [4] is a system for semi-automatic ontology construction,
domain. To achieve this we introduce a word weighting schema to                screenshot of the tool is presented in the Figure 1. Important part
be used in the document representation. The weighting schema is                of OntoGen are methods for discovering concepts from a collection
learned based on the background knowledge provided by user. It                 of documents. For the representation of the documents we use the
is than used by OntoGen’s machine learning and text mining algo-               well established bag-of-words representation which heavily relies on
rithms.                                                                        the weights associated with the words. The weights of the words are
                                                                               commonly calculated by so called TFIDF weighting. We argue that
                                                                               this provides just one of the possible views on the data and propose
                                                                               an alternative word weighting that takes into account the background
1     INTRODUCTION                                                             knowledge which provides the user’s view on the documents.
                                                                                  OntoGen discovers concepts using Latent Semantic Indexing
                                                                               (LSI) [3] and k-means clustering [6]. The LSI is a method for lin-
When using ontology-based techniques for knowledge management                  ear dimensionality reduction by learning an optimal sub-basis which
it is important for the ontology to capture the domain knowledge in            approximates documents’ bag-of-words vectors. The sub-basis vec-
a proper way. Very often different tasks and users require the knowl-          tors are treated as concepts. The k-means method discovers concepts
edge to be encoded into ontology in different ways, depending on               by clustering the documents’ bag-of-words vectors into k clusters
the task. For instance, the same document-database in a company                where each cluster is treated as a concept.
may be viewed differently by marketing, management, and technical                 Both methods heavily rely on the representation of the documents.
staff. Therefore it is crucial to develop techniques for incorporating         Namely, the document representation provides the vectors of the doc-
user’s background knowledge into ontologies.                                   uments which LSI tries to approximate and, the basis for clustering
    In [4] we introduced a system called OntoGen for semi-automatic            algorithm is the similarity of document which also depends on the
construction of topic ontologies. Topic ontology consists of a set of          document representation.
topics (or concepts) and a set of relations between the topics which              By incorporating background knowledge directly into the doc-
best describe the data. The OntoGen system helps the user by discov-           ument representation via word weighting, reflecting similarity be-
ering possible concepts and relations between them within the data.            tween the documents, we enable our methods to discover concepts
    In this paper we propose a method which extends OntoGen system             which resemble the view that the user has on the data.
so that the user can supervise the methods for concept discovery by
providing background knowledge - his specific view on the data used
by the text mining algorithms in the system.
    To encode the background knowledge we require from the user to
group documents into categories. These categories do not need to de-
scribe the data in details, the important thing is that they show to the
system the user’s view of the data - which documents are similar and
which are different from the user’s perspective. The process of man-
ually marking the documents with categories is time consuming but
can be significantly speeded up by the use of active learning [5], [8].
Another source of such labeled data could be popular online tagging
services (e.g Del.icio.us) which allow the user to label the websites
of his interests with labels he chose.
    This paper is organized as follows. In Section 2 we introduce On-
toGen system and in Section 3 we derive the algorithm for calculating
word weights. We conclude the paper with some preliminary results
in Section 4.

                                                                                   Figure 1.   Screen shot of the interactive system for construction topic
1   Institute Jozef Stefan, Slovenia, email: {blaz.fortuna, marko.grobelnik,                                     ontologies.
    dunja.mladenic}@ijs.si
3     WORD WEIGHTING                                                           (SVM) [2] has been found to increase the performance of classifica-
                                                                               tion by discovering which words are important for determining the
3.1    Bag-of-Words and Cosine Similarity                                      correct category of a document [1].
Most commonly used representation of the documents in text mining                 The method proceeds as follow. First linear SVM classifier is
is bag-of-words representation. Let V = w1 , . . . , wn be vocabulary          trained using all the features. Classification of a document is done
of words. Let T Fk be the number of occurrences of the word wk in              by multiplying the document’s bag-of-words vector with the normal
the document. In the bag-of-words representation a single document             vector computed by SVM,
is encoded as a vector x with elements corresponding to the words                             xT w = x1 w1 + x2 w2 + . . . + xn wn ,                (3)
from a vocabulary, eg. xk = T Fk . These vectors are in general very
sparse since the number of different words that appear in the whole            and if the result is above some threshold b then the document is con-
collection is usually much larger than the number of different words           sidered positive. This process can also be seen as voting where each
that appear inside one specific document.                                      word is assigned a vote weight wi and when document is being clas-
   Measure usually used to compare text documents is the cosine                sified each word from the document issues xi wi as its vote. All the
similarity and is defined to be the cosine of the angle between two            votes are summed together to obtain the classification. A vote can be
documents’ bag-of-words vectors,                                               positive (document should belong to the category) or negative (the
                                    Pn                                         document should not belong to the category).
                                     k=1
                                           xki xkj                                A simple and naive way of selecting the most important words for
          sim(xi , xj ) = pPn             pPn                    .   (1)       the given category would be to select the words with the highest vote
                                    k k
                                   x  x
                                k=1 i i          k=1
                                                       xki xki
                                                                               values wi for the category. It turns out that it is more stable to select
Performance of both bag-of-words representation and cosine similar-            the words with the highest vote xi wi averaged over all the positive
ity can be significantly improved by introducing word weights. Each            documents.
word from vocabulary V is assigned a weight and elements of vec-                  The votes wi could also be interpreted as word weights since they
tors xi are multiplied by the corresponding weights.                           are higher for the words which better separate the documents accord-
   As we already mentioned, our approach is based on the word                  ing to the given categories.
weights being the key to viewing the same data from different angels.
We can use the weights to store the background knowledge since the             3.4    Word Weighting with SVM
weights define which words are important.
                                                                               The algorithm we developed for assigning weights using SVM fea-
                                                                               ture selection method is the following:
3.2    TFIDF
                                                                               1. Calculate a classifier for each category from the document col-
Most of the research on word weighting schemas was traditionally                  lection (one-vs-all method for multi-class classification). TFIDF
done in the information retrieval community. A typical goal in in-                weighting schema can be used at this stage. Result is a set of SVM
formation retrieval is to find the most relevant document from the                normal vectors W = {wj ; j = 1, . . . , m}, one for each category.
document collection for a given query. Many popular methods from               2. Calculate weighting for each of the categories from its classifier
information retrieval are based on measuring cosine similarity be-                weight vector. Weights are calculated by averaging votes xi wi
tween the documents and a query and their performance can be sig-                 across all the documents from the category. Only weights with
nificantly improved by appropriate weighting of the words.                        positive average are kept while the negative ones are set to zero.
   Most of the popular methods for this task developed in last decades            This results in a separate set of word weights for each category.
do not involve learning. Word weights are calculated by predefined                By µjk we denote weight for the k-th word and j-th category.
formulas from some basic statistics of the word frequencies inside             3. Weighted bag-of-words vectors are calculated for each document.
the document and inside the whole document collection [10]. These                 Let C(di ) be a set of categories of a document di . Elements of
methods are base on intuition and experimental validation.                        vector xi are calculated in the following way:
   The most widely used is the TFIDF weighting schema [10] which                                                          
defines elements of bag-of-words vectors with the following formula:                                           X
                                                                                                    xki =               µjk  · T Fk .             (4)
                   xki = T Fk · log(N · IDFk ).                      (2)                                      j∈C(di )


  The intuition behind this weighting schema is that the words which              This approach has another strong point. Weights are not only se-
occur very often are not so important for determining if a pair of             lected so that similarities correspond to the categories given by the
documents is similar while a not so frequent words occurring in the            user but they also depend on the context. Let us illustrate this on a
both documents is a strong sign of similarity. The TFIDF weighting             sample document which contains words ”machine learning”. If the
can be easily modified to include category information by replacing            document would belong to category ”learning” then the word ”learn-
IDF and number of documents with ICF and number of categories.                 ing” would have high weight and the word ”machine” low weight.
  There are many extensions of this schema most famous being                   However, if the same document would belong to category ”machine
Okapi weighting schema [9] which we will skip here since it does               learning”, then most probably both words would be found important
not incorporate category information.                                          by SVM.


3.3    SVM Feature Selection                                                   4     PRELIMINARY RESULTS
As we will see in the next chapter a different approach can also be
                                                                               4.1    Reuters RCV1 Dataset
taken for generating word weights based on feature selection meth-             As a document collection for testing our method we chose Reuters
ods. Feature selection methods based on Support Vector Machine                 RCV1 [7] dataset. The reason for which we chose it is that each news

                                                                           2
article from the dataset has two different types of labels (categories).            ACKNOWLEDGEMENTS
Each news article is assigned labels according to (1) the topics cov-
ered and (2) the countries involved in it. We used a subset of 5000                 This work was supported by the Slovenian Research Agency and the
randomly chosen documents for the experiments.                                      IST Programme of the European Community under SEKT Seman-
   A List with the 10 most frequent categories from the used subset                 tically Enabled Knowledge Technologies (IST-1-506826-IP), NeOn
of RCV1 dataset is shown in Table 1. The statistics are for the subset              Networked Ontologies (IST-2004- 27595) and PASCAL Network of
used in the experiments.                                                            Excellence (IST-2002-506778). This publication only reflects the au-
                                                                                    thors’ views.
Table 1.   List of 10 most frequent categories for topics and countries view.
                                                                                    REFERENCES
        T OPICS VIEW                                C OUNTRIES VIEW
        CCAT      corporate/industrial     46%      USA          33%                 [1] Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Feature se-
        GCAT      government/social        30%      UK           11%                     lection using support vector machines. Proceedings of the Third Inter-
        MCAT markets                       24%      Japan         6%                     national Conference on Data Mining Methods and Databases for En-
        C15       performance              19%      Germany       4%                     gineering, Finance, and Other Fields, Bologna, Italy, 25–27 September
        ECAT      economics                14%      France       4%                      2002
        C151      accounts/earnings        10%      Australia     3%                 [2] N. Cristianini and J. Shawe-Taylor, An introduction to support vector
        M14       commodity/markets        10%      India         3%                     machines, Cambridge University Press, 2000
        C152      comment/forcast           9%      China         3%                 [3] S. Deerwester, S. Dumais, G. Furnas, T. Landuer and R. Harshman,
        GPOL      domestic politics         7%      EEC           3%                     Indexing by Latent Semantic Analysis, Journal of the American Society
        M13       money markets             7%      Hong Kong     2%                     of Information Science, vol. 41, no. 6, 391-407, 1990
                                                                                     [4] Fortuna, B., Mladenic, D., Grobelnik, M., (2005a). Semi-automatic
                                                                                         construction of topic ontology. Proceedings of the ECML/PKDD Work-
                                                                                         shop on Knowledge Discovery for Ontologies.
                                                                                     [5] Grobelnik M. & Mladenic D. Automated knowledge discovery in ad-
                                                                                         vanced knowledge management. J. of. Knowledge management 2005,
4.2     Results                                                                          Vol. 9, 132-149.
                                                                                     [6] Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv.,
In the Figure 2 are the top 3 concepts discovered with k-means al-                       1999
gorithm for both word weighting schemas. Documents are placed                        [7] Lewis, D. D., Yang, Y., Rose, T., and Li, F. RCV1: A New Bench-
also in different concepts. For example, having two documents talk-                      mark Collection for Text Categorization Research. Journal of Machine
                                                                                         Learning Research, 5:361-397, 2004.
ing about the stock prices, one at the New York stock-exchange and                   [8] Novak B., Mladenic D. & Grobelnik M. Text classification with active
the other at the UK stock-exchange. The New York document was                            learning. Proceedings of GfKl 2005.
placed in (1) Market concept (the same as the UK document) and in                    [9] Robertson, S. E., S. Walker, M. M. Hancock-Beaulieu, M. Gatford and
(2) USA concept (while the UK document was placed in (2) Europe                          A. Payne. Okapi at TREC-4. The Fourth Text REtrieval Conference
concept).                                                                                (TREC-4). 1996
                                                                                    [10] G.Salton. Developments in Automatic Text Retrieval. Science, Vol 253,
                                                                                         pages 974-979. 1991
                                                                                    [11] Singhal, A., C. Buckley and M. Mitra. Pivoted Document Length Nor-
                                                                                         malization. Proceedings of the 19th ACM SIGIR Conference on Re-
                                                                                         search and Development in Information Retrieval. 1996


    Figure 2.   The top 3 discovered concepts for topic labels (left) and for
                             country labels (right).


5     CONCLUSION
In this paper we have presented a method for learning document sim-
ilarity measure trough selecting appropriate word weights for bag-of-
words document representation model. We selected the word weights
by training the SVM linear classifier for given categories and than ex-
tracting the word weights from the hyper plane normal vector. The
learned word weighting schema was used to adjust the concept dis-
covery methods in the OntoGen system to the user’s domain knowl-
edge.
   As part of the future work we plan to extend this method to the text
categorization task where category information is known only for the
documents from training set.

                                                                                3