A Category-Driven Approach to Deriving Domain Specific Subset of Wikipedia © Anton Korshunov, Denis Turdakov Jinguk Jeong, Minho Lee, Changsung Moon Convergence Solution Team, ISPRAS DMC R&D Center, Samsung Electronics Co., Ltd. {korshunov, turdakov} {jinguk.jeong, minho03.lee, albert.moon} @ispras.ru @samsung.com Abstract URL, multiple and dense link structure, and category system which is edited and maintained by Wikipedia While many researchers attempt to build up users as well as articles. different kinds of ontologies by means of Researching community has proved more than once Wikipedia, the possibility of deriving high- that structure and content of Wikipedia are very quality domain specific subset of Wikipedia peculiar and valuable domains to study. One of the most using its own category structure still remains promising directions is automated ontology building undervalued. We prove the necessity of such which may be accomplished by extracting well-defined processing in this paper and also propose an concepts and relations among them from Wikipedia. appropriate technique. As a result, the size of The latest efforts in this field are DBpedia [3], the work knowledge base for our text processing of Ponzetto et al. [4], YAGO [5], etc. All these framework has been reduced by more than approaches mainly exploit the data derived from order, while the precision of disambiguating infoboxes and category structure. Due to continuous musical metadata (ID3 tags) has decreased enriching these techniques with better algorithms and from 98% to 64%. data sources, the quality of resulting ontologies becomes remarkable. 1 Introduction Despite these successes in automated upper-level ontology building, most of domain specific ontologies There's no need to introduce Wikipedia as a world's (for instance, UMLS [6]) are still assembled manually. largest and most rapidly expanding source of Needless to say, they are often costly to build and to information on many domains in almost every keep them up to date. language. At the time of drafting this paper, there are In this paper, we consider the possibility of loosely 279 Wikipedias in different languages, with more than supervised extraction of domain specific ontologies 17,870,000 articles, 1,920,000 uploaded images, and from upper-level ones. This is becoming feasible in 27,540,000 registered users [1]. 39 Wikipedias contain recent years as high-quality upper-level ontologies grow more than 100,000 articles each. The English edition more versatile and involve the knowledge of many new remains the largest Wikipedia, over three times as large domains. Moreover, many field experts from all around as the second largest edition, the German Wikipedia. the world often tend to pay their efforts to expand That's why it is reasonable to start investigating new existing widespread ontologies, rather than to create possibilities with English Wikipedia as a most their own or refine domain specific ones. comprehensive source. Our interest in this research is caused by necessity By now, Wikipedia has already got a lot of colorful of reducing the knowledge base for our text processing detailed descriptions. Conventional features of framework Texterra1. This base comprises a number of Wikipedia are well-known and discussed widely textual indices produced by our Wikipedia parser. in [2, 25]. They include concept identification by ID or These indices are loaded into RAM during the  initialization of Texterra server and take roughly 4.5 Gb This research was collaborated with and supported by of disk space and 2 Gb of RAM. Such a consumption is the Samsung Electronics Co., Ltd. DMC R&D Center acceptable for workstations, but not for mobile devices Convergence Solution Team. with restricted amount of memory. Thus, if one would Proceedings of the Spring Young Researcher's attempt to build a standalone mobile application Colloquium On Database and Information Systems SYRCoDIS, Moscow, Russia, 2011 1 http://modis.ispras.ru/texterra intended for text processing, then its data structures domain specific version of the latter. The explanation of simply would not fit in the RAM. Obviously, such our approach is given in Section 4. applications - if developed - would be highly demanded Section 2 contains the review of related work. The to date. key features of Wikipedia category structure are The disambiguation2 of musical metadata (i.e., ID3 discussed in Section 3. tags of MP3 files) is an important part of Texterra Having a fast access to any point of Wikipedia functionality. These algorithms are well-tuned by the category structure was one of the crucial tasks for our moment and show the precision of 98% on our test set. research. After a number of unsuccessful tries to reuse But it's obvious that only a narrow range of Wikipedia the existing graph libraries, we've coded our own dictionary is utilized during processing ID3 tags of MP3 implementation, named WikiGraph. See Section 5 for files. There are mostly named entities, such as musical details. compositions, song writers, singers, etc. Generally Given a set of specialized versions of the knowledge speaking, the amount of knowledge required to perform base, we conducted a series of experiments in order to such a disambiguation is likely much less than that learn how the reducing of Texterra knowledge base available from the entire knowledge base. affects the accuracy of results. Refer to Section 6 for Making the knowledge base more specialized by evaluation methodology. Section 7 contains removing concepts which are unimportant for this task experimental results and discussion. is what comes to mind first in this case. So, we've We conclude in Section 8 with our research findings implemented a system intended to produce Wikipedia and discussion about possible improvements. subsets covering the knowledge of the given domain. We decided not to narrow the scope of our research to 2 Related work Wikipedia derivates (such as YAGO). But, although the algorithms have been evaluated with Wikipedia only, 2.1 Automated ontology building they are applicable to any ontology with well-defined The field of automated ontology learning usually acts polyhierarchical taxonomy. by taking textual input and transforming it into a Of many interesting aspects of Wikipedia, here we taxonomy or a proper ontology. The texts are usually take into account only categorization and linkage within obtained from printed sources (books, magazines, its content. The category system in Wikipedia plays the newspapers) and Internet (online media, blogposts, role of a taxonomy and provides the function to search results of querying search engines). However, the articles by narrowing down categories. Given this, the learned ontologies are small and hard-to-update; in task of deriving domain specific concepts from addition, evaluations have revealed a rather poor Wikipedia dictionary seems to be solvable by mining performance [7]. the category system for the list of concepts tightly As mentioned before, the most accurate and, thus, connected with the given base categories. However, it valuable non-human assembled ontologies are now built is impossible to simply determine all concepts by automatically deriving explicit facts from Wikipedia. belonging to a particular category. The reason is that the One of the early attempts was the work of Gregorowicz category system of Wikipedia is organized into network and Kramer [8]. They focused on deriving a term- structure, not a perfect tree structure. Furthermore, the concept map which consists of terms, concepts, and lists of parent categories for many articles are redundant relationships between them. Only articles, redirects, and and contradictory. Thus, they don’t allow detecting the disambiguation pages are considered. Ponzetto and most relevant categories for a given page without Strube [4] were deriving a taxonomy from the entire analyzing the neighbour pages. Wikipedia category structure. Despite their successes, Therefore, we utilized concept vectorization method the resulted taxonomy is rather simple (supports only specialized for the category system in Wikipedia, with is-a and not-is-a relations) and domain independent. several additional expansion methods, as proposed The work described in [9] enhances this taxonomy with in [2]. The authors express affiliation relations among instance and class information for each node. Cui et concepts as category-based concept vectors. Each al. [10] introduce even more sophisticated approach to element (dimension) of a concept vector represents not building the ontology of concepts by making use of only binary affiliation information (whether the concept infobox structures, definition sentences, and category belongs to a certain category or not), but also the degree labels. Finally, YAGO2 [11] is probably the most of affiliation (called belonging degree in the rest of this complete and accurate semantic knowledge base paper). After applying different cutting techniques to derived automatically from Wikipedia and other the obtained concept vectors, the list of domain specific sources. The information extraction technique for concept IDs is outputted. This list is further applied to YAGO2 assumes varied utilization of infoboxes, Texterra knowledge base resulting in the reduced category structure, redirects, and other data within Wikipedia. Furthermore, the quality check is performed to find possible mistakes. As a result, the quality of 2 Word sense disambiguation is an open problem of natural extracted ontology is sufficient for the majority of IR- language processing, which governs the process of identifying and NLP-related tasks. which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings (polysemy). Notwithstanding the foregoing, newly emerging 2.3 Identifying domain specific concepts research fields often require well-structured and Syed et al. [17] tried to predict the topic of textual comprehensive domain specific ontologies. Researchers documents by matching them against Wikipedia articles don't need a huge knowledge base, but an extensible based on cosine similarity3. Then, they extracted corpus of specific concepts with good coverage of categories of found articles and scored them based on domain knowledge. If such ontologies were built in a different scoring schemes with or without spreading completely automated way, then this would avoid the activation4. The proposed approach implies scoring the necessity of assembling them from scratch manually. links with many categories for each given Wikipedia We believe that Wikipedia contains enough information article using bottom-up traversing of category structure. for this task to be completed. Below is the description This is acceptable for a small set of articles, but not for of approaches we've applied to narrowing Wikipedia our task. dictionary to domain specific subset. To the best of our knowledge, Cui et al. [18] were 2.2 Computing semantic relatedness the first who introduced an approach to deriving domain specific corpus from Wikipedia. The main idea is to We consider the key problem of our study as the generate a domain hierarchy from the hyperlinked pages computation of semantic relatedness between base top- of Wikipedia. Then, only articles strongly linked to this level categories and underlying articles. According to hierarchy are selected. They build a so-called Budanitsky and Hirst [12], semantic relatedness is Classification Tree by traversing down the directed defined to cover any kind of lexical or functional graph of Wikipedia category structure starting from the association that may exist between two words. This root node. This tree includes both categories and definition suits us more than semantic similarity, which articles and in fact is merely a connected branch of is typically defined via the lexical relations of Wikipedia classification graph with a specified root synonymy and hypernymy. node. Then, the Classification Tree is traversed with a There were many approaches proposed for simple adaption of breadth-first search algorithm. estimation of semantic relatedness between concepts in During the traversal, each node is given a score on the Wordnet (Rada et al., Leacock and Chodorow, Wu and relevance to the specific domain. Once the traversal is Palmer, Resnik, Jiang and Conrath, Lin) and between completed, the terminal nodes (article pages) are ranked Wikipedia articles [13] (Dice, Jaccard, SimRank). according to the domain relevance scores. Pages over a Among them are purely graph-based measures and certain threshold are considered domain relevant. The those involving information content. In this paper, we node score can consider either ingoing or outgoing consider only graph-based approaches. edges. Despite the proposed technique is quite simple, Since a significant part of Wikipedia knowledge is the results are remarkable. encoded in its graph-like link structure, it seems A more sophisticated algorithm has been proposed reasonable to apply existing graph-based methods or later by Shirakawa et al. [2]. The concept vectorization introduce new ones. Zesch and Gurevych [14] proved method is introduced for finding concepts which are that many Wordnet-based semantic similarity measures highly correlated with the base category (refer to are applicable to Wikipedia with minor changes. Section 1 for brief explanation). The main assumption Obviously, it’s also attractive to estimate the there is that the relatedness between categories gets strength of ties between different levels of taxonomy lower as the number of traversed pages (i.e., hopcount) (between base categories and its articles, in our case). In increases. In addition to the number of links for each this context, the links among articles appear not so node, they also take into account the number of paths important. Therefore, new category-based semantic between the concept and the base category, as well as relatedness measures are emerging. hopcounts of these paths. As it seems to us, this Chernov et al. [15] measured semantic relatedness understanding reflects the nature of Wikipedia between categories, not between concepts and classification approach much more precisely than ever categories, as it's required for our task. Nonetheless, before. Several heuristics are suggested for estimation they proposed several very useful and applicable of semantic relatedness by counting paths properties in techniques. the subgraph of desired base category. However, Strube and Ponzetto [16] employed Wikipedia, authors didn't compute these scores for thousands of Wordnet, and Google for computing semantic relatedness between concepts. For two Wikipedia 3 Cosine similarity is a measure of similarity between two articles being compared, they extracted two categories vectors by measuring the cosine of the angle between them. lists. Given the category lists, for each category pair Calculating the cosine of the angle between two vectors thus they performed a depth-limited search of maximum determines whether two vectors are pointing in roughly the depth of 4 for a least common ancestor. As they same direction. This is often used to compare documents in noticed, limiting the search improves the results. But text mining. 4 this is obviously inappropriate for computing Spreading activation is a method for searching associative relatedness between top-level category and articles from networks, neural networks, or semantic networks. The search process is initiated by labelling a set of source nodes with all its subcategories. weights or "activation" and then iteratively propagating or "spreading" that activation out to other nodes linked to the source nodes. articles with hundreds of paths for each of them at a links, i.e. links from articles or subcategories to upper- time, as it's required in this research. Moreover, the level categories. The English version of Wikipedia, as efficiency of both approaches described in [2] and [18] of September 2010, contains ~13 million category links. has not been evaluated in real tasks. In both cases, the The typical code of categorized article page is evaluation was performed by comparing the results of shown at Figure 1. It combines XML and Wiki markup. algorithms with answers of experts knowledgeable in The list of belonging categories is situated at the certain fields. This looks persuasively when proving the bottom. Categories have their own pages similar to theoretical applicability, but is not enough for articles. Category links at these pages also express unconditional embedding into the real system. which category belongs to what categories. In this work, we mainly exploited the ideas Categorization is a useful tool to group articles for formulated in [2] and [18]. Our main goal was to ease of navigation, and correlating similar information. estimate the scalability and practical applicability of However, not every verifiable fact (or the intersection these approaches for real tasks which imply processing of two or more such facts) in an article requires an of large amount of data. associated category. For lengthy articles, this could potentially result in hundreds of categories, most of 3 Features of Wikipedia category structure which aren't particularly relevant. This may also make it more difficult to find any particular category for a The advantages of Wikipedia category structure were specific article. Such overcategorization is also known studied by authors of [14] and many others. Here we as "category clutter" [19]. summarize only those features needed for better For these reasons, the WCG has an extremely understanding of our approach. complex nature. It is directed and has not a strong hierarchical structure as some may expect. Any category may branch into subcategories, and it is possible for a category to be a subcategory of more than one parent [20]. Upon closer inspection, the WCG is rather a polyhierarchy, or even a net (Figure 2). Figure 2. A fragment of Wikipedia category structure The figure has been produced by CatGraph [21]. This tool draws a cloud of links for the desired category. Each rectangle represents a category. Each arrow connecting two rectangles denotes a "belongs-to" relation, that is, the destination category is a subcategory of the initial one (an example of belonging Figure 1. Sample XML structure of categorized link). The cloud shown in the figure is for "Recorded Wikipedia article page music" category (bolded). It's worth noting here that not every Wikipedia page Categories of Wikipedia can be organized in a is categorized. According to statistics [22, 23], there are graph, where the nodes are categories and the edges are thousands of uncategorized articles and categories. hyperlinks. In this work we also add articles to this Moreover, certain categories are assigned graph. However, we still name it the Wikipedia incorrectly [24]. We suggest considering these facts as a category graph (WCG in the rest of the paper). possible drawbacks for any category-based algorithm. The links expressing which concept belongs to what In addition, automated categorizing (i.e., determining a categories are called category links. We call them topic of an uncategorized page) seems to be a belonging links or belonged links according to their challenging task. This can be done with certain direction. In this paper, we only consider the belonging accuracy by processing page title and text with specific NLP techniques and finding appropriate categories in desired domain than to others. As this task is WCG. computationally complex and, thus, supposed to be run rarely, it’s allowably to choose the base categories 4 Deriving a domain specific subsets manually for experiments. We’ve selected 3 base categories that likely cover the majority of concepts The developed system consists of three main parts: required for disambiguating musical metadata: 1. Link Filter produces a ready-for-load textual  Category:Musical compositions representation of WCG;  Category:Recorded music 2. Topic Deriver performs the main processing;  Category:Music-related lists 3. Reducer produces a domain specific version of the Texterra knowledge base5. For each of selected base categories, a separate All algorithms evaluated in this paper were subgraph is built. This subgraph is almost the same as implemented in the Java programming language. Classification Tree in [18]. The base category serves as a root node, and a tree-like structure of underlying 4.1 Link Filter pages is obtained from WCG. The only difference from approach proposed in [18] is that we use depth-first The input for Link Filter is Wikipedia links file search (DFS), not breadth-first search (BFS). The containing information about all links between reason for this is that the resulting subgraphs are often Wikipedia pages, along with their type and direction. large enough, thus, it’s inappropriate to waste the The result is category links file that contains only links memory for storing the FIFO queue required for BFS forming the WCG. Every line of this file denotes the traversal [26]. Moreover, as depth-first tree is expected affiliation of belonging between two pages and sets the to contain back edges and cross edges, the list of visited type of the belonging page. For example, nodes has been added to avoid repetitive visiting and loops. 12 780754 0 A concept vector in our research specifies the degree of affiliation between the base category and each of means that page with ID 12 ("Anarchism") is belonging articles reachable by traversing down the subgraph of to the category with ID 780754 the base category starting from its root node. As ("Category:Anarchism"). Moreover, the belonging page mentioned, the heuristics for building concept vectors ("Anarchism") is an article because the last field is "0" have been borrowed from [2] with some modifications. ("1" would mean that the belonging page is a category). We describe them briefly below, for more detailed Thus, category links file is a complete textual information refer to the source paper. representation of the WCG and contains no unwanted BVG (Basic Vector Generation method) generates data such as page titles and link types. Furthermore, concept vectors by tracking back parent categories in unlike the authors of [18], we’ve also removed pages of the category system and calculating the belonging certain types: lists, classifications, portals, redirects, degree to each concept. disambiguation pages, and user pages. This helped us to make the WCG more lightweight without loss of any The belonging degree from concept to meaningful concepts. As a result, category links file category is defined by the following equation: contains 13,001,687 links between 593,796 categories and 3,156,822 articles. 4.2 Topic Deriver Here, denotes a set of paths from to , denotes the hopcount of path , denotes a Topic Deriver loads category links file on start and fills monotonically increasing function on the hopcount of in the internal structures of WikiGraph (refer to Section path (given as ). 5 for details). The workflow for this stage is shown at It’s noteworthy that in the original method [2] paths the Figure 3. Here we touch on only the main steps. with a hopcount of more than 4 were ignored. We asked For our analysis, we denote as a set of concepts, the authors for the reasons of this. The response was as a set of categories, and as a set of belonging ―Because long paths scarcely affect values in concept links. Then, the category system in Wikipedia is vectors in most cases. Of course, sometimes long paths expressed as a directed graph . A path is a affect the values‖. We’ve decided to remove this sequence of edges that connects one node with another. constraint in our experiments, that is, we consider all The path length (hopcount) is the number of edges paths between two nodes. along that path. As a result, processing time may become too large The key task of Topic Deriver is to obtain the list of for base categories from high levels of the WCG concepts connected semantically with certain domain. hierarchy. The reason for such behaviour is an exclusive Herewith, this connection should be the tightest one, computational complexity of finding all paths between that is, these concepts should be more relevant to the two arbitrary nodes in the graph. It's well-known that this task is NP-complete in general case. 5 The Texterra knowledge base for this research has been obtained by parsing the dump of English Wikipedia, as of September 2010. Step 1. Get the list of Musical Recorded Music- categories to be compositions music related lists processed. Texterra knowledge base Step 2. Get connected ID1 ID2 ID3 to Texterra server and 940477 2722900 1520543 obtain an ID for each category. Wikipedia category graph Step 3. Explore the pre-loaded Wikipedia "Musical compositions" "Recorded music" "Music-related lists" category graph and SUBGRAPH SUBGRAPH SUBGRAPH build a separate subgraph for each category. "Musical compositions" "Recorded music" "Music-related lists" Step 4. Traverse the VECTOR VECTOR VECTOR subgraphs and build 16256 1.56 1234 1.25 67891 1.02 concept vectors. 723678 1.12 491358 1.04 2297721 0.90 Each vector consists 2 of "concept ID → 90381 0.92 76036 0.89 6788201 0.78 belonging degree" 2763203 0.56 386421 0.33 404313 0.44 pairs. 392 0.35 6032 0.19 51578 0.36 16937 0.01 84678 0.24 30557 0.20 280482 0.03 18358 0.09 Step 5. Apply cutting technique to each vector. Collect the remaining IDs into LIST OF EXTRACTED CONCEPTS common list of extracted concepts. Step 6. Process Texterra knowledge Knowledge base Knowledge base base in a way to save only records (complete) (domain specific) associated with obtained list. Figure 3. Overall architecture of Topic Deriver Since WCG contains millions of edges, the maximal Then, given all paths from to path length may reach hundreds of edges, leading to , belonging degree from concept to impetuous increase of processing time when trying to category is defined as follows: process top-level categories. This is exactly why we've picked up a "safe" set of categories, which are processed relatively fast and cover the knowledge of field we've chosen for experiments. is the weight of path , calculated by the Notwithstanding, we believe that taking all existing following equation: paths between two nodes into account allows to estimate the belonging degree more precisely. However, we didn’t confirm this assumption experimentally. Here, denotes a set of all To reduce the complexity of finding all paths belonging links forming path and denotes a between two arbitrary nodes, we tried to re-use one of belonging link. existing techniques [27-29]. Finally, the APAC After the vector is built and sorted in descending algorithm [29] has been chosen. This algorithm does not order of belonging degree, it's time to apply cutting need to keep track of all visited vertices and only stores technique to it and get the list of IDs most relevant to the feasible paths. the base category. We've tried out two approaches: For domain specific areas where categories are 1. belonging degree threshold – concepts with excessively segmentalized, the BVG method cannot belonging degrees less than the mean value of extract accurately concept vectors due to the increase in belonging degree for each vector are filtered; hopcount. To solve this problem, the Single Parent 2. percent threshold - 25% of concepts with the Integration (SPI) method is proposed. The authors lowest belonging degrees are filtered. confirmed from their experiences that a part in the category system which corresponds to (excessively Finally, Topic Deriver produces domain concepts segmentalized) categories for a domain specific area file that contains IDs of derived concepts. forms almost a tree structure. Based on this fact, when a 4.3 Reducer concept or a category has only one (onehop/multihop) belonging link, the SPI method shortens the belonging Reducer is the final part of the system. It takes domain link. This is based on the idea that the characteristic is concepts file as input, applies the concepts' list to not dispersed even when parent categories are tracked complete Texterra knowledge base, and produces the back if the concept or category has only one (onehop or reduced domain specific version of the latter. It contains multihop) belonging link. not only concept IDs, but also full information about In the SPI method, if there is only one belonging each of them, including the part of category structure link from node (or ) to , the path length of that covers selected concepts. Therefore, the domain is accounted as 0, which results in reformation of specific version is consistent and ready for loading into to , and then the BVG method is applied to Texterra. . VVG (Variance-based Vector Generation 5 An approach to storing Wikipedia method) considers the weight of each category link. category graph This method is based on the idea that the belonging degree from a certain category (concept) to parent As showed above, fast access to any point of Wikipedia categories depends on the number of parent categories, category structure is necessary for efficiency of all thus the weight of each category link is inversely described computations. In particular, VVG method proportional to the number of parent categories. requires both entire WCG and subgraph of current base Thus, the weight of a category link becomes 1 if the category to be available simultaneously. category has only one parent category. That’s why the Chernov et al. [15] studied semantic relationships authors argue that the VVG method contains the same between Wikipedia categories. They exported the feature as the SPI method. Therefore, they didn’t dataset of about 670 thousands pages into a MySQL combined VVG with SPI. We, on the other hand, tried database. The data size was ~1.2 Gb. But, like many both BVG + SPI and VVG + SPI combinations and other researchers, they picked just a small sample of confirmed that VVG + SPI performs slightly better than pages for processing (few thousands). For such small- VVG itself (see Section 7 for details). scale approaches, even a usual on-disk relational DB is In the VVG method, weights are set to all belonging fast enough. links, and the belonging degree from concept to But our goal was to create a technique for fast category is calculated according to the weights. iterative traversing through even a top-level categories When the number of belonging links from node (or with millions pages. Thus, we resorted to in-memory ) to category is , weight of each of the storage of WCG. belonging links is defined as follows: 5.1 Evaluation results for known graph libraries We've tried out two third-party libraries for storing the WCG in the JVM's memory. First of them, JUNG [30], showed satisfying performance results on small-scale needed) and lead to significant speed-up of loading and subgraphs. But the entire WCG was impetuously processing. expanding while loading and didn’t fit in the RAM of the test machine (8 Gb). Second one, JGraphT [31], 6 Evaluation methodology demonstrated almost the same behaviour: the WCG consumed a bit less amount of memory, but still too Obviously, a domain specific subset of Wikipedia much. These observations hinder to utilize these should have a good coverage of domain knowledge. But libraries as a solution for WCG storing. there is no easy direct way to evaluate quality of such a But there are a number of other libraries for graph subset. The reason is that we must evaluate the storage which provide handy interface to stored data. completeness of knowledge available from Wikipedia's We've found neo4j [32] and WebGraph [33] libraries. articles in resulting subset compared to that of specific They may appear useful during the further research. domain. It's clear that this is rather difficult. In addition, the quality of link structure in the resulting subset 5.2 WikiGraph should be also evaluated. Therefore, we applied so called in vivo approach for The common shortcoming of all Java graph evaluation. To estimate the quality of proposed implementations we've tested seemed to be the methods, we studied how applying of the extracted redundancy of data stored in the RAM. Thus, the right subsets affects the performance of Texterra as a whole. way is to change the data storage manner. We put this As mentioned before, one of Texterra parts is the into practice in WikiGraph. system that enriches ID3 tags for musical recordings All the prominent features of WikiGraph are due to with links to corresponding articles of Wikipedia. This the fact that it's intended to store WCG: system utilizes graph structure to compute semantic 1. It is directed (as category links have a direction); relatedness between Wikipedia pages [13]. Then, 2. It introduces the notions of category and article semantic relatedness is exploited by word sense and provides a powerful tooling to store and disambiguation algorithm. The latter is intended to maintain the data on affiliations between them; choose the most relevant Wikipedia page from several 3. Only IDs and types of pages are stored. Each homonymic variants. We assume that each page of Wikipedia describes vertex is presented as a map consisting of [ID, one possible meaning. WSD algorithm selects the most isCategory] entries. This allows to store page type consistent combination of meanings that correspond to as a Boolean variable (TRUE is for category, input ID3 tags. For sequence of input tags, it computes FALSE is for article). All page data are saved as similarity between all pairs of meanings. The weight of primitive variables, not an objects; a sequence is a sum of weights of all its pairs. Then, the 4. Incidence list has been chosen as a main data algorithm detects a sequence with greatest weight. To show good results, the derived subset should structure (along with vertices and edges lists). This include as much as possible Wikipedia articles is particularly important as the WCG is quite associated with a specific domain. In additional, link dense: . The incidence list structure should be good enough for relatedness computation. Therefore, this approach allows evaluating is organized into a set of [vertex, [list of incident both the quality of dictionary content and the quality of edges]] entries. Each list is sorted in ascending link structure. order of edges IDs just after the loading. This For testing purpose, we derived several music- avoids the need to look over the entire incidence related subsets of Wikipedia by running different list to get all the edges incident to an arbitrary combinations of heuristics and used these subsets for described system. Then, we consequentially loaded vertex. Moreover, due to sorting of the lists, it's these domain specific versions of knowledge base into allowed to interrupt the search over them after the Texterra and ran the tests. We used a small corpus of 20 edge with greatest expected ID is found; random musical compositions and 49 different tags. 5. All kinks (self-to-self links) are removed; Then, we estimated the precision of automated 6. Initial capacity of the incidence list is beforehand disambiguation by comparing the results of algorithm set to approximate amount of vertices in WCG with manually disambiguated tags. (3,500,000 for this case). This saves some memory allocation costs while loading; 7 Experimental results 7. A set of helper methods is developed also (for The configuration of test workstation was as follows: instance, a method for deriving a subgraph of a Intel Core 2 Duo CPU (3.16 GHz), 8 Gb RAM, given base category). This set provides usable and Windows 7 Enterprise 64 bit, Java SE 6 Development fast interface to the WCG data. Kit 1.6.0.20. Sample vectors for different combinations of After the described features were implemented, they heuristics are provided in Tables 1-4. Each sample allowed us to fit WCG entirely in the RAM (~4 Gb vector comprises three concepts with highest belonging degree and three concepts with lowest values. The base What's important here is that the size of Texterra category is Category:Musical compositions. It’s knowledge base (both on disk and in RAM) depends noteworthy that BVG vector differs significantly from linearly on the number of concepts. Thus, the challenge VVG one. Furthermore, enabling SPI affects both is to find a compromise between the precision of vectors. disambiguation and the size (and contents) of the The results of the experiments with different knowledge base. combinations of vector generation methods and cutting The conducted experiment was just our first effort of techniques are presented in Table 5. Contents of all 3 this kind. Implemented algorithms allowed us to reduce base categories listed in Section 4.2 are included. the size of Texterra knowledge base by more than order, Ground truth row corresponds to original Texterra while the precision of disambiguating musical metadata knowledge base. As can be seen, it is huge, but ensures has decreased from 98% to 64%. We believe that these the best accuracy of disambiguation. results prove the applicability of proposed approach for No threshold is for case when no cutting technique deriving domain specific subset of Wikipedia. is applied to the concept vectors. In other words, this set Certainly, there're still many things to improve. of IDs exactly matches the set of all articles from subgraphs of all base categories. This version is much Table 3. Sample vector produced by the BVG method smaller, but the precision gets lower also. This precision with SPI enabled drop (when no threshold is applied yet) is only due to Belonging ID Title imperfect choice of base categories. They merely don't degree cover all concepts required for precise disambiguation. 1523941 Axel F 5.53 It's also obvious that all comparisons of heuristics 1815726 I Will Always 3.88 results should be done with no threshold results, not Love You with ground truth. 923235 My Heart Will 3.50 Go On Table 1. Sample vector produced by the BVG method 9010 Dance Dance 0.00 Belonging Revolution ID Title degree 4527 Béla Bartók 0.00 1784928 Candle in the 0.54 1370 Ambrose 0.00 Wind 1997 1523941 Axel F 0.50 Table 4. Sample vector produced by the VVG method 1728643 Jeremy (song) 0.49 with SPI enabled 3720518 Ludwig 2.44 * Belonging Streicher 10-4 ID Title degree 2175948 Ian Bousfield 2.44 * 27684606 Niagara Falls 0.50 10-4 Suite 875344 Willi 2.44 * 20053503 Kumikyoku Nico 0.50 Boskovsky 10-4 Nico Douga 2501716 Megamix 0.50 Table 2. Sample vector produced by the VVG method 9010 Dance Dance 0.00 Belonging Revolution ID Title degree 4527 Béla Bartók 0.00 27684606 Niagara Falls 0.50 1370 Ambrose 0.00 Suite 20053503 Kumikyoku Nico 0.50 8 Conclusion and future work Nico Douga 2501716 Megamix 0.50 According to the results of this study, we outline the 14054430 Oh, by the Way 6.02 * following: 10-7  Wikipedia categories network may be utilized 454136 A Collection 6.02 * for domain specific subset of Wikipedia; of Great Dance 10-7  Using concept vectors seems to be appropriate Songs way to represent the affiliations of belonging 361654 Echoes: The 6.02 * between Wikipedia pages; Best of Pink 10-7 Floyd  BVG performs a bit better than VVG;  SPI often improves the results; As one can see, BVG performs a bit better than VVG. The most accurate combinations of heuristics are  Percent threshold showed the best results as a BVG + percent threshold and BVG + SPI + percent cutting technique. threshold. Enabling SPI for VVG slightly increases the Possible directions of future work include: precision of disambiguation. Percent threshold is 1. As noted by the authors of [18], the selection of definitely better than belonging degree threshold. the root node is vital to the quality of the domain reorganize the WCG; specific corpus. Thus, it’s reasonable to 5. Add a facility for storing the results of semantic introduce some heuristics for automated relatedness computation to boost the further identifying of the most appropriate base category processing; given just a set of specific keywords. Moreover, there can be several base categories with either 6. Develop the approximation algorithm for finding manually or automatically set relevance levels. all paths between two arbitrary nodes in the For example, to perform the search for WCG. "Musicians of World War II" a user should In this work, we’ve demonstrated the possible provide 2 base categories as an input: benefits of automated building of the domain specific Category:Musicians with relevance level of 0.9 ontologies. Also, we’ve tested different heuristics while and Category:World_War_II with relevance implementing the system for such processing. An level of 0.5; original approach to storing WCG in the RAM has been 2. Try other cutting techniques for concept vectors proposed, along with specific evaluation methodology. (i.e., attempt to detect the distribution of The described approach can be applied to any belonging degrees and utilize it); ontology with well-defined polyhierarchical taxonomy (for instance, YAGO2). As it seems to us, weighting the 3. Detect and remove meaningless pages from the existing semantic connections is always a challenging WCG (i.e., pages from administrative section of task while building any more or less large ontology. Wikipedia [4]); This may be helpful for any domain dependent 4. Distinguishing between classes and instances Wikipedia-related research [34, 35]. among categories [9] may help to prune and/or Table 5. Experimental results SPI Number of Threshold Size, Mb Precision, % enabled IDs belonging degree 334,575 112,3 45,10 yes Basic Vector percent 574,810 251,2 62,75 Generation belonging degree 434,229 159,2 45,10 no percent 574,940 251,8 64,71 belonging degree 420,302 154,8 49,02 yes Variance-based percent 549,791 244,5 60,78 Vector Generation belonging degree 418,998 150,1 41,18 no percent 549,639 245,5 56,86 No threshold — — 675,228 311,5 72,55 Ground truth — — 8,476,942 4528,3 98,04 Wikipedia and WordNet. In Elsevier Journal of References Web Semantics, Vol. 6, No. 3, pp. 203-217, 2008. [6] Unified Medical Language System (UMLS) - [1] List of Wikipedias - Meta. Home. http://www.nlm.nih.gov/research/umls/ http://meta.wikimedia.org/wiki/List_of_Wikipedia [7] P. Buitelaar, P. Cimiano, B. Magnini (Eds.). s Ontology Learning from Text: Methods, [2] M. Shirakawa, K. Nakayama, T. Hara, S. Nishio. Evaluation and Applications. In Frontiers in Concept Vector Extraction from Wikipedia Artificial Intelligence and Applications Series, Category Network. In Proceedings of 3rd Vol. 123, IOS Press, July 2005. International Conference on Ubiquitous [8] A. Gregorowicz, M. A. Kramer. Mining a Large- Information Management and Communication Scale Term-Concept Network from Wikipedia. (ICUIMC 2009), pp. 71-79, 2009. Technical Report #06-1028, The MITRE Corp., [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Oct. 2006. Cyganiak, Z. G. Ives. Dbpedia: A nucleus for a [9] Cäcilia Zirn, Vivi Nastase, Michael Strube. web of open data. In ISWC, volume 4825 of Distinguishing between instances and classes in LNCS, pages 722–735. Springer, 2007. the Wikipedia taxonomy. In Proc. of ESWC-08, [4] Simone P. Ponzetto, Michael Strube. Deriving a pages 376-387, 2008. large scale taxonomy from Wikipedia. In AAAI'07: [10] Gaoying Cui, Qin Lu, Wenjie Li, Yi-Rong Chen. Proceedings of the 22nd national conference on Mining Concepts from Wikipedia for Ontology Artificial intelligence, pp. 1440-1445, 2007. Construction. In Proceedings of Web [5] Fabian M. Suchanek, Gjergji Kasneci, Gerhard Intelligence/IAT Workshops, pp.287-290, 2009. Weikum. YAGO: A Large Ontology from [11] J. Hoffart, F. Suchanek, K. Berberich, G. Weikum. MIT Press and McGraw-Hill, 2001. Section 22.3: YAGO2: A Spatially and Temporally Enhanced Depth-first search, pp. 540–549. Knowledge Base from Wikipedia. Research [27] L.-E. Thorelli. An algorithm for computing all Report MPI-I-2010-5-007, Max-Planck-Institut für paths in a graph. In BIT 6, 347—349, 1966. Informatik, November 2010. [28] M. Migliore , V. Martorana , F. Sciortino. An [12] A. Budanitsky, G. Hirst. Evaluating WordNet- algorithm to find all paths between two nodes in a based measures of semantic distance. In graph. In Journal of Computational Physics, v.87 Computational Linguistics, 32(1), pp. 13-47, n.1, pp.231-236, March 1990. March 2006. [29] R. Simoes. APAC: An exact algorithm for [13] D. Turdakov, P. Velikhov. Semantic Relatedness retrieving cycles and paths in all kinds of graphs. Metric for Wikipedia Concepts Based on Link In Tékhne, no.12, p.39-55, 2009. Analysis and its Application to Word Sense [30] JUNG - Java Universal Network/Graph Disambiguation. In Proc. of SYRCoDIS, 2008. Framework. http://jung.sourceforge.net/ [14] T. Zesch, I. Gurevych. Analysis of the Wikipedia [31] JGraphT - a free Java graph library. Category Graph for NLP Applications. In http://www.jgrapht.org/ Proceedings of the TextGraphs-2 Workshop [32] neo4j open source nosql graph database. (NAACL-HLT), 2007. http://neo4j.org/ [15] S. Chernov, T. Iofciu, W. Nejdl, X. Zhou. [33] WebGraph. http://webgraph.dsi.unimi.it/ Extracting Semantic Relationships between [34] Wikipedia:Academic studies of Wikipedia - Wikipedia Categories. In Proceedings of the First Wikipedia, the free encyclopedia. International Workshop on Semantic Wikis - From http://en.wikipedia.org/wiki/Wikipedia:Academic_ Wiki To Semantics, June 2006. studies_of_Wikipedia [16] M. Strube, S. P. Ponzetto. WikiRelate! Computing [35] Academic studies about Wikipedia - Wikipedia, semantic relatedness using Wikipedia. In the free encyclopedia. Proceedings of the 21st national conference on http://en.wikipedia.org/wiki/Academic_studies_ab Artificial intelligence (AAAI'06), pp. 1419-1424, out_Wikipedia#Natural_language_processing 2006. [17] Z. Syed, T. Finin, and A. Joshi. Wikipedia as an Ontology for Describing Documents. In Proceedings of the Second International Conference on Weblogs and Social Media, 2008. [18] G. Y. Cui, Q. Lu, W. J. Li, Y. R. Chen. Corpus Exploitation from Wikipedia for Ontology Construction. In LREC 2008, Marrakech, pp. 2125-2132, 2008. [19] Wikipedia:Overcategorization - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Overcateg orization [20] Wikipedia:Categorization - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Categoriza tion [21] Catgraph. http://toolserver.org/~dapete/catgraph/ [22] Wikipedia:WikiProject Categories/uncategorized - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:WikiProjec t_Categories/uncategorized [23] Wikipedia:Database reports/Uncategorized categories - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Database_r eports/Uncategorized_categories [24] Category:Better category needed - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Category:Better_categ ory_needed [25] J. Soto. Wikipedia: A Quantitative Analysis. PhD thesis, 2009. [26] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to Algorithms, Second Edition.