A Category-Driven Approach to
            Deriving Domain Specific Subset of Wikipedia

     © Anton Korshunov, Denis Turdakov                         Jinguk Jeong, Minho Lee, Changsung Moon
                                                                        Convergence Solution Team,
                       ISPRAS                                               DMC R&D Center,
                                                                        Samsung Electronics Co., Ltd.

              {korshunov, turdakov}                             {jinguk.jeong, minho03.lee, albert.moon}
                   @ispras.ru                                               @samsung.com


                      Abstract                             URL, multiple and dense link structure, and category
                                                           system which is edited and maintained by Wikipedia
    While many researchers attempt to build up             users as well as articles.
    different kinds of ontologies by means of                  Researching community has proved more than once
    Wikipedia, the possibility of deriving high-           that structure and content of Wikipedia are very
    quality domain specific subset of Wikipedia            peculiar and valuable domains to study. One of the most
    using its own category structure still remains         promising directions is automated ontology building
    undervalued. We prove the necessity of such            which may be accomplished by extracting well-defined
    processing in this paper and also propose an           concepts and relations among them from Wikipedia.
    appropriate technique. As a result, the size of        The latest efforts in this field are DBpedia [3], the work
    knowledge base for our text processing                 of Ponzetto et al. [4], YAGO [5], etc. All these
    framework has been reduced by more than                approaches mainly exploit the data derived from
    order, while the precision of disambiguating           infoboxes and category structure. Due to continuous
    musical metadata (ID3 tags) has decreased              enriching these techniques with better algorithms and
    from 98% to 64%.                                       data sources, the quality of resulting ontologies
                                                           becomes remarkable.
1 Introduction                                                 Despite these successes in automated upper-level
                                                           ontology building, most of domain specific ontologies
There's no need to introduce Wikipedia as a world's
                                                           (for instance, UMLS [6]) are still assembled manually.
largest and most rapidly expanding source of               Needless to say, they are often costly to build and to
information on many domains in almost every                keep them up to date.
language. At the time of drafting this paper, there are
                                                               In this paper, we consider the possibility of loosely
279 Wikipedias in different languages, with more than
                                                           supervised extraction of domain specific ontologies
17,870,000 articles, 1,920,000 uploaded images, and
                                                           from upper-level ones. This is becoming feasible in
27,540,000 registered users [1]. 39 Wikipedias contain
                                                           recent years as high-quality upper-level ontologies grow
more than 100,000 articles each. The English edition       more versatile and involve the knowledge of many new
remains the largest Wikipedia, over three times as large   domains. Moreover, many field experts from all around
as the second largest edition, the German Wikipedia.
                                                           the world often tend to pay their efforts to expand
That's why it is reasonable to start investigating new
                                                           existing widespread ontologies, rather than to create
possibilities with English Wikipedia as a most
                                                           their own or refine domain specific ones.
comprehensive source.                                          Our interest in this research is caused by necessity
    By now, Wikipedia has already got a lot of colorful    of reducing the knowledge base for our text processing
detailed descriptions. Conventional features of
                                                           framework Texterra1. This base comprises a number of
Wikipedia are well-known and discussed widely
                                                           textual indices produced by our Wikipedia parser.
in [2, 25]. They include concept identification by ID or
                                                           These indices are loaded into RAM during the
                                                          initialization of Texterra server and take roughly 4.5 Gb
  This research was collaborated with and supported by     of disk space and 2 Gb of RAM. Such a consumption is
the Samsung Electronics Co., Ltd. DMC R&D Center           acceptable for workstations, but not for mobile devices
Convergence Solution Team.                                 with restricted amount of memory. Thus, if one would
Proceedings of the Spring Young Researcher's               attempt to build a standalone mobile application
Colloquium On Database and Information Systems
SYRCoDIS, Moscow, Russia, 2011
                                                           1
                                                               http://modis.ispras.ru/texterra
intended for text processing, then its data structures          domain specific version of the latter. The explanation of
simply would not fit in the RAM. Obviously, such                our approach is given in Section 4.
applications - if developed - would be highly demanded              Section 2 contains the review of related work. The
to date.                                                        key features of Wikipedia category structure are
     The disambiguation2 of musical metadata (i.e., ID3         discussed in Section 3.
tags of MP3 files) is an important part of Texterra                 Having a fast access to any point of Wikipedia
functionality. These algorithms are well-tuned by the           category structure was one of the crucial tasks for our
moment and show the precision of 98% on our test set.           research. After a number of unsuccessful tries to reuse
But it's obvious that only a narrow range of Wikipedia          the existing graph libraries, we've coded our own
dictionary is utilized during processing ID3 tags of MP3        implementation, named WikiGraph. See Section 5 for
files. There are mostly named entities, such as musical         details.
compositions, song writers, singers, etc. Generally                 Given a set of specialized versions of the knowledge
speaking, the amount of knowledge required to perform           base, we conducted a series of experiments in order to
such a disambiguation is likely much less than that             learn how the reducing of Texterra knowledge base
available from the entire knowledge base.                       affects the accuracy of results. Refer to Section 6 for
     Making the knowledge base more specialized by              evaluation    methodology.      Section     7    contains
removing concepts which are unimportant for this task           experimental results and discussion.
is what comes to mind first in this case. So, we've                 We conclude in Section 8 with our research findings
implemented a system intended to produce Wikipedia              and discussion about possible improvements.
subsets covering the knowledge of the given domain.
We decided not to narrow the scope of our research to           2 Related work
Wikipedia derivates (such as YAGO). But, although the
algorithms have been evaluated with Wikipedia only,             2.1 Automated ontology building
they are applicable to any ontology with well-defined
                                                                The field of automated ontology learning usually acts
polyhierarchical taxonomy.
                                                                by taking textual input and transforming it into a
     Of many interesting aspects of Wikipedia, here we
                                                                taxonomy or a proper ontology. The texts are usually
take into account only categorization and linkage within
                                                                obtained from printed sources (books, magazines,
its content. The category system in Wikipedia plays the
                                                                newspapers) and Internet (online media, blogposts,
role of a taxonomy and provides the function to search
                                                                results of querying search engines). However, the
articles by narrowing down categories. Given this, the
                                                                learned ontologies are small and hard-to-update; in
task of deriving domain specific concepts from
                                                                addition, evaluations have revealed a rather poor
Wikipedia dictionary seems to be solvable by mining
                                                                performance [7].
the category system for the list of concepts tightly
                                                                     As mentioned before, the most accurate and, thus,
connected with the given base categories. However, it
                                                                valuable non-human assembled ontologies are now built
is impossible to simply determine all concepts
                                                                by automatically deriving explicit facts from Wikipedia.
belonging to a particular category. The reason is that the
                                                                One of the early attempts was the work of Gregorowicz
category system of Wikipedia is organized into network
                                                                and Kramer [8]. They focused on deriving a term-
structure, not a perfect tree structure. Furthermore, the
                                                                concept map which consists of terms, concepts, and
lists of parent categories for many articles are redundant
                                                                relationships between them. Only articles, redirects, and
and contradictory. Thus, they don’t allow detecting the
                                                                disambiguation pages are considered. Ponzetto and
most relevant categories for a given page without
                                                                Strube [4] were deriving a taxonomy from the entire
analyzing the neighbour pages.
                                                                Wikipedia category structure. Despite their successes,
     Therefore, we utilized concept vectorization method
                                                                the resulted taxonomy is rather simple (supports only
specialized for the category system in Wikipedia, with
                                                                is-a and not-is-a relations) and domain independent.
several additional expansion methods, as proposed
                                                                The work described in [9] enhances this taxonomy with
in [2]. The authors express affiliation relations among
                                                                instance and class information for each node. Cui et
concepts as category-based concept vectors. Each
                                                                al. [10] introduce even more sophisticated approach to
element (dimension) of a concept vector represents not
                                                                building the ontology of concepts by making use of
only binary affiliation information (whether the concept
                                                                infobox structures, definition sentences, and category
belongs to a certain category or not), but also the degree
                                                                labels. Finally, YAGO2 [11] is probably the most
of affiliation (called belonging degree in the rest of this
                                                                complete and accurate semantic knowledge base
paper). After applying different cutting techniques to
                                                                derived automatically from Wikipedia and other
the obtained concept vectors, the list of domain specific
                                                                sources. The information extraction technique for
concept IDs is outputted. This list is further applied to
                                                                YAGO2 assumes varied utilization of infoboxes,
Texterra knowledge base resulting in the reduced
                                                                category structure, redirects, and other data within
                                                                Wikipedia. Furthermore, the quality check is performed
                                                                to find possible mistakes. As a result, the quality of
2
  Word sense disambiguation is an open problem of natural       extracted ontology is sufficient for the majority of IR-
language processing, which governs the process of identifying   and NLP-related tasks.
which sense of a word (i.e. meaning) is used in a sentence,
when the word has multiple meanings (polysemy).
    Notwithstanding the foregoing, newly emerging             2.3 Identifying domain specific concepts
research fields often require well-structured and
                                                              Syed et al. [17] tried to predict the topic of textual
comprehensive domain specific ontologies. Researchers
                                                              documents by matching them against Wikipedia articles
don't need a huge knowledge base, but an extensible
                                                              based on cosine similarity3. Then, they extracted
corpus of specific concepts with good coverage of
                                                              categories of found articles and scored them based on
domain knowledge. If such ontologies were built in a
                                                              different scoring schemes with or without spreading
completely automated way, then this would avoid the
                                                              activation4. The proposed approach implies scoring the
necessity of assembling them from scratch manually.
                                                              links with many categories for each given Wikipedia
We believe that Wikipedia contains enough information
                                                              article using bottom-up traversing of category structure.
for this task to be completed. Below is the description
                                                              This is acceptable for a small set of articles, but not for
of approaches we've applied to narrowing Wikipedia
                                                              our task.
dictionary to domain specific subset.
                                                                  To the best of our knowledge, Cui et al. [18] were
2.2 Computing semantic relatedness                            the first who introduced an approach to deriving domain
                                                              specific corpus from Wikipedia. The main idea is to
We consider the key problem of our study as the               generate a domain hierarchy from the hyperlinked pages
computation of semantic relatedness between base top-         of Wikipedia. Then, only articles strongly linked to this
level categories and underlying articles. According to        hierarchy are selected. They build a so-called
Budanitsky and Hirst [12], semantic relatedness is            Classification Tree by traversing down the directed
defined to cover any kind of lexical or functional            graph of Wikipedia category structure starting from the
association that may exist between two words. This            root node. This tree includes both categories and
definition suits us more than semantic similarity, which      articles and in fact is merely a connected branch of
is typically defined via the lexical relations of             Wikipedia classification graph with a specified root
synonymy and hypernymy.                                       node. Then, the Classification Tree is traversed with a
     There were many approaches proposed for                  simple adaption of breadth-first search algorithm.
estimation of semantic relatedness between concepts in        During the traversal, each node is given a score on the
Wordnet (Rada et al., Leacock and Chodorow, Wu and            relevance to the specific domain. Once the traversal is
Palmer, Resnik, Jiang and Conrath, Lin) and between           completed, the terminal nodes (article pages) are ranked
Wikipedia articles [13] (Dice, Jaccard, SimRank).             according to the domain relevance scores. Pages over a
Among them are purely graph-based measures and                certain threshold are considered domain relevant. The
those involving information content. In this paper, we        node score can consider either ingoing or outgoing
consider only graph-based approaches.                         edges. Despite the proposed technique is quite simple,
     Since a significant part of Wikipedia knowledge is       the results are remarkable.
encoded in its graph-like link structure, it seems                A more sophisticated algorithm has been proposed
reasonable to apply existing graph-based methods or           later by Shirakawa et al. [2]. The concept vectorization
introduce new ones. Zesch and Gurevych [14] proved            method is introduced for finding concepts which are
that many Wordnet-based semantic similarity measures          highly correlated with the base category (refer to
are applicable to Wikipedia with minor changes.               Section 1 for brief explanation). The main assumption
     Obviously, it’s also attractive to estimate the          there is that the relatedness between categories gets
strength of ties between different levels of taxonomy         lower as the number of traversed pages (i.e., hopcount)
(between base categories and its articles, in our case). In   increases. In addition to the number of links for each
this context, the links among articles appear not so          node, they also take into account the number of paths
important. Therefore, new category-based semantic             between the concept and the base category, as well as
relatedness measures are emerging.                            hopcounts of these paths. As it seems to us, this
     Chernov et al. [15] measured semantic relatedness        understanding reflects the nature of Wikipedia
between categories, not between concepts and                  classification approach much more precisely than ever
categories, as it's required for our task. Nonetheless,       before. Several heuristics are suggested for estimation
they proposed several very useful and applicable              of semantic relatedness by counting paths properties in
techniques.                                                   the subgraph of desired base category. However,
     Strube and Ponzetto [16] employed Wikipedia,             authors didn't compute these scores for thousands of
Wordnet, and Google for computing semantic
relatedness between concepts. For two Wikipedia               3
                                                                Cosine similarity is a measure of similarity between two
articles being compared, they extracted two categories        vectors by measuring the cosine of the angle between them.
lists. Given the category lists, for each category pair       Calculating the cosine of the angle between two vectors thus
they performed a depth-limited search of maximum              determines whether two vectors are pointing in roughly the
depth of 4 for a least common ancestor. As they               same direction. This is often used to compare documents in
noticed, limiting the search improves the results. But        text mining.
                                                              4
this is obviously inappropriate for computing                   Spreading activation is a method for searching associative
relatedness between top-level category and articles from      networks, neural networks, or semantic networks. The search
                                                              process is initiated by labelling a set of source nodes with
all its subcategories.
                                                              weights or "activation" and then iteratively propagating or
                                                              "spreading" that activation out to other nodes linked to the
                                                              source nodes.
articles with hundreds of paths for each of them at a      links, i.e. links from articles or subcategories to upper-
time, as it's required in this research. Moreover, the     level categories. The English version of Wikipedia, as
efficiency of both approaches described in [2] and [18]    of September 2010, contains ~13 million category links.
has not been evaluated in real tasks. In both cases, the       The typical code of categorized article page is
evaluation was performed by comparing the results of       shown at Figure 1. It combines XML and Wiki markup.
algorithms with answers of experts knowledgeable in        The list of belonging categories is situated at the
certain fields. This looks persuasively when proving the   bottom. Categories have their own pages similar to
theoretical applicability, but is not enough for           articles. Category links at these pages also express
unconditional embedding into the real system.              which category belongs to what categories.
    In this work, we mainly exploited the ideas                Categorization is a useful tool to group articles for
formulated in [2] and [18]. Our main goal was to           ease of navigation, and correlating similar information.
estimate the scalability and practical applicability of    However, not every verifiable fact (or the intersection
these approaches for real tasks which imply processing     of two or more such facts) in an article requires an
of large amount of data.                                   associated category. For lengthy articles, this could
                                                           potentially result in hundreds of categories, most of
3 Features of Wikipedia category structure                 which aren't particularly relevant. This may also make it
                                                           more difficult to find any particular category for a
The advantages of Wikipedia category structure were        specific article. Such overcategorization is also known
studied by authors of [14] and many others. Here we        as "category clutter" [19].
summarize only those features needed for better                For these reasons, the WCG has an extremely
understanding of our approach.                             complex nature. It is directed and has not a strong
                                                           hierarchical structure as some may expect. Any
                                                           category may branch into subcategories, and it is
                                                           possible for a category to be a subcategory of more than
                                                           one parent [20]. Upon closer inspection, the WCG is
                                                           rather a polyhierarchy, or even a net (Figure 2).


                                                            Figure 2. A fragment of Wikipedia category structure

                                                               The figure has been produced by CatGraph [21].
                                                           This tool draws a cloud of links for the desired
                                                           category. Each rectangle represents a category. Each
                                                           arrow connecting two rectangles denotes a "belongs-to"
                                                           relation, that is, the destination category is a
                                                           subcategory of the initial one (an example of belonging
    Figure 1. Sample XML structure of categorized          link). The cloud shown in the figure is for "Recorded
                Wikipedia article page                     music" category (bolded).
                                                               It's worth noting here that not every Wikipedia page
    Categories of Wikipedia can be organized in a          is categorized. According to statistics [22, 23], there are
graph, where the nodes are categories and the edges are    thousands of uncategorized articles and categories.
hyperlinks. In this work we also add articles to this      Moreover,       certain    categories     are     assigned
graph. However, we still name it the Wikipedia             incorrectly [24]. We suggest considering these facts as a
category graph (WCG in the rest of the paper).             possible drawbacks for any category-based algorithm.
    The links expressing which concept belongs to what     In addition, automated categorizing (i.e., determining a
categories are called category links. We call them         topic of an uncategorized page) seems to be a
belonging links or belonged links according to their       challenging task. This can be done with certain
direction. In this paper, we only consider the belonging   accuracy by processing page title and text with specific
NLP techniques and finding appropriate categories in         desired domain than to others. As this task is
WCG.                                                         computationally complex and, thus, supposed to be run
                                                             rarely, it’s allowably to choose the base categories
4 Deriving a domain specific subsets                         manually for experiments. We’ve selected 3 base
                                                             categories that likely cover the majority of concepts
The developed system consists of three main parts:           required for disambiguating musical metadata:
  1. Link Filter produces a ready-for-load textual                 Category:Musical compositions
      representation of WCG;                                       Category:Recorded music
  2. Topic Deriver performs the main processing;                   Category:Music-related lists
  3. Reducer produces a domain specific version of
      the Texterra knowledge base5.                              For each of selected base categories, a separate
   All algorithms evaluated in this paper were               subgraph is built. This subgraph is almost the same as
implemented in the Java programming language.                Classification Tree in [18]. The base category serves as
                                                             a root node, and a tree-like structure of underlying
4.1 Link Filter                                              pages is obtained from WCG. The only difference from
                                                             approach proposed in [18] is that we use depth-first
The input for Link Filter is Wikipedia links file
                                                             search (DFS), not breadth-first search (BFS). The
containing information about all links between
                                                             reason for this is that the resulting subgraphs are often
Wikipedia pages, along with their type and direction.
                                                             large enough, thus, it’s inappropriate to waste the
The result is category links file that contains only links
                                                             memory for storing the FIFO queue required for BFS
forming the WCG. Every line of this file denotes the
                                                             traversal [26]. Moreover, as depth-first tree is expected
affiliation of belonging between two pages and sets the
                                                             to contain back edges and cross edges, the list of visited
type of the belonging page. For example,
                                                             nodes has been added to avoid repetitive visiting and
                                                             loops.
                      12 780754 0
                                                                 A concept vector in our research specifies the degree
                                                             of affiliation between the base category and each of
means that page with ID 12 ("Anarchism") is belonging
                                                             articles reachable by traversing down the subgraph of
to      the     category       with      ID       780754
                                                             the base category starting from its root node. As
("Category:Anarchism"). Moreover, the belonging page
                                                             mentioned, the heuristics for building concept vectors
("Anarchism") is an article because the last field is "0"
                                                             have been borrowed from [2] with some modifications.
("1" would mean that the belonging page is a category).
                                                             We describe them briefly below, for more detailed
    Thus, category links file is a complete textual
                                                             information refer to the source paper.
representation of the WCG and contains no unwanted
                                                                 BVG (Basic Vector Generation method) generates
data such as page titles and link types. Furthermore,
                                                             concept vectors by tracking back parent categories in
unlike the authors of [18], we’ve also removed pages of
                                                             the category system and calculating the belonging
certain types: lists, classifications, portals, redirects,
                                                             degree to each concept.
disambiguation pages, and user pages. This helped us to
make the WCG more lightweight without loss of any                The belonging degree              from concept      to
meaningful concepts. As a result, category links file        category is defined by the following equation:
contains 13,001,687 links between 593,796 categories
and 3,156,822 articles.

4.2 Topic Deriver                                                Here,      denotes a set of paths from      to ,
                                                             denotes the hopcount of path             ,    denotes a
Topic Deriver loads category links file on start and fills   monotonically increasing function on the hopcount of
in the internal structures of WikiGraph (refer to Section    path (given as ).
5 for details). The workflow for this stage is shown at          It’s noteworthy that in the original method [2] paths
the Figure 3. Here we touch on only the main steps.          with a hopcount of more than 4 were ignored. We asked
    For our analysis, we denote     as a set of concepts,    the authors for the reasons of this. The response was
   as a set of categories, and     as a set of belonging     ―Because long paths scarcely affect values in concept
links. Then, the category system in Wikipedia is             vectors in most cases. Of course, sometimes long paths
expressed as a directed graph                . A path is a   affect the values‖. We’ve decided to remove this
sequence of edges that connects one node with another.       constraint in our experiments, that is, we consider all
The path length (hopcount) is the number of edges            paths between two nodes.
along that path.                                                 As a result, processing time may become too large
    The key task of Topic Deriver is to obtain the list of   for base categories from high levels of the WCG
concepts connected semantically with certain domain.         hierarchy. The reason for such behaviour is an exclusive
Herewith, this connection should be the tightest one,        computational complexity of finding all paths between
that is, these concepts should be more relevant to the       two arbitrary nodes in the graph. It's well-known that
                                                             this task is NP-complete in general case.
5
  The Texterra knowledge base for this research has been
obtained by parsing the dump of English Wikipedia, as of
September 2010.
                                                                                 Step 1. Get the list of
  Musical                        Recorded                           Music-       categories    to     be
compositions                      music                          related lists   processed.


                         Texterra knowledge base


                                                                                 Step 2. Get connected
     ID1                           ID2                              ID3          to Texterra server and
   940477                        2722900                          1520543        obtain an ID for each
                                                                                 category.


                         Wikipedia category graph

                                                                                 Step 3. Explore the
                                                                                 pre-loaded Wikipedia
"Musical compositions"        "Recorded music"          "Music-related lists"    category graph and
     SUBGRAPH                    SUBGRAPH                  SUBGRAPH              build a separate
                                                                                 subgraph for each
                                                                                 category.


"Musical compositions"        "Recorded music"          "Music-related lists"
                                                                                 Step 4. Traverse the
       VECTOR                     VECTOR                     VECTOR
                                                                                 subgraphs and build
    16256      1.56           1234       1.25            67891        1.02       concept     vectors.
    723678     1.12           491358     1.04            2297721      0.90       Each vector consists
                                         2                                       of "concept ID →
    90381      0.92           76036      0.89            6788201      0.78       belonging   degree"
    2763203    0.56           386421     0.33            404313       0.44       pairs.
    392        0.35           6032       0.19            51578        0.36
                              16937      0.01
    84678      0.24                                      30557        0.20
    280482     0.03                                      18358        0.09       Step 5. Apply cutting
                                                                                 technique to each
                                                                                 vector. Collect the
                                                                                 remaining IDs into
                      LIST OF EXTRACTED CONCEPTS                                 common      list   of
                                                                                 extracted concepts.


                                                                                 Step    6.     Process
                                                                                 Texterra knowledge
     Knowledge base                                       Knowledge base         base in a way to save
                                                                                 only           records
       (complete)                                        (domain specific)       associated        with
                                                                                 obtained list.


                              Figure 3. Overall architecture of Topic Deriver
     Since WCG contains millions of edges, the maximal            Then, given all paths                from           to
path length may reach hundreds of edges, leading to             , belonging degree            from concept            to
impetuous increase of processing time when trying to          category is defined as follows:
process top-level categories. This is exactly why we've
picked up a "safe" set of categories, which are
processed relatively fast and cover the knowledge of
field we've chosen for experiments.                                     is the weight of path      , calculated by the
     Notwithstanding, we believe that taking all existing     following equation:
paths between two nodes into account allows to
estimate the belonging degree more precisely. However,
we didn’t confirm this assumption experimentally.                 Here,                        denotes a set of all
     To reduce the complexity of finding all paths            belonging links forming path          and      denotes a
between two arbitrary nodes, we tried to re-use one of        belonging link.
existing techniques [27-29]. Finally, the APAC                    After the vector is built and sorted in descending
algorithm [29] has been chosen. This algorithm does not       order of belonging degree, it's time to apply cutting
need to keep track of all visited vertices and only stores    technique to it and get the list of IDs most relevant to
the feasible paths.                                           the base category. We've tried out two approaches:
     For domain specific areas where categories are                1. belonging degree threshold – concepts with
excessively segmentalized, the BVG method cannot                       belonging degrees less than the mean value of
extract accurately concept vectors due to the increase in              belonging degree for each vector are filtered;
hopcount. To solve this problem, the Single Parent                 2. percent threshold - 25% of concepts with the
Integration (SPI) method is proposed. The authors                      lowest belonging degrees are filtered.
confirmed from their experiences that a part in the
category system which corresponds to (excessively                  Finally, Topic Deriver produces domain concepts
segmentalized) categories for a domain specific area          file that contains IDs of derived concepts.
forms almost a tree structure. Based on this fact, when a
                                                              4.3 Reducer
concept or a category has only one (onehop/multihop)
belonging link, the SPI method shortens the belonging         Reducer is the final part of the system. It takes domain
link. This is based on the idea that the characteristic is    concepts file as input, applies the concepts' list to
not dispersed even when parent categories are tracked         complete Texterra knowledge base, and produces the
back if the concept or category has only one (onehop or       reduced domain specific version of the latter. It contains
multihop) belonging link.                                     not only concept IDs, but also full information about
     In the SPI method, if there is only one belonging        each of them, including the part of category structure
link from node (or ) to                , the path length of   that covers selected concepts. Therefore, the domain
    is accounted as 0, which results in reformation of        specific version is consistent and ready for loading into
to , and then the BVG method is applied to                    Texterra.
            .
     VVG (Variance-based Vector Generation                    5 An approach to storing Wikipedia
method) considers the weight of each category link.           category graph
This method is based on the idea that the belonging
degree from a certain category (concept) to parent            As showed above, fast access to any point of Wikipedia
categories depends on the number of parent categories,        category structure is necessary for efficiency of all
thus the weight of each category link is inversely            described computations. In particular, VVG method
proportional to the number of parent categories.              requires both entire WCG and subgraph of current base
     Thus, the weight of a category link becomes 1 if the     category to be available simultaneously.
category has only one parent category. That’s why the             Chernov et al. [15] studied semantic relationships
authors argue that the VVG method contains the same           between Wikipedia categories. They exported the
feature as the SPI method. Therefore, they didn’t             dataset of about 670 thousands pages into a MySQL
combined VVG with SPI. We, on the other hand, tried           database. The data size was ~1.2 Gb. But, like many
both BVG + SPI and VVG + SPI combinations and                 other researchers, they picked just a small sample of
confirmed that VVG + SPI performs slightly better than        pages for processing (few thousands). For such small-
VVG itself (see Section 7 for details).                       scale approaches, even a usual on-disk relational DB is
     In the VVG method, weights are set to all belonging      fast enough.
links, and the belonging degree from concept             to       But our goal was to create a technique for fast
category       is calculated according to the weights.        iterative traversing through even a top-level categories
When the number of belonging links from node            (or   with millions pages. Thus, we resorted to in-memory
   ) to category        is , weight          of each of the   storage of WCG.
belonging links is defined as follows:
                                                              5.1 Evaluation results for known graph libraries
                                                              We've tried out two third-party libraries for storing the
                                                              WCG in the JVM's memory. First of them, JUNG [30],
showed satisfying performance results on small-scale          needed) and lead to significant speed-up of loading and
subgraphs. But the entire WCG was impetuously                 processing.
expanding while loading and didn’t fit in the RAM of
the test machine (8 Gb). Second one, JGraphT [31],            6 Evaluation methodology
demonstrated almost the same behaviour: the WCG
consumed a bit less amount of memory, but still too           Obviously, a domain specific subset of Wikipedia
much. These observations hinder to utilize these              should have a good coverage of domain knowledge. But
libraries as a solution for WCG storing.                      there is no easy direct way to evaluate quality of such a
    But there are a number of other libraries for graph       subset. The reason is that we must evaluate the
storage which provide handy interface to stored data.         completeness of knowledge available from Wikipedia's
We've found neo4j [32] and WebGraph [33] libraries.           articles in resulting subset compared to that of specific
They may appear useful during the further research.           domain. It's clear that this is rather difficult. In addition,
                                                              the quality of link structure in the resulting subset
5.2 WikiGraph                                                 should be also evaluated.
                                                                  Therefore, we applied so called in vivo approach for
    The common shortcoming of all Java graph                  evaluation. To estimate the quality of proposed
implementations we've tested seemed to be the                 methods, we studied how applying of the extracted
redundancy of data stored in the RAM. Thus, the right         subsets affects the performance of Texterra as a whole.
way is to change the data storage manner. We put this             As mentioned before, one of Texterra parts is the
into practice in WikiGraph.                                   system that enriches ID3 tags for musical recordings
    All the prominent features of WikiGraph are due to        with links to corresponding articles of Wikipedia. This
the fact that it's intended to store WCG:                     system utilizes graph structure to compute semantic
 1. It is directed (as category links have a direction);      relatedness between Wikipedia pages [13]. Then,
 2. It introduces the notions of category and article         semantic relatedness is exploited by word sense
      and provides a powerful tooling to store and            disambiguation algorithm. The latter is intended to
      maintain the data on affiliations between them;         choose the most relevant Wikipedia page from several
 3. Only IDs and types of pages are stored. Each              homonymic variants.
                                                                  We assume that each page of Wikipedia describes
      vertex is presented as a map consisting of [ID,
                                                              one possible meaning. WSD algorithm selects the most
      isCategory] entries. This allows to store page type     consistent combination of meanings that correspond to
      as a Boolean variable (TRUE is for category,            input ID3 tags. For sequence of input tags, it computes
      FALSE is for article). All page data are saved as       similarity between all pairs of meanings. The weight of
      primitive variables, not an objects;                    a sequence is a sum of weights of all its pairs. Then, the
 4. Incidence list has been chosen as a main data             algorithm detects a sequence with greatest weight.
                                                                  To show good results, the derived subset should
      structure (along with vertices and edges lists). This
                                                              include as much as possible Wikipedia articles
      is particularly important as the WCG is quite           associated with a specific domain. In additional, link
     dense:                           . The incidence list    structure should be good enough for relatedness
                                                              computation. Therefore, this approach allows evaluating
     is organized into a set of [vertex, [list of incident    both the quality of dictionary content and the quality of
     edges]] entries. Each list is sorted in ascending        link structure.
     order of edges IDs just after the loading. This              For testing purpose, we derived several music-
     avoids the need to look over the entire incidence        related subsets of Wikipedia by running different
     list to get all the edges incident to an arbitrary       combinations of heuristics and used these subsets for
                                                              described system. Then, we consequentially loaded
     vertex. Moreover, due to sorting of the lists, it's
                                                              these domain specific versions of knowledge base into
     allowed to interrupt the search over them after the      Texterra and ran the tests. We used a small corpus of 20
     edge with greatest expected ID is found;                 random musical compositions and 49 different tags.
5.   All kinks (self-to-self links) are removed;              Then, we estimated the precision of automated
6.   Initial capacity of the incidence list is beforehand     disambiguation by comparing the results of algorithm
     set to approximate amount of vertices in WCG             with manually disambiguated tags.
     (3,500,000 for this case). This saves some memory
     allocation costs while loading;                          7 Experimental results
7.   A set of helper methods is developed also (for           The configuration of test workstation was as follows:
     instance, a method for deriving a subgraph of a          Intel Core 2 Duo CPU (3.16 GHz), 8 Gb RAM,
     given base category). This set provides usable and       Windows 7 Enterprise 64 bit, Java SE 6 Development
     fast interface to the WCG data.                          Kit 1.6.0.20.
                                                                  Sample vectors for different combinations of
    After the described features were implemented, they       heuristics are provided in Tables 1-4. Each sample
allowed us to fit WCG entirely in the RAM (~4 Gb              vector comprises three concepts with highest belonging
degree and three concepts with lowest values. The base            What's important here is that the size of Texterra
category is Category:Musical compositions. It’s               knowledge base (both on disk and in RAM) depends
noteworthy that BVG vector differs significantly from         linearly on the number of concepts. Thus, the challenge
VVG one. Furthermore, enabling SPI affects both               is to find a compromise between the precision of
vectors.                                                      disambiguation and the size (and contents) of the
    The results of the experiments with different             knowledge base.
combinations of vector generation methods and cutting             The conducted experiment was just our first effort of
techniques are presented in Table 5. Contents of all 3        this kind. Implemented algorithms allowed us to reduce
base categories listed in Section 4.2 are included.           the size of Texterra knowledge base by more than order,
    Ground truth row corresponds to original Texterra         while the precision of disambiguating musical metadata
knowledge base. As can be seen, it is huge, but ensures       has decreased from 98% to 64%. We believe that these
the best accuracy of disambiguation.                          results prove the applicability of proposed approach for
    No threshold is for case when no cutting technique        deriving domain specific subset of Wikipedia.
is applied to the concept vectors. In other words, this set   Certainly, there're still many things to improve.
of IDs exactly matches the set of all articles from
subgraphs of all base categories. This version is much        Table 3. Sample vector produced by the BVG method
smaller, but the precision gets lower also. This precision    with SPI enabled
drop (when no threshold is applied yet) is only due to                                               Belonging
                                                                    ID               Title
imperfect choice of base categories. They merely don't                                                degree
cover all concepts required for precise disambiguation.         1523941        Axel F                  5.53
It's also obvious that all comparisons of heuristics            1815726        I Will Always           3.88
results should be done with no threshold results, not                          Love You
with ground truth.                                               923235        My Heart Will           3.50
                                                                               Go On
Table 1. Sample vector produced by the BVG method                  9010        Dance Dance             0.00
                                        Belonging                              Revolution
      ID                Title
                                          degree                   4527        Béla Bartók             0.00
  1784928       Candle in the             0.54                     1370        Ambrose                 0.00
                Wind 1997
  1523941       Axel F                    0.50                Table 4. Sample vector produced by the VVG method
  1728643       Jeremy (song)             0.49                with SPI enabled
  3720518       Ludwig                   2.44 *                                                      Belonging
                Streicher                  10-4                     ID                Title
                                                                                                      degree
  2175948       Ian Bousfield            2.44 *                27684606 Niagara Falls                  0.50
                                           10-4                                Suite
   875344       Willi                    2.44 *                20053503 Kumikyoku Nico                 0.50
                Boskovsky                  10-4                                Nico Douga
                                                                2501716        Megamix                 0.50
Table 2. Sample vector produced by the VVG method                  9010        Dance Dance             0.00
                                        Belonging                              Revolution
      ID                Title
                                          degree                   4527        Béla Bartók             0.00
 27684606 Niagara Falls                   0.50                     1370        Ambrose                 0.00
                Suite
 20053503 Kumikyoku Nico                  0.50                8 Conclusion and future work
                Nico Douga
  2501716       Megamix                   0.50                According to the results of this study, we outline the
 14054430 Oh, by the Way                 6.02 *               following:
                                           10-7                     Wikipedia categories network may be utilized
   454136       A Collection             6.02 *                       for domain specific subset of Wikipedia;
                of Great Dance             10-7
                                                                      Using concept vectors seems to be appropriate
                Songs
                                                                       way to represent the affiliations of belonging
   361654       Echoes: The              6.02 *
                                                                       between Wikipedia pages;
                Best of Pink               10-7
                Floyd                                                 BVG performs a bit better than VVG;
                                                                      SPI often improves the results;
    As one can see, BVG performs a bit better than
VVG. The most accurate combinations of heuristics are                 Percent threshold showed the best results as a
BVG + percent threshold and BVG + SPI + percent                        cutting technique.
threshold. Enabling SPI for VVG slightly increases the        Possible directions of future work include:
precision of disambiguation. Percent threshold is                1. As noted by the authors of [18], the selection of
definitely better than belonging degree threshold.
      the root node is vital to the quality of the domain            reorganize the WCG;
      specific corpus. Thus, it’s reasonable to                  5. Add a facility for storing the results of semantic
      introduce some heuristics for automated
                                                                    relatedness computation to boost the further
      identifying of the most appropriate base category
                                                                    processing;
      given just a set of specific keywords. Moreover,
      there can be several base categories with either           6. Develop the approximation algorithm for finding
      manually or automatically set relevance levels.               all paths between two arbitrary nodes in the
      For example, to perform the search for                        WCG.
      "Musicians of World War II" a user should                   In this work, we’ve demonstrated the possible
      provide 2 base categories as an input:                  benefits of automated building of the domain specific
      Category:Musicians with relevance level of 0.9          ontologies. Also, we’ve tested different heuristics while
      and Category:World_War_II with relevance                implementing the system for such processing. An
      level of 0.5;                                           original approach to storing WCG in the RAM has been
   2. Try other cutting techniques for concept vectors        proposed, along with specific evaluation methodology.
      (i.e., attempt to detect the distribution of                The described approach can be applied to any
      belonging degrees and utilize it);                      ontology with well-defined polyhierarchical taxonomy
                                                              (for instance, YAGO2). As it seems to us, weighting the
   3. Detect and remove meaningless pages from the
                                                              existing semantic connections is always a challenging
      WCG (i.e., pages from administrative section of
                                                              task while building any more or less large ontology.
      Wikipedia [4]);
                                                              This may be helpful for any domain dependent
   4. Distinguishing between classes and instances            Wikipedia-related research [34, 35].
      among categories [9] may help to prune and/or

    Table 5. Experimental results
                              SPI                                  Number of
                                              Threshold                             Size, Mb      Precision, %
                           enabled                                    IDs
                                           belonging degree         334,575            112,3          45,10
                                yes
        Basic Vector                            percent             574,810            251,2          62,75
        Generation                         belonging degree         434,229            159,2          45,10
                                no
                                                percent             574,940            251,8          64,71
                                           belonging degree         420,302            154,8          49,02
                                yes
      Variance-based                            percent             549,791            244,5          60,78
     Vector Generation                     belonging degree         418,998            150,1          41,18
                                no
                                                percent             549,639            245,5          56,86
        No threshold            —                 —                 675,228            311,5          72,55
        Ground truth            —                 —                8,476,942          4528,3          98,04


                                                                   Wikipedia and WordNet. In Elsevier Journal of
References                                                         Web Semantics, Vol. 6, No. 3, pp. 203-217, 2008.
                                                              [6] Unified Medical Language System (UMLS) -
[1] List of Wikipedias - Meta.                                     Home. http://www.nlm.nih.gov/research/umls/
    http://meta.wikimedia.org/wiki/List_of_Wikipedia          [7] P. Buitelaar, P. Cimiano, B. Magnini (Eds.).
    s                                                              Ontology Learning from Text: Methods,
[2] M. Shirakawa, K. Nakayama, T. Hara, S. Nishio.                 Evaluation and Applications. In Frontiers in
    Concept Vector Extraction from Wikipedia                       Artificial Intelligence and Applications Series,
    Category Network. In Proceedings of 3rd                        Vol. 123, IOS Press, July 2005.
    International Conference on Ubiquitous                    [8] A. Gregorowicz, M. A. Kramer. Mining a Large-
    Information Management and Communication                       Scale Term-Concept Network from Wikipedia.
    (ICUIMC 2009), pp. 71-79, 2009.                                Technical Report #06-1028, The MITRE Corp.,
[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R.                Oct. 2006.
    Cyganiak, Z. G. Ives. Dbpedia: A nucleus for a            [9] Cäcilia Zirn, Vivi Nastase, Michael Strube.
    web of open data. In ISWC, volume 4825 of                      Distinguishing between instances and classes in
    LNCS, pages 722–735. Springer, 2007.                           the Wikipedia taxonomy. In Proc. of ESWC-08,
[4] Simone P. Ponzetto, Michael Strube. Deriving a                 pages 376-387, 2008.
    large scale taxonomy from Wikipedia. In AAAI'07:          [10] Gaoying Cui, Qin Lu, Wenjie Li, Yi-Rong Chen.
    Proceedings of the 22nd national conference on                 Mining Concepts from Wikipedia for Ontology
    Artificial intelligence, pp. 1440-1445, 2007.                  Construction. In Proceedings of Web
[5] Fabian M. Suchanek, Gjergji Kasneci, Gerhard                   Intelligence/IAT Workshops, pp.287-290, 2009.
    Weikum. YAGO: A Large Ontology from
[11] J. Hoffart, F. Suchanek, K. Berberich, G. Weikum.         MIT Press and McGraw-Hill, 2001. Section 22.3:
     YAGO2: A Spatially and Temporally Enhanced                Depth-first search, pp. 540–549.
     Knowledge Base from Wikipedia. Research              [27] L.-E. Thorelli. An algorithm for computing all
     Report MPI-I-2010-5-007, Max-Planck-Institut für          paths in a graph. In BIT 6, 347—349, 1966.
     Informatik, November 2010.                           [28] M. Migliore , V. Martorana , F. Sciortino. An
[12] A. Budanitsky, G. Hirst. Evaluating WordNet-              algorithm to find all paths between two nodes in a
     based measures of semantic distance. In                   graph. In Journal of Computational Physics, v.87
     Computational Linguistics, 32(1), pp. 13-47,              n.1, pp.231-236, March 1990.
     March 2006.                                          [29] R. Simoes. APAC: An exact algorithm for
[13] D. Turdakov, P. Velikhov. Semantic Relatedness            retrieving cycles and paths in all kinds of graphs.
     Metric for Wikipedia Concepts Based on Link               In Tékhne, no.12, p.39-55, 2009.
     Analysis and its Application to Word Sense           [30] JUNG - Java Universal Network/Graph
     Disambiguation. In Proc. of SYRCoDIS, 2008.               Framework. http://jung.sourceforge.net/
[14] T. Zesch, I. Gurevych. Analysis of the Wikipedia     [31] JGraphT - a free Java graph library.
     Category Graph for NLP Applications. In                   http://www.jgrapht.org/
     Proceedings of the TextGraphs-2 Workshop             [32] neo4j open source nosql graph database.
     (NAACL-HLT), 2007.                                        http://neo4j.org/
[15] S. Chernov, T. Iofciu, W. Nejdl, X. Zhou.            [33] WebGraph. http://webgraph.dsi.unimi.it/
     Extracting Semantic Relationships between            [34] Wikipedia:Academic studies of Wikipedia -
     Wikipedia Categories. In Proceedings of the First         Wikipedia, the free encyclopedia.
     International Workshop on Semantic Wikis - From           http://en.wikipedia.org/wiki/Wikipedia:Academic_
     Wiki To Semantics, June 2006.                             studies_of_Wikipedia
[16] M. Strube, S. P. Ponzetto. WikiRelate! Computing     [35] Academic studies about Wikipedia - Wikipedia,
     semantic relatedness using Wikipedia. In                  the free encyclopedia.
     Proceedings of the 21st national conference on            http://en.wikipedia.org/wiki/Academic_studies_ab
     Artificial intelligence (AAAI'06), pp. 1419-1424,         out_Wikipedia#Natural_language_processing
     2006.
[17] Z. Syed, T. Finin, and A. Joshi. Wikipedia as an
     Ontology for Describing Documents. In
     Proceedings of the Second International
     Conference on Weblogs and Social Media, 2008.
[18] G. Y. Cui, Q. Lu, W. J. Li, Y. R. Chen. Corpus
     Exploitation from Wikipedia for Ontology
     Construction. In LREC 2008, Marrakech, pp.
     2125-2132, 2008.
[19] Wikipedia:Overcategorization - Wikipedia, the
     free encyclopedia.
     http://en.wikipedia.org/wiki/Wikipedia:Overcateg
     orization
[20] Wikipedia:Categorization - Wikipedia, the free
     encyclopedia.
     http://en.wikipedia.org/wiki/Wikipedia:Categoriza
     tion
[21] Catgraph. http://toolserver.org/~dapete/catgraph/
[22] Wikipedia:WikiProject Categories/uncategorized -
     Wikipedia, the free encyclopedia.
     http://en.wikipedia.org/wiki/Wikipedia:WikiProjec
     t_Categories/uncategorized
[23] Wikipedia:Database reports/Uncategorized
     categories - Wikipedia, the free encyclopedia.
     http://en.wikipedia.org/wiki/Wikipedia:Database_r
     eports/Uncategorized_categories
[24] Category:Better category needed - Wikipedia, the
     free encyclopedia.
     http://en.wikipedia.org/wiki/Category:Better_categ
     ory_needed
[25] J. Soto. Wikipedia: A Quantitative Analysis. PhD
     thesis, 2009.
[26] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C.
     Stein. Introduction to Algorithms, Second Edition.