Using Wikipedia Content to Derive an Ontology for
Modeling and Recommending Web Pages across Systems
                         Pei-Chia Chang                                                          Luz M. Quiroga
      Department of Information & Computer Science                             Department of Information & Computer Science
                 1680 East-West Road                                                      1680 East-West Road
               Honolulu, HI 96822, USA                                                  Honolulu, HI 96822, USA
                      1-808-2209701                                                            1-808-9569988
                     pcchang@hawaii.edu                                                      lquiroga@hawaii.edu

ABSTRACT                                                                     knowledge bases. Although there are only a few contributors(less
In this work, we are building a cross-system recommender at the              than 10% of user population) to the content of Wikipedia[8], it
client side that uses the Wikipedia’s content to derive an ontology          has a huge pool of readers. As Sussan describes, “with Web 2.0
for content and user modeling. We speculate the collaborative                products, it is the user’s engagement with the website that literally
content of Wikipedia may cover many of the topical areas that                drives it.”[13] Similarly, we speculate Wikipedia’s content and its
people are generally interested in and the vocabulary may be                 vocabulary may cover recent and popular topical areas that people
closer to the general public users and updated sooner. Using the             are generally interested in. The language in Wikipedia may be
Wikipedia derived ontology as a shared platform to model web                 closer to what the general public use, instead controlled by
pages also addresses the issue of cross system recommendations,              domain experts. We emphasize the topics, but not content
which generally requires a unified protocol or a mediator.                   accuracy, from Wikipedia may reflect the dynamic information
Preliminary tests of our system may indicate that our derived                on the Internet.
ontology is a fair content model that maps an unknown webpage
to its related topical categories. Once page topics can be
identified, user models are formulated through analyzing usage               Our recommender formulates a user model based on the browsing
pages. Eventually, we will formally evaluate the topicality-based            behavior at a client side and the usage pages mapped to the
user model                                                                   derived ontology. Given the research potentials of Wikipedia’s
                                                                             content, we are interested in the performance of recommending
                                                                             web pages based on the Wikipedia derived ontology. Our research
                                                                             question is "Does the recommender based on the Wikipedia’s
Categories and Subject Descriptors                                           content model provide topically relevant recommendations?"
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval-- Information filtering; H.3.3 [Information             2. Related Work
Storage and Retrieval]: Information Search and Retrieval --                  Content-based recommenders include WebWatcher[6], Syskill &
Clustering D.3.3                                                             Webert[10], WebMate[5], and ifWeb[2]. WebWatcher and
                                                                             WebMate adopt TF-IDF, the vector space model and similarity
General Terms                                                                clustering. Syskill & Webert rely on feature extraction,
User Modeling, Wikipedia, Management,                     Measurement,
                                                                             particularly expected information gain[11], which relies on the
Documentation, Design, Experimentation.
                                                                             co-existence of related keywords, and relevance feedback. The
                                                                             system formulates the profile vector that consists of keywords
Keywords                                                                     from pages of positive ratings and against pages of negative
Recommender, Agent, User Modeling, Ontology.                                 ratings. Then, Bayesian classifier is employed to determine a
                                                                             page’s topics, and its similarity with the profile vector. ifWeb
1. INTRODUCTION                                                              employs a semantic network and consistency-based user modeling
User modeling through content is one common solution in                      shell[4]. In general, these four systems apply statistical
recommending web pages across systems [3,7,9,14]. In this work,              approaches, such as TF-IDF or expected information gain for
we are interested in using the collaborative content of Wikipedia            keywords extraction and a cluster or classifier for similarity
to derive an ontology as a unified knowledge base for modeling               identification. Our work borrows Wikipedia’s categorization
web pages. Wikipedia is one of the world’s largest collaborative             system and augments it with keywords identified by predefined
                                                                             heuristics as topical indices. A full listing of existing categories
                                                                             and      indexes    in     Wikipedia      can     be    found     at
 Permission to make digital or hard copies of all or part of this work for   http://en.wikipedia.org/wiki/Portal:Contents/Categorical_index.
 personal or classroom use is granted without fee provided that copies are   In our study, page classification depends on the frequency of
 not made or distributed for profit or commercial advantage and that         those indexing keywords appeared in a web page. Our difference
 copies bear this notice and the full citation on the first page. To copy
                                                                             from the previous systems is the use of Wikipedia’s collaborative
 otherwise, or republish, to post on servers or to redistribute to lists,
 requires prior specific permission and/or a fee.                            categorization system to derive an ontology that is augmented
 Conference’04, Month 1–2, 2004, City, State, Country.                       with heuristic information extraction from Wikipedia’s content.
 Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
3. Method Description                                                 Therefore, the user model is constantly evolved. In other words, if
                                                                      a user accesses a specific categorical topic in multiple times or
3.1 System Architecture                                               through multiple pages, the user will score higher in the
Our recommender uses the Wikipedia’s content to derive an             corresponding category of the user model. Keyword weighting
ontology for content and user modeling. With the ontology, our        and      sensing      formulas       are      defined      below.
system automatically assigns the Wikipedia category(s) to a new
page that pass a category’s threshold value, which formulates a
“categorical” vector space model for the page. The system also        Definitions:
captures user interests in the user model through the categories.     |Kj|, the number of keywords extracted for heuristic type j
Pages of similar topics with the profile will be recommended.         |K c|, the number of keywords in category c
Figure 1 depicts the system's architecture, which will be explained   |Categories|, the number of categories in the knowledge base
in the following paragraphs.                                          freq(Kij) is the frequency of keyword Kij for heuristic type j
                                                                      max(K1, … ,Km) is the maximum value among the m elements

                                                                      The weight of keyword Kij among m heuristics is:
                                                                                                        freq(K ij )
                                                                      W (Kij) = ∑ aj (                                                )
                                                                                              max (freq(K1j ), …, freq(K |Kj | j ))
                                                                      1 ≤ i ≤ |Kj|, 1 ≤ j ≤ m,
                                                                      aj is a weighting coefficient assigned to heuristic j.


                                                                      A page’s Relevance Score Rc to a category c is:
                                                                                                  freq(Ki c)
                                                                      Rc = ∑ d W(Ki c) α
                                                                      1 ≤ i ≤ |K c|, 0.5 < α < 1, 1 ≤ c ≤ |Categories|

                                                                         0 < d < 0.5, partial match
                                                                          d = 1,       full match

                                                                      As for the matcher, it compares the cosine similarity of those
                                                                      crawler-retrieved pages with the user model and then generates
                                                                      recommendations. In addition to cosine similarity, the matcher
                                                                      also relies on the ontological structure of WikiBase. With the
                                                                      structure, topical association among web pages can be revealed
                 Figure 1 System Architecture
                                                                      and it also helps to identify if a user is interested in particular
There are four major components in the system -- the crawler, the     domains or not. We define two indices (diversity and specificity)
Wikipedia knowledge base (WikiBase), the sensor, and the              to represent the coverage of user interests. The following
matcher. To begin with the top part of the graph, the crawler         describes the procedure.
fetches those hyperlinked pages from the usage pages as well as
                                                                      At the beginning, construct a minimal spanning tree that traverses
queries search engines based on the user model managed by the
                                                                      all the identified categories in the user model. Identified
sensor. Utilizing the sensor, the crawler generates a corresponding
                                                                      categories are those categories with a Relevance Score over a
content model for each newly fetched page.
                                                                      predefined-threshold. In order to connect identified categories
                                                                      together, connecting nodes, such as parents or neighbors of the
Every component in the system uses WikiBase, which stores             identified categories may be added to the tree. Definitions of the
ontologies, keywords, content models and the user model               two indices are as follows:
respectively. We construct WikiBase by combining the                  Diversity index: count the number of edges of the minimal
Wikipedia’s categorization system with heuristic information          spanning tree and normalize it by dividing the number of
extraction on keywords. Heuristics include page titles, categorical   identified nodes, excluding connecting nodes, in the spanning
labels, anchor texts, italic, bold, and TF-IDF terms. In order to     tree.
associate keywords with categories, we extract heuristic keywords     Specificity index: sum the minimal distances from the root to all
form pages labeled as one of the categories by the Wikipedia’s        identified categories respectively and normalize it by dividing the
editors. Therefore, each category has a list of keywords to be        number of those identified categories, excluding connecting
utilized by the sensor.                                               nodes.


The sensor manages the user model and maps usage pages into           3.2 Evaluation Method
content models. It calculates a page's topical relevance and          We plan to recruit a few participants (< 10) in the computer
formulates the corresponding content model according to the           science domain where we derive WikiBase. Each of them has to
WikiBase’s keyword weight and the word frequency of the web           rate a collection (> 300) of web pages based on topical relevance
page. The user model is updated, on a frequency basis, by the         and novelty. They have to provide certain web pages (>30) of
sensor whenever it maps a usage page into a content model.            their interest in advance as the usage source of formulating the
user models. Afterward, they have to rate the collection. The          evaluating the recommendations, we are training the matcher with
ratings will be divided into a training and validation set. Our        pages of a different topical coverage. Eventually, we will apply
system will tune the keyword weight based on the training data.        the evaluation method described earlier.
We will compare our system performance with the SMART
system[12], which utilizes vector space model as well.
                                                                       4.2 Discussion
                                                                       Using the Wikipedia categories as an ontological model yields a
4. Current Status & Discussion                                         simple user profile. This modeling approach benefits significantly
                                                                       in cross-system recommendations. Our recommendation engine
4.1 Current Status                                                     works at the client side, which eases the privacy concern of
We have built WikiBase in the computer science domain, listed at
                                                                       disclosing sensitive information at web servers. Combining
http://www2.hawaii.edu/~pcchang/ICS699/results.html.              We
                                                                       categories with heuristic information extractions leaves rooms for
selected the domain due to its rich data. Two preliminary tests
                                                                       the selection of heuristics. Different domains or user groups are
were conducted on two computer science professionals. In the
                                                                       able to apply heuristics of interest. Given the above mentioned
first one, w tested the following pages for their topical relevance.
                                                                       advantages, we are looking forward to see the results of our
http://www.algosort.com/ (A)                                           evaluation.
http://tc.eserver.org (B)
Considering only the top two ranking, page A is sensed as
“algorithms” and “genetic algorithms” categories; page B is            5. Future Work
sensed as “human-computer interaction” and “usability”                 The Wikipedia content and categorization system play an
categories. In the evaluation of the classification result, both       important role in our method to generate recommendations. Our
participants’ rankings are the same as the system’s ranking,           work emphasizes the framework to automate the ontology
considering only the top two.                                          generation and its performance in recommendations.
                                                                       Nevertheless, the quality of Wikipedia content is controversial. It
In the second test, we selected fourteen pages, listed in the          will be worthwhile to adopt the same framework to another
appendix, from four topical areas – algorithm, data mining,            Wikipedia-like platform with a different user group, such as
human computer interaction (HCI) and computer games. Both              domain experts, to ensure the content quality.
participants have to evaluate at least five categorical keywords of
each page. They have to provide the degree of agreement from 1
(disagree) to 5(agree) about the following statement. "The given       Another interesting area is to study the content statistics, such as
phrase is a topical keyword of the page." The given phrase is a        volume or the granularity of the categories, with recommendation
categorical label generated by the sensor for each page. The           performance. Not every domain in Wikipedia contains rich
following table summarizes the ratings.                                categories and articles like computer science. Therefore, the
                                                                       performance of recommendations may be related to some of the
                            Participant 1        Participant 2         statistics.
Algorithm (3 pages)         3.88                 2.83
Data Mining (4 pages)       3.95                 3.40                  6. Appendix
                                                                       Due to limited space, only 1st page of each selected topic displays
HCI (4 pages)               4.22                 4.27
                                                                       the categorical keywords.
Games (3 pages)             2.67                 2.2
Average                     3.67                 3.2                   Algorithms
          Table 1 Evaluation of Categorical Keywords                   http://www.algosort.com/
                                                                          (Algorithms, Genetic algorithms, Root-finding algorithms,
                                                                           Networking algorithms, Disk scheduling algorithms)
From the result, the ranking of both participants’ average scores      http://www.oopweb.com/Algorithms/Files/Algorithms.html
                                                                       http://cgm.cs.mcgill.ca/~godfried/teaching/algorithms-web.html
is: HCI, Data Mining, Algorithm, and Games. Except for the
game topic, the agreement score is around 4 for participant 1 and      Data Mining
3.5 for participant 2. We suspect that due the wide coverage of        http://www.data-mining-guide.net/
computer games, our system performs worse in that category.               (Databases, Algorithms, Knowledge representation, Natural language
Another reason may be because of the nature of computer science            processing, Knowledge discovery in databases, Machine learning, Data
category. It reflects the common scientific techniques of theory           mining)
                                                                       http://www.thearling.com/
for producing computer games, which is different from the tested
                                                                       http://databases.about.com/od/datamining/
pages that viewing computer games from a player’s perspective.               Data_Mining_and_Data_Warehousing.htm
                                                                       http://www.ccsu.edu/datamining/resources.html

We are still in the process of tuning up the keyword weight by         HCI
utilizing the computer science pages from Open Directory Project       http://www.pcd-innovations.com/
(DOP) [1]. Pages in ODP are manually categorized by its users          (Human-computer interaction, Human-computer interaction researchers,
and we use the classification to evaluate our sensor. As for           Usability, Computer science organizations,
                                                                       Artificial intelligence, Software development)
http://www.nathan.com/                                                      of the Fifteenth International Joint Conference on
http://nooface.net/                                                         Artificial Intelligence.
http://www.hcibib.org/                                            [7] Mooney, R. J., & Roy, L. 2000. Content-based book
                                                                            recommending using learning for text categorization. In
Games
                                                                            Proceedings of the fifth ACM conference on Digital
http://www.robinlionheart.com/gamedev/genres.xhtml
(Image processing, Computer programming, Demo effects                       libraries, San Antonio, Texas, United States.
Regression analysis, Computer graphics)                           [8] Ortega, F., Gonzalez-Barahona, J. M., & Robles, G. 2008.
http://open-site.org/Games/Video_Games/                                     On the Inequality of Contributions to Wikipedia. In
http://www.literature-study-online.com/essays/alice_video.html              Proceedings of the 41st Annual Hawaii International
                                                                            Conference on System Sciences.
                                                                  [9] Pazzani, M., & Billsus, D. 1997. Learning and Revising User
7. REFERENCES                                                               Profiles: The Identification of Interesting Web Sites.
                                                                            Machine Learning, 27, 313-331.
[1] http://www.dmoz.org/                                          [10] Pazzani, M., Muramatsu, J., & Billsus, D. 1996. Syskill &
[2] Asnicar, F., & Tasso, C. 1997. ifWeb: A Prototype of User               Webert: Identifying Interesting Web Sites. In
         Model-Based Intelligent Agent for Document Filtering               Proceedings of the Thirteenth National Conference on
         and Navigation in the World Wide Web. In Proceedings               Artificial Intelligence, Portland, Oregon, United States.
         of the 6th International Conference on User Modeling.    [11] Quinlan, J. R. 1986. Induction of Decision Trees. Machine
[3] Billsus, D., & Pazzani, M. 1999. A Personal News Agent                  Learning, 1(1), 81-106.
         that Talks, Learns and Explains. In Proceedings of the   [12] Salton, G., & Lesk, M. E. 1965. The SMART automatic
         3rd Ann. Conf. Autonomous Agents.                                  document retrieval systems -- an illustration. Commun.
[4] Brajnik, G., & Tasso, C. 1994. A Shell for Developing Non-              ACM, 8(6), 391-398.
         Monotonic User Modeling Systems. Human-Computer          [13] Sussan, G. 2007. Web 2.0 The Academic Library and the
         Studies, 40, 31-62.                                                Net Gen Student (pp. 35): ALA editions.
[5] Chen, L., & Sycara, K. 1998. WebMate: a personal agent for    [14] Zhang, Y., Callan, J., & Minka, T. 2002. Novelty and
         browsing and searching. In Proceedings of the second               Redundancy Detection in Adaptive Filtering. In
         international conference on Autonomous agents,                     Proceedings of the 25th Ann. Int’l ACM SIGIR Conf.
         Minneapolis, Minnesota, United States
[6] Joachims, T., Freitag, D., & Mitchell, T. 1997. WebWatcher:
         A Tour Guide for the World Wide Web. In Proceedings