Using Wikipedia Content to Derive an Ontology for Modeling and Recommending Web Pages across Systems Pei-Chia Chang Luz M. Quiroga Department of Information & Computer Science Department of Information & Computer Science 1680 East-West Road 1680 East-West Road Honolulu, HI 96822, USA Honolulu, HI 96822, USA 1-808-2209701 1-808-9569988 pcchang@hawaii.edu lquiroga@hawaii.edu ABSTRACT knowledge bases. Although there are only a few contributors(less In this work, we are building a cross-system recommender at the than 10% of user population) to the content of Wikipedia[8], it client side that uses the Wikipedia’s content to derive an ontology has a huge pool of readers. As Sussan describes, “with Web 2.0 for content and user modeling. We speculate the collaborative products, it is the user’s engagement with the website that literally content of Wikipedia may cover many of the topical areas that drives it.”[13] Similarly, we speculate Wikipedia’s content and its people are generally interested in and the vocabulary may be vocabulary may cover recent and popular topical areas that people closer to the general public users and updated sooner. Using the are generally interested in. The language in Wikipedia may be Wikipedia derived ontology as a shared platform to model web closer to what the general public use, instead controlled by pages also addresses the issue of cross system recommendations, domain experts. We emphasize the topics, but not content which generally requires a unified protocol or a mediator. accuracy, from Wikipedia may reflect the dynamic information Preliminary tests of our system may indicate that our derived on the Internet. ontology is a fair content model that maps an unknown webpage to its related topical categories. Once page topics can be identified, user models are formulated through analyzing usage Our recommender formulates a user model based on the browsing pages. Eventually, we will formally evaluate the topicality-based behavior at a client side and the usage pages mapped to the user model derived ontology. Given the research potentials of Wikipedia’s content, we are interested in the performance of recommending web pages based on the Wikipedia derived ontology. Our research question is "Does the recommender based on the Wikipedia’s Categories and Subject Descriptors content model provide topically relevant recommendations?" H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-- Information filtering; H.3.3 [Information 2. Related Work Storage and Retrieval]: Information Search and Retrieval -- Content-based recommenders include WebWatcher[6], Syskill & Clustering D.3.3 Webert[10], WebMate[5], and ifWeb[2]. WebWatcher and WebMate adopt TF-IDF, the vector space model and similarity General Terms clustering. Syskill & Webert rely on feature extraction, User Modeling, Wikipedia, Management, Measurement, particularly expected information gain[11], which relies on the Documentation, Design, Experimentation. co-existence of related keywords, and relevance feedback. The system formulates the profile vector that consists of keywords Keywords from pages of positive ratings and against pages of negative Recommender, Agent, User Modeling, Ontology. ratings. Then, Bayesian classifier is employed to determine a page’s topics, and its similarity with the profile vector. ifWeb 1. INTRODUCTION employs a semantic network and consistency-based user modeling User modeling through content is one common solution in shell[4]. In general, these four systems apply statistical recommending web pages across systems [3,7,9,14]. In this work, approaches, such as TF-IDF or expected information gain for we are interested in using the collaborative content of Wikipedia keywords extraction and a cluster or classifier for similarity to derive an ontology as a unified knowledge base for modeling identification. Our work borrows Wikipedia’s categorization web pages. Wikipedia is one of the world’s largest collaborative system and augments it with keywords identified by predefined heuristics as topical indices. A full listing of existing categories and indexes in Wikipedia can be found at Permission to make digital or hard copies of all or part of this work for http://en.wikipedia.org/wiki/Portal:Contents/Categorical_index. personal or classroom use is granted without fee provided that copies are In our study, page classification depends on the frequency of not made or distributed for profit or commercial advantage and that those indexing keywords appeared in a web page. Our difference copies bear this notice and the full citation on the first page. To copy from the previous systems is the use of Wikipedia’s collaborative otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. categorization system to derive an ontology that is augmented Conference’04, Month 1–2, 2004, City, State, Country. with heuristic information extraction from Wikipedia’s content. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00. 3. Method Description Therefore, the user model is constantly evolved. In other words, if a user accesses a specific categorical topic in multiple times or 3.1 System Architecture through multiple pages, the user will score higher in the Our recommender uses the Wikipedia’s content to derive an corresponding category of the user model. Keyword weighting ontology for content and user modeling. With the ontology, our and sensing formulas are defined below. system automatically assigns the Wikipedia category(s) to a new page that pass a category’s threshold value, which formulates a “categorical” vector space model for the page. The system also Definitions: captures user interests in the user model through the categories. |Kj|, the number of keywords extracted for heuristic type j Pages of similar topics with the profile will be recommended. |K c|, the number of keywords in category c Figure 1 depicts the system's architecture, which will be explained |Categories|, the number of categories in the knowledge base in the following paragraphs. freq(Kij) is the frequency of keyword Kij for heuristic type j max(K1, … ,Km) is the maximum value among the m elements The weight of keyword Kij among m heuristics is: freq(K ij ) W (Kij) = ∑ aj ( ) max (freq(K1j ), …, freq(K |Kj | j )) 1 ≤ i ≤ |Kj|, 1 ≤ j ≤ m, aj is a weighting coefficient assigned to heuristic j. A page’s Relevance Score Rc to a category c is: freq(Ki c) Rc = ∑ d W(Ki c) α 1 ≤ i ≤ |K c|, 0.5 < α < 1, 1 ≤ c ≤ |Categories|  0 < d < 0.5, partial match d = 1, full match As for the matcher, it compares the cosine similarity of those crawler-retrieved pages with the user model and then generates recommendations. In addition to cosine similarity, the matcher also relies on the ontological structure of WikiBase. With the structure, topical association among web pages can be revealed Figure 1 System Architecture and it also helps to identify if a user is interested in particular There are four major components in the system -- the crawler, the domains or not. We define two indices (diversity and specificity) Wikipedia knowledge base (WikiBase), the sensor, and the to represent the coverage of user interests. The following matcher. To begin with the top part of the graph, the crawler describes the procedure. fetches those hyperlinked pages from the usage pages as well as At the beginning, construct a minimal spanning tree that traverses queries search engines based on the user model managed by the all the identified categories in the user model. Identified sensor. Utilizing the sensor, the crawler generates a corresponding categories are those categories with a Relevance Score over a content model for each newly fetched page. predefined-threshold. In order to connect identified categories together, connecting nodes, such as parents or neighbors of the Every component in the system uses WikiBase, which stores identified categories may be added to the tree. Definitions of the ontologies, keywords, content models and the user model two indices are as follows: respectively. We construct WikiBase by combining the Diversity index: count the number of edges of the minimal Wikipedia’s categorization system with heuristic information spanning tree and normalize it by dividing the number of extraction on keywords. Heuristics include page titles, categorical identified nodes, excluding connecting nodes, in the spanning labels, anchor texts, italic, bold, and TF-IDF terms. In order to tree. associate keywords with categories, we extract heuristic keywords Specificity index: sum the minimal distances from the root to all form pages labeled as one of the categories by the Wikipedia’s identified categories respectively and normalize it by dividing the editors. Therefore, each category has a list of keywords to be number of those identified categories, excluding connecting utilized by the sensor. nodes. The sensor manages the user model and maps usage pages into 3.2 Evaluation Method content models. It calculates a page's topical relevance and We plan to recruit a few participants (< 10) in the computer formulates the corresponding content model according to the science domain where we derive WikiBase. Each of them has to WikiBase’s keyword weight and the word frequency of the web rate a collection (> 300) of web pages based on topical relevance page. The user model is updated, on a frequency basis, by the and novelty. They have to provide certain web pages (>30) of sensor whenever it maps a usage page into a content model. their interest in advance as the usage source of formulating the user models. Afterward, they have to rate the collection. The evaluating the recommendations, we are training the matcher with ratings will be divided into a training and validation set. Our pages of a different topical coverage. Eventually, we will apply system will tune the keyword weight based on the training data. the evaluation method described earlier. We will compare our system performance with the SMART system[12], which utilizes vector space model as well. 4.2 Discussion Using the Wikipedia categories as an ontological model yields a 4. Current Status & Discussion simple user profile. This modeling approach benefits significantly in cross-system recommendations. Our recommendation engine 4.1 Current Status works at the client side, which eases the privacy concern of We have built WikiBase in the computer science domain, listed at disclosing sensitive information at web servers. Combining http://www2.hawaii.edu/~pcchang/ICS699/results.html. We categories with heuristic information extractions leaves rooms for selected the domain due to its rich data. Two preliminary tests the selection of heuristics. Different domains or user groups are were conducted on two computer science professionals. In the able to apply heuristics of interest. Given the above mentioned first one, w tested the following pages for their topical relevance. advantages, we are looking forward to see the results of our http://www.algosort.com/ (A) evaluation. http://tc.eserver.org (B) Considering only the top two ranking, page A is sensed as “algorithms” and “genetic algorithms” categories; page B is 5. Future Work sensed as “human-computer interaction” and “usability” The Wikipedia content and categorization system play an categories. In the evaluation of the classification result, both important role in our method to generate recommendations. Our participants’ rankings are the same as the system’s ranking, work emphasizes the framework to automate the ontology considering only the top two. generation and its performance in recommendations. Nevertheless, the quality of Wikipedia content is controversial. It In the second test, we selected fourteen pages, listed in the will be worthwhile to adopt the same framework to another appendix, from four topical areas – algorithm, data mining, Wikipedia-like platform with a different user group, such as human computer interaction (HCI) and computer games. Both domain experts, to ensure the content quality. participants have to evaluate at least five categorical keywords of each page. They have to provide the degree of agreement from 1 (disagree) to 5(agree) about the following statement. "The given Another interesting area is to study the content statistics, such as phrase is a topical keyword of the page." The given phrase is a volume or the granularity of the categories, with recommendation categorical label generated by the sensor for each page. The performance. Not every domain in Wikipedia contains rich following table summarizes the ratings. categories and articles like computer science. Therefore, the performance of recommendations may be related to some of the Participant 1 Participant 2 statistics. Algorithm (3 pages) 3.88 2.83 Data Mining (4 pages) 3.95 3.40 6. Appendix Due to limited space, only 1st page of each selected topic displays HCI (4 pages) 4.22 4.27 the categorical keywords. Games (3 pages) 2.67 2.2 Average 3.67 3.2 Algorithms Table 1 Evaluation of Categorical Keywords http://www.algosort.com/ (Algorithms, Genetic algorithms, Root-finding algorithms, Networking algorithms, Disk scheduling algorithms) From the result, the ranking of both participants’ average scores http://www.oopweb.com/Algorithms/Files/Algorithms.html http://cgm.cs.mcgill.ca/~godfried/teaching/algorithms-web.html is: HCI, Data Mining, Algorithm, and Games. Except for the game topic, the agreement score is around 4 for participant 1 and Data Mining 3.5 for participant 2. We suspect that due the wide coverage of http://www.data-mining-guide.net/ computer games, our system performs worse in that category. (Databases, Algorithms, Knowledge representation, Natural language Another reason may be because of the nature of computer science processing, Knowledge discovery in databases, Machine learning, Data category. It reflects the common scientific techniques of theory mining) http://www.thearling.com/ for producing computer games, which is different from the tested http://databases.about.com/od/datamining/ pages that viewing computer games from a player’s perspective. Data_Mining_and_Data_Warehousing.htm http://www.ccsu.edu/datamining/resources.html We are still in the process of tuning up the keyword weight by HCI utilizing the computer science pages from Open Directory Project http://www.pcd-innovations.com/ (DOP) [1]. Pages in ODP are manually categorized by its users (Human-computer interaction, Human-computer interaction researchers, and we use the classification to evaluate our sensor. As for Usability, Computer science organizations, Artificial intelligence, Software development) http://www.nathan.com/ of the Fifteenth International Joint Conference on http://nooface.net/ Artificial Intelligence. http://www.hcibib.org/ [7] Mooney, R. J., & Roy, L. 2000. Content-based book recommending using learning for text categorization. In Games Proceedings of the fifth ACM conference on Digital http://www.robinlionheart.com/gamedev/genres.xhtml (Image processing, Computer programming, Demo effects libraries, San Antonio, Texas, United States. Regression analysis, Computer graphics) [8] Ortega, F., Gonzalez-Barahona, J. M., & Robles, G. 2008. http://open-site.org/Games/Video_Games/ On the Inequality of Contributions to Wikipedia. In http://www.literature-study-online.com/essays/alice_video.html Proceedings of the 41st Annual Hawaii International Conference on System Sciences. [9] Pazzani, M., & Billsus, D. 1997. Learning and Revising User 7. REFERENCES Profiles: The Identification of Interesting Web Sites. Machine Learning, 27, 313-331. [1] http://www.dmoz.org/ [10] Pazzani, M., Muramatsu, J., & Billsus, D. 1996. Syskill & [2] Asnicar, F., & Tasso, C. 1997. ifWeb: A Prototype of User Webert: Identifying Interesting Web Sites. In Model-Based Intelligent Agent for Document Filtering Proceedings of the Thirteenth National Conference on and Navigation in the World Wide Web. In Proceedings Artificial Intelligence, Portland, Oregon, United States. of the 6th International Conference on User Modeling. [11] Quinlan, J. R. 1986. Induction of Decision Trees. Machine [3] Billsus, D., & Pazzani, M. 1999. A Personal News Agent Learning, 1(1), 81-106. that Talks, Learns and Explains. In Proceedings of the [12] Salton, G., & Lesk, M. E. 1965. The SMART automatic 3rd Ann. Conf. Autonomous Agents. document retrieval systems -- an illustration. Commun. [4] Brajnik, G., & Tasso, C. 1994. A Shell for Developing Non- ACM, 8(6), 391-398. Monotonic User Modeling Systems. Human-Computer [13] Sussan, G. 2007. Web 2.0 The Academic Library and the Studies, 40, 31-62. Net Gen Student (pp. 35): ALA editions. [5] Chen, L., & Sycara, K. 1998. WebMate: a personal agent for [14] Zhang, Y., Callan, J., & Minka, T. 2002. Novelty and browsing and searching. In Proceedings of the second Redundancy Detection in Adaptive Filtering. In international conference on Autonomous agents, Proceedings of the 25th Ann. Int’l ACM SIGIR Conf. Minneapolis, Minnesota, United States [6] Joachims, T., Freitag, D., & Mitchell, T. 1997. WebWatcher: A Tour Guide for the World Wide Web. In Proceedings