<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Wikipedia Content to Derive an Ontology for Modeling and Recommending Web Pages across Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pei-Chia Chang</string-name>
          <email>pcchang@hawaii.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information &amp; Computer Science 1680</institution>
          <addr-line>East-West Road Honolulu, HI 96822, USA 1-808-2209701</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we are building a cross-system recommender at the client side that uses the Wikipedia's content to derive an ontology for content and user modeling. We speculate the collaborative content of Wikipedia may cover many of the topical areas that people are generally interested in and the vocabulary may be closer to the general public users and updated sooner. Using the Wikipedia derived ontology as a shared platform to model web pages also addresses the issue of cross system recommendations, which generally requires a unified protocol or a mediator. Preliminary tests of our system may indicate that our derived ontology is a fair content model that maps an unknown webpage to its related topical categories. Once page topics can be identified, user models are formulated through analyzing usage pages. Eventually, we will formally evaluate the topicality-based user model</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender</kwd>
        <kwd>Agent</kwd>
        <kwd>User Modeling</kwd>
        <kwd>Ontology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        User modeling through content is one common solution in
recommending web pages across systems [
        <xref ref-type="bibr" rid="ref14 ref3 ref7 ref9">3,7,9,14</xref>
        ]. In this work,
we are interested in using the collaborative content of Wikipedia
to derive an ontology as a unified knowledge base for modeling
web pages. Wikipedia is one of the world’s largest collaborative
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
      </p>
      <p>Conference’04, Month 1–2, 2004, City, State, Country.</p>
      <p>Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.</p>
      <p>Luz M. Quiroga
Department of Information &amp; Computer Science
1680 East-West Road
Honolulu, HI 96822, USA</p>
      <p>
        1-808-9569988
knowledge bases. Although there are only a few contributors(less
than 10% of user population) to the content of Wikipedia[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], it
has a huge pool of readers. As Sussan describes, “with Web 2.0
products, it is the user’s engagement with the website that literally
drives it.”[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Similarly, we speculate Wikipedia’s content and its
vocabulary may cover recent and popular topical areas that people
are generally interested in. The language in Wikipedia may be
closer to what the general public use, instead controlled by
domain experts. We emphasize the topics, but not content
accuracy, from Wikipedia may reflect the dynamic information
on the Internet.
      </p>
      <p>Our recommender formulates a user model based on the browsing
behavior at a client side and the usage pages mapped to the
derived ontology. Given the research potentials of Wikipedia’s
content, we are interested in the performance of recommending
web pages based on the Wikipedia derived ontology. Our research
question is "Does the recommender based on the Wikipedia’s
content model provide topically relevant recommendations?"</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Content-based recommenders include WebWatcher[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Syskill &amp;
Webert[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], WebMate[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and ifWeb[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. WebWatcher and
WebMate adopt TF-IDF, the vector space model and similarity
clustering. Syskill &amp; Webert rely on feature extraction,
particularly expected information gain[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which relies on the
co-existence of related keywords, and relevance feedback. The
system formulates the profile vector that consists of keywords
from pages of positive ratings and against pages of negative
ratings. Then, Bayesian classifier is employed to determine a
page’s topics, and its similarity with the profile vector. ifWeb
employs a semantic network and consistency-based user modeling
shell[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In general, these four systems apply statistical
approaches, such as TF-IDF or expected information gain for
keywords extraction and a cluster or classifier for similarity
identification. Our work borrows Wikipedia’s categorization
system and augments it with keywords identified by predefined
heuristics as topical indices. A full listing of existing categories
and indexes in Wikipedia can be found at
http://en.wikipedia.org/wiki/Portal:Contents/Categorical_index.
In our study, page classification depends on the frequency of
those indexing keywords appeared in a web page. Our difference
from the previous systems is the use of Wikipedia’s collaborative
categorization system to derive an ontology that is augmented
with heuristic information extraction from Wikipedia’s content.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method Description</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 System Architecture</title>
      <p>Our recommender uses the Wikipedia’s content to derive an
ontology for content and user modeling. With the ontology, our
system automatically assigns the Wikipedia category(s) to a new
page that pass a category’s threshold value, which formulates a
“categorical” vector space model for the page. The system also
captures user interests in the user model through the categories.
Pages of similar topics with the profile will be recommended.
There are four major components in the system -- the crawler, the
Wikipedia knowledge base (WikiBase), the sensor, and the
matcher. To begin with the top part of the graph, the crawler
fetches those hyperlinked pages from the usage pages as well as
queries search engines based on the user model managed by the
sensor. Utilizing the sensor, the crawler generates a corresponding
content model for each newly fetched page.</p>
      <p>Every component in the system uses WikiBase, which stores
ontologies, keywords, content models and the user model
respectively. We construct WikiBase by combining the
Wikipedia’s categorization system with heuristic information
extraction on keywords. Heuristics include page titles, categorical
labels, anchor texts, italic, bold, and TF-IDF terms. In order to
associate keywords with categories, we extract heuristic keywords
form pages labeled as one of the categories by the Wikipedia’s
editors. Therefore, each category has a list of keywords to be
utilized by the sensor.</p>
      <p>The sensor manages the user model and maps usage pages into
content models. It calculates a page's topical relevance and
formulates the corresponding content model according to the
WikiBase’s keyword weight and the word frequency of the web
page. The user model is updated, on a frequency basis, by the
sensor whenever it maps a usage page into a content model.
Therefore, the user model is constantly evolved. In other words, if
a user accesses a specific categorical topic in multiple times or
through multiple pages, the user will score higher in the
corresponding category of the user model. Keyword weighting
and sensing formulas are defined below.</p>
      <sec id="sec-4-1">
        <title>Definitions:</title>
        <p>|Kj|, the number of keywords extracted for heuristic type j
|K c|, the number of keywords in category c
|Categories|, the number of categories in the knowledge base
freq(Kij) is the frequency of keyword Kij for heuristic type j
max(K1, … ,Km) is the maximum value among the m elements
W (Kij) = ∑ aj (</p>
      </sec>
      <sec id="sec-4-2">
        <title>The weight of keyword Kij among m heuristics is:</title>
        <p>freq(Kij)
)
max (freq(K1j),…, freq(K|Kj| j ))
1 ≤ i ≤ |Kj|, 1 ≤ j ≤ m,
aj is a weighting coefficient assigned to heuristic j.</p>
      </sec>
      <sec id="sec-4-3">
        <title>A page’s Relevance Score Rc to a category c is:</title>
        <p>Rc = ∑ d W(Ki c) α
1 ≤ i ≤ |K c|, 0.5 &lt; α &lt; 1, 1 ≤ c ≤ |Categories|</p>
        <p>freq(Kic)
d = 1,
 0 &lt; d &lt; 0.5, partial match</p>
        <p>full match
As for the matcher, it compares the cosine similarity of those
crawler-retrieved pages with the user model and then generates
recommendations. In addition to cosine similarity, the matcher
also relies on the ontological structure of WikiBase. With the
structure, topical association among web pages can be revealed
and it also helps to identify if a user is interested in particular
domains or not. We define two indices (diversity and specificity)
to represent the coverage of user interests. The following
describes the procedure.</p>
        <p>At the beginning, construct a minimal spanning tree that traverses
all the identified categories in the user model. Identified
categories are those categories with a Relevance Score over a
predefined-threshold. In order to connect identified categories
together, connecting nodes, such as parents or neighbors of the
identified categories may be added to the tree. Definitions of the
two indices are as follows:
Diversity index: count the number of edges of the minimal
spanning tree and normalize it by dividing the number of
identified nodes, excluding connecting nodes, in the spanning
tree.</p>
        <p>Specificity index: sum the minimal distances from the root to all
identified categories respectively and normalize it by dividing the
number of those identified categories, excluding connecting
nodes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2 Evaluation Method</title>
      <p>
        We plan to recruit a few participants (&lt; 10) in the computer
science domain where we derive WikiBase. Each of them has to
rate a collection (&gt; 300) of web pages based on topical relevance
and novelty. They have to provide certain web pages (&gt;30) of
their interest in advance as the usage source of formulating the
user models. Afterward, they have to rate the collection. The
ratings will be divided into a training and validation set. Our
system will tune the keyword weight based on the training data.
We will compare our system performance with the SMART
system[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which utilizes vector space model as well.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. Current Status &amp; Discussion</title>
    </sec>
    <sec id="sec-7">
      <title>4.1 Current Status</title>
      <p>We have built WikiBase in the computer science domain, listed at
http://www2.hawaii.edu/~pcchang/ICS699/results.html. We
selected the domain due to its rich data. Two preliminary tests
were conducted on two computer science professionals. In the
first one, w tested the following pages for their topical relevance.
http://www.algosort.com/ (A)
http://tc.eserver.org (B)
Considering only the top two ranking, page A is sensed as
“algorithms” and “genetic algorithms” categories; page B is
sensed as “human-computer interaction” and “usability”
categories. In the evaluation of the classification result, both
participants’ rankings are the same as the system’s ranking,
considering only the top two.</p>
      <p>In the second test, we selected fourteen pages, listed in the
appendix, from four topical areas – algorithm, data mining,
human computer interaction (HCI) and computer games. Both
participants have to evaluate at least five categorical keywords of
each page. They have to provide the degree of agreement from 1
(disagree) to 5(agree) about the following statement. "The given
phrase is a topical keyword of the page." The given phrase is a
categorical label generated by the sensor for each page. The
following table summarizes the ratings.</p>
      <p>Participant 1</p>
      <p>Participant 2</p>
      <sec id="sec-7-1">
        <title>Algorithm (3 pages)</title>
      </sec>
      <sec id="sec-7-2">
        <title>Data Mining (4 pages)</title>
      </sec>
      <sec id="sec-7-3">
        <title>HCI (4 pages)</title>
      </sec>
      <sec id="sec-7-4">
        <title>Games (3 pages)</title>
        <p>
          Average
3.88
3.95
4.22
2.67
3.67
2.83
3.40
4.27
2.2
3.2
From the result, the ranking of both participants’ average scores
is: HCI, Data Mining, Algorithm, and Games. Except for the
game topic, the agreement score is around 4 for participant 1 and
3.5 for participant 2. We suspect that due the wide coverage of
computer games, our system performs worse in that category.
Another reason may be because of the nature of computer science
category. It reflects the common scientific techniques of theory
for producing computer games, which is different from the tested
pages that viewing computer games from a player’s perspective.
We are still in the process of tuning up the keyword weight by
utilizing the computer science pages from Open Directory Project
(DOP) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Pages in ODP are manually categorized by its users
and we use the classification to evaluate our sensor. As for
evaluating the recommendations, we are training the matcher with
pages of a different topical coverage. Eventually, we will apply
the evaluation method described earlier.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4.2 Discussion</title>
      <p>Using the Wikipedia categories as an ontological model yields a
simple user profile. This modeling approach benefits significantly
in cross-system recommendations. Our recommendation engine
works at the client side, which eases the privacy concern of
disclosing sensitive information at web servers. Combining
categories with heuristic information extractions leaves rooms for
the selection of heuristics. Different domains or user groups are
able to apply heuristics of interest. Given the above mentioned
advantages, we are looking forward to see the results of our
evaluation.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Future Work</title>
      <p>The Wikipedia content and categorization system play an
important role in our method to generate recommendations. Our
work emphasizes the framework to automate the ontology
generation and its performance in recommendations.
Nevertheless, the quality of Wikipedia content is controversial. It
will be worthwhile to adopt the same framework to another
Wikipedia-like platform with a different user group, such as
domain experts, to ensure the content quality.</p>
      <p>Another interesting area is to study the content statistics, such as
volume or the granularity of the categories, with recommendation
performance. Not every domain in Wikipedia contains rich
categories and articles like computer science. Therefore, the
performance of recommendations may be related to some of the
statistics.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Appendix</title>
      <p>Due to limited space, only 1st page of each selected topic displays
the categorical keywords.</p>
      <sec id="sec-10-1">
        <title>Algorithms</title>
        <p>http://www.algosort.com/
(Algorithms, Genetic algorithms, Root-finding algorithms,
Networking algorithms, Disk scheduling algorithms)
http://www.oopweb.com/Algorithms/Files/Algorithms.html
http://cgm.cs.mcgill.ca/~godfried/teaching/algorithms-web.html
Data Mining
http://www.data-mining-guide.net/
(Databases, Algorithms, Knowledge representation, Natural language
processing, Knowledge discovery in databases, Machine learning, Data
mining)
http://www.thearling.com/
http://databases.about.com/od/datamining/</p>
        <p>Data_Mining_and_Data_Warehousing.htm
http://www.ccsu.edu/datamining/resources.html
HCI
http://www.pcd-innovations.com/
(Human-computer interaction, Human-computer interaction researchers,
Usability, Computer science organizations,
Artificial intelligence, Software development)</p>
      </sec>
      <sec id="sec-10-2">
        <title>Games</title>
        <p>http://www.robinlionheart.com/gamedev/genres.xhtml
(Image processing, Computer programming, Demo effects
Regression analysis, Computer graphics)
http://open-site.org/Games/Video_Games/
http://www.literature-study-online.com/essays/alice_video.html</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>7. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] http://www.dmoz.org/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Asnicar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tasso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>ifWeb: A Prototype of User Model-Based Intelligent Agent for Document Filtering and Navigation in the World Wide Web</article-title>
          .
          <source>In Proceedings of the 6th International Conference on User Modeling.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Billsus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pazzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>A Personal News Agent that Talks, Learns and Explains</article-title>
          .
          <source>In Proceedings of the 3rd Ann</source>
          . Conf. Autonomous Agents.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Brajnik</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tasso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>A Shell for Developing NonMonotonic User Modeling Systems</article-title>
          . Human-Computer Studies,
          <volume>40</volume>
          ,
          <fpage>31</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sycara</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>WebMate: a personal agent for browsing and searching</article-title>
          .
          <source>In Proceedings of the second international conference on Autonomous agents, Minneapolis</source>
          , Minnesota, United States
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp; Mitchell,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>1997</year>
          .
          <article-title>WebWatcher: A Tour Guide for the World Wide Web</article-title>
          .
          <source>In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Content-based book recommending using learning for text categorization</article-title>
          .
          <source>In Proceedings of the fifth ACM conference on Digital libraries</source>
          , San Antonio, Texas, United States.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Barahona</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Robles</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>On the Inequality of Contributions to Wikipedia</article-title>
          .
          <source>In Proceedings of the 41st Annual Hawaii International Conference on System Sciences.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Pazzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Billsus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Learning and Revising User Profiles: The Identification of Interesting Web Sites</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>27</volume>
          ,
          <fpage>313</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Pazzani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muramatsu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Billsus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Syskill &amp; Webert: Identifying Interesting Web Sites</article-title>
          .
          <source>In Proceedings of the Thirteenth National Conference on Artificial Intelligence</source>
          , Portland, Oregon, United States.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <year>1986</year>
          .
          <article-title>Induction of Decision Trees</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>81</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lesk</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <year>1965</year>
          .
          <article-title>The SMART automatic document retrieval systems -- an illustration</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ),
          <fpage>391</fpage>
          -
          <lpage>398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sussan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Web 2.0 The Academic Library and the Net Gen Student (pp. 35): ALA editions</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Minka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Novelty and Redundancy Detection in Adaptive Filtering</article-title>
          .
          <source>In Proceedings of the 25th Ann. Int'l ACM SIGIR Conf.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>