<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recommender system based on user-generated content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© Turdakov Denis</string-name>
          <email>turdakov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for System Programming, RAS Ph.D. adviser: Kuznetsov S.D</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recommender systems apply statistical and knowledge discovery techniques to the problem of making recommendations during live user interaction. This paper describes a novel approach of building recommender systems for the Web with the aid of usergenerated content. Recently certain communities of Internet users have engaged in creating high quality peer reviewed content for the Web. In our approach we are planning to extract the semantics of such user-generated content and to use these semantics to make more useful recommendations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <sec id="sec-1-1">
        <title>1.1 Recommender Systems</title>
        <p>Recommender systems attempt to predict items (web
pages, movies, books) that a user may be interested in,
given some information about the user's profile.</p>
        <p>
          Collaborative filtering is the most popular approach
to building such systems. Individual users are
automatically joined in groups based on similarity in
their interests or past behaviour and recommendations
are made based on the preferences of their group
members. Another method in generating
recommendations is based on predicting users’ interests
based on his/her past preferences. In the former
approach new content is compared against user’s past
preferences and similar items are recommended. Both
systems suffer from a number of deficiencies: mainly
they both fail to recommend novel interesting topics.
For example, a recommender system for travellers
based on collaborative filtering can assign a user to a
class of people who visit European capitals. However,
once he visits all of the capitals, this system cannot
recommend anything new to this person. More
generally, it has been described in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that both types of
recommender systems that strive to achieve maximum
accuracy in classification do not lead to useful
recommendations.
        </p>
        <p>In our work, we avoid this difficulty by generating
recommendations of Web pages with the aid of
semantics, extracted from user-generated content. This
allows us to make recommendations based on the
relationships between concepts, created and
peerreviewed by a large community of users. It is obvious
that building a recommender system for Web pages that
extracts semantics from all of the content on the Web
would be very resource demanding. Instead we picked
Wikipedia as our source of user-generated semantics,
which is a comprehensive and up-to-date corpus of
knowledge and relationships between concepts.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2 Wikipedia</title>
        <p>User-generated content refers to various kinds of
content that is produced or primarily influenced by
endusers. However, average quality of Web content is quite
poor, as evidenced by vast amounts of Web spam and
unverified information submitted by non-authoritative
users. Therefore, instead of using the Web, we chose
Wikipedia as our base, since it is a body of
usergenerated content that is all encompassing and at the
same time of high quality and peer reviewed.</p>
        <p>In every article of Wikipedia links guide users to
associated articles, often with additional information,
and lists of categories for each article organize
Wikipedia articles in a taxonomic structure. These links
convey important semantic information that we can use
to produce high quality recommendations.
Furthermore, any Internet user is welcome to add
further information, cross-references, or citations, so
long as user do so within Wikipedia's editing policies
and to an appropriate standard. So after a time
semantically rich and high quality peer reviewed
content emerges.</p>
        <p>The English-language Wikipedia currently (when
article was written) contains more then 1,500, 000
articles (6,000,000 when including redirects, discussion
pages and portals).</p>
        <p>Furthermore, we can consider that Wikipedia is a
form of web summarization because of its broad scope,
conciseness and ability to quickly reflect new trends.
Therefore, Wikipedia can provide us with extremely
useful information about articles and Web page
relationships.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2 Problem formulation</title>
      <p>The main goal of this research is to develop a
recommender system for the Web based on
userContent processing</p>
      <sec id="sec-2-1">
        <title>Wikipedia</title>
        <p>Blogs</p>
      </sec>
      <sec id="sec-2-2">
        <title>Link extraction and cleaning</title>
      </sec>
      <sec id="sec-2-3">
        <title>Concept extraction</title>
      </sec>
      <sec id="sec-2-4">
        <title>Ontology creation</title>
      </sec>
      <sec id="sec-2-5">
        <title>Concept weighting</title>
        <p>Runtime system</p>
      </sec>
      <sec id="sec-2-6">
        <title>Recommender system</title>
      </sec>
      <sec id="sec-2-7">
        <title>Ontology</title>
      </sec>
      <sec id="sec-2-8">
        <title>Blogs with important concepts</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1 Cleaning Wikipedia links</title>
      <p>Wikipedia has its own markup, links to internal and
external articles, redirects and list of categories. All of
this information would be useful for our research. So
far we are only using article titles and internal links.
Though this is only a small part of information could be
extracted from Wikipedia, we would be able to get
results very quickly and rate their quality.</p>
      <p>When you analyze the link topology of Wikipedia
carefully, you will notice that in many cases an article
will contain links to other articles that are completely
unrelated. Thus, for example, in article about Moscow
there is a link to an article about Fahrenheit temperature
scale. Clearly, we should make a distinction between
these kind of links and high quality links such as link
from “Moscow” to “Capital of Russia”. So we need a
mechanism to clean or rank links on the basis of their
quality.</p>
      <p>
        For solving link cleaning task it is necessary to
investigate how such low-quality links appear.
Typically, Wikipedia editors carefully insert relevant
links between key concepts of their article to other
articles. Occasionally a rogue user will insert a bunch of
irrelevant links into an otherwise quality article. We
can see a similar pattern with Web spam, where
spammers create large artificial chunks of the Web to
boost the page rank of some specific site. Therefore we
would like to modify and use emerging Web spam
combating algorithms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to clean Wikipedia links. In
order for these methods to be applicable we need to
make sure that Wikipedia has the same properties.
      </p>
      <p>
        Widely known models of the evolution of the Web
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] describe global properties such as degree
distribution or the appearance of communities. These
models indicate that overall hyperlink structure arises
by copying links to pages depending in their existing
popularity. For example in the most powerful model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
pages within similar topics copy their links that result in
“rich gets richer” and we see power law degree
distribution where the exponent vary approximately
from 2 to 3.
      </p>
      <p>So, web graph relates to the class of scale-free
networks with most distinguishing characteristic are
that their degree distribution follows a power law
relationship. The second property of this class of
networks is self-similarity: a large-enough supporter set
should behave similar to the entire Web. Thus we can
guess that properties of Wikipedia links graph and its
subgraphs would be same to the Web graph.</p>
      <p>The basic idea is to analyze rank distribution of
some page in its neighborhood. If link distribution in
some article neighborhood isn’t power law we have a
high probability that page rank is artificially
overstating. Therefore emerging spam detection
algorithms require the following properties:
• Power-law link distribution
• Self-similarity</p>
      <p>We have analyzed Wikipedia and found that its link
structure follows the power-law distribution (figure 2)
and it follows that self-similarity holds for Wikipedia.
Hence we can use modified Web spam detection
algorithms for this task.</p>
      <sec id="sec-3-1">
        <title>2.2 Ontology extraction</title>
        <p>
          A naïve way to produce recommendations is to
recommend blogs associated with nearest neighbors in
the Wikipedia link graph. However there are serious
problems with this method. Researchers [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] proved that
uncorrelated power-law graph having the exponent
approximately from 2 to 3 will also have ultrasmall
diameter d ~ ln ln N (for Wikipedia d = 2.75). For our
work it means we can’t use only the link structure to
make recommendations, since we will end up
recommending the whole collection of blogs. So we
need to extract additional knowledge from Wikipedia
that will help us select a few relevant links for
recommendations. Therefore the second stage of
Wikipedia processing deals with extracting semantic
information from link and articles. Next, we give a short
overview of ontologies used in information retrieval and
describe our ontology model.
        </p>
        <p>Semantic extraction and ontology development is
well-studied topic. WordNet is the most successful hand
crafted semantic lexicon for the English language. It
groups English words into sets of synonyms called
synsets, provides short, general definitions, and records
the various semantic relations between these synonym
sets. WordNet distinguishes between nouns, verbs,
adjectives and adverbs because they follow different
grammatical rules. Every synset contains a group of
synonymous words or collocations (a collocation is a
sequence of words that go together to form a specific
meaning, such as "car pool"); different senses of a word
are in different synsets. The meaning of the synsets is
further clarified with short defining glosses (Definitions
and/or example sentences).</p>
        <p>
          Simple methods are used in IBM research center to
make automatic semantic annotation of WEB pages.
Seeker is a platform for large-scale text analytics,
described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. SemTag, an application written on the
platform to perform automated semantic tagging of
large corpora. Authors use small and simple TAP
ontology [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to process large amounts of web pages.
This is the largest scale semantic tagging effort to date.
We work with much smaller data; therefore we can use
a more complicated method and extract more semantic
data from a document.
        </p>
        <p>
          In recent works researches have started to extract
semantics from Wikipedia; the most profound work was
done by Kozlova [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. She extracts an ontology with
structure similar to WordNet. To evaluate the quality of
the ontology she compared the performance of the
ontology-driven classification of Reuters collection with
extracted ontology versus WordNet, and achieved better
results.
        </p>
        <p>In her work article link structure and article structure
itself was used for ontology extraction. For example, if
analyze the link [[Capital of France | Paris]] we can
easily produce synonyms: “Paris” and “Capital of
France. Also, if a document is linked under one of the
special sections like “see also”, “similar topics” it
indicates, that this document has something to do with
the topic.</p>
        <p>Unlike the previous works we will extract a more
semantically rich ontology. For now we will use
categories, “see also” links and general links in the
articles.</p>
        <p>List of categories form a directed graph over the
articles of Wikipedia, which can be very useful in
pruning irrelevant links when making
recommendations. For example, Wikipedia article about
Kurchatov contains links to “physics”,
“PhysicoTechnical Institute” and other topics that are poor
recommendation candidates. With the aid of categories
we can prune these links and recommend more relevant
topics such as articles about Kurchatov colleagues. This
is the most basic use of our ontology; we will
investigate more sophisticated methods in our future
work.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.3 Web logs processing</title>
        <p>Now we deal with preprocessing web logs. In order
to find blogs most similar to user preference set it’s
necessary to extract terms from each blog and correlate
these terms with concepts from the ontology. This will
enable us to make recommendations based on these
concepts.</p>
        <p>
          When we correlate blogs with concepts each blog
will become associated with a large number of
concepts. In order to identify essential concepts we use
a modified tf-idf weighting scheme [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We avoid
recomputing idf every time the blog collection is
updated by computing idf using only Wikipedia.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.4 Generation of recommendations</title>
        <p>
          At runtime the recommender system derives top-N
recommendation from the ontology based on users
preference set. Little research has been done on this
topic. An ontology-based information retrieval model
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] exploits ontology-based knowledge bases to
improve search over large documents. This approach
includes an ontology-based scheme for the
semiautomatic annotation and retrieval of documents. We
plan to use and extend this technique for computing
similarity between blogs and ranking recommendations.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 Related work</title>
      <p>
        Tapestry [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is one of the earliest implementations of
collaborative filtering based recommender systems.
This system relied on the explicit opinions of people
from a close-knit community, such as office workgroup.
However, recommender system for large communities
can’t depend on each knowing others. Later on several
rating-based automated recommender systems were
developed. The GroupLens research system [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
provides a pseudonymous collaborative filtering
solution for Usenet news and movies. Ringo and Video
Recommender are email and web-based systems that
generate recommendations on music and movies
respectively. A special issue of Communications of the
ACM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] presents a number of different recommender
systems. Although these systems have been successful
in the past, their widespread use has exposed some of
their limitations such as the problems of sparsity in the
data set, problems associated with high dimensionality
and so on.
      </p>
      <p>
        А myriad of other recommender systems exist,
particularly on e-commerce sites. Schafer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] examines
and categorizes a large set of these commercialized
recommender systems. In addition, numerous
recommenders in a variety of domains have been
developed for research purposes, including MovieLens
(films), Ringo (music), and Jester (jokes).
      </p>
      <p>All of these systems are based on collaborative
filtering. Correspondingly they have problems as stated
above. We avoid these problems by using high quality
user-generated content as a foundation for making
recommendations.</p>
    </sec>
    <sec id="sec-5">
      <title>4 Conclusion</title>
      <p>In this article we propose a novel architecture for
building recommendation systems and formulate the
major directions of future work. In the future we plan to
modify and apply Web spam detection algorithms for
cleaning Wikipedia links. We will then evaluate various
approaches to making recommendations using the
extracted ontology (we have given a basic example of
such an approach in Sections 2.2 and 2.4).</p>
      <p>Finally, we will implement a complete blog
recommender system based on the described techniques
and evaluate it on the Internet users.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.M.</given-names>
            <surname>McNee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          .
          <article-title>"Being Accurate is Not Enough: How Accuracy Metrics have hurt Recommender Systems"</article-title>
          .
          <source>In the Extended Abstracts of the 2006 ACM Conference on Human Factors in Computing Systems (CHI</source>
          <year>2006</year>
          ), Montreal, Canada,
          <year>April 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Andras a</article-title>
          . Benczur, Karoly Csalogany, Tamas Sarlos Mate, and
          <string-name>
            <surname>Uher. SpamRank - Fully Automatic Link Spam Detection</surname>
          </string-name>
          , work in progress, Computer and Automation Research Institute, Hungsrian Academy of Sciences,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Barabási</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Albert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jeong</surname>
          </string-name>
          .
          <article-title>Scale-free characteristics of random networks: the topology of the word-wide web</article-title>
          .
          <source>Physica A</source>
          ,
          <volume>281</volume>
          :
          <fpage>69</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajagopalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sivakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Upfal</surname>
          </string-name>
          .
          <article-title>Stochastic models for the web graph</article-title>
          .
          <source>In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science (FOCS)</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohen</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Havlin</surname>
          </string-name>
          ,
          <article-title>Scale-free networks are ultrasmall</article-title>
          ,
          <source>Phys. Rev. Lett</source>
          .
          <volume>90</volume>
          , 058701
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Dill</surname>
          </string-name>
          , Nadav Eiron, David Gibson,
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Gruhl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guha</surname>
          </string-name>
          , Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan,
          <string-name>
            <given-names>Andrew Tomkins</given-names>
            , John A. Tomlin, and
            <surname>Jason</surname>
          </string-name>
          <string-name>
            <surname>Y. Zien.</surname>
          </string-name>
          <article-title>SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation</article-title>
          .
          <source>IBM Almaden Research Center</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] TAP ontology page</article-title>
          . http://ontap.stanford.edu/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Natalia</given-names>
            <surname>Kolova</surname>
          </string-name>
          .
          <article-title>Automatic ontology extraction for document classification</article-title>
          . Computer science department, Saarland University,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>David</given-names>
            <surname>Vallet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Miriam</given-names>
            <surname>Fernández</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Castells</surname>
          </string-name>
          .
          <article-title>An Ontology-Based Information Retrieval Model</article-title>
          . Universidad Autonoma de Madrid.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nichols</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oki</surname>
            ,
            <given-names>B. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Terry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Using Collaborative Filtering to Weave an Information Tapestry</article-title>
          .
          <source>Communication of ACM</source>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Resnick</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iacovou</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergstrom</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>GroupLens: An open Architecture for Collaborative Filtering of Netnews</article-title>
          .
          <source>Proceedings of CSCW</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Resnik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Varian</surname>
            ,
            <given-names>H. R. Recommender</given-names>
          </string-name>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          .
          <source>Special issue of Communication of the ACM</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ),
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schafer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          .
          <article-title>Electronic commerce recommender applications</article-title>
          .
          <source>Data Mining and Knowledge Discovery, Jan</source>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>