=Paper= {{Paper |id=Vol-256/paper-9 |storemode=property |title=Recommender system based on user-generated content |pdfUrl=https://ceur-ws.org/Vol-256/submission_16.pdf |volume=Vol-256 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Turdakov07 }} ==Recommender system based on user-generated content== https://ceur-ws.org/Vol-256/submission_16.pdf
      Recommender system based on user-generated content

                                                © Turdakov Denis
                                   Institute for System Programming, RAS
                                             turdakov@gmail.com
                                        Ph.D. adviser: Kuznetsov S.D.



                       Abstract                              allows us to make recommendations based on the
                                                             relationships between concepts, created and peer-
    Recommender systems apply statistical and                reviewed by a large community of users. It is obvious
    knowledge discovery techniques to the                    that building a recommender system for Web pages that
    problem of making recommendations during                 extracts semantics from all of the content on the Web
    live user interaction. This paper describes a            would be very resource demanding. Instead we picked
    novel approach of building recommender                   Wikipedia as our source of user-generated semantics,
    systems for the Web with the aid of user-                which is a comprehensive and up-to-date corpus of
    generated     content.      Recently    certain          knowledge and relationships between concepts.
    communities of Internet users have engaged in
    creating high quality peer reviewed content for          1.2 Wikipedia
    the Web. In our approach we are planning to
                                                             User-generated content refers to various kinds of
    extract the semantics of such user-generated
                                                             content that is produced or primarily influenced by end-
    content and to use these semantics to make
                                                             users. However, average quality of Web content is quite
    more useful recommendations.
                                                             poor, as evidenced by vast amounts of Web spam and
                                                             unverified information submitted by non-authoritative
1 Introduction                                               users. Therefore, instead of using the Web, we chose
                                                             Wikipedia as our base, since it is a body of user-
1.1 Recommender Systems                                      generated content that is all encompassing and at the
                                                             same time of high quality and peer reviewed.
Recommender systems attempt to predict items (web                In every article of Wikipedia links guide users to
pages, movies, books) that a user may be interested in,
                                                             associated articles, often with additional information,
given some information about the user's profile.             and lists of categories for each article organize
    Collaborative filtering is the most popular approach     Wikipedia articles in a taxonomic structure. These links
to building such systems. Individual users are
                                                             convey important semantic information that we can use
automatically joined in groups based on similarity in        to     produce     high    quality     recommendations.
their interests or past behaviour and recommendations        Furthermore, any Internet user is welcome to add
are made based on the preferences of their group             further information, cross-references, or citations, so
members.       Another       method      in    generating
                                                             long as user do so within Wikipedia's editing policies
recommendations is based on predicting users’ interests
                                                             and to an appropriate standard. So after a time
based on his/her past preferences. In the former             semantically rich and high quality peer reviewed
approach new content is compared against user’s past         content emerges.
preferences and similar items are recommended. Both              The English-language Wikipedia currently (when
systems suffer from a number of deficiencies: mainly
                                                             article was written) contains more then 1,500, 000
they both fail to recommend novel interesting topics.        articles (6,000,000 when including redirects, discussion
For example, a recommender system for travellers             pages and portals).
based on collaborative filtering can assign a user to a          Furthermore, we can consider that Wikipedia is a
class of people who visit European capitals. However,        form of web summarization because of its broad scope,
once he visits all of the capitals, this system cannot       conciseness and ability to quickly reflect new trends.
recommend anything new to this person. More                  Therefore, Wikipedia can provide us with extremely
generally, it has been described in [1] that both types of
                                                             useful information about articles and Web page
recommender systems that strive to achieve maximum
                                                             relationships.
accuracy in classification do not lead to useful
recommendations.
    In our work, we avoid this difficulty by generating      2 Problem formulation
recommendations of Web pages with the aid of                 The main goal of this research is to develop a
semantics, extracted from user-generated content. This       recommender system for the Web based on user-
 Content processing                                                                          Runtime system


                                                                                                Recommender
        Wikipedia                    Link extraction                Ontology                       system
                                      and cleaning                  creation
                                                                                                    Ontology


           Blogs                         Concept                   Concept                         Blogs with
                                        extraction                 weighting                       important
                                                                                                   concepts

Figure 1: Recommender system architecture
generated content and novel semantic techniques.              extracted from Wikipedia, we would be able to get
Instead of recommending web pages or sites, our               results very quickly and rate their quality.
system will recommend blogs. This choice stems from a             When you analyze the link topology of Wikipedia
number of considerations. First, blogs are the most           carefully, you will notice that in many cases an article
dynamic part of Internet and are constantly getting           will contain links to other articles that are completely
renewed. Secondly, blogs have simple structure in             unrelated. Thus, for example, in article about Moscow
comparison to web sites, and it’s easier to evaluate a        there is a link to an article about Fahrenheit temperature
recommender system for blogs versus complex web               scale. Clearly, we should make a distinction between
sites. Finally, crawling a representative subset of the       these kind of links and high quality links such as link
web is a daunting task.                                       from “Moscow” to “Capital of Russia”. So we need a
    Web log recommender system would have                     mechanism to clean or rank links on the basis of their
substantial practical value, since this type of content       quality.
search is a difficult task for the user. For example,             For solving link cleaning task it is necessary to
discussion about new products starts long before official     investigate how such low-quality links appear.
announcement of these products. But since the user            Typically, Wikipedia editors carefully insert relevant
doesn’t know about this product, he cannot perform a          links between key concepts of their article to other
meaningful search.                                            articles. Occasionally a rogue user will insert a bunch of
    The overall system architecture is presented in           irrelevant links into an otherwise quality article. We
figure 1. Within the content processing framework there       can see a similar pattern with Web spam, where
two key processes: Wikipedia analysis (top of figure)         spammers create large artificial chunks of the Web to
and blogs processing (bottom of figure). We analyze           boost the page rank of some specific site. Therefore we
Wikipedia and create an ontology based on its structure.      would like to modify and use emerging Web spam
Then we extract and rank concepts from the blogs,             combating algorithms [2] to clean Wikipedia links. In
making use of the ontology. Also in this stage we             order for these methods to be applicable we need to
associate blogs and ontology through concepts. For            make sure that Wikipedia has the same properties.
example, our system associates a blog with keywords               Widely known models of the evolution of the Web
“Moscow” and “Capital of Russia” with an article about        [3, 4] describe global properties such as degree
Moscow in Wikipedia. Then we use these associations           distribution or the appearance of communities. These
to find blogs similar to users preference set (set of blogs   models indicate that overall hyperlink structure arises
that characterizes users interests) with the aid of the       by copying links to pages depending in their existing
ontology.                                                     popularity. For example in the most powerful model [4]
    In thee next two sections we focus on two main            pages within similar topics copy their links that result in
stages of the system: the first one is Wikipedia link         “rich gets richer” and we see power law degree
cleaning and ontology extraction; second is establishing      distribution where the exponent vary approximately
semantic relationships between Wikipedia concepts and         from 2 to 3.
web logs. In the following sections we describe the               So, web graph relates to the class of scale-free
remaining stages.                                             networks with most distinguishing characteristic are
                                                              that their degree distribution follows a power law
2.1 Cleaning Wikipedia links                                  relationship. The second property of this class of
                                                              networks is self-similarity: a large-enough supporter set
Wikipedia has its own markup, links to internal and
                                                              should behave similar to the entire Web. Thus we can
external articles, redirects and list of categories. All of
                                                              guess that properties of Wikipedia links graph and its
this information would be useful for our research. So
                                                              subgraphs would be same to the Web graph.
far we are only using article titles and internal links.
                                                                  The basic idea is to analyze rank distribution of
Though this is only a small part of information could be
                                                              some page in its neighborhood. If link distribution in
                                                            meaning, such as "car pool"); different senses of a word
                                                            are in different synsets. The meaning of the synsets is
                                                            further clarified with short defining glosses (Definitions
                                                            and/or example sentences).
                                                                Simple methods are used in IBM research center to
                                                            make automatic semantic annotation of WEB pages.
                                                            Seeker is a platform for large-scale text analytics,
                                                            described in [6]. SemTag, an application written on the
                                                            platform to perform automated semantic tagging of
                                                            large corpora. Authors use small and simple TAP
                                                            ontology [7] to process large amounts of web pages.
                                                            This is the largest scale semantic tagging effort to date.
                                                            We work with much smaller data; therefore we can use
                                                            a more complicated method and extract more semantic
                                                            data from a document.
                                                                In recent works researches have started to extract
                                                            semantics from Wikipedia; the most profound work was
                                                            done by Kozlova [8]. She extracts an ontology with
                                                            structure similar to WordNet. To evaluate the quality of
                                                            the ontology she compared the performance of the
some article neighborhood isn’t power law we have a         ontology-driven classification of Reuters collection with
high probability that page rank is artificially             extracted ontology versus WordNet, and achieved better
overstating. Therefore emerging spam detection              results.
algorithms require the following properties:                    In her work article link structure and article structure
        • Power-law link distribution                       itself was used for ontology extraction. For example, if
        • Self-similarity                                   analyze the link [[Capital of France | Paris]] we can
    We have analyzed Wikipedia and found that its link      easily produce synonyms: “Paris” and “Capital of
structure follows the power-law distribution (figure 2)     France. Also, if a document is linked under one of the
and it follows that self-similarity holds for Wikipedia.    special sections like “see also”, “similar topics” it
Hence we can use modified Web spam detection                indicates, that this document has something to do with
algorithms for this task.                                   the topic.
                                                                Unlike the previous works we will extract a more
2.2 Ontology extraction                                     semantically rich ontology. For now we will use
A naïve way to produce recommendations is to                categories, “see also” links and general links in the
recommend blogs associated with nearest neighbors in        articles.
the Wikipedia link graph. However there are serious             List of categories form a directed graph over the
problems with this method. Researchers [5] proved that      articles of Wikipedia, which can be very useful in
uncorrelated power-law graph having the exponent            pruning        irrelevant     links     when        making
approximately from 2 to 3 will also have ultrasmall         recommendations. For example, Wikipedia article about
diameter d ~ ln ln N (for Wikipedia d = 2.75). For our      Kurchatov contains links to “physics”, “Physico-
work it means we can’t use only the link structure to       Technical Institute” and other topics that are poor
make recommendations, since we will end up                  recommendation candidates. With the aid of categories
recommending the whole collection of blogs. So we           we can prune these links and recommend more relevant
need to extract additional knowledge from Wikipedia         topics such as articles about Kurchatov colleagues. This
that will help us select a few relevant links for           is the most basic use of our ontology; we will
recommendations. Therefore the second stage of              investigate more sophisticated methods in our future
Wikipedia processing deals with extracting semantic         work.
information from link and articles. Next, we give a short
                                                            2.3 Web logs processing
overview of ontologies used in information retrieval and
describe our ontology model.                                    Now we deal with preprocessing web logs. In order
    Semantic extraction and ontology development is         to find blogs most similar to user preference set it’s
well-studied topic. WordNet is the most successful hand     necessary to extract terms from each blog and correlate
crafted semantic lexicon for the English language. It       these terms with concepts from the ontology. This will
groups English words into sets of synonyms called           enable us to make recommendations based on these
synsets, provides short, general definitions, and records   concepts.
the various semantic relations between these synonym            When we correlate blogs with concepts each blog
sets. WordNet distinguishes between nouns, verbs,           will become associated with a large number of
adjectives and adverbs because they follow different        concepts. In order to identify essential concepts we use
grammatical rules. Every synset contains a group of         a modified tf-idf weighting scheme [9]. We avoid
synonymous words or collocations (a collocation is a        recomputing idf every time the blog collection is
sequence of words that go together to form a specific       updated by computing idf using only Wikipedia.
2.4 Generation of recommendations                           6 References
At runtime the recommender system derives top-N             [1] S.M. McNee, J. Riedl, and J.A. Konstan. "Being
recommendation from the ontology based on users                 Accurate is Not Enough: How Accuracy Metrics
preference set. Little research has been done on this           have hurt Recommender Systems". In the Extended
topic. An ontology-based information retrieval model            Abstracts of the 2006 ACM Conference on Human
[9] exploits ontology-based knowledge bases to                  Factors in Computing Systems (CHI 2006),
improve search over large documents. This approach              Montreal, Canada, April 2006.
includes an ontology-based scheme for the semi-             [2] Andras a. Benczur, Karoly Csalogany, Tamas
automatic annotation and retrieval of documents. We             Sarlos Mate, and Uher. SpamRank – Fully
plan to use and extend this technique for computing             Automatic Link Spam Detection, work in progress,
similarity between blogs and ranking recommendations.           Computer and Automation Research Institute,
                                                                Hungsrian Academy of Sciences, 2005.
3 Related work                                              [3] A.-L. Barabási, R. Albert, and H. Jeong. Scale-free
                                                                characteristics of random networks: the topology of
Tapestry [10] is one of the earliest implementations of         the word-wide web. Physica A, 281:69–77, 2000.
collaborative filtering based recommender systems.          [4] R. Kumar, P. Raghavan, S. Rajagopalan, D.
This system relied on the explicit opinions of people           Sivakumar, A. Tomkins, and E. Upfal. Stochastic
from a close-knit community, such as office workgroup.
                                                                models for the web graph. In Proceedings of the
However, recommender system for large communities
                                                                41st IEEE Annual Symposium on Foundations of
can’t depend on each knowing others. Later on several           Computer Science (FOCS), 2000.
rating-based automated recommender systems were             [5] R. Cohen and S. Havlin, Scale-free networks are
developed. The GroupLens research system [11]                   ultrasmall, Phys. Rev. Lett. 90, 058701 2003.
provides a pseudonymous collaborative filtering
                                                            [6] Stephen Dill, Nadav Eiron, David Gibson,
solution for Usenet news and movies. Ringo and Video            Daniel Gruhl, R. Guha, Anant Jhingran,
Recommender are email and web-based systems that                Tapas Kanungo, Sridhar Rajagopalan,
generate recommendations on music and movies
                                                                Andrew Tomkins, John A. Tomlin, and
respectively. A special issue of Communications of the          Jason Y. Zien. SemTag and Seeker: Bootstrapping
ACM [12] presents a number of different recommender             the semantic web via automated semantic
systems. Although these systems have been successful            annotation. IBM Almaden Research Center, 2003.
in the past, their widespread use has exposed some of
                                                            [7] TAP ontology page. http://ontap.stanford.edu/
their limitations such as the problems of sparsity in the
                                                            [8] Natalia Kolova. Automatic ontology extraction for
data set, problems associated with high dimensionality          document classification. Computer science
and so on.                                                      department, Saarland University, 2005.
     А myriad of other recommender systems exist,
                                                            [9] David Vallet, Miriam Fernández, and Pablo
particularly on e-commerce sites. Schafer [13] examines
                                                                Castells. An Ontology-Based Information Retrieval
and categorizes a large set of these commercialized             Model. Universidad Autonoma de Madrid.
recommender systems. In addition, numerous                  [10] Goldberg, D., Nichols, D., Oki, B. M., and Terry,D.
recommenders in a variety of domains have been                  Using Collaborative Filtering to Weave an
developed for research purposes, including MovieLens            Information Tapestry. Communication of ACM,
(films), Ringo (music), and Jester (jokes).                     1992.
     All of these systems are based on collaborative        [11] Resnick, P., Iacovou, N., Suchak, M., Bergstrom,
filtering. Correspondingly they have problems as stated
                                                                P., and Riedl, J. GroupLens: An open Architecture
above. We avoid these problems by using high quality
                                                                for Collaborative Filtering of Netnews. Proceedings
user-generated content as a foundation for making               of CSCW, 1994.
recommendations.                                            [12] Resnik, P., and Varian, H. R. Recommender
                                                                Systems. Special issue of Communication of the
4 Conclusion                                                    ACM, 40(3), 1997.
In this article we propose a novel architecture for         [13] J. Schafer, J. Konstan, and J. Riedl. Electronic
building recommendation systems and formulate the               commerce recommender applications. Data Mining
major directions of future work. In the future we plan to       and Knowledge Discovery, Jan. 2001.
modify and apply Web spam detection algorithms for
cleaning Wikipedia links. We will then evaluate various
approaches to making recommendations using the
extracted ontology (we have given a basic example of
such an approach in Sections 2.2 and 2.4).
    Finally, we will implement a complete blog
recommender system based on the described techniques
and evaluate it on the Internet users.