=Paper=
{{Paper
|id=Vol-256/paper-9
|storemode=property
|title=Recommender system based on user-generated content
|pdfUrl=https://ceur-ws.org/Vol-256/submission_16.pdf
|volume=Vol-256
|dblpUrl=https://dblp.org/rec/conf/syrcodis/Turdakov07
}}
==Recommender system based on user-generated content==
Recommender system based on user-generated content
© Turdakov Denis
Institute for System Programming, RAS
turdakov@gmail.com
Ph.D. adviser: Kuznetsov S.D.
Abstract allows us to make recommendations based on the
relationships between concepts, created and peer-
Recommender systems apply statistical and reviewed by a large community of users. It is obvious
knowledge discovery techniques to the that building a recommender system for Web pages that
problem of making recommendations during extracts semantics from all of the content on the Web
live user interaction. This paper describes a would be very resource demanding. Instead we picked
novel approach of building recommender Wikipedia as our source of user-generated semantics,
systems for the Web with the aid of user- which is a comprehensive and up-to-date corpus of
generated content. Recently certain knowledge and relationships between concepts.
communities of Internet users have engaged in
creating high quality peer reviewed content for 1.2 Wikipedia
the Web. In our approach we are planning to
User-generated content refers to various kinds of
extract the semantics of such user-generated
content that is produced or primarily influenced by end-
content and to use these semantics to make
users. However, average quality of Web content is quite
more useful recommendations.
poor, as evidenced by vast amounts of Web spam and
unverified information submitted by non-authoritative
1 Introduction users. Therefore, instead of using the Web, we chose
Wikipedia as our base, since it is a body of user-
1.1 Recommender Systems generated content that is all encompassing and at the
same time of high quality and peer reviewed.
Recommender systems attempt to predict items (web In every article of Wikipedia links guide users to
pages, movies, books) that a user may be interested in,
associated articles, often with additional information,
given some information about the user's profile. and lists of categories for each article organize
Collaborative filtering is the most popular approach Wikipedia articles in a taxonomic structure. These links
to building such systems. Individual users are
convey important semantic information that we can use
automatically joined in groups based on similarity in to produce high quality recommendations.
their interests or past behaviour and recommendations Furthermore, any Internet user is welcome to add
are made based on the preferences of their group further information, cross-references, or citations, so
members. Another method in generating
long as user do so within Wikipedia's editing policies
recommendations is based on predicting users’ interests
and to an appropriate standard. So after a time
based on his/her past preferences. In the former semantically rich and high quality peer reviewed
approach new content is compared against user’s past content emerges.
preferences and similar items are recommended. Both The English-language Wikipedia currently (when
systems suffer from a number of deficiencies: mainly
article was written) contains more then 1,500, 000
they both fail to recommend novel interesting topics. articles (6,000,000 when including redirects, discussion
For example, a recommender system for travellers pages and portals).
based on collaborative filtering can assign a user to a Furthermore, we can consider that Wikipedia is a
class of people who visit European capitals. However, form of web summarization because of its broad scope,
once he visits all of the capitals, this system cannot conciseness and ability to quickly reflect new trends.
recommend anything new to this person. More Therefore, Wikipedia can provide us with extremely
generally, it has been described in [1] that both types of
useful information about articles and Web page
recommender systems that strive to achieve maximum
relationships.
accuracy in classification do not lead to useful
recommendations.
In our work, we avoid this difficulty by generating 2 Problem formulation
recommendations of Web pages with the aid of The main goal of this research is to develop a
semantics, extracted from user-generated content. This recommender system for the Web based on user-
Content processing Runtime system
Recommender
Wikipedia Link extraction Ontology system
and cleaning creation
Ontology
Blogs Concept Concept Blogs with
extraction weighting important
concepts
Figure 1: Recommender system architecture
generated content and novel semantic techniques. extracted from Wikipedia, we would be able to get
Instead of recommending web pages or sites, our results very quickly and rate their quality.
system will recommend blogs. This choice stems from a When you analyze the link topology of Wikipedia
number of considerations. First, blogs are the most carefully, you will notice that in many cases an article
dynamic part of Internet and are constantly getting will contain links to other articles that are completely
renewed. Secondly, blogs have simple structure in unrelated. Thus, for example, in article about Moscow
comparison to web sites, and it’s easier to evaluate a there is a link to an article about Fahrenheit temperature
recommender system for blogs versus complex web scale. Clearly, we should make a distinction between
sites. Finally, crawling a representative subset of the these kind of links and high quality links such as link
web is a daunting task. from “Moscow” to “Capital of Russia”. So we need a
Web log recommender system would have mechanism to clean or rank links on the basis of their
substantial practical value, since this type of content quality.
search is a difficult task for the user. For example, For solving link cleaning task it is necessary to
discussion about new products starts long before official investigate how such low-quality links appear.
announcement of these products. But since the user Typically, Wikipedia editors carefully insert relevant
doesn’t know about this product, he cannot perform a links between key concepts of their article to other
meaningful search. articles. Occasionally a rogue user will insert a bunch of
The overall system architecture is presented in irrelevant links into an otherwise quality article. We
figure 1. Within the content processing framework there can see a similar pattern with Web spam, where
two key processes: Wikipedia analysis (top of figure) spammers create large artificial chunks of the Web to
and blogs processing (bottom of figure). We analyze boost the page rank of some specific site. Therefore we
Wikipedia and create an ontology based on its structure. would like to modify and use emerging Web spam
Then we extract and rank concepts from the blogs, combating algorithms [2] to clean Wikipedia links. In
making use of the ontology. Also in this stage we order for these methods to be applicable we need to
associate blogs and ontology through concepts. For make sure that Wikipedia has the same properties.
example, our system associates a blog with keywords Widely known models of the evolution of the Web
“Moscow” and “Capital of Russia” with an article about [3, 4] describe global properties such as degree
Moscow in Wikipedia. Then we use these associations distribution or the appearance of communities. These
to find blogs similar to users preference set (set of blogs models indicate that overall hyperlink structure arises
that characterizes users interests) with the aid of the by copying links to pages depending in their existing
ontology. popularity. For example in the most powerful model [4]
In thee next two sections we focus on two main pages within similar topics copy their links that result in
stages of the system: the first one is Wikipedia link “rich gets richer” and we see power law degree
cleaning and ontology extraction; second is establishing distribution where the exponent vary approximately
semantic relationships between Wikipedia concepts and from 2 to 3.
web logs. In the following sections we describe the So, web graph relates to the class of scale-free
remaining stages. networks with most distinguishing characteristic are
that their degree distribution follows a power law
2.1 Cleaning Wikipedia links relationship. The second property of this class of
networks is self-similarity: a large-enough supporter set
Wikipedia has its own markup, links to internal and
should behave similar to the entire Web. Thus we can
external articles, redirects and list of categories. All of
guess that properties of Wikipedia links graph and its
this information would be useful for our research. So
subgraphs would be same to the Web graph.
far we are only using article titles and internal links.
The basic idea is to analyze rank distribution of
Though this is only a small part of information could be
some page in its neighborhood. If link distribution in
meaning, such as "car pool"); different senses of a word
are in different synsets. The meaning of the synsets is
further clarified with short defining glosses (Definitions
and/or example sentences).
Simple methods are used in IBM research center to
make automatic semantic annotation of WEB pages.
Seeker is a platform for large-scale text analytics,
described in [6]. SemTag, an application written on the
platform to perform automated semantic tagging of
large corpora. Authors use small and simple TAP
ontology [7] to process large amounts of web pages.
This is the largest scale semantic tagging effort to date.
We work with much smaller data; therefore we can use
a more complicated method and extract more semantic
data from a document.
In recent works researches have started to extract
semantics from Wikipedia; the most profound work was
done by Kozlova [8]. She extracts an ontology with
structure similar to WordNet. To evaluate the quality of
the ontology she compared the performance of the
some article neighborhood isn’t power law we have a ontology-driven classification of Reuters collection with
high probability that page rank is artificially extracted ontology versus WordNet, and achieved better
overstating. Therefore emerging spam detection results.
algorithms require the following properties: In her work article link structure and article structure
• Power-law link distribution itself was used for ontology extraction. For example, if
• Self-similarity analyze the link [[Capital of France | Paris]] we can
We have analyzed Wikipedia and found that its link easily produce synonyms: “Paris” and “Capital of
structure follows the power-law distribution (figure 2) France. Also, if a document is linked under one of the
and it follows that self-similarity holds for Wikipedia. special sections like “see also”, “similar topics” it
Hence we can use modified Web spam detection indicates, that this document has something to do with
algorithms for this task. the topic.
Unlike the previous works we will extract a more
2.2 Ontology extraction semantically rich ontology. For now we will use
A naïve way to produce recommendations is to categories, “see also” links and general links in the
recommend blogs associated with nearest neighbors in articles.
the Wikipedia link graph. However there are serious List of categories form a directed graph over the
problems with this method. Researchers [5] proved that articles of Wikipedia, which can be very useful in
uncorrelated power-law graph having the exponent pruning irrelevant links when making
approximately from 2 to 3 will also have ultrasmall recommendations. For example, Wikipedia article about
diameter d ~ ln ln N (for Wikipedia d = 2.75). For our Kurchatov contains links to “physics”, “Physico-
work it means we can’t use only the link structure to Technical Institute” and other topics that are poor
make recommendations, since we will end up recommendation candidates. With the aid of categories
recommending the whole collection of blogs. So we we can prune these links and recommend more relevant
need to extract additional knowledge from Wikipedia topics such as articles about Kurchatov colleagues. This
that will help us select a few relevant links for is the most basic use of our ontology; we will
recommendations. Therefore the second stage of investigate more sophisticated methods in our future
Wikipedia processing deals with extracting semantic work.
information from link and articles. Next, we give a short
2.3 Web logs processing
overview of ontologies used in information retrieval and
describe our ontology model. Now we deal with preprocessing web logs. In order
Semantic extraction and ontology development is to find blogs most similar to user preference set it’s
well-studied topic. WordNet is the most successful hand necessary to extract terms from each blog and correlate
crafted semantic lexicon for the English language. It these terms with concepts from the ontology. This will
groups English words into sets of synonyms called enable us to make recommendations based on these
synsets, provides short, general definitions, and records concepts.
the various semantic relations between these synonym When we correlate blogs with concepts each blog
sets. WordNet distinguishes between nouns, verbs, will become associated with a large number of
adjectives and adverbs because they follow different concepts. In order to identify essential concepts we use
grammatical rules. Every synset contains a group of a modified tf-idf weighting scheme [9]. We avoid
synonymous words or collocations (a collocation is a recomputing idf every time the blog collection is
sequence of words that go together to form a specific updated by computing idf using only Wikipedia.
2.4 Generation of recommendations 6 References
At runtime the recommender system derives top-N [1] S.M. McNee, J. Riedl, and J.A. Konstan. "Being
recommendation from the ontology based on users Accurate is Not Enough: How Accuracy Metrics
preference set. Little research has been done on this have hurt Recommender Systems". In the Extended
topic. An ontology-based information retrieval model Abstracts of the 2006 ACM Conference on Human
[9] exploits ontology-based knowledge bases to Factors in Computing Systems (CHI 2006),
improve search over large documents. This approach Montreal, Canada, April 2006.
includes an ontology-based scheme for the semi- [2] Andras a. Benczur, Karoly Csalogany, Tamas
automatic annotation and retrieval of documents. We Sarlos Mate, and Uher. SpamRank – Fully
plan to use and extend this technique for computing Automatic Link Spam Detection, work in progress,
similarity between blogs and ranking recommendations. Computer and Automation Research Institute,
Hungsrian Academy of Sciences, 2005.
3 Related work [3] A.-L. Barabási, R. Albert, and H. Jeong. Scale-free
characteristics of random networks: the topology of
Tapestry [10] is one of the earliest implementations of the word-wide web. Physica A, 281:69–77, 2000.
collaborative filtering based recommender systems. [4] R. Kumar, P. Raghavan, S. Rajagopalan, D.
This system relied on the explicit opinions of people Sivakumar, A. Tomkins, and E. Upfal. Stochastic
from a close-knit community, such as office workgroup.
models for the web graph. In Proceedings of the
However, recommender system for large communities
41st IEEE Annual Symposium on Foundations of
can’t depend on each knowing others. Later on several Computer Science (FOCS), 2000.
rating-based automated recommender systems were [5] R. Cohen and S. Havlin, Scale-free networks are
developed. The GroupLens research system [11] ultrasmall, Phys. Rev. Lett. 90, 058701 2003.
provides a pseudonymous collaborative filtering
[6] Stephen Dill, Nadav Eiron, David Gibson,
solution for Usenet news and movies. Ringo and Video Daniel Gruhl, R. Guha, Anant Jhingran,
Recommender are email and web-based systems that Tapas Kanungo, Sridhar Rajagopalan,
generate recommendations on music and movies
Andrew Tomkins, John A. Tomlin, and
respectively. A special issue of Communications of the Jason Y. Zien. SemTag and Seeker: Bootstrapping
ACM [12] presents a number of different recommender the semantic web via automated semantic
systems. Although these systems have been successful annotation. IBM Almaden Research Center, 2003.
in the past, their widespread use has exposed some of
[7] TAP ontology page. http://ontap.stanford.edu/
their limitations such as the problems of sparsity in the
[8] Natalia Kolova. Automatic ontology extraction for
data set, problems associated with high dimensionality document classification. Computer science
and so on. department, Saarland University, 2005.
А myriad of other recommender systems exist,
[9] David Vallet, Miriam Fernández, and Pablo
particularly on e-commerce sites. Schafer [13] examines
Castells. An Ontology-Based Information Retrieval
and categorizes a large set of these commercialized Model. Universidad Autonoma de Madrid.
recommender systems. In addition, numerous [10] Goldberg, D., Nichols, D., Oki, B. M., and Terry,D.
recommenders in a variety of domains have been Using Collaborative Filtering to Weave an
developed for research purposes, including MovieLens Information Tapestry. Communication of ACM,
(films), Ringo (music), and Jester (jokes). 1992.
All of these systems are based on collaborative [11] Resnick, P., Iacovou, N., Suchak, M., Bergstrom,
filtering. Correspondingly they have problems as stated
P., and Riedl, J. GroupLens: An open Architecture
above. We avoid these problems by using high quality
for Collaborative Filtering of Netnews. Proceedings
user-generated content as a foundation for making of CSCW, 1994.
recommendations. [12] Resnik, P., and Varian, H. R. Recommender
Systems. Special issue of Communication of the
4 Conclusion ACM, 40(3), 1997.
In this article we propose a novel architecture for [13] J. Schafer, J. Konstan, and J. Riedl. Electronic
building recommendation systems and formulate the commerce recommender applications. Data Mining
major directions of future work. In the future we plan to and Knowledge Discovery, Jan. 2001.
modify and apply Web spam detection algorithms for
cleaning Wikipedia links. We will then evaluate various
approaches to making recommendations using the
extracted ontology (we have given a basic example of
such an approach in Sections 2.2 and 2.4).
Finally, we will implement a complete blog
recommender system based on the described techniques
and evaluate it on the Internet users.