=Paper=
{{Paper
|id=Vol-1542/paper2
|storemode=property
|title=Combining Collaborative Filtering and Search Engine into Hybrid News Recommendations
|pdfUrl=https://ceur-ws.org/Vol-1542/paper2.pdf
|volume=Vol-1542
|authors=Toon De Pessemier,Sam Leroux,Kris Vanhecke,Luc Martens
|dblpUrl=https://dblp.org/rec/conf/recsys/PessemierLVM15
}}
==Combining Collaborative Filtering and Search Engine into Hybrid News Recommendations==
content was GroupLens [14]. GroupLens used collaborative tent sources of a different nature, such as premium content,
filtering to generate recommendations for Usenet news and blogs, Twitter, etc. (complete overview), and the clustering
was evaluated by a public trial with users from over a dozen of content items by topic (clearly structured).
newsgroups. This research identified some important chal- The remainder of this paper is structured as follows. Sec-
lenges involved in creating a news recommender system. tion 3 compares the recommendation and content retrieval
SCENE [15] is such a news service. It stands for a SCal- problem and indicates resemblances between the two ap-
able two-stage pErsonalized News rEcommendation system. proaches. Section 4 discusses the architecture of our system
The system considers characteristics such as news content, and zooms in on the data fetching, search engine, recom-
access patterns, named entities, popularity, and recency of mender, and clustering component of the proposed system.
news items when performing recommendation. The pro- Section 5 provides details on the implementation, the user
posed news selection mechanism demonstrates the impor- interaction with the system, and the user interface. Finally,
tance of a good balance between user interests, the novelty, Section 6 draws conclusions.
and diversity of the recommendations.
The News@hand system [5] is a news recommender which 3. RECOMMENDATION AS A CONTENT
applies semantic-based technologies to describe and relate
news contents and user preferences in order to produce en-
RETRIEVAL PROBLEM
hanced recommendations. This news system ensures multi- Content-based algorithms typically compare a represen-
media source applicability. The resultant recommendations tation of the user profile with (the metadata of) the con-
can be adapted to the current context of interest, thereby tent, and deliver the best matching items as recommen-
emphasizing the importance of contextualization in the do- dations [16]. These algorithms often use relatively simple
main of news. retrieval models, such as keyword matching or the Vector
In the CLEF NEWSREEL track [3], news recommenda- Space Model (VSM) with basic Term Frequency - Inverse
tion techniques could be evaluated in real-time by providing Document Frequency (TF-IDF) weighting [17]. As such,
news recommendations to actual users that visit commercial the matching process of content and profile in a content-
news portals. A web-based platform is used to distribute based algorithm shows many resemblances with the content
recommendations to the users and return users’ impressions retrieval process of a search engine.
of the recommendations to the researchers. Before employing the VSM and TF-IDF weighting in a
The News Recommender Systems Challenge [22] focused content-based algorithm, preprocessing of the content is of-
on providing live recommendations for readers of German ten required. If the content consists of complete sentences,
news media articles. This challenge highlighted why news the text stream must be broken up into tokens: phrases,
recommendations have not been analyzed as thoroughly as words, symbols or other meaningful elements. Tokens that
some of the other domains such as movies, books, or mu- belong together, e.g. United States of America or New York,
sic. Reasons for this include the lack of data sets as well deserve special attention, and can be handled by reasoning
as the lack of open systems to deploy algorithms in. In the based on uppercase letters and n-gram models [4]. Before
challenge, the deployed recommenders for generating news further processing of the content, the next operation is fil-
recommendations are: Recent Recommender (based only on tering out stop words, the most common words in a lan-
the recency of the articles), Lucene Recommender (a text guage that typically have a limited intrinsic value. Another
retrieval system built on top of Apache Lucene), Category- important operation is stemming, the process for reducing
based Recommender (using the article’s category), User Fil- inflected (or sometimes derived) words to their word stem,
ter (filters out the articles previously observed by the current or root form. In our implementation, Snowball [20] is used,
user), and Combined Recommender (a stack or cascade of a powerful stemmer for the English language. Again, a re-
two or more of the above recommenders). semblance with content retrieval processes can be noticed,
The usefulness of retrieval algorithms for content-based since these preprocessing operations are also performed dur-
recommendations has been demonstrated with experiments ing the indexing of web pages in search engines.
using a large data set of news content [2]. Binary and graded Based on these similarities between the content recom-
evaluation were compared and graded evaluation showed mendation and content retrieval problem, we opted to utilize
to be intrinsically better for news recommendations. This a search engine as the core component of our recommender
study emphasizes the potential of combining content-based service. The user profile is used as search query and pro-
approaches with collaborative filtering into a hybrid recom- vides the input for the search engine. Consequently, the
mender system for news. search results are the content items best matching the user
Although the various initiatives emphasize the importance profile and can therefore be considered as personalized rec-
of a personalized news offer, most of them focus on the rec- ommendations for the user.
ommendation algorithms. However, the way in which con- Utilizing a search engine to generate personalized recom-
tent is gathered, delivered, and presented to end-users is of mendations for news content brings some additional advan-
crucial importance for a successful service. Users want an tages.
up-to-date, personalized news offer, providing a complete • Short response time. Search engines are strongly opti-
overview of all news events, which is clearly structured and mized to quickly identify and retrieve relevant content
classified by topic. In this study, the focus is not on improv- items. An inverted index [6] is used as a very efficient
ing state of the art recommendation algorithms or search structuring of the content, enabling to handle massive
engines, since many studies covered this already [22, 3, 6, amounts of documents.
2]. The focus of this paper is rather on investigating the
• Fast processing of new content. New content items can
real-time aspect of delivering personalized recommendations
be processed quickly by making additions to the in-
(up-to-date content offer ), the aggregation of multiple con-
dex structure, thereby making these new content items
time. As opposed to batch processing, Storm handles the
news articles as soon as these are available. To use Storm,
a topology composed of ‘Spouts’ and ‘Bolts’ has to be built,
which describes how messages flow into the system and how
they have to be processed. A Spout is a source of data
streams. A Bolt consumes any number of data streams, does
some processing, and can emit new data streams. Storm can
make duplicates of these components, and even distribute
these duplicates over multiple machines, in order to process
large amounts of data. As a result, Storm makes the system
scalable and distributed.
In our implementation, the Spouts input data into the sys-
tem as URLs of RSS-feeds, blogs, or social network accounts.
Storm will distribute the work load over different Bolts of the
first type, which fetch the data from the feeds. In case new
articles are available in the feed, the URL of these articles
is passed to the Bolts of the second type. These Bolts fetch
the article content and remove non-topical information, such
as advertisements, by identifying specific HTML tags in the
Figure 1: The architecture and content flow of the source code of the web page. Subsequently, the Bolts pass
news recommender system. the article content to Bolts of the third type. The task of
Bolts of the third type is to analyze the content and obtain
information such as the title, date, category, etc. Next, the
available for recommendation almost immediately. In article content is passed to the fourth type of Bolts, which
contrast, traditional recommender systems often re- will input the news articles into the search engine. After
quire intensive calculations of similarities before a new inputting the content into the search engine, statistical in-
item can be recommended. formation about the article content is stored by the fifth
and last type of Bolts. E.g., the frequency of occurrence of
• Limited storage requirements. The index structure of a term at a specific moment in time is used to determine if
search engines is a very efficient storage way to retrieve a news topic is trending and important (Section 4.3).
documents.
4.2 Search Engine
4. ARCHITECTURE In the second phase, the content is processed by a search
Figure 1 shows the architecture and content flow of the engine. We opted to use Apache Lucene [24], a Java library
news recommender system. The different components will that is typically used for services handling large amounts
be discussed in more detail in this section. of data and offering search functionalities. Since Lucene’s
performance, simplicity, and ease-of-use have been investi-
4.1 Data Fetching gated in related work [12], this research does not focus on
The first phase of recommendation process is to fetch the the characteristics of Lucene, but rather on the combination
news content periodically from different sources. When new of search engine and recommender system.
items are available, their content is fetched and processed. As alternative search engines, we considered Solr [26] and
Many online news services provide their content through ElasticSearch [10]. Solr is a ready-to-use, open source search
RSS-feeds. To parse these feeds, the Rome project [28] is engine based on Lucene. In comparison with Lucene, Solr
used since this is a robust parser. Besides RSS-feeds, other provides more specific features such as a REST webinterface
sources, such as blogs, can also be incorporated into the to index and search for documents. However, the disadvan-
system by using a specific content parser. tage of Solr is that some of the specialized functionality is
In order to keep track of the most recent news content, hidden and not directly usable. Besides, the overhead of the
news sources are checked regularly for new content. Differ- webinterface of Solr introduced some delay in comparison
ent news sources have a different publishing frequency, rang- with Lucene in our experiments. Similar to Solr, Elastic-
ing from one news item per day, to multiple news items per Search hides some of Lucene’s functionality by using a simple
minute. Therefore, we used a simple mechanism to adapt web interface. Specific information about the content items,
the frequency of checking for new content to the publishing such as the term frequencies or statistics about the com-
frequency of the content source. For each content source, plete index, are not directly accessible using ElasticSearch.
a dynamic timer is used to determine when to check for Therefore, Lucene was chosen to provide the functionality of
new content. After a timeout, the content is fetched. If the search engine. In case the processing load for the Lucene
new content is available, the content item is added to the index becomes an issue, distribution over different machines
search engine and the timeout is reduced by half. If no new is possible by solutions such as Katta [13], thereby making
content is available, the timeout is doubled. This simple it scalable.
mechanism showed to be sufficient as a convergence method
for the timeout parameter. 4.3 Recommender
In order to process the stream of incoming news articles of In the third phase, personalized recommendations are gen-
different sources continuously, Apache Storm [1] was used. erated. The user profile is used as a search query and sent to
Storm enables the processing of large streams of data in real the search engine. The resulting search results are consid-
ered as personalized recommendations. As is common prac- mentation will recommend profile terms that are prominent
tice in the VSM [16], the user profile is modeled as a vector in neighboring profiles. These profile terms of the neighbors
of terms (tags) together with a value specifying the user’s are used to extend the profile of the user, thereby making it
interest in the term. These terms are words (or N-grams) in more diverse. Subsequently, this extended profile is used to
the article that are identified as relevant for the content. The generate content-based recommendations using the search
current implementation is based on the traditional TF-IDF, engine. By extending the profile of a user with terms that
but alternative solutions can easily be integrated. When the are significant in the profiles of the user’s neighbors, profiles
user reads a news article, the profile vector is updated with are broadened and diversified with related terms. These ex-
the TF-IDF values of the terms of the article. However, this tended profiles will produce more diverse recommendations
update process is only executed if the user has spent more covering a broad range of topics. Since the additional pro-
time on the article than a predefined threshold. In our im- file terms are originating from neighbors’ profiles, the added
plementation, we have chosen 10 seconds as a minimum time terms will probably be in the area of interest of the user.
period for users to read the title and get an impression of The collaborative filtering component is based on the im-
the article content. More advanced approaches are possible plementation of Apache Mahout [25]. Mahout ensures the
using the reading time and article length, but these are not scalability of this component of the system. Moreover, the
always reliable in a mobile environment. profile extension is not a time-critical component, and is
Since our system uses implicit feedback based on users’ therefore implemented as a batch process running period-
selections (see Section 5), the profile update process is a ically. Content-based recommendations are based on the
simple summation of the item vectors of different articles. current version of the user profile, and as soon as the pro-
Articles from the past are considered as less representative file extension is finished, the profile is updated. This en-
for the user’s preferences than recent articles. Therefore, sures that real-time recommendations can be generated at
the value of a term decreases exponentially as the age (in all time.
hours) of the article increases, meaning that older items will Finally, also the publishing date of the article is taken
contribute less to the profile. Although these terms with into account in the recommendation process. In the current
their corresponding interest values may form a rather long implementation, only news articles of the last two days are
profile vector, and as a result a long search query, Lucene is candidate recommendations. However, a more intelligent
designed to handle such search requests in a very short time. degradation over time, with a degradation rate depending
Therefore, recommendations are requested when needed and on the category or content of the article, can be future work.
hence always up-to-date.
News events with a high impact (e.g., a natural disaster in 4.4 Clustering
a remote part of the world) have to be detected and consid- In the fourth phase, the recommended news items are clus-
ered as a recommendation, even if the topic does not com- tered into topics. Since the news items in our system origi-
pletely match the user’s interests. These trending topics can nate from different content sources, multiple items may cover
be identified based on their frequency of occurrence. If the the same news story. To provide users a clear overview of the
current frequency of occurrence is significantly higher than news without removing content items, items about the same
the frequency of occurrence in the past, the topic is consid- topic are clustered together. To cluster the content, three
ered as trending. Besides, trending topics are discovered by clustering approaches are considered during the design.
checking trends on Google’s search queries [11]. Every hour, 1. A periodic clustering of the complete content library
Google publishes a short list with trending searches. A spe- before generating recommendations. Traditional clus-
cial Spout was implemented to fetch these trending topics tering algorithms, which assume that all items are known
hourly. Trending topics are used to create a query for the before the clustering starts, can be used to periodi-
search engine, and the resulting news items are added to the cally cluster all news items [23]. This approach does
user’s recommendation list. A final source of trending topics not allow the recommendation process to begin before
is Twitter. Research has shown that Twitter messages are the complete clustering of the content library is fin-
a good reflection of topical news [18]. Therefore, another ished. Since this disadvantage introduces too much
Spout was assigned specifically to query tweets regarding delay when adding new content to the library, it was
news topics using the Twitter API. Twitter accounts of spe- not an approach for our system.
cialized news services and newspapers were followed. The
tweets originating from these accounts are focusing on re- 2. An incremental clustering of the content library be-
cent news and characterized by a high quality. Retweets fore generating recommendations. In this approach,
and Favorites give an indication of the popularity and im- new content items are assigned to the best matching
pact of a tweet. Subsequently, Tweets are processed in the cluster, or a new cluster is made in case there is no
same manner as other news items by Bolts. match. Although this clustering approach is used in
As stated in the introduction, straightforward collabora- different existing systems [15, 7], we did not opt for
tive filtering is not usable for news recommendations be- this approach because it is not personalized. For a
cause of the new item problem. Unfortunately, content- large content library, a large number of clusters can be
based recommendations are typically characterized by a low identified. Since the clustering process is performed
serendipity; recommendations are too obvious. To introduce before the recommendation process, the clusters are
serendipity, a hybrid approach was taken by adding a collab- identical for all users. However, personal interests may
orative filtering aspect to the content based recommender. A require a personalized clustering of the news content.
traditional nearest neighbor approach was used to calculate
3. A clustering of the recommended content items. This
similarities between user-user pairs. Instead of recommend-
is the approach that is used in our system, using a hi-
ing the items that the neighbors have consumed, our imple-
erarchical clustering algorithm. Content items are not
clustered until the recommendation process is finished.
The advantage of this approach is that only a small set
of content items (250 candidate recommendations in
our system) have to be clustered. Another advantage
of clustering the recommendation results is the person-
alized nature of this set. For each user, the clustering
process will result in a different clustering. Even a dif-
ferent level of clustering (number of clusters) can be
chosen for every user. Users who are very interested in
sports may find different clusters for soccer, baseball,
cycling, etc., whereas users who are moderately inter-
ested may receive only one sports cluster containing
all sporting disciplines. On the downside, users may
not be familiar with a personalized clustering. As user Figure 2: A screenshot of the user interface of the
preferences change or as collaborative filtering is ap- (mobile) web application.
plied to extend profile vectors, clusters are not stable
over time. This behavior may surprise users who first
got used to the existing clusters and then cannot find Evaluating the system performance in terms of response
their ‘old favorite’ clusters anymore. time gave the following results. A mean response time of 800
ms was measured to generate 250 recommendations. This re-
5. USER INTERACTION quest includes retrieving the user profile and trending terms,
Mobile has become, especially amongst younger media executing the query on the search engine, and clustering the
consumers, the first gateway to most news events published resulting items. These results were obtained on our test sys-
online. In a recent survey [21], conducted in 10 countries tem, an Intel Xeon E5645 CPU at 2.40 GHz with 8GB of
with high Internet penetration, one-fifth of the users now RAM running CentOS 6.6.
claim that their mobile phone is the primary access point
for news. The small screen and typical interaction methods 6. CONCLUSIONS
of mobile devices (touch screen) induce extra challenges and In this paper, we proposed a hybrid, real-time recom-
possibilities for news services. mender system for news, combining technologies such as
Because of this, we made our news service available as a Storm, Lucene, and Mahout to ensure scalability and quick
web application that is usable on desktop but also on tablets response times. Storm enables the processing of large streams
and smartphones. Figure 2 shows a screenshot of the user of news content. Lucene provides the functionality of a
interface of the (mobile) web application, based on HTML5 search engine and is used as a content-based recommender.
and Javascript. On the left hand side, an overview of the The collaborative filter of Mahout is used to exchange pro-
recommended content items is shown. For each article, the file terms among neighboring users. User profile vectors are
number indicates how many articles covering this topic are extended with related terms interesting to read about. The
clustered together. Selecting one of the items in the left resulting hybrid recommendations are clustered according
column will show the article content on the right hand side to their topic and presented to the user through a web ap-
using an HTML iframe. HTML iframes are used in order plication that is optimized for mobile devices. This research
to provide all functionality of the source website, such as discussed the possibility of combining collaborative filtering
hyperlinks, while providing users the ability to browse their and a search engine to compose a hybrid news recommender
recommendations using the left column. Parsing the content system, thereby combining the advantages of both. Search
of the source and reproducing it inside our own application engines ensure a real-time response behavior while collab-
is a technically feasible alternative, but violates the terms orative filtering adds community knowledge to the system.
of use of many websites. Redirecting the users to the source As future work, we consider to make a distinction between
website (using hyperlinks) would imply that users leave our short-term interests and long-term interests of users. We
web application and continue their news consumption on the also plan to focus more on entities mentioned in articles.
source website, thereby making it impossible to track their
behavior. The user interface is adapted to mobile devices
by providing a clearly readable overview of the content, and 7. ACKNOWLEDGMENTS
interaction through tapping and swiping the touch screen. We would like to thank Sam Leroux for the work he per-
For smaller screens, such as smartphones, the column on the formed in the context of this research during his master the-
left hand side can be hidden to show the news articles in full sis.
screen. Further optimizations for mobile devices and touch
screens are provided by using JQuery Mobile [27].
Explicit feedback for news services is difficult to interpret
8. REFERENCES
and therefore less common. E.g., a 1-star on a 5 point rat- [1] Apache Software Foundation. Apache storm, 2015.
ing scale can be interpreted as a disinterest for the content, Available at http://storm.apache.org/.
or as sympathizing with a story about some tragic event. [2] T. Bogers and A. van den Bosch. Comparing and
Therefore, our system is using implicit feedback based on evaluating information retrieval algorithms for news
the user’s viewing behavior. If an article is selected and recommendation. In Proceedings of the 2007 ACM
shown on the screen for at least 10 seconds, we assume that Conference on Recommender Systems, RecSys ’07,
the user has some interest in the topic of the story . pages 141–144, New York, NY, USA, 2007. ACM.
[3] T. Brodt and F. Hopfgartner. Shedding light on a [17] C. D. Manning, P. Raghavan, H. Schütze, et al.
living lab: The clef newsreel open recommendation Introduction to information retrieval, volume 1.
platform. In Proceedings of the 5th Information Cambridge university press Cambridge, 2008.
Interaction in Context Symposium, IIiX ’14, pages [18] O. Phelan, K. McCarthy, and B. Smyth. Using twitter
223–226, New York, NY, USA, 2014. ACM. to recommend real-time topical news. In Proceedings
[4] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. of the Third ACM Conference on Recommender
Pietra, and J. C. Lai. Class-based n-gram models of Systems, RecSys ’09, pages 385–388, New York, NY,
natural language. Comput. Linguist., 18(4):467–479, USA, 2009. ACM.
Dec. 1992. [19] L. Pizzato, T. Rej, T. Chung, I. Koprinska, and
[5] I. Cantador, A. Bellogı́n, and P. Castells. News@hand: J. Kay. Recon: A reciprocal recommender for online
A semantic web approach to recommending news. In dating. In Proceedings of the Fourth ACM Conference
W. Nejdl, J. Kay, P. Pu, and E. Herder, editors, on Recommender Systems, RecSys ’10, pages 207–214,
Adaptive Hypermedia and Adaptive Web-Based New York, NY, USA, 2010. ACM.
Systems, volume 5149 of Lecture Notes in Computer [20] M. F. Porter. Snowball: A language for stemming
Science, pages 279–283. Springer Berlin Heidelberg, algorithms, 2001. Available at
2008. http://snowball.tartarus.org/.
[6] D. Cutting and J. Pedersen. Optimization for dynamic [21] Reuters Institute for the Study of Journalism. Digital
inverted index maintenance. In Proceedings of the 13th News Report, 2015. Available at
Annual International ACM SIGIR Conference on http://www.digitalnewsreport.org/.
Research and Development in Information Retrieval, [22] A. Said, A. Bellogı́n, and A. de Vries. News
SIGIR ’90, pages 405–411, New York, NY, USA, 1990. recommendation in the wild: Cwi’s recommendation
ACM. algorithms in the NRS challenge. In Proceedings of the
[7] A. S. Das, M. Datar, A. Garg, and S. Rajaram. 2013 International News Recommender Systems
Google news personalization: Scalable online Workshop and Challenge. NRS, volume 13, 2013.
collaborative filtering. In Proceedings of the 16th [23] K. G. Saranya and G. S. Sadhasivam. A personalized
International Conference on World Wide Web, WWW online news recommendation system. International
’07, pages 271–280, New York, NY, USA, 2007. ACM. Journal of Computer Applications, 57(18):6–14,
[8] T. De Pessemier, S. Coppens, K. Geebelen, November 2012.
C. Vleugels, S. Bannier, E. Mannens, K. Vanhecke, [24] The Apache Software Foundation. Apache Lucene,
and L. Martens. Collaborative recommendations with 2015. Available at https://lucene.apache.org/.
content-based filters for cultural activities via a [25] The Apache Software Foundation. Apache Mahout,
scalable event distribution platform. Multimedia Tools 2015. Available at http://mahout.apache.org/users/
and Applications, 58(1):167–213, 2012. recommender/recommender-documentation.html.
[9] T. De Pessemier, C. Courtois, K. Vanhecke, [26] The Apache Software Foundation. Apache Solr, 2015.
K. Van Damme, L. Martens, and L. De Marez. A Available at http://lucene.apache.org/solr/.
user-centric evaluation of context-aware [27] The jQuery Foundation. jQuery mobile, a
recommendations for a mobile news service. touch-optimized web framework, 2015. Available at
Multimedia Tools and Applications, pages 1–29, 2015. http://jquerymobile.com.
[10] Elastic. Elasticsearch, 2015. Available at [28] M. Woodman. Rome, 2015. Available at https:
https://www.elastic.co/. //rometools.jira.com/wiki/display/ROME/Home.
[11] Google. Google Hourly Trends, 2015. Available at
http:
//www.google.com/trends/hottrends/atom/hourly.
[12] E. Hatcher and O. Gospodnetic. Lucene in action (in
action series). 2004.
[13] Katta. Lucune & more in the cloud, 2015. Available at
http://katta.sourceforge.net/.
[14] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker,
L. R. Gordon, and J. Riedl. Grouplens: Applying
collaborative filtering to usenet news. Commun. ACM,
40(3):77–87, Mar. 1997.
[15] L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan.
Scene: A scalable two-stage personalized news
recommendation system. In Proceedings of the 34th
International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR ’11,
pages 125–134, New York, NY, USA, 2011. ACM.
[16] P. Lops, M. de Gemmis, and G. Semeraro.
Content-based recommender systems: State of the art
and trends. In F. Ricci, L. Rokach, B. Shapira, and
P. B. Kantor, editors, Recommender Systems
Handbook, pages 73–105. Springer US, 2011.