content was GroupLens [14]. GroupLens used collaborative tent sources of a different nature, such as premium content, filtering to generate recommendations for Usenet news and blogs, Twitter, etc. (complete overview), and the clustering was evaluated by a public trial with users from over a dozen of content items by topic (clearly structured). newsgroups. This research identified some important chal- The remainder of this paper is structured as follows. Sec- lenges involved in creating a news recommender system. tion 3 compares the recommendation and content retrieval SCENE [15] is such a news service. It stands for a SCal- problem and indicates resemblances between the two ap- able two-stage pErsonalized News rEcommendation system. proaches. Section 4 discusses the architecture of our system The system considers characteristics such as news content, and zooms in on the data fetching, search engine, recom- access patterns, named entities, popularity, and recency of mender, and clustering component of the proposed system. news items when performing recommendation. The pro- Section 5 provides details on the implementation, the user posed news selection mechanism demonstrates the impor- interaction with the system, and the user interface. Finally, tance of a good balance between user interests, the novelty, Section 6 draws conclusions. and diversity of the recommendations. The News@hand system [5] is a news recommender which 3. RECOMMENDATION AS A CONTENT applies semantic-based technologies to describe and relate news contents and user preferences in order to produce en- RETRIEVAL PROBLEM hanced recommendations. This news system ensures multi- Content-based algorithms typically compare a represen- media source applicability. The resultant recommendations tation of the user profile with (the metadata of) the con- can be adapted to the current context of interest, thereby tent, and deliver the best matching items as recommen- emphasizing the importance of contextualization in the do- dations [16]. These algorithms often use relatively simple main of news. retrieval models, such as keyword matching or the Vector In the CLEF NEWSREEL track [3], news recommenda- Space Model (VSM) with basic Term Frequency - Inverse tion techniques could be evaluated in real-time by providing Document Frequency (TF-IDF) weighting [17]. As such, news recommendations to actual users that visit commercial the matching process of content and profile in a content- news portals. A web-based platform is used to distribute based algorithm shows many resemblances with the content recommendations to the users and return users’ impressions retrieval process of a search engine. of the recommendations to the researchers. Before employing the VSM and TF-IDF weighting in a The News Recommender Systems Challenge [22] focused content-based algorithm, preprocessing of the content is of- on providing live recommendations for readers of German ten required. If the content consists of complete sentences, news media articles. This challenge highlighted why news the text stream must be broken up into tokens: phrases, recommendations have not been analyzed as thoroughly as words, symbols or other meaningful elements. Tokens that some of the other domains such as movies, books, or mu- belong together, e.g. United States of America or New York, sic. Reasons for this include the lack of data sets as well deserve special attention, and can be handled by reasoning as the lack of open systems to deploy algorithms in. In the based on uppercase letters and n-gram models [4]. Before challenge, the deployed recommenders for generating news further processing of the content, the next operation is fil- recommendations are: Recent Recommender (based only on tering out stop words, the most common words in a lan- the recency of the articles), Lucene Recommender (a text guage that typically have a limited intrinsic value. Another retrieval system built on top of Apache Lucene), Category- important operation is stemming, the process for reducing based Recommender (using the article’s category), User Fil- inflected (or sometimes derived) words to their word stem, ter (filters out the articles previously observed by the current or root form. In our implementation, Snowball [20] is used, user), and Combined Recommender (a stack or cascade of a powerful stemmer for the English language. Again, a re- two or more of the above recommenders). semblance with content retrieval processes can be noticed, The usefulness of retrieval algorithms for content-based since these preprocessing operations are also performed dur- recommendations has been demonstrated with experiments ing the indexing of web pages in search engines. using a large data set of news content [2]. Binary and graded Based on these similarities between the content recom- evaluation were compared and graded evaluation showed mendation and content retrieval problem, we opted to utilize to be intrinsically better for news recommendations. This a search engine as the core component of our recommender study emphasizes the potential of combining content-based service. The user profile is used as search query and pro- approaches with collaborative filtering into a hybrid recom- vides the input for the search engine. Consequently, the mender system for news. search results are the content items best matching the user Although the various initiatives emphasize the importance profile and can therefore be considered as personalized rec- of a personalized news offer, most of them focus on the rec- ommendations for the user. ommendation algorithms. However, the way in which con- Utilizing a search engine to generate personalized recom- tent is gathered, delivered, and presented to end-users is of mendations for news content brings some additional advan- crucial importance for a successful service. Users want an tages. up-to-date, personalized news offer, providing a complete • Short response time. Search engines are strongly opti- overview of all news events, which is clearly structured and mized to quickly identify and retrieve relevant content classified by topic. In this study, the focus is not on improv- items. An inverted index [6] is used as a very efficient ing state of the art recommendation algorithms or search structuring of the content, enabling to handle massive engines, since many studies covered this already [22, 3, 6, amounts of documents. 2]. The focus of this paper is rather on investigating the • Fast processing of new content. New content items can real-time aspect of delivering personalized recommendations be processed quickly by making additions to the in- (up-to-date content offer ), the aggregation of multiple con- dex structure, thereby making these new content items time. As opposed to batch processing, Storm handles the news articles as soon as these are available. To use Storm, a topology composed of ‘Spouts’ and ‘Bolts’ has to be built, which describes how messages flow into the system and how they have to be processed. A Spout is a source of data streams. A Bolt consumes any number of data streams, does some processing, and can emit new data streams. Storm can make duplicates of these components, and even distribute these duplicates over multiple machines, in order to process large amounts of data. As a result, Storm makes the system scalable and distributed. In our implementation, the Spouts input data into the sys- tem as URLs of RSS-feeds, blogs, or social network accounts. Storm will distribute the work load over different Bolts of the first type, which fetch the data from the feeds. In case new articles are available in the feed, the URL of these articles is passed to the Bolts of the second type. These Bolts fetch the article content and remove non-topical information, such as advertisements, by identifying specific HTML tags in the Figure 1: The architecture and content flow of the source code of the web page. Subsequently, the Bolts pass news recommender system. the article content to Bolts of the third type. The task of Bolts of the third type is to analyze the content and obtain information such as the title, date, category, etc. Next, the available for recommendation almost immediately. In article content is passed to the fourth type of Bolts, which contrast, traditional recommender systems often re- will input the news articles into the search engine. After quire intensive calculations of similarities before a new inputting the content into the search engine, statistical in- item can be recommended. formation about the article content is stored by the fifth and last type of Bolts. E.g., the frequency of occurrence of • Limited storage requirements. The index structure of a term at a specific moment in time is used to determine if search engines is a very efficient storage way to retrieve a news topic is trending and important (Section 4.3). documents. 4.2 Search Engine 4. ARCHITECTURE In the second phase, the content is processed by a search Figure 1 shows the architecture and content flow of the engine. We opted to use Apache Lucene [24], a Java library news recommender system. The different components will that is typically used for services handling large amounts be discussed in more detail in this section. of data and offering search functionalities. Since Lucene’s performance, simplicity, and ease-of-use have been investi- 4.1 Data Fetching gated in related work [12], this research does not focus on The first phase of recommendation process is to fetch the the characteristics of Lucene, but rather on the combination news content periodically from different sources. When new of search engine and recommender system. items are available, their content is fetched and processed. As alternative search engines, we considered Solr [26] and Many online news services provide their content through ElasticSearch [10]. Solr is a ready-to-use, open source search RSS-feeds. To parse these feeds, the Rome project [28] is engine based on Lucene. In comparison with Lucene, Solr used since this is a robust parser. Besides RSS-feeds, other provides more specific features such as a REST webinterface sources, such as blogs, can also be incorporated into the to index and search for documents. However, the disadvan- system by using a specific content parser. tage of Solr is that some of the specialized functionality is In order to keep track of the most recent news content, hidden and not directly usable. Besides, the overhead of the news sources are checked regularly for new content. Differ- webinterface of Solr introduced some delay in comparison ent news sources have a different publishing frequency, rang- with Lucene in our experiments. Similar to Solr, Elastic- ing from one news item per day, to multiple news items per Search hides some of Lucene’s functionality by using a simple minute. Therefore, we used a simple mechanism to adapt web interface. Specific information about the content items, the frequency of checking for new content to the publishing such as the term frequencies or statistics about the com- frequency of the content source. For each content source, plete index, are not directly accessible using ElasticSearch. a dynamic timer is used to determine when to check for Therefore, Lucene was chosen to provide the functionality of new content. After a timeout, the content is fetched. If the search engine. In case the processing load for the Lucene new content is available, the content item is added to the index becomes an issue, distribution over different machines search engine and the timeout is reduced by half. If no new is possible by solutions such as Katta [13], thereby making content is available, the timeout is doubled. This simple it scalable. mechanism showed to be sufficient as a convergence method for the timeout parameter. 4.3 Recommender In order to process the stream of incoming news articles of In the third phase, personalized recommendations are gen- different sources continuously, Apache Storm [1] was used. erated. The user profile is used as a search query and sent to Storm enables the processing of large streams of data in real the search engine. The resulting search results are consid- ered as personalized recommendations. As is common prac- mentation will recommend profile terms that are prominent tice in the VSM [16], the user profile is modeled as a vector in neighboring profiles. These profile terms of the neighbors of terms (tags) together with a value specifying the user’s are used to extend the profile of the user, thereby making it interest in the term. These terms are words (or N-grams) in more diverse. Subsequently, this extended profile is used to the article that are identified as relevant for the content. The generate content-based recommendations using the search current implementation is based on the traditional TF-IDF, engine. By extending the profile of a user with terms that but alternative solutions can easily be integrated. When the are significant in the profiles of the user’s neighbors, profiles user reads a news article, the profile vector is updated with are broadened and diversified with related terms. These ex- the TF-IDF values of the terms of the article. However, this tended profiles will produce more diverse recommendations update process is only executed if the user has spent more covering a broad range of topics. Since the additional pro- time on the article than a predefined threshold. In our im- file terms are originating from neighbors’ profiles, the added plementation, we have chosen 10 seconds as a minimum time terms will probably be in the area of interest of the user. period for users to read the title and get an impression of The collaborative filtering component is based on the im- the article content. More advanced approaches are possible plementation of Apache Mahout [25]. Mahout ensures the using the reading time and article length, but these are not scalability of this component of the system. Moreover, the always reliable in a mobile environment. profile extension is not a time-critical component, and is Since our system uses implicit feedback based on users’ therefore implemented as a batch process running period- selections (see Section 5), the profile update process is a ically. Content-based recommendations are based on the simple summation of the item vectors of different articles. current version of the user profile, and as soon as the pro- Articles from the past are considered as less representative file extension is finished, the profile is updated. This en- for the user’s preferences than recent articles. Therefore, sures that real-time recommendations can be generated at the value of a term decreases exponentially as the age (in all time. hours) of the article increases, meaning that older items will Finally, also the publishing date of the article is taken contribute less to the profile. Although these terms with into account in the recommendation process. In the current their corresponding interest values may form a rather long implementation, only news articles of the last two days are profile vector, and as a result a long search query, Lucene is candidate recommendations. However, a more intelligent designed to handle such search requests in a very short time. degradation over time, with a degradation rate depending Therefore, recommendations are requested when needed and on the category or content of the article, can be future work. hence always up-to-date. News events with a high impact (e.g., a natural disaster in 4.4 Clustering a remote part of the world) have to be detected and consid- In the fourth phase, the recommended news items are clus- ered as a recommendation, even if the topic does not com- tered into topics. Since the news items in our system origi- pletely match the user’s interests. These trending topics can nate from different content sources, multiple items may cover be identified based on their frequency of occurrence. If the the same news story. To provide users a clear overview of the current frequency of occurrence is significantly higher than news without removing content items, items about the same the frequency of occurrence in the past, the topic is consid- topic are clustered together. To cluster the content, three ered as trending. Besides, trending topics are discovered by clustering approaches are considered during the design. checking trends on Google’s search queries [11]. Every hour, 1. A periodic clustering of the complete content library Google publishes a short list with trending searches. A spe- before generating recommendations. Traditional clus- cial Spout was implemented to fetch these trending topics tering algorithms, which assume that all items are known hourly. Trending topics are used to create a query for the before the clustering starts, can be used to periodi- search engine, and the resulting news items are added to the cally cluster all news items [23]. This approach does user’s recommendation list. A final source of trending topics not allow the recommendation process to begin before is Twitter. Research has shown that Twitter messages are the complete clustering of the content library is fin- a good reflection of topical news [18]. Therefore, another ished. Since this disadvantage introduces too much Spout was assigned specifically to query tweets regarding delay when adding new content to the library, it was news topics using the Twitter API. Twitter accounts of spe- not an approach for our system. cialized news services and newspapers were followed. The tweets originating from these accounts are focusing on re- 2. An incremental clustering of the content library be- cent news and characterized by a high quality. Retweets fore generating recommendations. In this approach, and Favorites give an indication of the popularity and im- new content items are assigned to the best matching pact of a tweet. Subsequently, Tweets are processed in the cluster, or a new cluster is made in case there is no same manner as other news items by Bolts. match. Although this clustering approach is used in As stated in the introduction, straightforward collabora- different existing systems [15, 7], we did not opt for tive filtering is not usable for news recommendations be- this approach because it is not personalized. For a cause of the new item problem. Unfortunately, content- large content library, a large number of clusters can be based recommendations are typically characterized by a low identified. Since the clustering process is performed serendipity; recommendations are too obvious. To introduce before the recommendation process, the clusters are serendipity, a hybrid approach was taken by adding a collab- identical for all users. However, personal interests may orative filtering aspect to the content based recommender. A require a personalized clustering of the news content. traditional nearest neighbor approach was used to calculate 3. A clustering of the recommended content items. This similarities between user-user pairs. Instead of recommend- is the approach that is used in our system, using a hi- ing the items that the neighbors have consumed, our imple- erarchical clustering algorithm. Content items are not clustered until the recommendation process is finished. The advantage of this approach is that only a small set of content items (250 candidate recommendations in our system) have to be clustered. Another advantage of clustering the recommendation results is the person- alized nature of this set. For each user, the clustering process will result in a different clustering. Even a dif- ferent level of clustering (number of clusters) can be chosen for every user. Users who are very interested in sports may find different clusters for soccer, baseball, cycling, etc., whereas users who are moderately inter- ested may receive only one sports cluster containing all sporting disciplines. On the downside, users may not be familiar with a personalized clustering. As user Figure 2: A screenshot of the user interface of the preferences change or as collaborative filtering is ap- (mobile) web application. plied to extend profile vectors, clusters are not stable over time. This behavior may surprise users who first got used to the existing clusters and then cannot find Evaluating the system performance in terms of response their ‘old favorite’ clusters anymore. time gave the following results. A mean response time of 800 ms was measured to generate 250 recommendations. This re- 5. USER INTERACTION quest includes retrieving the user profile and trending terms, Mobile has become, especially amongst younger media executing the query on the search engine, and clustering the consumers, the first gateway to most news events published resulting items. These results were obtained on our test sys- online. In a recent survey [21], conducted in 10 countries tem, an Intel Xeon E5645 CPU at 2.40 GHz with 8GB of with high Internet penetration, one-fifth of the users now RAM running CentOS 6.6. claim that their mobile phone is the primary access point for news. The small screen and typical interaction methods 6. CONCLUSIONS of mobile devices (touch screen) induce extra challenges and In this paper, we proposed a hybrid, real-time recom- possibilities for news services. mender system for news, combining technologies such as Because of this, we made our news service available as a Storm, Lucene, and Mahout to ensure scalability and quick web application that is usable on desktop but also on tablets response times. Storm enables the processing of large streams and smartphones. Figure 2 shows a screenshot of the user of news content. Lucene provides the functionality of a interface of the (mobile) web application, based on HTML5 search engine and is used as a content-based recommender. and Javascript. On the left hand side, an overview of the The collaborative filter of Mahout is used to exchange pro- recommended content items is shown. For each article, the file terms among neighboring users. User profile vectors are number indicates how many articles covering this topic are extended with related terms interesting to read about. The clustered together. Selecting one of the items in the left resulting hybrid recommendations are clustered according column will show the article content on the right hand side to their topic and presented to the user through a web ap- using an HTML iframe. HTML iframes are used in order plication that is optimized for mobile devices. This research to provide all functionality of the source website, such as discussed the possibility of combining collaborative filtering hyperlinks, while providing users the ability to browse their and a search engine to compose a hybrid news recommender recommendations using the left column. Parsing the content system, thereby combining the advantages of both. Search of the source and reproducing it inside our own application engines ensure a real-time response behavior while collab- is a technically feasible alternative, but violates the terms orative filtering adds community knowledge to the system. of use of many websites. Redirecting the users to the source As future work, we consider to make a distinction between website (using hyperlinks) would imply that users leave our short-term interests and long-term interests of users. We web application and continue their news consumption on the also plan to focus more on entities mentioned in articles. source website, thereby making it impossible to track their behavior. The user interface is adapted to mobile devices by providing a clearly readable overview of the content, and 7. ACKNOWLEDGMENTS interaction through tapping and swiping the touch screen. We would like to thank Sam Leroux for the work he per- For smaller screens, such as smartphones, the column on the formed in the context of this research during his master the- left hand side can be hidden to show the news articles in full sis. screen. Further optimizations for mobile devices and touch screens are provided by using JQuery Mobile [27]. Explicit feedback for news services is difficult to interpret 8. REFERENCES and therefore less common. E.g., a 1-star on a 5 point rat- [1] Apache Software Foundation. Apache storm, 2015. ing scale can be interpreted as a disinterest for the content, Available at http://storm.apache.org/. or as sympathizing with a story about some tragic event. [2] T. Bogers and A. van den Bosch. Comparing and Therefore, our system is using implicit feedback based on evaluating information retrieval algorithms for news the user’s viewing behavior. If an article is selected and recommendation. In Proceedings of the 2007 ACM shown on the screen for at least 10 seconds, we assume that Conference on Recommender Systems, RecSys ’07, the user has some interest in the topic of the story . pages 141–144, New York, NY, USA, 2007. ACM. [3] T. Brodt and F. Hopfgartner. Shedding light on a [17] C. D. Manning, P. Raghavan, H. Schütze, et al. living lab: The clef newsreel open recommendation Introduction to information retrieval, volume 1. platform. In Proceedings of the 5th Information Cambridge university press Cambridge, 2008. Interaction in Context Symposium, IIiX ’14, pages [18] O. Phelan, K. McCarthy, and B. Smyth. Using twitter 223–226, New York, NY, USA, 2014. ACM. to recommend real-time topical news. In Proceedings [4] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. of the Third ACM Conference on Recommender Pietra, and J. C. Lai. Class-based n-gram models of Systems, RecSys ’09, pages 385–388, New York, NY, natural language. Comput. Linguist., 18(4):467–479, USA, 2009. ACM. Dec. 1992. [19] L. Pizzato, T. Rej, T. Chung, I. Koprinska, and [5] I. Cantador, A. Bellogı́n, and P. Castells. News@hand: J. Kay. Recon: A reciprocal recommender for online A semantic web approach to recommending news. In dating. In Proceedings of the Fourth ACM Conference W. Nejdl, J. Kay, P. Pu, and E. Herder, editors, on Recommender Systems, RecSys ’10, pages 207–214, Adaptive Hypermedia and Adaptive Web-Based New York, NY, USA, 2010. ACM. Systems, volume 5149 of Lecture Notes in Computer [20] M. F. Porter. Snowball: A language for stemming Science, pages 279–283. Springer Berlin Heidelberg, algorithms, 2001. Available at 2008. http://snowball.tartarus.org/. [6] D. Cutting and J. Pedersen. Optimization for dynamic [21] Reuters Institute for the Study of Journalism. Digital inverted index maintenance. In Proceedings of the 13th News Report, 2015. Available at Annual International ACM SIGIR Conference on http://www.digitalnewsreport.org/. Research and Development in Information Retrieval, [22] A. Said, A. Bellogı́n, and A. de Vries. News SIGIR ’90, pages 405–411, New York, NY, USA, 1990. recommendation in the wild: Cwi’s recommendation ACM. algorithms in the NRS challenge. In Proceedings of the [7] A. S. Das, M. Datar, A. Garg, and S. Rajaram. 2013 International News Recommender Systems Google news personalization: Scalable online Workshop and Challenge. NRS, volume 13, 2013. collaborative filtering. In Proceedings of the 16th [23] K. G. Saranya and G. S. Sadhasivam. A personalized International Conference on World Wide Web, WWW online news recommendation system. International ’07, pages 271–280, New York, NY, USA, 2007. ACM. Journal of Computer Applications, 57(18):6–14, [8] T. De Pessemier, S. Coppens, K. Geebelen, November 2012. C. Vleugels, S. Bannier, E. Mannens, K. Vanhecke, [24] The Apache Software Foundation. Apache Lucene, and L. Martens. Collaborative recommendations with 2015. Available at https://lucene.apache.org/. content-based filters for cultural activities via a [25] The Apache Software Foundation. Apache Mahout, scalable event distribution platform. Multimedia Tools 2015. Available at http://mahout.apache.org/users/ and Applications, 58(1):167–213, 2012. recommender/recommender-documentation.html. [9] T. De Pessemier, C. Courtois, K. Vanhecke, [26] The Apache Software Foundation. Apache Solr, 2015. K. Van Damme, L. Martens, and L. De Marez. A Available at http://lucene.apache.org/solr/. user-centric evaluation of context-aware [27] The jQuery Foundation. jQuery mobile, a recommendations for a mobile news service. touch-optimized web framework, 2015. Available at Multimedia Tools and Applications, pages 1–29, 2015. http://jquerymobile.com. [10] Elastic. Elasticsearch, 2015. Available at [28] M. Woodman. Rome, 2015. Available at https: https://www.elastic.co/. //rometools.jira.com/wiki/display/ROME/Home. [11] Google. Google Hourly Trends, 2015. Available at http: //www.google.com/trends/hottrends/atom/hourly. [12] E. Hatcher and O. Gospodnetic. Lucene in action (in action series). 2004. [13] Katta. Lucune & more in the cloud, 2015. Available at http://katta.sourceforge.net/. [14] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: Applying collaborative filtering to usenet news. Commun. ACM, 40(3):77–87, Mar. 1997. [15] L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan. Scene: A scalable two-stage personalized news recommendation system. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 125–134, New York, NY, USA, 2011. ACM. [16] P. Lops, M. de Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors, Recommender Systems Handbook, pages 73–105. Springer US, 2011.