A News Recommender Engine with a Killer Sequence

A News Recommender Engine with a Killer Sequence PieterBons pieter.bons@ortec.com ORTEC

-Houtsingel 5 2719 EA Zoetermeer The Netherlands

NickEvans nick.evans@ortec.com ORTEC

-Houtsingel 5 2719 EA Zoetermeer The Netherlands

PeterKampstra peter.kampstra@ortec.com ORTEC

-Houtsingel 5 2719 EA Zoetermeer The Netherlands

TimoVan Kessel timo.vankessel@ortec.com ORTEC

-Houtsingel 5 2719 EA Zoetermeer The Netherlands

A News Recommender Engine with a Killer Sequence 437DCC47BD683DE71F48216204025ADE GROBID - A machine learning software for extracting information from scholarly documents Recommender Systems News Recommender Engine Graph Database Sequence Evaluation Collaborative filtering Real-time recommendations

Our submission to the CLEF NewsREEL 2107 News Recommendation Evaluation Lab attempts to apply the concept of storytelling to the participating news domains. The goal is to guide the user through a series of related articles where each transition from one article to the next provides an opportunity to steer the storyline in a certain direction using recommendations. The approach employs collaborative filtering to discover an optimal sequence of articlesa killer sequence. The choices that were made by users reading two or more successive articles were stored in the graph database Neo4J. The recommendations were generated by querying this database for the most popular historic sequences containing the concerning article. For articles that were not yet part of any sequence we generated the recommendations from a dynamic set of the most recently changed publications for that domain. The performance of our combined algorithm was approximately 79% better for Click Through Rate (CTR) than the competition baseline. We investigated whether this top performance was due to unwanted behavior of the recommender, such as only answering on certain domains or times, but could not conclude that this was the case.

Introduction

The holy grail for (online) publishers is reader engagement. The more engaging the user's experience, the more likely that they will come back, increasing their loyalty to the domain during each visit until they may even become brand ambassadors. Once the consumer has crossed the barrier to enter the domain, the goal is to keep him there as long as possible. One of the important strategies to achieve this is storytelling. On the content side this requires rich and interactive media to retain the reader's attention. This can be reinforced by artificial intelligence and automation tools providing relevant personalized content to individual readers. Many types of algorithms could be considered for this job. However, traditional news recommendation algorithms rarely consider the time sequence characteristic of user browsing behaviors. Also, using strategies from other domains such as movies or music offers no solution as in these cases the order in which items are consumed is hardly relevant, while this is crucial for news articles which are much less independent from each other. Text mining may be employed to classify the articles into topics or to calculate similarities but this approach is also symmetric under the order in which articles are read.

To tackle this subtle issue, we have chosen to go with a collaborative filtering approach where the wisdom of the crowds will provide the information in which order the articles are best presented. The collective behavior of the users will uncover dominant sequences of articles that together form a powerful story. This sequence can then be suggested to new users via the recommendations causing them to stay more engaged and consume more content from the domain.

There is some related work which uses the timestamps of user-item interaction events [11,13]. We have not based our work specifically on these papers but followed a pragmatic approach based on our experience in the online news industry.

Approach

The Living Lab Evaluation [2] was performed using the open recommendation platform (ORP) provided by plista. By redirecting some of the live traffic, the ORP allows participants to test algorithms in a real environment. There are four types of events that are sent from ORP: Recommendation requests, Impressions, Item Updates, and Error Messages (such as timeout).

Fig. 1. Data processing infrastructure

Our infrastructure choices were dictated by two main requirements: 1) the system should answer recommendation requests as fast as possible, and 2) the system should be extensible and support many different algorithms running in parallel. In order to satisfy both these requirements it was necessary to decouple the processing of recommendation requests from the processing of impressions, clicks and item updates. We settled on Apache Kafka [1] to implement this decoupling. As can be seen in Errore. L'origine riferimento non è stata trovata. all requests are first processed by a Node.js [9] server. All requests are stored in MongoDB [3] for tracking and debugging. If the request is a recommendation request, the recommendations are gathered and returned. If the request is an impression, a click or an item update, the request is published to a Kafka topic. The recommendation algorithms can subscribe to this topic and process the necessary information as it becomes available. This decoupling ensures that the time to answer a recommendation request is not affected by the number or nature of the recommendation algorithms. We found that a single Node.js process was enough to adequately handle all the recommendation requests (note that Node.js processes are always single threaded).

The setup we used for this contest relied only on two recommendation algorithms: most recent (the baseline) and killer sequence. Recommendations were taken from killer sequence if available and otherwise from most recent. In the future, once more recommendation algorithms have been implemented, we would like to experiment with a meta-algorithm for selecting the algorithm from which recommendations should be used for a particular request. Therefore, our infrastructure already takes multiple algorithms into account.

Killer sequence algorithm

For each new item update event an article node is generated in the Neo4J graph database [10]. A flag is also set to indicate whether the article is available for recommendations. This flag can be altered by later item updates with the same id.

Based on the impression events we keep track of user sessions. When an impression event is received, a new session is created containing the user ID and the article ID in the REDIS [4] database with an expiration time of 60 seconds. When a second impression with the same id is handled within these 60 seconds, the session is updated with the new article ID and a relation is set in Neo4J between the first and the second article nodes signifying that these two articles were read in this specific order by the user. The sequence can grow to as many articles as the user consumes as long as there is less than 60 seconds between the events so the session does not expire. The replies to the recommendation requests are generated by querying the Neo4J graph database for the strongest sequence containing the concerning article. The top X results are returned and the ID's are sent to the ORP. We employ a breadth-first search of only one layer deep because it could be undesirable to skip an article in the sequence that forms a news story. To make the recommendation more personalized, we would like to incorporate the historical sequence of a user into the search but in the current implementation this information is no longer available since the we store the sum of all interactions and not the individual actions.

When the graph query does not return any recommendations because the article is not yet part of a sequence, a fallback recommendation is used randomly taken from a dynamic set of the most recently changed articles per domain, as implemented in the example code provided for the challenge.

The expiration time of 60 seconds was chosen arbitrarily and we expect that optimizing it will improve the performance of the algorithm.

Results and analysis

By participating in the CLEF NewsREEL 2107 Task 1 we were able to test our algorithm in a real-life environment and compare our algorithm to other algorithms. Originally, multiple test periods were planned and a single evaluation period was available at the end. However, due to technical issues on both sides, given by corporate environments, we started straight into the final evaluation period.

In Table 1 we show the overall results of CLEF NewsREEL 2107 Task 1. Our algorithm is ORLY_KS and is shown in bold, while the baseline algorithm is shown as BL2Beat. Our algorithm performs 79% better than the baseline, and scores the highest CTR of all algorithms with more than 1000 widget impressions. Overall, the CTRs are higher than during CLEF NewsREEL 2016, where the highest CTR reported was 1.23% [6]. One recommendation widget could contain multiple items, with the num-bers of items not always being the same. Only one recommendation item per page could be clicked for a refresh to trigger, hence the CTR for items is always lower than for widgets. Because the item/widget ratio for our recommender is the lowest of all recommenders, the per item CTR is even slightly higher being 82% better than the baseline. However, the amount of widgets impressions for our recommender is quite a bit lower than the baseline and some other algorithms.

From previous years of CLEF NewsREEL (e.g. [8]) it is known that CTR ratios can differ hugely between different times and different domains. Therefore, answering a selective number of recommendation requests or having a different sampling of the available requests could make the CTR ratios incomparable. We will therefore focus our analysis on finding out if we inadvertently achieved a high CTR by answering selectively.

Response time analysis

One of the properties of a good real-time news recommendation system is having a fast response time, preferably within 100 milliseconds. Also, since the ORP might drop recommendations after 100 ms, we decided to log the performance of our response times. The results are shown on a density plot in Fig. 3. In 99.9% of the cases we responded within 90 milliseconds, and on average we responded in 7 milliseconds. Given that our recommender was located at the same hosting provider (Hetzner Online GmbH), 10 milliseconds is more than enough for transfer times. Therefore it is very unlikely that response time problems played a role in having less widget recommendations.

Fig. 3. Density plot of response times for our recommender during evaluation on a log-scale. Data gathering starts during the contest at 2017-04-28.

Weekly results and error reports

ORP provided us with an API to query the weekly results during the contest. The results are shown in Table 2. While our error ratios (in bold) for both weeks are higher than most recommenders, other recommenders like RIADI_nehyb have even higher error ratios but are able to produce more impressions. The most likely explanation for the connection problems was that our corporate policy asked for a firewall of the TCP port in case an insecure connection was used. As the connection was insecure, the port was firewalled and whitelisted only certain IP addresses. However, in week 1 certain IP addresses were not whitelisted, while in the second week servers were added by plista, which were whitelisted only after a small delay.

In Table 3 we show the error notifications received by our server. The amount of content errors received exactly matches with 1415 instances. Probably, these content errors were caused by a deploy on 2017-06-28. Strangely, connection errors also seem to match somewhat, while it is usually impossible for a server to communicate such an error in case of a firewall blocking the connection. The number of timeouts errors is negligible.

Overall, the number of errors is still quite low and does not prevent other recommenders from achieving high numbers of impressions. Also, the errors are not only related to recommendation requests, but also to other requests. We therefore assume that the impact from these errors was quite low.

Returning the wrong (number of) recommendations

A possible cause for not having the lowest number of recommendation items per widget is that our recommender simply does not return enough recommendations. As we stored almost all recommendations our recommender did in MongoDB, it was possible to investigate these. We also record which algorithm did the recommendation. In Table 4 the results of this analysis are shown. It is clearly visible that something went wrong during a deploy with a re-implementation of the baseline recommender (Most recent) in Redis. The idea of this re-implementation was to be able to restart the recommender without losing state, however it contained an off-by-one error. Therefore, it usually returned a recommendation more than requested. We however expect that returning one recommendation too many is not a huge problem, as ORP usually requests more recommendations than it needs anyway. Most likely the extra recommendation was simply ignored.

It is also shown that our killer sequence recommender more frequently gives less recommendation than requested. In some cases it handed out 0 recommendations which was fixed later on and probably explains the difference between outgoing and incoming recommendations. In case both recommenders cannot obtain a result, no result was returned. In case the killer sequence recommender had fewer than requested recommendations, we did not add results from the baseline recommender. This might explain why our algorithm had the lowest item/widget ratio. In case an article is received from ORP it also contains a flag indication whether this article is recommendable. Articles with a flag other than 0 are not considered recommendable, however the baseline implementation ignores this flag when recommending, in both the Java and Node.js versions of the example code supplied by the organization. Also it does not check whether it recommends the same article multiple times.

In Table 5 we show an analysis of whether a recommendation contains an unrecommendable article or a duplicated recommendation. Especially the baseline recommender almost always does this. This is no surprise, as it does not check the commendable flag and returns from a set of recently updated articles. Out of 698340 received article updates, 460743 (66%) contain an non-recommendable article. If selecting 6 or 7 from these updates the probability of getting a non-recommendable article is quite high. Also, it is likely that the same article is updated multiple times within a short period, resulting in duplicate recommendations in 22% of the cases.

While our killer sequence never returned duplicates, it also failed to check the status flag which we did store in Neo4J. Therefore, it did return non-recommendable articles in 23% of the cases.

All in all there is a huge room for improvement in returning valid recommendation results. However, given that the baseline recommender makes the same mistakes a lot of the time, we do not believe this is the cause for having less widget impressions. Table 5. Different situations for different sub-algorithms, percentages are for total recommendations given by the sub-algorithm. A recommendation is considered ok if it has only recommendations for articles with status=0 and no duplicates. Ok <6 means that there are less than 6 recommendations returned, while Ok >6 means there are more than 6 recommendations returned. Given that almost all requests were for 6 recommendations, Ok 6 with 6 returned recommendations, and no invalid or duplicate articles is probably preferable.

3.4

So why did we do less widget impressions?

The above analysis did not yield a conclusive result into the cause of having fewer impressions. Also, some recommenders in Table 1 do significantly more impressions than the baseline recommender does. We therefore assume that the variation of impressions between different recommenders is probably also the result of some nonrandom preference within the ORP platform.

Discussion

Currently the killer sequence algorithm behaves the same for every user. Given that not all users will have the same preferences, it is probably beneficial to personalize the algorithm towards certain user behavioral patterns. For example, instead of giving all follow relations the same weight, it is likely beneficial to give follow relations of similar users a higher weight. This would of course require a metric of similarity between users or a grouping of the users, where relations from the same user group are given more weight for the current user. We expect that a personalized version of the killer sequence will increase the performance.

It is also likely that trends are not static over time. Currently we do not take into account temporal trends in the data, but it is likely that it would be better to take the time when paths occurred into account. For example, it would be possible to use a decay function which gives less weight to paths that have taken place a long time ago.

In many machine learning problems, an ensemble multiple algorithms performs better than a single algorithm [6]. Our infrastructure already takes into account the possibility of using an ensemble of algorithms and using a meta-algorithm, however currently it uses just two algorithms and a very simple meta-algorithm. This of course offers many possibilities for using a better ensemble.

Due to the test setup, it was unfortunately impossible to track which sub-algorithm leaded to which results. In the world of Real Time Bidding [12], it is commonplace to have a unique identifier for a certain impression, which is used throughout the whole interaction, for example also when an impression is clicked. In this case it would have been quite beneficial for the evaluation to have both a unique identifier, and the recommender used to give the impression as attributes in the protocol. As this was not the case, we can only estimate the performance of the Killer Sequence sub-algorithm. Fortunately our algorithm did not require a direct feedback loop of how well the recommendations where doing, as such would have been impossible. Debugging and click attribution would be greatly improved if a future version of the ORP supported attributes for unique identifiers.

Conclusion

The CLEF NewsREEL 2107 Task 1 provided a unique opportunity to test our killer sequence recommender in a real-life situation, and compare the results with other recommenders. It turned out that our killer sequence recommender performed 79% better than the baseline recommender. This resulted in the top position among the recommenders with more than 1000 recommendations. That could have been caused due to bad behavior, such as only answering requests only for certain high CTR domains or times. We investigated whether this higher than expected performance was due to unwanted behavior of the recommender, but could not conclude that this was the case. Therefore, it appears that our algorithm performs very well in the test environment.

There are still many ideas left for improvement of the results. A future version of the recommender should avoid non-recommendable articles, duplicate articles within the recommendation, or returning fewer results than needed. Also, personalization of the killer sequence recommender and ensemble recommendation is expected to yield better results.

Fig. 2 .2Fig. 2. A schematic overview of the event handling for the killer sequence algorithm.

Table 1 .1Overall results of algorithms in CLEF NewsREEL 2107 Task 1 sorted by CTR.NameClicksWidgetsCTRItemsctrItemItems /% CTRshownshownWidgetbase-lineRiadi_NV_01124432.709%13800.8696%3.1151232%ORLY_KS896427862.094%1302210.6881%3.0435179%ody41139726011.569%2308960.4933%3.1803134%IRS55837081.564%118560.4892%3.1974134%ody51268812451.561%2556630.4960%3.1468133%ody3813592271.373%1840520.4417%3.1076117%ody2875639501.368%1995470.4385%3.1204117%IT5925685821.349%2149220.4304%3.1338115%Eins817615241.328%1916470.4263%3.1150114%yl-2747608141.228%1922070.3886%3.1606105%WIRG600498301.204%1544190.3886%3.0989103%ody1810687681.178%2144060.3778%3.1178101%BL2Beat726620521.170%1930140.3761%3.1105100%RIADI_pn879777231.131%2443340.3598%3.143797%IL813791201.028%2494920.3259%3.153388%RIADI_nehyb764755351.011%2363220.3233%3.128686%Has logs68160.735%26100.2299%3.198563%ody0166230230.721%725990.2287%3.153362%RIADI_hyb23490.573%11460.1745%3.283749%

Table 2 .2Weekly results of algorithms in CLEF NewsREEL 2107 Task 1 sorted by overall CTR. Note that counts for both weeks are almost, but not precisely matched with overall results.NameEvaluation week 1Evaluation week 2clicksim-CTR Con-Con-clicksim-CTR Con-Con-pressinecttentpressinecttentonserrorerroronserrorerrorRiadi_NV_0112443 2.709% 72044576ORLY_KS367 18930 1.939% 3836 1415529 23931 2.211% 33550ody4571 36917 1.547%5830560 35404 1.582%4650IRS558 3708 1.564%90231ody5608 41515 1.465%6300655 39449 1.660%9940ody3392 29983 1.307% 11130417 29035 1.436%5310ody2457 32751 1.395%6240416 30995 1.342%4890IT5438 33766 1.297%49812482 34534 1.396%4668Eins405 32207 1.257%8670411 29024 1.416%1090yl-2401 32572 1.231%33184343 27891 1.230%6384WIRG316 27031 1.169% 400155284 22882 1.241% 25951ody1413 35382 1.167%5900397 33141 1.198%4990BL2Beat322 29535 1.090% 1740402405 32586 1.243%5480RIADI_pn409 37488 1.091%33413470 40184 1.170%930IL388 40345 0.962%7528422 38526 1.095%7657RIADI_nehyb326 36414 0.895% 24847578437 38858 1.125%96946Has logs6816 0.735%120ody0103 12982 0.793% 17815061 9828 0.621%00RIADI_hyb2349 0.573% 827870

Table 3 .3Error codes received. Unfortunately, these can also occur outside of recommendation requests.Error codeConstantCount455ERRCODE_CONNECTION_FAILED6658442ERRCODE_FORMAT_INVALID1415408ERRCODE_CONNECTION_TIMEOUT76Total8149

Table 4 .4Number of recommendations versus recommendation limit for different algorithms. Returning 7 recommendations is the result of an off-by-one bug.# rec's01234567Total%K. Sequence526 2358 1384 1140 1114 929660617351242.93%Most Recent248530122335141 624159773057.07%All Outgoing526 2382 1469 1170 1126 952 101202 62415 171242 100.00%All Incoming88 172144172160

<author> <persName><forename type="first">Apache</forename><surname>Kafka</surname></persName> </author> <ptr target="https://kafka.apache.org/,lastaccesed" /> <imprint> <date type="published" when="2017-06-01">2017/06/01</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Towards a living lab for information retrieval research and development LAzzopardi KBalog International Conference of the Cross-Language Evaluation Forum for European Languages

Heidelberg

Springer 2011 KBanker MongoDB in action Manning Publications Co 2011 JLCarlson Redis in Action Manning Publications Co 2013 CLEF NewsREEL 2016: Image based Recommendation FCorsini MLarson Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum

Évora, Portugal

5-8 September, 2016. 2016 Ensemble methods in machine learning TGDietterich International workshop on multiple classifier systems

Heidelberg

Springer 2000 Overview of NewsREEL'16: Multidimensional Evaluation of Real-Time Stream-Recommendation Algorithms BKille ALommatzsch GGGebremeskel FHopfgartner MLarson JSeiler DMalagoli ASerény TBrodt APVries De Experimental IR Meets Multilinguality, Multimodality, and Interaction -7th International Conference of the CLEF Association, CLEF 2016 Proceedings

Évora, Portugal

September 5-8, 2016. 2016 BKille ALommatzsch RTurrin ASerény MLarson TBrodt JSeiler FHopfgartner Overview of CLEF newsreel 2015: News recommendation evaluation lab 2015 Node.js: Using JavaScript to build high-performance network programs STilkov SVinoski IEEE Internet Computing 14 6 2010 A programmatic introduction to neo4j JWebber Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity the 3rd annual conference on Systems, programming, and applications: software for humanity ACM 2012 Temporal recommendation on graphs via long-and short-term preference fusion LXiang QYuan SZhao LChen XZhang QYang JSun Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining the 16th ACM SIGKDD international conference on Knowledge discovery and data mining ACM 2010 Real-time bidding for online advertising: measurement and analysis SYuan JWang XZhao Proceedings of the Seventh International Workshop on Data Mining for Online Advertising the Seventh International Workshop on Data Mining for Online Advertising ACM 2013 3 An intelligent recommender system using sequential web access patterns BZhou SCHui KChang Cybernetics and Intelligent Systems IEEE 2004. 2004 1