-

ECML-PKDD 2011 Discovery Challenge Overview

Nino Antulov-Fantulin

Matko Boˇsnjak

Martin Zˇnidarˇsiˇc

Miha Grˇcar

Mikolaj Morzy

Tomislav Sˇmuc

2 0 Joˇzef Stefan Institute , Ljubljana , Slovenia 1 Poznan ́ University of Technology , Poznan ́ , Poland 2 Rudjer Boˇskovic Institute , Zagreb , Croatia

2011

7 20

This year's Discovery Challenge was dedicated to solving of the video lecture recommendation problems, based on the data collected at VideoLectures.Net site. Challenge had two tasks: task 1 in which new-user/newitem recommendation problem was simulated, and the task 2 which was a simulation of the clickstream-based recommendation. In this overview we present challenge datasets, tasks, evaluation measure and we analyze solutions and results.

the leaderboard and the test set), together with task and evaluation descriptions is publicly available for the non-commercial research purposes [28].

We have ensured prize-sponsoring (5500 e) from the European Commission through the e-LICO EU project, 2009-2012 whose primary goal is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences.

The prizes, for each of the tracks are: – 1500 e for the first place – 700 e for the second place – 300 e for the third place

The prizes, for the Workflow contest are:

– 500 e for the best workflow – Free admission to RapidMiner Community Meeting and Conference 2012 for the best RapidMiner workflow (sponsor: Rapid-I)

The challenge has been hosted on TunedIt7.

Background

Recommender systems have become an important research area since the first appearance of the information overload for the typical user on the internet. Personalized recommender systems take user profiles into account when the prediction for particular user and item is generated. The prediction techniques for the recommender systems [ 1–3 ] can be divided into three main categories: content-based, collaborative-based and hybrid-based prediction techniques.

Content-based techniques [ 4, 5 ] are based on interactions between a particular user and all the items in the system. Content-based recommender systems use information about items and the user’s past activities on items in order to recommend similar items.

Collaborative filtering techniques [ 6–8 ] analyze interactions between all users and all items through users’ ratings, clicks, comments, tags, etc. Collaborative filtering recommender systems do not use any specific knowledge about the items except their unique identifiers. These prediction techniques are domain-independent and can provide serendipity recommendations for users. However, collaborative filtering needs sufficient amount of collaborative data in order to recommend for the new user or the new item (the cold-start problem) [ 9, 10 ].

Hybrid prediction techniques [ 11–13 ] merge collaborative-based and content-based techniques and are more resistant to cold start problems. This challenge was designed to tackle the problems of cold start and hybridization of content and collaborative data in realistic setting of the VL.Net website. In comparison to recommender challenges of recent years (Netflix challenge, KDDCup challenge 2008, KDDCup challenge 2011) this challenge relies on indirect collaborative data, and is more focussed on utilization of content and descriptions of items. 7 http://tunedit.org 3

Description of the challenge dataset

The data snapshot which is the basis for the VideoLectures.Net dataset was taken in August 2010. At that time, the database contained 8 105 video lectures. 5 286 lectures were manually categorized into taxonomy of roughly 350 scientific topics such as Arts, Computer Science, and Mathematics.

VideoLectures.Net dataset includes: 1. Data about lectures: every lecture has a title, type (e.g. lecture, keynote, tutorial, press conference, etc.) language identifier (e.g. en, sl, fr, etc.), number of views, publication date, event identifier, and a set of authors. Many lectures come with a short textual description and/or with slide titles from the respective presentations. Specifically, 5 724 lectures are enriched with this additional unstructured data. The training part of data contains also lecture-pairs coviewing frequencies (CVS - common view score), and pooled sequences related collaborative data, which is not available for the set of test lectures. Test set contains lectures with publication date after July 01, 2009, which are used for task 1 scoring. Neither CVS nor pooled viewing sequences containing these lectures are available in the training data. 2. Data about authors: each author has a name, e-mail address, homepage address, gender, affiliation, and the respective list of lectures. The dataset contains 8 092 authors. The data about the authors is represented by authors’ names, VL.Net url, e-mail, homepage, gender, affiliation, and pairwise relations to the lectures delivered by the author at VL.Net 3. Data about events: a set of lectures can be associated with an event (e.g. a specific conference). In a similar fashion, events can be further grouped into metaevents. An event is described in a similar way as a lecture: it has a title, type (e.g. project, event, course), language identifier, publication date, and a meta-event identifier. The VideoLectures.Net dataset contains data about 519 events and meta-events (245 events are manually categorized, 437 events are enriched with textual descriptions). 4. Data about the categories: The data about the categories is represented in the shape of the scientific taxonomy used on VL.Net. The taxonomy is described in a pairwise form, using parent and child relations. 5. View statistics: The VideoLectures.Net software observes the users accessing the content. Each browser, identified by a cookie, is associated with the sequence of lectures that were viewed in the identified browser. Temporal information, view durations, and/or user demographics are not available. The dataset contains anonymized data of 329 481 distinct cookie-identified browsers. The data about view statistics is given in the form of frequencies: (i) for a pair of lectures viewed together (not necessarily consecutively) with at least two distinct cookie-identified browsers; (ii) for pooled viewing sequences - triplets of lectures viewed together prior to a given sequence of ten lectures. This is a special construct based on aggregation of click-streams, which is used for training and scoring in task 2. 3.1

Creating pooled viewing sequences In order to comply with privacy-preserving constraints, lecture viewing sequences for the task 2 have been transformed into what we named pooled sequences. Pooled viewing sequence is given by the set of three lectures on the left side (triplet) and a ranked list of at most ten lectures on the right side. The set of three lectures does not imply an ordering, it is merely a set that comes upstream of lectures given on the right of a pooled viewing sequence. Ranked list on the right side of some pooled viewing sequence is constructed from all the clickstreams with the particular triplet on the left side. The transformation process for the construction of pooled viewing sequences is given below.

Consider a sequence of viewed lectures:

id1 → id7 → id2 → id1 → id4 → id5 → id6 → id3

We first filter out duplicates (here - id1):

id1 → id7 → id2 → id4 → id5 → id6 → id3 Then, we determine all possible unordered triplets in the sequence. For each triplet, cut the sequence after the right-most lecture from the triplet.

In the above example, if {id1, id4, id5} is the triplet, the sequence is cut right after id5. Finally, increase triplet-specific counts for all the lectures after the cut. In the above example, given the triplet {id1, id4, id5}, the triplet-specific counts for id6 and id3 are increased:

{id1, id4, id5} → id6 : 1, id3 : 1 Suppose there is another click-stream sequence, that amongst others, contains unordered triplet id1, id4, id5 and that id6, id3, and id7 are lectures appearing after the cut. Then the counts for the {id1, id4, id5} are increased as follows:

{id1, id4, id5} → id6 : 2, id3 : 2, id7 : 1 3.2

Creating lecture co-viewing frequencies

Consider two sequences of viewed lectures:

id1 → id7 → id2 → id1, id2 → id3 → id7. id1 → id7 → id2, id2 → id3 → id7.

CVS (id1, id2) = 1, CVS (id1, id7) = 1, CVS (id2, id7) = 2, CVS (id2, id3) = 1, CVS (id3, id7) = 1. We first filter out duplicates in sequences:

Then, we determine lecture co-viewing frequencies (CVS): ECML-PKDD 2011 Discovery Challenge. Table 1: Train-test data statistics

Moment t2 05.08.2010.

Moment t1 01.07.2009.

Total number of lectures in the train set 6983 Total number of lectures in the test set 1122 Number of common-view pairs in the train set 363 880

Number of common-view pairs in the test set 18 450 3.3

A train-test split logic Basic statistics of lectures in the training and test sets are given in Table 1. Common view score matrix CV S is a lecture co-viewing frequency matrix collected at the site at some moment t2 and represents lecture viewing adjacency matrix of lecture-lecture graph G at the moment t2. G is undirected weighted graph of all lectures. Each lecture in this graph has associated temporal information - date of publishing at the VideoLectures.Net site. We partition G using the publishing date by some threshold t1, into two disjoint graphs G1 and G2: each lecture in G1 has publishing date before the date threshold while each lecture in G2 has publishing date after the date threshold t1. We define pair common viewing time as a period that two lectures spend together in the system. All lecture pairs (xi, xj ) : xi ∈ G1, xj ∈ G1 have pair common time strictly greater than (t2 − t1) value and all lecture pairs (xi, xj ) : xi ∈ G1, xj ∈ G2 have pair common time strictly less than (t2 − t1) value.

In oder to make proper the training and test set split based on G1 and G2, we had to ensure similar distribution of pair common times in both training and test sets. We have divided nodes from subgraph G2 in randomized fashion (with some constraints) into two approximately equal sets (G21, G22) and we have appended G21 to the training set. Now, the subset of lecture pairs (xi, xj ) : xi ∈ G1, xj ∈ G21 from the training set has similar distribution of pair common times that overlaps with times (xi, xj ) : xi ∈ G1, xj ∈ G22 from the test set. Figure 1 gives the distribution of edges related to the graphs G1, G22.

Finaly, the train-test split logic was implemented through the series of steps: 1. Split the lectures by publication date into two subsets: old (publication date < July 01, 2009) and new (publication date ≥ July 01, 2009). Put the old lectures into the training set; 2. Move all new lectures with parent id occuring in the old lecture subset to the training set; 3. Split the rest of the new lectures randomly into two disjoint sets of similar cardinality, taking care of their parent ids: (a) lectures with the same parent id can be only in one of the sets; (b) lectures without parent id are just randomly divided between two sets. 4. Finally, add one of the disjoint sets to the training set; the other disjoint set represents the test set.

At the end of the process, we get the training set consisting of all the lectures with publishing date prior to July 01, 2009, together with approximately half of the lectures after the aforementioned date, and the test set consisting of the rest of the lectures published after the aforementioned date. Due to the nature of the problem, each of the tasks has its own merit: task 1 simulates new-user and new-item recommendation (cold start mode); task 2 simulates clickstream-based (implicit preference) recommendation. The first task of the challenge is related to solving the so called cold start problem, commonly associated with pure collaborative filtering (CF) recommenders. Generally, cold start recommending quality should be measured through user satisfaction surveys and analysis. For the challenge, one needs a quantitative measure and a simulated cold start situation. In order to be able to score solutions, new video lectures are those that entered the site more recently, but for which there is already some viewing information available.

In this task, we assume that the user has seen one of the lectures which are characterized by the earlier times of entering the site (old lectures). As a solution for this task a ranked list of lectures from the new lectures set, is to be recommended after viewing some of the old lectures. The length of the recommended list is fixed at 30 lectures. Overall score for the submission/solution is based on the mean average R-precision score (MARp) (explained in Section 5).

Solution for the task 1 is based on ranking of lectures according to withheld lecture co-viewing frequencies in descending order. Suppose, the co-viewing frequencies (CVS) for some old lecture id1 to new lectures {id2, id3, id4, id5} are: then we construct solution ranked list for old-lecture id1:

CVS (id1, id2) = 12, CVS (id1, id3) = 2, CVS (id1, id4) = 43, CVS (id1, id5) = 3, id1 : id4, id2, id5, id3. 4.2

Pooled lecture viewing sequences task In task 2 contestants are asked to recommend a ranked list of ten lectures that should be recommended after viewing a set of three lectures. In contrast to the task 1, this is the situation close to typical recommendation scenario (submission and evaluation for the task 2). Solution for the task 2 is based on ranking of lectures according to frequencies in withheld pooled lecture viewing sequences in descending order. Test lectures from the task 1 are in this case not included into training pooled sequences, but can be a part of the ranked solution list for the task 2.

Suppose, there is a pooled lecture viewing sequences:

{id1, id4, id5} → id6 : 5, id3 : 4, id7 : 2, id2 : 1, then we construct solution ranked list for triplet {id1, id4, id5}:

{id1, id4, id5} → id6, id3, id7, id2. 5

Challenge evaluation function

Taking into account relative scarcity of items available for learning, recommending and evaluation (esp. in case of cold start task), we have defined an R-precision variants of standard evaluation measures in information retrieval p@k and MAP . The overall score of the submission is mean value over all queries R (recommended lists r) given in the test sets:

MARp = 1

X AvgRp(r) |R| r∈R AvgRp(r) = X z∈Z

Rp@z(r) |Z|

Average R-precision score - AvgRp(r) for a single recommended ranked list r is defined as:

where Rp@z(r) is R-precision at some cut-off length z ∈ Z. Rp@z(r) is defined as the ratio of number of retrieved relevant items and relevant items at the particular cut-off z of the list: |relevant ∩ retrived |z = |relevant ∩ retrived |z

|relevant |z min(m, z)

Number of relevant items at cut-off length z is defined as min(m, z), where m is the total number of relevant items. When m ≤ z, number of relevant items at z is m, while for other situations it is limited to top z relevant items from the (real) solution ranked list s. A special situation happens when there are more equally relevant items at the same rank (ties) at the cut-off length of the s list. In that case, any of these items are treated as relevant (true positive) in calculating Rp@z(r). For the task 1, cut-off lengths z for the calculation of MARp are z ∈ {5, 10, 15, 20, 25, 30}. For the task 2, cut-off lengths z for the calculation of MARp are z ∈ {5, 10}. We have introduced R-precision because it is more apt to our situation: it adjusts to the size of the set of relevant documents. Typically, in information retrieval tasks one has to filter and rank from a large pool of both relevant and irrelevant items. This is not the case with the simulated cold start situation of this challenge. As an example, if there were only 4 items (lectures) in the whole collection relevant to the particular query, a perfect recommender system would score 1, measured by Rp@10, whereas its p@10 would be only 0.4. Using this measure for our application makes more sense, as the number of relevant items can vary from 1 to above 30, and in such situations Rp@z expresses the quality of retrieval more fairly at some predefined retrieval (cutoff) length, than p@z. The reason why we use AvgRp(r) over set of different Rp@z, is that through the averaging we can also take into account ranking and at the same time improve the ability to differentiate between similar solutions (recommenders).

We have also considered MAP (mean average precision) measure, which is the closest to the proposed measure. However, MAP does not take into account absolute ranking positions of recommended items since permutations of relevant or true positive items in recommended list do not affect MAP score.

Normalized discounted cumulative gain (NDCG) [16, 17] takes into account that relevant documents are more useful when apperaing earliear in a recommendation list. It is the most common measure used for ranking the results of the search list in information retrieval. This measure has also been used in other challenges where the main task was to learn ranking [ 14, 15 ].

If ranking order is not to be so strict for the top-n item recommendations [18], the ”granularity” of ranking can be relaxed. This is the main reason why we are using MARp measure instead of the NDCG. Proposed measure MARp takes into account absolute ranking positions with granularity of five items. This granularity was chosen after studying the ranking-recall influence on recommender system evaluation. 6

Challenge submissions results

ECML-PKDD 2011 Discovery Challenge started on 18th of April and ended on 8th of July 2011. The competition attracted significant number of participants: 303 teams with 346 members, with 62/22 active teams per task. More than 2000 submissions were sent and best approaches outperformed baseline solution several times.

Winners of the challenge for task 1 are:

Winners of the challenge for task 2 are:

The final scores, for the teams that scored better than the random recommender, are presented in the Figures 3 and 4, for each of the tasks respectivelly. The scores are accompanied with the graphs of differences between preliminary MARp score on the leaderboard set and the final MARp score on the test set. For the task 1, from Figure 3, we can conclude that majority of the teams had positive difference scores, which may suggest overtraining. To the contrast, the majority of the teams had negative difference scores in tasks 2 (see Figure 4).

The distributions of the average R-precision over queries for the winning entry on each of the tasks are presented in Figure 5. Difference in distributions between the tasks reflects also the difference in the approaches used: while for the first task main features for solving the problem are constructed from lecture content and metadata similarity, for the second task only co-viewing information is utilized. We have also noted that these distributions are qualitatively very similar between first three positioned entries on each of the tasks, reflecting general similarity in approaches of different teams.

Dependence of query average R-precision score on the size of the solution list for the task 1 is presented in Figure 6 (graph on the left). On average, query score just slightly diminishes with the increase of the solution list. To the contrast, dependence of the query average R-precision score to the triplet frequency, for the task 2 (graph on the right in Figure 6) shows that on average the quality of result for the query is proportional to the triplet frequency. The teams approached task 1 using quite different learning techniques with the primary effort focussed on feature engineering and optimization. Almost all of the participants have utilized all the lecture content related data (lecture taxonomy, event tree, types of lectures, descriptions, etc.), differing however slightly in definitions of the similarity of any two lectures. Important with respect to the overall score was the process of filling the missing values for the lectures that lack some of the content related data. Winning solutions used more sophisticated approach of filling lecture content and meta-data features’ missing values using lecture co-viewing information (weighted CVS feature vector expansion [19], query expansion[20]) - thus utilizing collaborative information to ”enrich” content-based features.

Table 2 gives a summary of the feature engineering approaches and learning methods used in solving challenge tasks. 7

Conclusion

In the last couple of years, a number of challenges was organized in the field of recommendation problems. Most of them were focussed on prediction problems related to large scale explicit or implicit user preference matrices, in some cases combined with (mostly obfuscated) user, item and/or context related information. ECML-PKDD 2011 Discovery Challenge differed from this mainstream through two aspects: (i) instead of user preferences, only item to item preference information is available in the shape of the co-viewing frequencies graph; (ii) a rich and explicit description of lectures is available in the form of structured and unstructured text. On both tasks participants have obtained significantly higher MARp values than set by the baseline solutions.

The analysis of the results shows that the most important part of a successful solution was careful feature engineering. Definition of the similarity scoring function capable of capturing content, context and temporal information turned out to be crucial for the success in the cold start (task 1) competition. Task 2, pooled sequence completion problem, was easier to solve and both approaches and results of the participants were mutualy much more similar. Rather unexpectedly, content related information was not used in ranking lectures to be viewed in succession to test set triplets. Most of the participants have also reported about the complexity/scaling of their solutions.

Our opinion is that the results of the challenge could be quite useful for constructing a new recommendation system for the VideoLectures.Net. In particular, there are several approaches that could significantly improve recommendation quality of new lectures at the site, with modest consumption of additional computational resources. Using lecture co-viewing frequency information instead of original preferences information in the form of click-streams should be studied in more detail, in order to understand the implications of this transformation on the personalized recommendation quality from the user’s perspective.

Acknowledgements The Discovery Challenge 2011 has been supported by the EU collaborative project e-LICO (GA 231519). The organizers of the Challenge are grateful to the Center for Knowledge Transfer in Information Technologies of the Joˇzef Stefan Institute and Viidea Ltd for the data of the VideoLectures.Net site, and TunedIT for the professional support in conducting the competition. Finally, we want to thank all the active participants of the challenge for their effort in the challenge and willingness to share their solutions and experience through the contributions in this workshop. 15. Internet Mathematics 2009 contest: Limited Liability Company, http://imat2009.yandex.ru/academic/mathematic/2009/en/. 16. K. Jarvelin, J. Kekalainen: Cumulated gain-based evaluation of IR techniques, ACM

Transactions on Information Systems 20(4), pp 422-446 (2002). 17. B. Croft, D. Metzler, and T. Strohman: Search Engines: Information Retrieval in Practice. Addison Wesley, (2009). 18. A. Turpin, W. Hersh: Why batch and user evaluations do not give the same results, In Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, pp 17-24, (2001). 19. A. D’yakonov: Two Recommendation Algorithms Based on Deformed Linear Combinations. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 21-27 (2011). 20. E. Spyromitros-Xioufis, E. Stachtiari, G. Tsoumakas, and I. Vlahavas: A Hybrid Approach for Cold-start Recommendations of Videolectures. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 29-39, (2011). 21. M. Moˇzina, A. Sadikov, and I. Bratko: Recommending VideoLectures with Linear Regression. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 41-49, (2011). 22. J. A. Kreiner and E. Abraham: Recommender system based on purely probabilistic model from pooled sequence statistics. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 51-57, (2011). 23. V. Nikulin: OpenStudy: Recommendations of the Following Ten Lectures After Viewing a Set of Three Given Lectures. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 59-69, (2011). 24. H. Liu, S. Das, D. Lee, P. Mitra, C. Lee Giles: Using Co-views Information to Learn Lecture Recommendations. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 71-82, (2011). 25. M. Chevalier, T. Dkaki, D. Dudognon, J. Mothe: IRIT at VLNetChallenge. In Proc. of

ECML-PKDD 2011 Discovery Challenge Workshop, pp 83-93, (2011). 26. L. Iaquinta and G. Semeraro: Lightweight Approach to the Cold Start Problem in the Video Lecture Recommendation. In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 95-101, (2011). 27. G. Capan, O. Yilmazel: Joint Features Regression for Cold-Start Recommendation on VideoLectures.Net In Proc. of ECML-PKDD 2011 Discovery Challenge Workshop, pp 103-109, (2011). 28. N. Antulov-Fantulin, M. Boˇsnjak, T. Sˇmuc, M. Jermol, M. Zˇnidarˇsiˇc, M. Grˇcar, P. Keˇse, N. Lavraˇc: ECML/PKDD 2011 - Discovery challenge: VideoLectures.Net Recommender System Challenge, http://lis.irb.hr/challenge/

Rendle ,

Tso-Sutter ,

Huijsen ,

Freudenthaler ,

Gantner ,

Wartena ,

Brussee and M. Wibbels: Report on State of the Art Recommender Algorithms (Update) . MyMedia public deliverable D4.1 .2., ( 2011 ).

Adomavicius and

Tuzhilin Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions . IEEE Transactions on knowledge and data engineering , 17 ( 6 ) ( 2005 ).

Montaner ,

Lopez and J.L. de la Rosa : A Taxonomy of Recommender Agents on the Internet . Artificial Intelligence Review , 19 , ( 2003 ), 285 - 330 .

4. G. Salton: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer . Addison Wesley , ( 1989 ).

Baeza-Yates ,

Ribeiro-Neto : Modern Information Retrieval . Addison Wesley, ( 1999 ).

Hill ,

Stead ,

Rosenstein and G. Furnas: Recommending and Evaluating Choices in Virtual Community of Use . Proc. Conf. Human Factors in Computing Systems , ( 1995 ).

Resnick ,

Iakovou ,

Sushak ,

Bergstrom and J.Riedl: GroupLens: An Open Architecture for Collaborative Filtering of Netnews Proc . Computer Supported Cooperative Work Conf., ( 1994 ).

Shardanand and

Maes : Social Information Filtering: Algorithms for Automating ' word of Mouth' Proc. Conf. Human Factors in Computing Systems , ( 1995 ).

Boutilier ,

R.S.

Zemel and

Marlin : Active Collaborative Filtering . In Proc. of the Nineteenth Annual Conference on Uncertainty in Artificial Intelligence , ( 2003 ).

10.

Schein ,

Popescul ,

Ungar , and D. Pennock: Generative models for coldstart recommendations . In Proceedings of the 2001 SIGIR Workshop on Recommender Systems , ( 2001 ).

11.

Balabanovic and

Shoham : Fab: Content-based, collaborative recommendation . Communications of the ACM , 40 ( 3 ), ( 1997 ).

12.

Basilico and T. Hofmann: Unifying collaborative and content-based filtering . In Proceedings of the Twenty-First International Conference on Machine Learning , pages 65 - 72 , New York, NY, USA, ACM Press, ( 2004 ).

13. R.Burke: Hybrid recommender systems: Survey and experiments . User Modeling and User-Adapted

Interaction

, 12 ( 4 ), pp 331 - 370 , ( 2002 ).

14.

Chapelle , Y. Chang: Yahoo! Learning to Rank Challenge Overview , JMLR: Workshop and Conference Proceedings 14 , pp 1 - 24 , ( 2011 ).