Recommending VideoLectures with Linear Regression Martin Možina, Aleksander Sadikov, and Ivan Bratko Faculty of Computer and Information Science University of Ljubljana, Slovenia {martin.mozina,aleksander.sadikov,ivan.bratko}@fri.uni-lj.si Abstract. This paper describes our approach to the task 1 of the ECML PKDD 2011 VideoLectures.Net Recommender System Challenge. The task was to select a set of lectures to be recommended to a visitor of the VideoLecture.Net homepage after already seeing another lecture. Our proposed approach is a hybrid recom- mender system combining content and collaborative approaches. The core of the system is a linear regression model for predicting the rank of a lecture, whereas by rank we mean the lecture’s position in the list of all lectures ordered by the interest of the visitor. Due to the complexity of the problem, the model could not be learned by a classical approach - instead, we had to employ the stochastic gradient descent optimization. The present paper furthermore, through evalua- tion, identifies and describes some interesting properties of the domain and of the algorithm that were crucial to achieve a higher prediction accuracy. The final accuracy of the model was enough to take the third place in the competition. 1 Introduction In this paper, we describe our approach to the first task (cold start) of the ECML PKDD 2011 Discovery Challenge (VideoLectures.Net Recommender System Challenge) hosted by TunedIT1 . The task was to make recommendations for video lectures on the VideoLectures.Net website. Our proposed approach is a hybrid recommender system [2] based on linear regression for predicting the ranks of the newly acquired lectures after viewing some of the “older” lectures. First, we give a short description of the domain, of the learning problem, and present the attributes that we decided to use in learning. Afterwards, we define the problem of learning ranks, present a method for learning a linear regression based on stochastic gradient descent and use it to make content-based and collaborative-based predictions. In section 4, we evaluate the method and assess the results. We finish the paper with conclusions. 2 Domain description The coldstart problem in the challenge was defined as: 1. after seeing an “old” lecture (a lecture present in the system for a while), 1 http://tunedit.org/challenge/VLNetChallenge M. Možina, A. Sadikov, I. Bratko 2. recommend 30 of the “new” (published after the 1st of July 2009) lectures that the user might like. The results have to be ranked - more “likeable” lectures should come first. The organizers collected 6983 training (old) lectures and 1122 test (new) lectures. The task of competitors was to make a selection of 30 test lectures for almost each training lecture and submit the results for evaluation. The most important part of the provided data was the so-called pairs data set. It contained the frequencies of co-occurrences of lectures from the training set. Two lectures are said to co-occur if they were seen together (in the same browser session, not necessarily consecutively) and the user watched at least half of each of the two lectures. The co- occurrences between training and test lectures were withheld from the competitors and were used to produce the true rankings, which were used to evaluate the submitted solutions. Provided lectures were described with the following attributes: type of lecture (discrete; sample values: normal lecture, tutorial, debate, invited talk, etc.) language (discrete; sample values: Slovene, English, etc.) recorded date (date value) published date (date value) title of lecture (textual attribute) description (a short textual description of the lecture) slide titles (textual attribute) categories (discrete; a lecture can belong to multiple categories, e.g. machine learning, reinforcement learning, etc.) event (discrete; the event where the lecture was recorded) author (discrete; a lecture can have several authors) views (number of all views of the lecture; this information was given for the training lectures only) For categories, events and authors, additional information was provided. Each cate- gory was specified by its name and a link to its description on Wikipedia. Events were described with: type (normal event, project, etc.), language of the event, recorded and published dates, name and a short description (both textual) of the event. Fi- nally, for each author we were given his or her name, email, homepage, gender, and organization. 2.1 Attributes used in Learning We manually constructed a set of attributes to be used in learning. Some of them were simply taken as originally provided, while some were combinations of the original at- tributes. The set of used attributes is summarized in Table 1. Published date is the number of days between the day the lecture was published and the day the snapshot of the database was taken (31st July 2010). Categories is a list of categories that the lecture belongs to. The authors attribute contains the names of actual authors and also their domain names extracted from their email addresses. Therefore, if a lecture was presented 72 Recommending VideoLectures with Linear Regression by two persons coming from Google, lecture’s value of authors attribute would be: “name of first author”, “name of second author”, and “google.com”. Significant title words are a selection of specific words found in the titles of lectures that, as assumed, imply a higher number of views. These words are: “introduction”, “tutorial”, “basic”, “lecture 1:”, “lecture 2:”,“lecture 3:”,“lecture 4:”, and “lecture”. The last attribute, Title words, contains all words mentioned at least once in a title. The constructed attributes include almost all the information provided, with the exception of the textual attributes, where only titles of lectures were used. Original at- tributes not included in our model are: recording date of lectures, lecture’s description and slide titles, lecture’s number of views, description of categories on Wikipedia, any- thing about events, authors homepage and organization. Table 1. An overview of attributes used in learning. No. of values is the number of distinct values found in the training and test data. The column Can have multiple values? specifies whether a single lecture can have several values (e.g. a lecture can belong to several categories) or not. Name Type No. of values Can have multiple values? Published date continuous n/a no Type discrete 17 no Language discrete 9 no Categories discrete 291 yes Authors discrete 8109 yes Events discrete 520 yes Significant title words discrete 9 yes Title words discrete 12812 yes 3 Content Based Recommendations with Linear Regression 3.1 Terminology Throughout the remainder of the paper we will use the following conventions. Lectures LO and LN will correspond to the “old” and “new” lectures, respectively. To address the attribute value of a lecture we will use a dot and the name of the attribute. For instance, LN .pub date is the published date of a new lecture. Greek letters β and γ will be used as parameters of the linear regression model. Abbreviation Attr will represent the set of learning attributes from Table 1, and T rain will be the set of all training lectures. 3.2 The Linear Regression Model for Predicting Rank In this section, we will describe our linear regression model for predicting the rank of a lecture LN given that the user has already seen another lecture LO . To learn such a model, we needed to estimate the true ranks of lectures in the training data and use them as the values of the class variable. Let Rank(LO , LN ) be the rank of lecture LN , given 73 M. Možina, A. Sadikov, I. Bratko LO has been seen by the user, and let P airs(LO , LN ) be the number of co-occurrences of LO and LN taken from the pairs data set. To compute ranks Rank(LO , LN ) for all training lectures LN , we used the standard ranking algorithm: 1. For a given training lecture LO , create a list of co-occurrences P airs(LO , LN ) for all LN , where LN 6= LO and LN and LO are not from the same event. 2. Order list in ascending sequence. 3. The rank Rank(LO , LN ) is the position of P airs(LO , LN ) in the above ordered list. The pair with the lowest P airs(LO , Llow ) gets rank 1 and the pair with the highest value gets rank that equals the length of the list. In the case of ties, a mean rank is assigned. 4. Divide all ranks Rank(LO , LN ) by the length of the ordered list; Rank(LO , LN ) Rank(LO , LN ) ← . Length of list This procedure is repeated for every training lecture LO . We avoided using lectures from the same event (first step) to make learn data more alike to test data. Namely, two lectures, one from the learn set and one from the test set, always belong to different events. The second and the third step are standard steps in ranking. The normalization in the last step was applied to avoid different maximal ranks when the lengths of lists differ. The main reason to use ranking, instead of predicting P airs(LO , LN ) directly, was to minimize the influence of lecture LO . For example, if LO is a tutorial, it will for this reason have a high number of views, and hence also P airs(LO , LN ) will be relatively high for all LN . On the other hand, the ranks depend only on LN and on similarities between LO and LN , but not much on LO itself. The linear regression model for predicting rank has the form: d Rank(L O , LN ) = β0 + βdate × min(L0 .pub date, LN .pub date) + " # X X + βa.v + a∈Attr\{pub date} v∈LN .a " # X X + γa.v (1) a∈Attr\{pub date} v∈LN .a∩LO .a The first term β0 is the intercept and is usually close to 0.5, which is the average normalized rank of lectures. The second term contains βdate , the parameter that models the influence of published date. This parameter has to be positive, as lectures with low published date were published later and therefore had less chances to be viewed. The third term is the sum over all remaining attributes and values of the lecture LN . Each attribute value has an assigned parameter βa.v that models the increase (or decrease) of the rank of a lecture if the lecture contains this value. The last term is again the sum over all attributes, however considering only the values that are the same for LO and LN . The parameter γa.v , therefore, models the change of rank of LN if both lectures have the same value of attribute a. This parameter is usually positive, as lectures with same values are more alike, therefore it is more probable they that will be viewed together. 74 Recommending VideoLectures with Linear Regression Such a process, where the recommendation relies only on the characteristics of the lectures, is commonly called content-based recommendation [2]. In section 4, we will describe a possible approach to extend this model to exploit some techniques from col- laborative filtering. 3.3 The Learning Algorithm The learning algorithm has to find such parameters (βs and γs) to minimize the residual sum of squares. A penalty factor λ, as in ridge regression [1], was included to prevent overfitting: " =LN .event  # X LO .event6X 2 Err = d Rank(L O , LN ) − Rank(LO , LN ) + LO ∈T rain LN ∈T rain  ! X X  + λ ∗ βdate 2 + 2 βa.v 2 + γa.v  (2) a∈Attr\{pub date} v∈a Usually, the parameters of linear regression are fit easily using the formula for linear regression that involves matrix multiplication and inversion. However, considering 6983 training lectures, we have approximately 69832 = 48762289 ranks to predict. Moreover, we are dealing with a large number of parameters, as adding all values from the Table 1 and multiplying by 2 sums up to 43544 parameters. The data matrix, in statistics called the design matrix, would therefore be enormous. If each multi-valued attribute would have only one value, it would still add up to 48762289*43544 values. Even with optimal coding, where each value takes only 1 byte, such matrix would still require over 2TB in memory. The data could be coded more efficiently by considering sparseness of the input (most of the values are 0), however we believe it would still present a problem for many implementations of linear regression. Instead, we used the stochastic gradient descent algorithm. This algorithm iteratively updates the model given one learning example at a time and removes the need to have the complete data set stored. An outline of the algorithm is given in Alg. 1. The algorithm begins by setting all parameters to 0. With the third line, it commences an iteration over all attributes (including an attribute with constant value 1 to learn the β0 parameter). In each iteration, the parameters for the values of only one attribute will be fit. The algorithm will keep optimizing parameters of one attribute until the error of the model is decreasing (5th line). Lines 8-17 contain the core of the stochastic gradient descent algorithm. Here, the algorithm makes one sweep over the pairs of training lectures (avoiding those with the same event) and updates the values of parameters in the updateFeatures procedure. The update of parameters consists of the two classical steps of gradient optimization: 1) it computes the gradient of the error function (Eq. 2) on the current learning example for all relevant parameters, and 2) updates these parameters for a small negative ratio of the corresponding gradient. The learning is governed by 4 hyperparameters of the algorithm: – ITER: the number of passes over attributes made by the algorithm. 75 M. Možina, A. Sadikov, I. Bratko Algorithm 1 Skeleton of the algorithm for fitting parameters of the linear regression model. Inputs: training data T rain and the hyperparameters: ITER, MP, λ, MF. Out- puts: parameters βs and γs of the linear model. 1: set values of all parameters (βs and γs) to 0. 2: for n = 1 → ITER do 3: for each a in Attr do 4: oldError, newError ← 1, 0 {To satisfy the condition in while clause.} 5: while oldError > newError do 6: oldError ← newError 7: call shuffle(T rain) {Shuffling order of training examples is a standard step in stochastic gradient descent optimization.} 8: for each LO in T rain do 9: for each LN in T rain do 10: if LO .event == LN .event or LO in less than MP pairs then 11: continue 12: end if 13: newError ← newError + (predictRank(LO , LN ) − trueRank(LO , LN ))2 14: call updateFeatures(LO , LN , a, λ, MF) 15: {Function updateFeatures computes derivatives and updates features accord- ingly.} 16: end for 17: end for 18: end while 19: end for 20: end for – MP: the minimal number of pairs P airs(LO , LN ) for a given LO , where P airs(LO , LN ) > 0. A high number of non-zero pairs assures that the ranks Rank(LO , LN ) were esti- mated better. – λ: the penalty factor from the error function in Equation 2. – MF: minimal frequency of an attribute value. In order to update a parameter for an attribute value, at least MF of lectures must have this value. Otherwise, the parameter’s value will remain at 0. For example, if MF=10, and since there are only 6 lectures in Russian language in the training set, the parameters related to Russian language would remain at 0. 4 Towards a Collaborative Approach When the description of a lecture LO is inadequate or even wrong, the content-based approach by itself cannot provide good results. In such cases, it might be better to predict using some of the lectures that were often viewed together with LO . The question is how to select these lectures? A natural choice to measure the importance of a “friendly” lecture is the number of its co-occurrences with LO . A weighted sum, where weights are the co-occurrences, implements this idea: P d j∈T rain P airs(LO , Lj ) ∗ Rank(Lj , LN ) Coll(LO , LN ) = P , (3) j∈T rain P airs(LO , Lj ) 76 Recommending VideoLectures with Linear Regression d where ranks Rank(L j , LN ) are estimated with the linear model described in the previous section. d The above formula neglects the content-based prediction Rank(L O , LN ) completely. A “hybrid” approach, combining content and collaborative approaches, could result in a more accurate predictor: d Rank(L O , LN ) + CC ∗ Error(LO ) ∗ Coll(LO , LN ) Hybrid(LO , LN ) = , (4) 1 + CC ∗ Error(LO ) The term Error(LO ) is the mean squared error of the content-based predictor with respect to the lecture LO : PLN .event6=LO .event  d 2 LN ∈T rain Rank(L O , L N ) − Rank(L O , L N ) Error(LO ) = . (5) Number of lectures in Train The motivation to use the error term is to give a larger weight to the collaborative pre- dictor when the content-based predictor makes less accurate predictions. The parameter CC sets the proportion of the error to be used as a weight. 5 Evaluation The submitted rankings were evaluated with the mean average R-precision score (MARp), a measure commonly used in information retrieval. Initially, we tried to produce a repre- sentative holdout set from the provided training data, which would be used to optimize the algorithm’s hyperparameters. However, as we were unable to sample a set that would give similar results to those on the leaderboard, we decided to use the available number of submissions (60) to select the hyperparameters. Table 2. The results of the experiments on 20% of test data (for the leaderboard) during param- eter optimization. Only the relevant experiments are shown. The initial values were: MF=10, λ = 200, MP=0, CC=0, ITER=4. The final values (in bold) are: MF=3, λ=200, MP=200, CC=3, ITER=8. Alongside the values, MARp’s achieved on leaderboard are provided in brack- ets. MF 10 (0.157) 5 (0.162) 3 (0.264) λ 200 (0.264) 100 (0.263) 50 (0.262) MP 0 (0.264) 50 (0.271) 100 (0.288 ) 200 (0.290) CC (MP=100) 0 (0.288) 3 (0.290) 6 (0.289) ITER 4 (0.292) 8 (0.307) Table 2 shows some of the relevant experiments. We omit several initial experiments and all experiments not leading to the final model. At the start, the values of the hyper- parameters were MF=10, λ = 200, MP=0, CC=0, and ITER=4. We started off with larger values of MF and noticed a significant growth in perfor- mance when MF was reduced to 3. As it seems, there must be some attribute values 77 M. Možina, A. Sadikov, I. Bratko occurring in only three training lectures, yet are critical to make good predictions for the new lectures. The following parameter tested was λ, which appeared to have a small influence on prediction. There was a slight decline in precision when λ was decreased, however the change was negligible. After fixing MF and λ we moved on to MP. We increased its value from 0 up to 200 and noticed that values over 100 clearly lead to more accurate models. As described above, with high values of MP the algorithm will learn from lectures with better estimations of ranks. In machine learning terms, we are removing examples with large noise levels in class value. Since the increase of precision was minimal when MP went from 100 to 200, we set 200 as the final value of the parameter. Afterwards, we investigated whether the hybrid approach helps to improve the model. The CC weighs the importance of the collaborative predictor; setting it to zero will result in the use of content predictor only, high numbers will give preference to the collaborative predictor. As it turned out (to our surprise), different values of CC did not affect the performance much. The results suggest that both approaches make similarly good predictions, but their combination brings only a slight improvement. The final value of CC was set to 3. Lastly, we explored the ITER parameter. This parameter is the number of passes made by the algorithm over all attributes. Initially, it was set to 4 and was then increased to 8. According to the obtained results, where 8 repeats lead to notably better results, it seems that our algorithm slowly converges towards the global optimum and it would be useful to let it run even longer. However, we were out of available submissions. The final score on the leaderboard was 0.307, while the final results on the complete test set were only 0.277. Such a difference was expected as the hyperparameters of the algorithm were optimized using the leaderboard data. This was just enough to achieve the third place in competition with a small advantage over the fourth place (0.271). The first and the second place competitors scored 0.359 and 0.307, respectively. 6 Conclusions In this paper, we have presented a model for predicting the ranks of lectures. It is a large-scale linear regression model that can not be fit by conventional methods, hence we used the stochastic gradient descent. We implemented content-based and collaborative- based approaches. Both approaches exhibited similar prediction accuracies with a small improvement if they were used together. The evaluation showed an interesting property of the data; there are some rare attribute values occurring in only 3 (out of 6983) training lectures that are critical for a good prediction on the test data. It would be interesting to further examine this matter and seek out which are these values. Furthermore, it turned out that it is important to have reliable class values. Removing the lectures with low number of views and learning the model from well represented lectures resulted in significantly better models. During experiments we noticed that our stochastic gradient descent only slowly con- verged towards an optimum. Increasing the number of passes visibly improved the ac- curacy of the model and we believe that letting it run even further would result in even larger improvements. Another way of improving our model would be to consider the tex- tual attributes, to which we did not pay much attention. Finally, it would be also useful 78 Recommending VideoLectures with Linear Regression to construct more sensible attributes for the given domain. A possible approach would be to use argument based machine learning (ABML), where new attributes are constructed from arguments (explanations) given to some of the critical learning examples [4, 3]. Acknowledgements This work was partly supported by the Guru Cue d.o.o. company (http://www.gurucue.com/), and Slovene Agency for Research and Development (ARRS). We would also like to thank the organizers for putting up an interesting challenge. References 1. Hoerl, A.E., Kennard, R.W.: Ridge regression: Applications to nonorthogonal problems. Tech- nometrics 12(1), 69–82 (1970) 2. Jannach, D., Zanker, M., Felfernig, A., Friedrich, G.: Recommender Systems: An Introduc- tion. Cambridge University Press (2010) 3. Možina, M., Guid, M., Krivec, J., Sadikov, A., Bratko, I.: Fighting knowledge acquisition bottleneck with argument based machine learning. In: The 18th European Conference on Artificial Intelligence (ECAI). pp. 234–238. Patras, Greece (2008) 4. Možina, M., Žabkar, J., Bratko, I.: Argument based machine learning. Artificial Intelligence 171(10/15), 922–937 (2007) 79