Clustering multi-relationnal TV data by diverting supervised ILP Vincent Claveau IRISA - CNRS Campus de Beaulieu, Rennes, France vincent.claveau@irisa.fr Abstract Traditionally, clustering operates on data described by a fixed number of (usually numerical) features; this description schema is said propo- sitional or attribute-value. Yet, when the data cannot be described in that way, usual data-mining or clustering algorithms are no longer suitable. In this paper, we consider the problem of discovering similar types of programs in TV streams. The TV data have two important characteristics: 1) they are multi-relational, that is to say with multiple relationships between features; 2) they require background knowledge external to their interpretation. To process the data, we use Inductive Logic Programming (ILP) [MD94]. In this paper, we show how to di- vert ILP to work unsupervised in this context: from artificial learning problems, we induce a notion of similarity between broadcasts, which is later used to perform the clustering. Experiments presented show the soundness of the approach, and thus open up many research avenues. 1 Introduction Many TV services require the TV stream to be correctly segmented and tagged (thematic corpora from archives, TV on demand...). Thus, one needs a complete TV guide, also documenting inter-program (short spots between main programs, such as ads, trailers...), with a very high precision (at the frame level). Such guides usually do not exist, which makes their automatic building necessary. This task is at the heart of automatic structuring of TV streams. Several approaches have been proposed; some relies on meta-data [Pol08] or audio/video clues [NG08, MB10, IG11]. They all rely on a supervised classification step (assign a class to each TV segment), thus requiring a priori knowledge (the user need to define the classes) and also too many manually annotated data to be actually usable. In this paper, we propose to reduce this important a priori involvement of the user by tackling the problem as a non-supervised one, that is as clustering. The remaining role of the user would then be to tag the clusters. As with the well-known k-means, clustering techniques rely on a simple representation of the data and on a distance notion operating of these representations which has to be provided by the user [Jai10]. In our case, this leads to two problems. First, our data need to be represented in a complex way, as they are multi-relational. Second, we do not know how to define a priori a relevant distance over these complex representations. In this Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In: Nicolas Lachiche, Christel Vrain (eds.): Late Breaking Papers of ILP 2017, Orléans, France, September 4-6, 2017, published at http://ceur-ws.org 11 Figure 1: Multiple relations of the trailer Clara Sheller. paper, we propose to define a clustering technique suited to our complex data by diverting supervised Inductive Logic Programming (ILP) into a non-supervised technique. ILP makes it possible to easily represent our multi- relational data, and a distance between broadcasts is automatically from fake supervised classification problems, in the vein of [SH05, CN13]. 2 ILP and multi-relational data For classification problems, objects are usually described in a propositional form, also said attribute-value or vector-based. In this representation, objects must have the same number of features, and the features are to be considered independently (relations between features are not exploited). In our case, each object is a segment of TV-streams corresponding to a program or an inter-program. But each object may have several occurrences, such as a particular ad which is repeated several times in the stream. The number of occurrences vary from one object to another, which makes the attribute-value description impossible. Moreover, certain relations between occurrences may be very relevant (eg. two occurrences are broadcast on different TV channels, two occurrences are broadcast in less than 1 day...). This multi-relational aspect of our data is thus important to consider for the clustering task. Figure 1 shows these different relations between occurrences and their feature as arrows with different colors (in gray: the class of broadcast, which is unknown in our problem). ILP is usually used as supervised machine learning technique able to infer rules (eg. Horn clauses) H from examples (E + ) and counter-examples E − of a concept, and with the help of background knowledge B [MD94]. Figure 2 shows how a program can be described in B (with standard Prolog). One can see how the relations between the occurrences are easily encoded with predicates next occ/2 and next in stream/2. In B we also define the predicates that can be used to infer rules in H, such as prev occ/2 which indicate two occurrences of the same program, one occurring after the other, or such as interval/3 wich indicates the time interval between two occurrences of two program. Here is an example of rule that can be inferred : broadcast(A) :- has occ(A,B), duration(B,3), next occ(B,C), next in stream(B,D), next in stream(C,E), has occ(F,D), has occ(F,E). This rule highlights the interest of the multi-relational representation: it covers every broadcast A having two occurrences B and C, lasting 3 seconds, such as these two occurrences are followed by two occurrences (D,E) from a same program (F). This rule typically covers sponsoring broadcast always appearing before a program. 3 From supervised to unsupervised 3.1 Principles Our approach aims at deducing distances (or similarities) between two programs from repeated random classi- fications problems with ILP. For a given random classification problem, if the two programs are covered by H, it tends to show that they are related. If this is the case for every random classification problem, it means that they are very similar. Algorithm 1 gives an overview of the process. As for bagging [Bre96], classification is repeated many times with different learning parameters: examples (step 3 wichich divides the data into positive + Etrain and EOoB , a out-of-bag set used later), counter-examples (step 4), the hypothesis language (step 5). At each iteration, we record the pairs of programs (xi , xj ) that are covered by the same inferred clauses (called 12 %%% description of the 1st occurrence has occ(broadcast12,b12 occ1). duration(b12 occ1,69). date time(b12 occ1,20,42,1,10,june,2005,friday). channel(b12 occ1,2). next occ(b12 occ1,b12 occ2). next in stream(b12 occ1,b28 occ5). %%% description of the 2nd occurrence has occ(broadcast12,b12 occ2). date time(b12 occ2,24,48,19,10,june,2005,friday). duration(b12 occ2,66). channel(b12 occ2,2). next occ(b12 occ2,b12 occ3). next in stream(b12 occ2,b5 occ19). ... %%% other knowledge / predicate definition prev occ(Occ1,Occ2) :- next occ(Occ2,Occ1). interval(Occ1,Occ2,Duration) :- date time(Occ1,H1,Min1,S1,D1,M1,Y1, ), date2epoch(H1,Min1,S1,D1,M1,Y1,Epoch1), date time(Occ2,H2,Min2,S2,D2,M2,Y2, ), date2epoch(H2,Min2,S2,D2,M2,Y2,Epoch2), Duration is abs(Epoch1-Epoch2). ... Figure 2: Excerpt of the example description and background knowledge co-covers hereafter) in a matrix Mco-cov . One can give more weight to a clause covering very few examples, and less weight to a clause covering most of the examples (function weight). The last step is simply to use a standard clustering technique on the co-cover matrix, considered as a similarity matrix. In the experiments presented below, we use Markov Clustering [vD00]. Its main advantage compared with k-means/k-medoids is to avoid the need to decide a priori the number of expected clusters. The strategy at the heart of this approach is to vary the learning biases at each iteration. The first bias is the set of examples used. In our experiments we use 1/10 of the programs to be used as positive examples. The inferred rules are then applied on the 9/10 remaining programs to find which one are co-covered. The generation of negative examples is an important step in our algorithm. In our case, it means inventing programs, with their occurrences and features. They have to be realistic enough in order to produce learning problems that will generate discriminative enough clauses, and thus relevant co-covers. In order to generate counter-examples, we randomly copy parts of the description of real programs (with a renaming of the constants in order to produce a coherent set of occurrences and features). The hypothesis language, setting the format of acceptable clauses, Algorithm 1 Clustering with ILP 1: input: Etotal : programs 2: for i in [1 .. N ] do 3: Ei+ , EiOoB ← Divide(Etotal ) 4: Generate negative examples Ei− 5: Generate randomly the hypothesis language LiH and the ILP parameters θi 6: Inferring : Hi ← ILP(Ei+ ,Ei− ,LiH ,θi ) 7: for all clause hl among Hi do 8: for all pair ea , eb from EiOoB such that B, hl ` ea , eb do 9: Mco-cov (ea , eb )+ = weighti (ea , eb ) 10: end for 11: end for 12: end for 13: return clustering(Mco-cov ) 13 Figure 3: Class repartition in the ground-truth. is also different at each iteration. In practice, every mode of every predicate is given at the initialization of the algorithm, and a subset is randomly chosen at each iteration. All these machine learning problems on fake supervised tasks brings, through their variety, important properties to the obtained similarity: it mixes complex descriptions, implements feature selection, take into account redundancy between descriptions, and is robust to outliers. 4 Experiments 4.1 Experimental setting The data use for our experiments are those developed by [NG08]; it consists of a 22-day recording of the French France2 channel in May 2005. The stream is segmented in programs and the different occurrences of a same program have been identified automatically and manually consolidated [NG08]. To build the ground- truth needed to evaluate our clustering results, we used the manual annotation of the data proposed by [NG08] who tagged the programs according to 6 classes: movie/show, series, commercials, sponsoring, branding (short programs displaying the the name or logo of the channel), trailers (short programs announcing what will be broadcast later). This ground-truth tagging of the stream will be used as reference clusters (cf. Figure 3 for their repartition). The evaluation scores are those commonly used for clustering comparison (the one produced automatically vs. the ground-truth one): Adjusted Purity, Normalized mutual information and Adjusted Rand Index (ARI) [Ran71, HA85, VEB10]. 4.2 Results Figure 4 presents the results of the relational clustering after 1 000 iterations as well as several baselines relying on a usual propositional representation. In this latter case, the features used are: number of occurrences, average duration, mean, minimal and maximal interval between two occurrences, maximal number of occurrences during a 24h, duration between the first and last occurrences, presence or not of every occurrences in the same day, and average number of other programs occurring before or after the program occurrences. The baseline algorithms are: k-means, EM, CobWeb, such that implemented in weka [HFH+ 09]; for each of them, we only report the results of the configurations yielding the best ARI. The ILP algorithm used is aleph [Sri01], the data are described as shown in Section 2. We also, report the results of our ILP-based approach exploiting the same representation (i.e. discarding the relational predicates of LH ). For any evaluation score, our ILP-based clustering approach perform better than the propositional approaches; it clearly shows the added-value of the ability to handle the multi-relational representation of the data. The generated clusters are nonetheless different in terms of numbers of clusters and in terms and of the content of these clusters. An analysis of the differences between the ILP clusters and ground-truth ones shows that the trailer class is difficult to capture (such programs appear in several ILP clusters). Other problems are caused by programs at the boundaries of our 22-day TV recording or for programs for which the 3 weeks are not enough to capture the recurrence patterns. 14 Figure 4: Results of clustering methods in terms of Adjusted Purity, Normalized mutual information and Adjusted Rand Index. An analysis of the inferred rule for each iteration also allow an indirect validation of our approach since they exhibit the multi-relational property of our data. This is the case of the following rule which covers programs broadcast at fixed interval: broadcast(A) :- has occ(A,B), next occ(B,C), next occ(C,D), interval(B,C,E), interval(C,D,E). 5 Conclusions Our clustering approach, relying on ILP, allows us to make the most of the multi-relational aspect of our TV data. It makes it possible to get a notion of distance even in rich description spaces where metrics cannot be defined a priori. Of course, even if there is no explicit definition of the distance, other biases from the user are unavailable, such as the way the data are described, the definition of the modes in the hypothesis language... Several perspectives are foreseen. For our TV application, the use of a larger dataset (recording several months with several channels) would allow us to limit the errors mentioned in the previous section. Adding multimodal features (logo detection, black frames, speech detection...) would also bring useful information about the content of the TV segment. These features should help the clustering process to distinguish between branding and sponsoring, or to better categorize trailers. More generally, the good results obtained by the ILP- based clustering argues in favor of applying this approach to other problems where the multi-relational aspect in important [DL01, MDP+ 12]. References [Bre96] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [CN13] Vincent Claveau and Abir Ncibi. Dcouverte de connaissances dans les squences par CRF non- superviss. In Actes de la confrence TALN 2013, 2013. [DL01] S. Dzerosky and N. Lavrac, editors. Relational Data Mining. Berlin: Springer-Verlag, 2001. [HA85] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985. [HFH+ 09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10–18, 2009. [IG11] Zein Al Abidin Ibrahim and Patrick Gros. Tv stream structuring. ISRN Signal Processing, 2011. [Jai10] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. 15 [MB10] Gal Manson and Sid-Ahmed Berrani. Automatic tv broadcast structuring. International Journal of Digital Multimedia Broadcasting, 2010. [MD94] Stephen Muggleton and Luc De Raedt. Inductive Logic Programming: Theory and Methods. Journal of Logic Programming, 19-20:629–679, 1994. [MDP+ 12] Stephen Muggleton, Luc De Raedt, David Poole, Ivan Bratko, Peter A. Flach, Katsumi Inoue, and Ashwin Srinivasan. ILP turns 20 - biography and future challenges. Machine Learning, 86(1):3–23, 2012. [NG08] Xavier Naturel and Patrick Gros. Detecting repeats for video structuring. Multimedia Tools and Applications, 38(2):233–252, 2008. [Pol08] Jean-Philippe Poli. An automatic television stream structuring system for television archives holders. Multimedia Systems, 14(5):255–275, 2008. [Ran71] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):pp. 846–850, 1971. [SH05] Tao Shi and Steve Horvath. Unsupervised learning with random forest predictors. Journal of Com- putational and Graphical Statistics, 15(1):118–138, 2005. [Sri01] Aswin Srinivasan. The aleph manual. Machine Learning at the Computing Laboratory, Oxford University, 2001. [vD00] Stijn van Dongen. Graph Clustering by Flow Simulation. Thse de doctorat, Universit d’Utrecht, 2000. [VEB10] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison. Journal of Machine Learning Research, 2010. 16