Clustering multi-relationnal TV data by diverting supervised ILP

Clustering multi-relationnal TV data by diverting supervised ILP VincentClaveau vincent.claveau@irisa.fr IRISA CNRS Campus de Beaulieu

Rennes France

Clustering multi-relationnal TV data by diverting supervised ILP CC93DAD8EF2DC67A2D85020F88FF2EEE GROBID - A machine learning software for extracting information from scholarly documents

Traditionally, clustering operates on data described by a fixed number of (usually numerical) features; this description schema is said propositional or attribute-value. Yet, when the data cannot be described in that way, usual data-mining or clustering algorithms are no longer suitable. In this paper, we consider the problem of discovering similar types of programs in TV streams. The TV data have two important characteristics: 1) they are multi-relational, that is to say with multiple relationships between features; 2) they require background knowledge external to their interpretation. To process the data, we use Inductive Logic Programming (ILP) [MD94]. In this paper, we show how to divert ILP to work unsupervised in this context: from artificial learning problems, we induce a notion of similarity between broadcasts, which is later used to perform the clustering. Experiments presented show the soundness of the approach, and thus open up many research avenues.

Introduction

Many TV services require the TV stream to be correctly segmented and tagged (thematic corpora from archives, TV on demand...). Thus, one needs a complete TV guide, also documenting inter-program (short spots between main programs, such as ads, trailers...), with a very high precision (at the frame level). Such guides usually do not exist, which makes their automatic building necessary. This task is at the heart of automatic structuring of TV streams. Several approaches have been proposed; some relies on meta-data [Pol08] or audio/video clues [NG08,MB10,IG11]. They all rely on a supervised classification step (assign a class to each TV segment), thus requiring a priori knowledge (the user need to define the classes) and also too many manually annotated data to be actually usable. In this paper, we propose to reduce this important a priori involvement of the user by tackling the problem as a non-supervised one, that is as clustering. The remaining role of the user would then be to tag the clusters.

As with the well-known k-means, clustering techniques rely on a simple representation of the data and on a distance notion operating of these representations which has to be provided by the user [Jai10]. In our case, this leads to two problems. First, our data need to be represented in a complex way, as they are multi-relational. Second, we do not know how to define a priori a relevant distance over these complex representations. In this

ILP and multi-relational data

For classification problems, objects are usually described in a propositional form, also said attribute-value or vector-based. In this representation, objects must have the same number of features, and the features are to be considered independently (relations between features are not exploited). In our case, each object is a segment of TV-streams corresponding to a program or an inter-program. But each object may have several occurrences, such as a particular ad which is repeated several times in the stream. The number of occurrences vary from one object to another, which makes the attribute-value description impossible. Moreover, certain relations between occurrences may be very relevant (eg. two occurrences are broadcast on different TV channels, two occurrences are broadcast in less than 1 day...). This multi-relational aspect of our data is thus important to consider for the clustering task. Figure 1 shows these different relations between occurrences and their feature as arrows with different colors (in gray: the class of broadcast, which is unknown in our problem).

ILP is usually used as supervised machine learning technique able to infer rules (eg. Horn clauses) H from examples (E + ) and counter-examples E − of a concept, and with the help of background knowledge B [MD94]. Figure 2 shows how a program can be described in B (with standard Prolog). One can see how the relations between the occurrences are easily encoded with predicates next occ/2 and next in stream/2.

In B we also define the predicates that can be used to infer rules in H, such as prev occ/2 which indicate two occurrences of the same program, one occurring after the other, or such as interval/3 wich indicates the time interval between two occurrences of two program. Here is an example of rule that can be inferred : This rule highlights the interest of the multi-relational representation: it covers every broadcast A having two occurrences B and C, lasting 3 seconds, such as these two occurrences are followed by two occurrences (D,E) from a same program (F). This rule typically covers sponsoring broadcast always appearing before a program.

3 From supervised to unsupervised

Principles

Our approach aims at deducing distances (or similarities) between two programs from repeated random classifications problems with ILP. For a given random classification problem, if the two programs are covered by H, it tends to show that they are related. If this is the case for every random classification problem, it means that they are very similar. Algorithm 1 gives an overview of the process. As for bagging [Bre96], classification is repeated many times with different learning parameters: examples (step 3 wichich divides the data into positive E + train and E OoB , a out-of-bag set used later), counter-examples (step 4), the hypothesis language (step 5). At each iteration, we record the pairs of programs (x i , x j ) that are covered by the same inferred clauses (called The strategy at the heart of this approach is to vary the learning biases at each iteration. The first bias is the set of examples used. In our experiments we use 1/10 of the programs to be used as positive examples. The inferred rules are then applied on the 9/10 remaining programs to find which one are co-covered. The generation of negative examples is an important step in our algorithm. In our case, it means inventing programs, with their occurrences and features. They have to be realistic enough in order to produce learning problems that will generate discriminative enough clauses, and thus relevant co-covers. In order to generate counter-examples, we randomly copy parts of the description of real programs (with a renaming of the constants in order to produce a coherent set of occurrences and features). The hypothesis language, setting the format of acceptable clauses, Algorithm 1 Clustering with ILP 1: input: E total : programs 2: for i in [1 .. N ] do 3:

E + i , E OoB i ← Divide(E total ) 4: Generate negative examples E − i 5:

Generate randomly the hypothesis language L H i and the ILP parameters θ i

Inferring :

H i ← ILP(E + i ,E − i ,L H i ,θ i ) 7:

for all clause h l among H i do is also different at each iteration. In practice, every mode of every predicate is given at the initialization of the algorithm, and a subset is randomly chosen at each iteration. All these machine learning problems on fake supervised tasks brings, through their variety, important properties to the obtained similarity: it mixes complex descriptions, implements feature selection, take into account redundancy between descriptions, and is robust to outliers.

Experiments

Experimental setting

The data use for our experiments are those developed by [NG08]; it consists of a 22-day recording of the French France2 channel in May 2005. The stream is segmented in programs and the different occurrences of a same program have been identified automatically and manually consolidated [NG08]. To build the groundtruth needed to evaluate our clustering results, we used the manual annotation of the data proposed by [NG08] who tagged the programs according to 6 classes: movie/show, series, commercials, sponsoring, branding (short programs displaying the the name or logo of the channel), trailers (short programs announcing what will be broadcast later). This ground-truth tagging of the stream will be used as reference clusters (cf. Figure 3 for their repartition). The evaluation scores are those commonly used for clustering comparison (the one produced automatically vs. the ground-truth one): Adjusted Purity, Normalized mutual information and Adjusted Rand Index (ARI) [Ran71, HA85, VEB10].

Results

Figure 4 presents the results of the relational clustering after 1 000 iterations as well as several baselines relying on a usual propositional representation. In this latter case, the features used are: number of occurrences, average duration, mean, minimal and maximal interval between two occurrences, maximal number of occurrences during a 24h, duration between the first and last occurrences, presence or not of every occurrences in the same day, and average number of other programs occurring before or after the program occurrences. The baseline algorithms are: k-means, EM, CobWeb, such that implemented in weka [HFH + 09]; for each of them, we only report the results of the configurations yielding the best ARI. The ILP algorithm used is aleph [Sri01], the data are described as shown in Section 2. We also, report the results of our ILP-based approach exploiting the same representation (i.e. discarding the relational predicates of L H ).

For any evaluation score, our ILP-based clustering approach perform better than the propositional approaches; it clearly shows the added-value of the ability to handle the multi-relational representation of the data. The generated clusters are nonetheless different in terms of numbers of clusters and in terms and of the content of these clusters. An analysis of the differences between the ILP clusters and ground-truth ones shows that the trailer class is difficult to capture (such programs appear in several ILP clusters). Other problems are caused by programs at the boundaries of our 22-day TV recording or for programs for which the 3 weeks are not enough to capture the recurrence patterns. An analysis of the inferred rule for each iteration also allow an indirect validation of our approach since they exhibit the multi-relational property of our data. This is the case of the following rule which covers programs broadcast at fixed interval:

Conclusions

Our clustering approach, relying on ILP, allows us to make the most of the multi-relational aspect of our TV data. It makes it possible to get a notion of distance even in rich description spaces where metrics cannot be defined a priori. Of course, even if there is no explicit definition of the distance, other biases from the user are unavailable, such as the way the data are described, the definition of the modes in the hypothesis language... Several perspectives are foreseen. For our TV application, the use of a larger dataset (recording several months with several channels) would allow us to limit the errors mentioned in the previous section. Adding multimodal features (logo detection, black frames, speech detection...) would also bring useful information about the content of the TV segment. These features should help the clustering process to distinguish between branding and sponsoring, or to better categorize trailers. More generally, the good results obtained by the ILPbased clustering argues in favor of applying this approach to other problems where the multi-relational aspect in important [DL01, MDP + 12].

Figure 1 :1Figure 1: Multiple relations of the trailer Clara Sheller. paper, we propose to define a clustering technique suited to our complex data by diverting supervised Inductive Logic Programming (ILP) into a non-supervised technique. ILP makes it possible to easily represent our multirelational data, and a distance between broadcasts is automatically from fake supervised classification problems, in the vein of [SH05, CN13].

broadcast(A) :-has occ(A,B), duration(B,3), next occ(B,C), next in stream(B,D), next in stream(C,E), has occ(F,D), has occ(F,E).

Figure 2 :2Figure 2: Excerpt of the example description and background knowledge co-covers hereafter) in a matrix M co-cov . One can give more weight to a clause covering very few examples, and less weight to a clause covering most of the examples (function weight). The last step is simply to use a standard clustering technique on the co-cover matrix, considered as a similarity matrix. In the experiments presented below, we use Markov Clustering [vD00]. Its main advantage compared with k-means/k-medoids is to avoid the need to decide a priori the number of expected clusters.The strategy at the heart of this approach is to vary the learning biases at each iteration. The first bias is the set of examples used. In our experiments we use 1/10 of the programs to be used as positive examples. The inferred rules are then applied on the 9/10 remaining programs to find which one are co-covered. The generation of negative examples is an important step in our algorithm. In our case, it means inventing programs, with their occurrences and features. They have to be realistic enough in order to produce learning problems that will generate discriminative enough clauses, and thus relevant co-covers. In order to generate counter-examples, we randomly copy parts of the description of real programs (with a renaming of the constants in order to produce a coherent set of occurrences and features). The hypothesis language, setting the format of acceptable clauses,

8 :Figure 3 :83Figure 3: Class repartition in the ground-truth.

Figure 4 :4Figure 4: Results of clustering methods in terms of Adjusted Purity, Normalized mutual information and Adjusted Rand Index.

broadcast(A) :-has occ(A,B), next occ(B,C), next occ(C,D), interval(B,C,E), interval(C,D,E).

Bagging predictors LBreiman Machine Learning 1996 24 Dcouverte de connaissances dans les squences par CRF nonsuperviss VincentClaveau AbirNcibi Actes de la confrence TALN 2013 s de la confrence TALN 2013 2013 Relational Data Mining SDzerosky NLavrac

Berlin

Springer-Verlag 2001 Comparing partitions LawrenceHubert PhippsArabie Journal of Classification 2 1 1985 The WEKA data mining software: An update Hfh + ; Mark EibeHall GeoffreyFrank BernhardHolmes PeterPfahringer IanHReutemann Witten SIGKDD Explorations 11 1 2009 Tv stream structuring ZeinAl AbidinIbrahim PatrickGros ISRN Signal Processing 2011 Data clustering: 50 years beyond k-means AKJain Pattern Recognition Letters 31 8 2010 Automatic tv broadcast structuring GalManson Sid-AhmedBerrani 2010 International Journal of Digital Multimedia Broadcasting Inductive Logic Programming: Theory and Methods StephenMuggleton LucDe Raedt Journal of Logic Programming 19 1994 ILP turns 20 -biography and future challenges LucMdp + ; Stephen Muggleton DavidDe Raedt IvanPoole PeterABratko KatsumiFlach AshwinInoue Srinivasan Machine Learning 86 1 2012 Detecting repeats for video structuring XavierNaturel PatrickGros Multimedia Tools and Applications 38 2 2008 An automatic television stream structuring system for television archives holders Jean-PhilippePoli Multimedia Systems 14 5 2008 Objective criteria for the evaluation of clustering methods WilliamMRand Journal of the American Statistical Association 66 336 1971 Unsupervised learning with random forest predictors TaoShi SteveHorvath Journal of Computational and Graphical Statistics 15 1 2005 The aleph manual. Machine Learning at the Computing Laboratory AswinSrinivasan 2001 Oxford University Graph Clustering by Flow Simulation Stijn Van Dongen 2000 Universit d'Utrecht Thse de doctorat Information theoretic measures for clusterings comparison XuanNguyen JulienVinh JamesEpps Bailey Journal of Machine Learning Research 2010