=Paper=
{{Paper
|id=Vol-1707/alatiknow_paper5
|storemode=property
|title=Active Subtopic Detection in Multitopic Data
|pdfUrl=https://ceur-ws.org/Vol-1707/alatiknow_paper5.pdf
|volume=Vol-1707
|authors=Benjamin Bergner,Georg Krempl
|dblpUrl=https://dblp.org/rec/conf/iknow/BergnerK16
}}
==Active Subtopic Detection in Multitopic Data==
Active Subtopic Detection in Multitopic Data Benjamin Bergner Georg Krempl benjamin.bergner@st.ovgu.de georg.krempl@ovgu.de Knowledge Management & Discovery, Otto von Guericke University Magdeburg, Germany Abstract Subtopic detection is a useful text information retrieval tool to cre- ate relations between documents and to add descriptive information to them. For the task of detecting subtopics with user guidance, clustering by intent (CBI) has recently been proposed. However, this approach is limited to single-topic environments. We extend this approach for interactive subtopic detection in multi-topic environments, and for the incorporation of positive and negative user feedback. Our multi-topic clustering by intent (MCBI) approach iteratively constructs so-called similarity sets of documents within the same topic, derives candidates for new subtopics and actively queries feedback from the user, which is then used to refine the subtopic and similarity sets in the next iter- ation. For evaluation, we construct a corpus of the Wikipedia articles for the 4309 most common English nouns, comprising a broad range of different topics. Our MCBI approach is compared with the recently proposed CBI approach and random sampling. Each approach is eval- uated based on the number of subtopics that are found in the same predefined, closed topic (countries). The results show that MCBI finds up to 137% and 445% percent more correct subtopics than random term selection or CBI, respectively. 1 Introduction A business user might intend to group text documents, such as support logs or customer reviews, into meaningful subsets that correspond to the different subtopics addressed in the texts. Or, as further illustrative example, consider a user wants to cluster documents by using the different types of sport as subtopics. As pointed out in [7], it might be too tedious or even impossible to provide a priori an exhaustive list of possible subtopics, to label documents manually, or to experiment with different parameters until the desired clustering is obtained. However, it is comparatively easy for the user to illustrate the intended clustering by providing one (or a few) exemplary subtopics, for example by giving a cluster of tennis-related documents. Given the documents and initial subtopic(s), the clustering algorithm’s task is to construct candidates for new subtopics, to actively seek the user’s confirmation or rejection of these candidates as members of the intended topic, and to extend the clustering accordingly. Thus, this problem, recently described as clustering by intent in [7], is a combination of an active, incremental clustering task with very weak, interactive supervision. A first approach for this task is provided in [7]. However, this approach is limited to single-topic environments. That is, documents unrelated to the intended topic might confuse the approach, like for example food-related Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In: G. Krempl, V. Lemaire, E. Lughofer, and D. Kottke (eds.): Proceedings of the Workshop Active Learning: Applications, Foundations and Emerging Trends, AL@iKNOW 2016, Graz, Austria, 18-OCT-2016, published at http://ceur-ws.org 1 35 Active Subtopic Detection in Multitopic Data documents when the intent is deriving animal-related subtopics. Furthermore, the approach is focused on the user’s confirmation of candidates, neglecting negative feedback by the user. In this paper, we extend this approach to the interactive subtopic detection in multi-topic environments, and for incorporating positive and negative user feedback. Our Multi-topic Clustering By Intent (MCBI) approach iteratively constructs so-called similarity sets, which comprise solely documents of the same topic. Then, it derives candidates for new subtopics, for which it actively queries feedback from the user. Information on the confirmed as well as on the rejected subtopics is then incorporated into the clustering model and used in subsequent iterations. The remainder of this paper is structured as follows: in the next Section 2, we provide a more detailed background of clustering by intent and review the related work. Our method is presented in Section 3, followed by the experimental evaluation in Section 4. The paper closes with concluding remarks and an outlook to future work in Section 5. 2 Background and Related Work We first provide a more formal definition of clustering by intent, in order to facilitate the subsequent discussion of the related work. Clustering by intent [7] corresponds to an interactive clustering task, where a given set of documents D should be partitioned into clusters. Each document d ∈ D is represented by its bag of word feature vector w. ~ Each cluster cst is defined and labelled by its corresponding subtopic st. The clusterer has neither access to an explicitly given similarity function, nor to an expert willing to perform a tedious tuning of such a function, nor to single similarity measurements between objects. Instead, the clusterer is provided with one (or few) exemplary clusters as an indication of the human intent. Its task is then to actively propose the user further clusters that should correspond to the same intent. Documents belonging to clusters that the user has already confirmed constitute the labeled set L ⊆ D. The unlabelled set U ⊆ D comprises all remaining documents. This is illustrated in Fig. 1 a, where the two subtopics cDog and cCat as well as their corresponding documents (blue and yellow dots) have already been identified, while other documents (black dots) remain unlabelled. The approach proposed in [7] for CBI iteratively (1) trains a probabilistic multi-class classifier that discriminates between the known subtopics, (2) computes for each unlabelled document its confidence as the margin between its highest and second highest posterior estimate, (3) builds a so-called residual set that comprises the low-confidence documents, (4) selects the terms with highest precision in separating L from U as subtopic candidates, (5) queries their membership to the intent from the user, updates the sets and restarts the process again. This is illustrated in Fig. 1 b, where the most uncertain documents are used to form a residual set. The pseudocode for our implementation of this clustering by intent algorithm is given in Fig. 5 in the appendix. This clustering by intent is related to areas in constraint clustering and active learning, as well as to novel class detection, subgroup discovery, and topic detection, for which we will briefly review the most related literature below. Within the rich literature on clustering, constraint clustering is of particular relevance. A recent survey on constrained (also denoted semi-supervised) clustering is provided in [5]. As an example, text clustering is given, where the objective is to group texts according to similarities in e.g. their content or their authors. The expert might provide must-link constraints for documents that are deemed to be similar, or cannot-link constraints for documents that are different. In [5], three categories are named, search based (constraint-based), where the solution space is modified e.g. by penalizing violations of constraints, distance based (similarity based), where the similarity function is modified to consider the constraints, or by hybrids of both. Active approaches, e.g. for selection of informative document pairs for which user feedback is queried [1], and more recently also noisy constraints have been researched [18]. However, in contrast to [7, page 33], constrained clustering assumes the similarity measure to be a priori specified, and the pairwise constraints require to query information on the document level, rather than on the subtopic level. In (inter)active clustering (e.g. [9, 6, 11]), not all similarities (or distances) are known a priori. Rather, the approach actively selects similarity measurements and queries them from an oracle. Recent works comprise the so-called interactive (hierarchical) clustering proposed in [11] and the active hierarchical clustering in [6]. However, these (inter)active clustering approaches require the similarity measure to be known a priori or similarity measurements to be provided by the user. Many works in active learning literature addresses classification, where labels for instances are queried from the user [15]. Although some of these works address active learning on text data (e.g. sentiment classification in [10]), they assume the set of labels to be known and are therefore not applicable here. Similarly, in [14], active learning in text categorization is studied. The authors propose a two-fold process, where active learning is done 36 2 Active Subtopic Detection in Multitopic Data both on the document-level (querying document’s category, i.e. its label), and on the term level. The latter corresponds to asking the oracle about the importance of features, i.e. asking for the most predictive words. However, the text categorization problem is posed as one-versus-the-rest classification problem, where the single category of interest is initially known. In contrast, in clustering by intent, the categories are not known a priori but rather need to be learned on the way. Nevertheless, the approach in [14] might be used as post-processing for labelling all documents, once the set of categories has been determined by a clustering-by-intent approach. Another related field of supervised machine learning research is subgroup discovery ([17], see e.g. [2] for a recent survey). Given a population of individuals, its objective is finding as large as possible subgroups that show a distributional unusualness with respect to a certain property of interest [17]. The interestingness of a pattern is measured by a quality function [2], which is in general exploiting the target concepts’ distribution. A common exemplary quality function is to combine a measure for the subgroup’s size with its from the whole population in the target concept value. There exist interactive subgroup discovery frameworks (e.g. [8] that allow the expert to affect the attributes that are used for learning. However, the quality function and the target concept (i.e. class attribute) need to be specified a priori. Furthermore, novel class detection (e.g. [12]) in data streams is a related area of research. There, the objective is to detect novel classes, whose instances differ from those of already known classes. Some approaches are passive (e.g. [12]), while others actively query labels from the user (e.g. [13, 3]). However, similar to clustering these approaches rely on an a priori specified similarity function. Furthermore, these approaches focus on finding emerging topics, requiring a chronological ordering of instances. In contrast, in our setting all documents are provided at once, before starting to detect subtopics interactively. Likewise, one-class classification, outlier detection and anomaly detection approaches are not applicable here [7], as their objective is the detection of abnormal, rare data points that differ from the data available at training time. Thus, they detect small rather than large groups, and do not aim to provide a meaningful clustering. In topic detection and modelling, the aim is to discover the most relevant topics in text documents. An example is [4], where frequent topics in twitter messages are discovered. Another, more recent, is the discovery of the most interesting topics (and subtopics) in students’ messages in a learning management system [16]. However, these approaches rely on temporal information. For example, the detected “emerging topics” in [4] and the “spike topics” in [16] correspond to terms have gained in frequency. More related is the detection of “chatter topics” in [16]. These chatter topics are sustained discussion topics, which are the most frequent words that are not in a set of so-called “DumpTerms”. However, this requires providing a list of DumpTerms, which is unfeasible for large vocabularies. Figure 1: Dots represent single documents, some assigned to clusters of initially given subtopics. (a) Left: Example topic animals; cDog , cCat contain documents with subtopics dog and cat, respectively, remaining docu- ments are unlabeled and might contain further animal-related subtopics. (b) Right: Building the residual set R of documents who’s classification is uncertain. 3 37 Active Subtopic Detection in Multitopic Data 3 Multi-Topic Clustering by Intent We first explain the main ideas of our approach, before providing details. Similar to CBI [7], we use an uncertainty-related measure to construct a residual set of documents that are not certainly classified into one known subtopic. However, our aim is building a residual set consisting solely of documents from the intended topic. Thus, for constructing the residual set, we include a threshold on the document’s similarity with respect to the union of known subtopics. Having this residual set of documents, we extract the words that are used therein. In contrast to CBI, we exclude all words that have previously been rejected as subtopics. Furthermore, some terms are not specific for a subtopic, but are rather related to the topic as a whole. Therefore, we also exclude such words that occur frequently in all subtopics. We then compute a score for the remaining words. Similar to CBI, we consider the discriminatory power of a word, but we also consider the frequency of its co-occurrence with rejected words. The highest-ranking words in this combined score are proposed as new subtopics. Upon querying the users decision, the lists of rejected words and accepted subtopics are updated. The pseudocode of our Multi-Topic Clustering by Intent (MCBI) approach is given in Figure 3. Its input are lists of unlabelled U as well as labelled L documents. Using a bag-of-words representation over the vocabulary V , each document d therein has a feature vector d. ~ Furthermore, the current clustering C is given as a set of subtopics s1 , s2 , · · · , each subtopic consisting of one or more words. Finally, a list of previously rejected subtopics VREJ is given. These four lists are updated and returned by the algorithm. The first step of the algorithm (lines 2–12) is the construction of the residual set. This is done by first training a Multinomial Naive Bayes Classifier (MNB) to classify documents into the different known subtopics (line 2). Iterating over each document u in the unlabelled set U (lines 3–12), we use this classifier to obtain a posterior- based score ps|u , normalized by the vocabulary size, for each known subtopic s (line 6). For simplicity and speed, we consider in subsequent calculations solely the scores of the best (s) and second-best (s0 ) subtopic (line 6). For better comparison with CBI, our algorithm uses the same uncertainty-based strategy to select the most ambiguous unlabelled documents. Thus, MCBI calculates for each unlabelled instance the difference between the score of its best s and second best s0 subtopic (line 7)1 : uncertaintyu = ps|u − ps0 |u (1) In a multi-topic corpus, one might distinguish three types of documents in the residual set: first, documents belonging to one of the known subtopics. Second, documents belonging to a new subtopic of the intended topic, and third, documents belonging to a different topic. Our aim is to identify documents of the second kind. For illustration, consider the example of the intended topic “sports”, with known subtopics “soccer” and “tennis”, the yet undiscovered subtopic “hockey”, and the unrelated topic “building” (see Fig. 2 (a)). A classifier using the frequency of words like “soccer” or “tennis” might differentiate documents from the first type from the rest. However, it fails in distinguishing between the second and the third type, as both contain equally likely soccer- or tennis-specific words (see Fig. 2 (b)). Nevertheless, the second type of sports-related documents contain words that occur in both known subtopics, as for example the word “athlete”. Such topic-specific words are more frequent in the second (and first) type, than they are in the third (see Fig. 2 (c)). Thus, the score over all subtopics will be greater for topic-related documents. We exploit this to remove topic-unrelated documents from the residual set by applying a threshold on the sum of scores (which we denote as similarity, line 8): similarityu = ps|u + ps0 |u (2) Finally, all documents that are ambiguous with respect to being classified into one existing subtopic (high uncertainty) and are sufficiently similar to the union of the two subtopics in question (high similarity), are selected for the residual set R (lines 9–11 in Figure 3), i.e. R is a subset of the one computed with equation 1 by applying the similarity condition in order to guide pre-selected documents to the desired topic. Having the residual set of topic-related documents, the next task is selecting candidate words for new subtopics, for which the user’s feedback is then queried. Thus, we iterate over each word w occurring in a document of the residual set (lines 15–24). However, the most frequent words in the residual set might be specific for the topic, but not for a subtopic. Continuing the sports-topic above, an exemplary word is “athlete”. Excluding such topic-specific words from being suggested as subtopic candidates saves annotation time and prevents them from being added to the list of rejected terms VREJ . Such topic-specific words are frequent in the documents of the 1 Here, entropy might be considered as another uncertainty measure. 4 38 Active Subtopic Detection in Multitopic Data labelled set L, too. Thus, we exclude any word that occurs in more than half of the labelled documents2 (line 16). Here, the subfunction getDocumentsW ithW ords(D, W ) returns all documents in D that contain words W . Furthermore, our algorithm makes use of negative feedback in different ways. First, by excluding w in case it corresponds to a previously proposed but rejected subtopic (line 17). Second, by computing a score that considers how often this word co-occurs with rejected words. Thereby, we aim to exclude words that are specific to unrelated topics. For each rejected subtopic n ∈ VREJ , we compute the co-occurence frequency of w and n and divide it with the frequency of n. Then, we use the maximum of this relative co-occurence frequencies as a reject score in line 20. Next, we compute a score reflecting the discriminatory power of the word w, similarly to [7]. This discriminative score corresponds to the difference between the number of unlabelled documents that are classified by the word w, subtracted by the number of labelled documents that are (wrongly) classified by w as discriminative feature (line 21). Finally, the best scoring among the candidate words are selected and added to the query list Q (line 25). The user’s feedback on these subtopic candidates is queried (line 27), which might result in an accept or reject decision. With the accepted words the list of subtopics and clustering is updated C (line 28), and their corresponding documents are moved from U to L (lines 29–30). Rejected words are simply added to the rejected word list VREJ (line 31), and the resulting lists U ,L,C,VREJ are returned. Figure 2: Illustration of subtopic’s word bags. (a) Left: Overlap of word sets of the sport-subtopics “soccer” (given), “tennis” (given), “hockey” (unknown), with unrelated topic “building” (unknown). (b) Center: When considering solely ambiguity with respect to classification into “soccer” and “tennis”, the residual set (in red) contains “building” and “hockey”-related documents. (c) Right: Considering also the similarity to the topic- common word “athlete”, only “sport”-related documents like “hockey” overlap. 4 Evaluation There are several questions that will be answered for one composed dataset to evaluate an active learning topic detection algorithm. How many subtopics will be found in comparison to random term selection? How many subtopics will be found over all subtopics in D? What is the round distribution, i.e. how many subtopics will be found over every round? Finally, how many unique subtopics will be found over rounds with different initial subtopics? An ideal algorithm would detect all subtopics that are available in D. In fact, it would detect them all in the first round without needing much time of a human oracle. Furthermore, it will indicate fast when it is not able anymore to find new subtopics and runs reliable, such that its performance is equal over differing starting values. 4.1 Dataset and Evaluation Design For constructing a dataset with multiple topics, a list of 4309 common English nouns3 has been parsed to search for Wikipedia articles which build up the document set D. In case of getting more than one search result on 2 This threshold value IGN = 0.5 allows the algorithm to differentiate already with two documents in L, and also showed the best performance in experiments. 3 Available at http://www.desiquintans.com/nounlist 5 39 Active Subtopic Detection in Multitopic Data 1: function MCBI(U ,L,C,VREJ ) 2: M N B ← trainClassif ier(L, C) . train Multinomial Naive Bayes 3: R←∅ . build residual set 4: for u ∈ U do 5: p·|u ← posteriorScores(M N B, u) . compute scores 6: (pbest , p2nd ) ← maxN (p·|u , 2) . get the two highest scores 7: uncertaintyu ← pbest − p2nd . compute their difference 8: similarityu ← pbest + p2nd . compute their sum 9: if (uncertaintyu ≤ τm ) ∧ (similarityu ≥ τs ) then 10: R ← R ∪ {u} . add document to residual set 11: end if 12: end for 13: 14: i←0 . get words and their scores 15: for w ∈ words(R) do 16: if |getDocumentsW ithW ords(L, w)| < 0.5 · |L| then . no topic term 17: if w ∈/ VREJ then . no rejected term 18: i←i+1 19: wordi ← w . compute scores |getDocumentsW ithW ords(U ∪L,{w,n})|2 20: rejscorei ← maxn∈VREJ |getDocumentsW ithW ords(U ∪L,n)| 21: discscorei ← |getDocumentsW ithW ords(U, w)| − |getDocumentsW ithW ords(L, w)| 22: end if 23: end if 24: end for 25: Q ← getBestScores(word, discscore, rejscore) . select candidates 26: 27: (Qaccepted , Qrejected ) ← queryU ser(Q) . query user’s feedback 28: C ← C ∪ Qaccepted . update clustering 29: L ← L ∪ getDocumentsW ithW ords(U, Qaccepted ) . update labelled set 30: U ← U \ getDocumentsW ithW ords(U, Qaccepted ) . update unlabelled set 31: VREJ ← VREJ ∪ Qrejected . update reject list 32: return U ,L,C,VREJ 33: end function 34: Figure 3: The MCBI Algorithm Wikipedia, the first choice has been considered. Stop words have been removed, followed by lemmatization. On D, tf-idf with a threshold of 5 words per document was applied to keep solely the most important words. The resulting vocabulary set V has a size of 6.594 unique terms, which were considered in the bag-of-words representation. In order to enable a fair and automatic evaluation, a topic with known subtopics as ground truth was required. We have chosen the topic countries, as it is closed and identifiable. This means, that a complete list of all countries’ names (as subtopics) is obtainable. We also wanted to count in languages 4 and denonyms (naming of a country’s natives) 5 because of their high relatedness. As in most natural language applications some words are ambiguous, e.g. the country Georgia which is also an US state. For simplicity, such ambiguous terms were considered as valid subtopics. Furthermore, countries like Marshall Islands and New Zealand were transformed into single terms like Marshall and Zealand to reduce ambiguity. We compared our MCBI approach (denoted as MCBI, NF=on) against three baselines: CBI [7]6 , which in the paper [7] was already shown to compare favourably against other clustering-based approaches. Random selection, and finally a variant of MCBI without using negative feedback (denoted as MCBI, NF=off ). For CBI’s residual set size |R|, different parameter values were used (see first column in CBI’s results table 1a). On this dataset, we run CBI and the MCBI algorithms 20 times, while random was run 1000 times to get reliable results. 4 Available at http://www.infoplease.com/ipa/A0855611.html 5 Available at http://geography.about.com/library/weekly/aa030900a.htm 6 We reimplemented this approach, as the original source code is not available. 6 40 Active Subtopic Detection in Multitopic Data Each of those iterations consisted of 20 sampling rounds, where in each round 20 subtopics where queried. For the experiments we used 2, respectively 4, initial given subtopics. Within iterations, initial subtopics changed between countries (e.g. Germany & Portugal ). When constructing MCBI ’s residual set R, we iteratively lowered the thresholds τs and τm until the residual set’s size was equal or greater than one percent of the unlabelled set’s size. We also evaluated the effect of choosing not 0.5 as threshold IGN for ignored topic terms in line 16, which confirmed the our choice. 4.2 Results and Discussion We first provide the aggregated results in Fig. 4, where we compare the four algorithms in terms of their average (over all iterations) number of found subtopics (ordinate axis) in each round (abscissa). In these experiments, Random yields a nearly uniform performance over the rounds. This is as expected, as due to the large size of V there is little dependence between rounds (i.e. it corresponds to sampling without replacement from a very large urn). Random performs on this dataset consistently better than CBI. This is in contrast to single- topic environments where CBI originates from. This evaluation shows the necessity of adaptations for handling multi-topic environments. Furthermore, Random performs worse than MCBI in most rounds. The performance of MCBI increases over the first rounds, before starting to decline towards the last quarter of the rounds. In particular the MCBI variant using negative feedback improves at the beginning. This indicates that the negative feedback helps in determining the relevance of subtopics. However, in the last quarter both MCBI variants perform similarly. This indicates that at this point negative feedback might start to be too restricting, although this requires further investigation. When relating the maximum average found subtopics 26.20 to the total number of available subtopics (183) in D, ca. 14% are discovered, with standard deviation 4.69 over all iterations. Another interesting information is the unique count of subtopics in relation to the total availability over all iterations within highest scoring settings (#subtopics = 2, 4; N F = on; IGN Share variable) which ranges between 60 and 66. Since we already detect up to 26.20 subtopics in one iteration, it seems not worth to restart with differing initial values depending on application. 2 initial given subtopics 4 initial given subtopics Figure 4: Average number of found subtopics over 20 rounds with original CBI, 2 & 4 initial subtopics and varying size of R More detailed results for CBI and random are given in Table 1a. In the first column, settings are listed to compare different numbers of initial given subtopics and the document size of R. In the other columns, the average number of found countries, languages & denonyms and the count of subtopic and related words over all iterations with different initial given subtopics are depicted. Selecting words at random gives much better results than applying CBI in its original form. When composing R, documents are rated best for which it is uncertain how to associate them to the given subtopic clusters. Since we work in a multitopic environment, R will mostly imply documents that do not share any words with those from L, posteriors for given clusters and u ∈ U are nearly equal which will result in small margins. For new subtopic proposals, those words are preferred that occur most often in R and least often in L. Since R is multitopical, most frequent words from R with a high unrelatedness to the searched topic are proposed. The size of R is set very low to limit the number of unrelated documents and therefore to achieve best results under these critical circumstances. The more documents we consider for R, the worse gets performance neglecting variance. 7 41 Active Subtopic Detection in Multitopic Data Detailed results for MCBI on different configurations (number of initial subtopics, negative feedback and ignoring shares) are given in Table 1b. MCBI without using negative feedback finds 44% more subtopics than random. By incorporating negative feedback without factoring in ignored words, we again make a jump with a total improvement of 117%. Furthermore, the choice of setting the IGN threshold to 0.5 yields the best results, as explained in Section 3. Average subtopics found Average subtopics found Iteration Setting Lang. & Iteration Setting Lang. & # subtopics, |R| Count. Denonym Total # subtopics, NF, IGN Count. Denonym Total Random 5.52 5.60 11.12 Random 5.52 5.60 11.12 2, 0.003 · |U | 1.2 3.25 4.45 2, of f, − 7.75 8.20 15.95 2, 0.005 · |U | 0.9 3.9 4.8 2, on, 0 15.40 8.60 24.00 2, 0.01 · |U | 0.4 3.15 3.55 2, on, 0.1 15.55 8.9 24.45 2, 0.05 · |U | 0.05 2.4 2.45 2, on, 0.3 15.35 8.9 24.25 2, 0.1 · |U | 0.0 2.3 2.3 2, on, 0.5 15.90 10.30 26.20 4, 0.003 · |U | 1.05 2.95 4.0 2, on, 0.7 15.60 10.15 25.75 4, 0.005 · |U | 0.9 2.8 3.7 2, on, 1.0 10.15 7.05 17.20 4, 0.01 · |U | 0.95 2.85 3.8 4, of f, − 9.00 7.10 16.10 4, 0.05 · |U | 0.0 1.45 1.45 4, on, 0 15.40 7.90 23.30 4, 0.1 · |U | 0.1 2.15 2.25 4, on, 0.1 15.90 8.00 23.90 4, on, 0.3 15.95 8.45 24.40 (a) Detailed evaluation results for CBI and random. 4, on, 0.5 16.30 9.05 25.35 4, on, 0.7 15.90 9.35 25.25 4, on, 1.0 9.00 6.90 15.90 (b) Detailed evaluation results for M CBI and random. Table 1: Detailed results 5 Conclusion This work addressed the clustering by intent scenario recently introduced in [7]. For this scenario, an active approach for detecting subtopics was presented. This approach extends the CBI algorithm to multi-topic en- vironments, and to the incorporation of positive and negative user feedback. It iteratively constructs so-called residual sets of documents within the same topic. Based on this residual sets, it derives candidates for new subtopics. Then, feedback is actively queried from the user on these candidates. Finally, this is used to refine the subtopic and residual sets in the next iteration. The approach was evaluated on a corpus of Wikipedia articles for the 4309 most common English nouns, comprising a broad range of different topics. In our experiments, MCBI provides an improvement of up to 137% against random sampling, and 445% against CBI. While these first results are promising, a more extensive experimental evaluation is planned for the future. Because of the unsupervised learning task, a way to automatically tune parameters like uncertainty and similarity for construct- ing the residual set as well as IGN with differing datasets is desired. Furthermore, the role of negative feedback in such a system seems worth to be investigated further. Acknowledgements We would like to thank Andreas Nürnberger, Daniel Kottke, and George Forman for insightful discussions on this topic. 8 42 Active Subtopic Detection in Multitopic Data References [1] An active learning framework for semi-supervised document clustering with language modeling. Data and Knowledge Engineering, 68(1):49–67, 2009. [2] Martin Atzmüller. Subgroup discovery. WIREs Data Mining Knowledge Discovery, 5(1):35–49, 2015. [3] Mohamed-Rafik Bouguelia, Yolande Belaı̈d, and Abdel Belaı̈d. Efficient active novel class detection for data stream classification. In 22nd International Conference on Pattern Recognition, ICPR 2014, Stockholm, Sweden, August 24-28, 2014, number 3, pages 2826–2831, 2014. [4] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. Emerging topic detection on twitter based on temporaland social terms. In Proceedings of the Tenth International Workshop on Multimedia Data Mining, pages 4:1–4:10, 2010. [5] Derya Dinler and Mustafa Kemal Tural. A Survey of Constrained Clustering, pages 207–235. Springer International Publishing, Cham, 2016. [6] Brian Eriksson, Gautam Dasarathy, Aarti Singh, and Robert Nowak. Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities. In Proceedings of the 14th International Con- ference on Artificial Intelligence and Statistics (AISTATS), volume 15 of JMLR Workshop and Conference Proceedings, pages 260–268, 2011. [7] George Forman, Hila Nachlieli, and Renato Keshet. Clustering by intent: A semi-supervised method to discover relevant clusters incrementally. In Albert Bifet, Michael May, Bianca Zadrozny, Ricard Gavalda, Dino Pedreschi, Francesco Bonchi, Jaime Cardoso, and Myra Spiliopoulou, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part III, pages 20–36, Cham, 2015. Springer International Publishing. [8] Dragan Gamberger, Nada Lavra, and Goran Krstai. Active subgroup mining: a case study in coronary heart disease risk group detection. Artificial Intelligence in Medicine, 28:27–57, 2003. [9] Thomas Hofmann and Joachim M. Buhmann. Active data clustering. In Proceedings of the 10th Conference on Advances in Neural Information Processing Systems, NIPS ’97, pages 528–534. MIT Press, 1998. [10] Janez Kranjc, Jasmina Smailović, Vid Podpečan, Martin Grčar, Miha Žnidaršič, and Nada Lavrač. Ac- tive learning for sentiment analysis on data streams: Methodology and workflow implementation in the clowdflows platform. Information Processing and Management, 51:187–203, 2014. [11] Akshay Krishnamurthy. Interactive Algorithms for Unsupervised Machine Learning. PhD thesis, 2015. [12] Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. on Knowl. and Data Eng., 23(6):859–874, June 2011. [13] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. Classification and novel class detection in data streams with active mining. In Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II, PAKDD 2010, pages 311–324, Berlin, Heidelberg, 2010. Springer-Verlag. [14] Hema Raghavan, Omid Madani, and Rosie Jones. Active learning with feedback on features and instances. Journal of Machine Learning Research, 7:1655–1686, 2006. [15] Burr Settles. Active Learning. Number 18 in Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2012. [16] Llanos Tobarra, Antonio Robles-Gmez, Salvador Ros, Roberto Hernndez, and Agustn C. Caminero. Discov- ery of interest topics in web-based educational communities. In Proceedings of the International Symposium on Computers in Education (SIIE), pages 87–92. IEEE, 2012. 43 9 Active Subtopic Detection in Multitopic Data [17] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. of the 1st Europ. Sympo- sium on Principles of Data Mining and Knowledge Discovery, pages 78–87. Springer, 1997. [18] Xiatian Zhu, Chen Change Loy, and Shaogang Gong. Constrained clustering with imperfect oracles. IEEE Transactions on Neural Networks and Learning Systems, 27(6):1345–1357, 2015. 1: function CBI(U ,L,C) 2: M N B ← trainClassif ier(L, C) . train Multinomial Naive Bayes 3: R←∅ . build residual set 4: for u ∈ U do 5: p·|u ← posteriorScores(M N B, u) . compute scores 6: (pbest , p2nd ) ← maxN (p·|u , 2) . get the two highest scores 7: uncertaintyu ← pbest − p2nd . compute their difference 8: if (uncertaintyu ≤ τm ) then 9: R ← R ∪ {u} . add document to residual set 10: end if 11: end for 12: 13: i←0 . get words and their scores 14: for w ∈ words(R) do 15: i←i+1 16: wordi ← w . compute scores 17: discscorei ← |getDocumentsW ithW ords(U, w)| − |getDocumentsW ithW ords(L, w)| 18: end for 19: Q ← getBestScores(word, discscore) . select candidates 20: 21: (Qaccepted ) ← queryU ser(Q) . query user’s feedback 22: C ← C ∪ Qaccepted . update clustering 23: L ← L ∪ getDocumentsW ithW ords(U, Qaccepted ) . update labelled set 24: U ← U \ getDocumentsW ithW ords(U, Qaccepted ) . update unlabelled set 25: return U ,L,C 26: end function Figure 5: The Implemented CBI Algorithm (based on [7]). 10 44