Query-based Topic Detection Using Concepts and Named Entities Ilias Gialampoukidis1, Dimitris Liparas1, Stefanos Vrochidis1, and Ioannis Kompatsiaris1 Abstract.1 In this paper, we present a framework for topic “DBSCAN-Martingale” method [2], which can deal with the detection in news articles. The framework receives as input the aforementioned assumptions. All clusters are progressively results retrieved from a query-based search and clusters them by extracted (by a density-based algorithm) by applying Doob’s topic. To this end, the recently introduced “DBSCAN-Martingale” martingale and then Latent Dirichlet Allocation is applied for the method for automatically estimating the number of topics and the assignment of news articles to topics. Contrary to [2], the well-established Latent Dirichlet Allocation topic modelling contribution of this paper is based on the fact that the overall approach for the assignment of news articles into topics of interest, framework relies on high-level textual features (concepts and are utilized. Furthermore, the proposed query-based topic detection named entities) that are extracted from the retrieved results of a framework works on high-level textual features (such as concepts textual query, and can assist any search engine. and named entities) that are extracted from news articles. Our topic The rest of the paper is organized as follows: Section 2 provides detection approach is tackled as a text clustering task, without related work with respect to topic detection, news clustering and knowing the number of clusters and compares favorably to several density-based clustering. In Section 3, our framework for topic text clustering approaches, in a public dataset of retrieved results, detection is presented and described. Section 4 discusses the with respect to four representative queries. experimental results from the application of our framework and several other clustering methods to four collections of text 1 INTRODUCTION documents, related to four given queries, respectively. Finally, some concluding remarks are provided in Section 5. The need by both journalists and media monitoring companies to master large amounts of news articles produced on a daily basis, in order to identify and detect interesting topics and events, has 2 RELATED WORK highlighted the importance of the topic detection task. In general, Topic detection is traditionally considered as a clustering problem topic detection aims at grouping together stories-documents that [3], due to the absence of training sets. The clustering task usually discuss about the same topic-event. Formally, a topic is defined in involves feature selection [4], spectral clustering [5] and k-means [1] as “a specific thing that happens at a specific time and place oriented [3] techniques, assuming mainly that the number of topics along with all necessary preconditions and unavoidable to be discovered is known a priori and there is no noise, i.e. news consequences”. It is clarified [1] that the notion of “topic” is not items that do not belong to any of the news clusters. Latent general like “accidents” but is limited to a specific collection of Dirichlet Allocation (LDA) is a popular approach for topic related events of the type accident, such as “cable car crash”. We modelling for a given number of topics k [6]. LDA has been shall refer to topics as news clusters, or simply clusters. generalized to nonparametric Bayesian approaches, such as the The two main challenges involved in the topic detection hierarchical Dirichlet process [7] and DP-means [8], which predict problem are the following: one needs to (1) estimate the correct the number of topics k. The extraction of the correct number of number of topics/news clusters and (2) assign the most similar topics is equivalent to the estimation of the correct number of news articles into clusters. In addition, the following assumptions clusters in a dataset. The majority vote among 30 clustering indices must be made: Firstly, real data is highly noisy and the number of has been proposed in [9] as an indicator for the number of clusters clusters is not known a priori. Secondly, there is a lower bound for in a dataset. In contrast, we propose an alternative majority vote the minimum number of documents per news cluster. among 10 realizations of the “DBSCAN-Martingale”, which is a In this context, we present and describe the hybrid clustering modification of the DBSCAN algorithm [10] with parameters the framework for topic detection, which has been developed within density level 𝜀 and a lower bound for the minimum number of the FP7 MULTISENSOR project2. For a given query-based search, points per cluster. However, the DBSCAN-Martingale [2] regards the main idea is to efficiently cluster the retrieved results, without the density level 𝜀 as a random variable and the clusters are the need for a pre-specified number of topics. To this end, the progressively extracted. We consider the general case, where the framework, recently introduced in [2], combines automatic number of topics to be discovered is unknown and it is possible to estimation of the number of clusters and assignment of news have news articles which are not assigned to any topic. articles into topics of interest, on the results of a text query. The Graph-based methods for event detection and multimodal estimation of the number of clusters is done by the novel clustering in social media streams have appeared in [11], where a graph clustering algorithm is applied on the graph of items. The decision, whether to link two items or not, is based on the output of 1 Information Technologies Institute, CERTH, Thessaloniki, Greece, a classifier, which assigns or not, the candidate items in the same email: {heliasgj, dliparas, stefanos, ikom}@iti.gr 2 http://www.multisensorproject.eu/ cluster. Contrary to this graph-based approach, we cluster news In our approach, the constructed DBSCAN-Martingale items in an unsupervised way. combines several density levels and is applied on high-level Density-based clustering does not require as input the number of concepts and named entities. In the following, the construction of topics. OPTICS [12] is very useful for the visualization of the DBSCAN-Martingale is briefly reported. cluster structure and for the optimal selection of the density level 𝜀. The OPTICS-ξ algorithm [12] requires an extra parameter ξ, which has to be manually set in order to find “dents” in the OPTICS 3.1 The DBSCAN-Martingale reachability plot. The automatic extraction of clusters from the OPTICS reachability plot, as an extension of the OPTICS-ξ Given a collection of 𝑛 news articles, density-based clustering algorithm, has been presented in [13] and has been outperformed algorithms output clustering vector 𝐶 with values the cluster IDs by HDBSCAN [14] in several datasets of any nature. In the context 𝐶[𝑗] for each news item 𝑗 = 1,2, … , 𝑛, where we denote by 𝐶[𝑗] of news clustering, however, we shall examine whether some of the 𝑗-th element of a vector 𝐶. In case the 𝑗-th document is not these density-based algorithms perform well on the topic detection assigned to any of the clusters, the 𝑗-th cluster ID is zero. problem and by comparing them with our DBSCAN-Martingale, in Assuming that 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀) is the clustering vector provided by the terms of the number of estimated topics. All the aforementioned DBSCAN [10] algorithm for the density level 𝜀, the problem is to methods, which do not require the number of topics to be known a combine the results for several values of 𝜀, into one unique priori, are combined with LDA in order to examine whether the use clustering result. To that end, a martingale construction has been of DBSCAN-Martingale (combined with LDA) provides the most presented in [2], where the density level 𝜀 is a random variable, efficient assignment of news articles to topics. uniformly sampled in a pre-defined interval. 3 TOPIC DETECTION USING CONCEPTS AND NAMED ENTITIES The MULTISENSOR framework for topic detection, which is presented in Figure 1, is approached as a news clustering problem, where the number of topics needs to be estimated. The overall framework is based on textual features, namely concepts and named entities. The number of topics k is estimated by DBSCAN- Martingale and the assignment of news articles to topics is done using Latent Dirichlet Allocation (LDA). LDA has shown great performance in text clustering, given the number of topics. However, in realistic applications, the number of topics is unknown to the system. On the other hand, DBSCAN does not require as input the number of clusters, but its performance in text clustering is very weak, due to the fact that it Figure 2. One realization of the DBSCAN-Martingale with T = 2 iterations assigns too much noise to the news article collection and this and 3 topics detected [2] results in very limited performance [2]. Moreover, it is difficult to find a unique density level that can output all clusters. Thus, we The DBSCAN-Martingale progressively updates the estimation keep only the number of clusters using density-based clustering of the number of clusters (topics), as shown in Figure 2, where 3 and the assignment of documents to topics is done by the well- topics are detected in 2 iterations of the process. Due to the performing LDA. randomness in the selection of the density levels 𝜀, it is likely that each realization of the DBSCAN-Martingale will output a random variable 𝑘̂ as an estimation of the number of clusters. Hence, we allow 10 realizations 𝑘̂1 , 𝑘 ̂2 , … , 𝑘̂ 10 and the final estimation of the number of clusters is the majority vote over them. An illustrative example of 5 clusters in the 2-dimensional plane is demonstrated in Figure 3. Figure 1. The MULTISENSOR topic detection framework using DBSCAN-Martingale and LDA Figure 3. Example in the 2-dimensional plane and the histogram of results after 100 realizations of the DBSCAN-Martingale  energy policy In brief, the DBSCAN-Martingale is mathematically formulated as follows. Firstly, a sample of size 𝑇 𝜀𝑡 , 𝑡 = 1,2, … , 𝑇 is randomly  home appliances generated in [0, 𝜀𝑚𝑎𝑥 ], where 𝜀𝑚𝑎𝑥 is an upper bound for the  solar energy density levels. The sample of 𝜀𝑡 , 𝑡 = 1,2, … , 𝑇 is then sorted in increasing order. For each density level 𝜀𝑡 we find the It should be noted that the aforementioned queries are corresponding clustering vectors 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀𝑡) for all stages 𝑡 = considered representative, with respect to the use cases addressed by the MULTISENSOR project. The output of our topic detection 1,2, … , 𝑇. In the first stage, all clusters detected by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀1) are framework can be visualized in Figure 4 for the query “home kept, corresponding to the lowest density level 𝜀1 . In the second appliances”, where the retrieved results are clustered by 9 topics. stage (𝑡 = 2), some of the detected clusters by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀2) are new The font size of the clusters’ labels depends on the particular word and some of them have also been detected by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀1) . In order probability within each cluster. to keep only the newly detected clusters, we keep only groups of numbers of the same cluster ID with size greater than 𝑚𝑖𝑛𝑃𝑡s. Finally, the cluster IDs are relabelled and the maximum value of a 4.2 Evaluation results clustering vector provides the number of clusters. Complexity: The DBSCAN-Martingale requires 𝑇 iterations of the In order to evaluate the clustering of the retrieved news articles, we DBSCAN algorithm, which runs in 𝒪(𝑛 log 𝑛) if a tree-based use the average precision (AP), broadly used in the context of spatial index can be used and in 𝒪(𝑛2 ) without tree-based spatial information retrieval, clustering and classification. A document 𝑑 indexing [12]. Therefore, the DBSCAN-Martingale runs of a cluster 𝐶 is considered relevant to 𝐶 (true positive), if at least one concept associated with document 𝑑 appears also in the label in 𝒪(𝑇𝑛 log 𝑛) for tree-based indexed datasets and in 𝒪(𝑇𝑛2 ) 4 of cluster 𝐶. It should be noted that the labels of the clusters without tree-based indexing. Our code3 is written in R , using the 5 (topics) are provided by the concepts or named entities that have dbscan package, which runs DBSCAN in 𝑂(𝑛 𝑙𝑜𝑔 𝑛) with kd-tree the highest probability (provided by LDA) within each topic. data structures for fast nearest neighbor search. Precision is considered the fraction of relevant documents in a cluster and average precision is the average for all clusters of a query. Finally, we average the AP scores for all considered queries 3.2 Latent Dirichlet Allocation (LDA) to obtain the Mean Average Precision (MAP). LDA assumes a Bag-of-Words (BoW) representation of the We compared the clustering performance of the proposed topic collection of documents and each topic is a distribution over terms detection framework, in which the DBSCAN-Martingale algorithm in a fixed vocabulary. LDA assigns probabilities to words and (for estimating the number of topics) and LDA (for assigning news assumes that documents exhibit multiple topics, in order to assign a articles to topics) are employed, against a variety of well-known probability distribution on the set of documents. Finally, LDA clustering approaches, which were also combined with LDA for a assumes that the order of words does not matter and, therefore, fair comparison. DP-means is a Dirichlet process and we used its LDA is not applicable to word 𝑛-grams for 𝑛 ≥ 2, but can be implementation in R8. HDBSCAN is a hierarchical DBSCAN applied to named entities and concepts. This input allows topic approach, which uses the “excess-of-mass” (EOM) approach to detection even in multilingual corpora, where 𝑛-grams are not find the optimal cut. Nbclust is a majority vote of the first 16 available in a common language. indices, which are all described in detail in [9]. 4 EXPERIMENTS In this Section, we describe our dataset and evaluate our method. 4.1 Dataset description A part of the present MULTISENSOR database (in which articles crawled from international news websites are stored) was used for the evaluation of our query-based topic detection framework. We use the retrieved results for a given query in order to cluster them into labelled clusters (topics) without knowing the number of clusters. The concepts and named entities are extracted using the DBpedia spotlight6 online tool and the final concepts and named entities replaced the raw text of each news article. The final collection of text documents is available online7. The queries that were used for the experiments are the following: Figure 4. Demonstration of the MULTISENSOR topic detection  energy crisis framework 3 https://github.com/MKLab-ITI/topic-detection 4 https://www.r-project.org/ 5 https://cran.r-project.org/web/packages/dbscan/index.html 6 https://dbpedia-spotlight.github.io/demo/ 7 8 http://mklab2.iti.gr/project/query-based-topic-detection-dataset https://github.com/johnmyleswhite/bayesian_nonparametrics Table 1. Average Precision (± standard deviation) and Mean Average Precision over 10 runs of LDA using the estimated number of topics Index + LDA energy crisis energy policy home appliances solar energy MAP CH 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 Duda 0.4498±0.0671 0.5534±0.0457 0.4299±0.0237 0.4484±0.0067 0.4703 Pseudo t^2 0.4498±0.0671 0.5534±0.0457 0.4299±0.0237 0.4484±0.0067 0.4703 C-index 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 Ptbiserial 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 DB 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 Frey 0.3541±0.0181 0.3911±0.0033 0.3745±0.064 0.4484±0.0067 0.3920 Hartigan 0.5938±0.0502 0.5336±0.0375 0.5942±0.0282 0.5961±0.0347 0.5794 Ratkowsky 0.5357±0.0151 0.5371±0.0357 0.4962±0.0721 0.5375±0.0446 0.5266 Ball 0.4207±0.0093 0.4501±0.0021 0.4975±0.016 0.4464±0.0614 0.4536 McClain 0.5786±0.0425 0.5371±0.0357 0.3745±0.064 0.5961±0.0347 0.5215 KL 0.5786±0.0425 0.5371±0.0357 0.5701±0.0145 0.5961±0.0347 0.5704 Silhouette 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 Dunn 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 SDindex 0.3541±0.0181 0.3911±0.0033 0.5942±0.0282 0.4484±0.0067 0.4469 SDbw 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 NbClust 0.5786±0.0425 0.5371±0.0357 0.5942±0.0282 0.5961±0.0347 0.5765 DP-means 0.3541±0.0181 0.3911±0.0033 0.3745±0.064 0.4484±0.0067 0.3920 HDBSCAN-EOM 0.4498±0.0671 0.3911±0.0033 0.5951±0.0184 0.5375±0.0446 0.4933 DBSCAN-Martingale 0.7691±0.0328 0.5534±0.0457 0.6115±0.0225 0.6073±0.0303 0.6353 Table 2. Estimation of the number of topics in the MULTISENSOR queries Index energy crisis energy policy home appliances solar energy CH 12 8 15 15 Duda 4 4 3 2 Pseudo t^2 4 4 3 2 C-index 12 8 15 15 Ptbiserial 12 8 15 15 DB 12 8 15 15 Frey 2 2 2 2 Hartigan 11 7 15 15 Ratkowsky 7 8 5 5 Ball 3 3 3 3 McClain 12 8 2 15 KL 12 8 11 15 Silhouette 12 8 15 15 Dunn 12 8 15 15 SDindex 2 2 15 2 SDbw 12 8 15 15 NbClust 12 8 15 15 DP-means 2 2 2 2 HDBSCAN-EOM 4 2 10 5 DBSCAN-Martingale 6 4 9 10 The AP scores per query and the MAP scores per method over approach is faster than several well-performing methods in the 10 runs of LDA are displayed in Table 1, for each estimation of the estimation of the number of clusters, given as input the same number of topics combined with LDA. In addition, the numbers of number of query-based retrieved news articles. news clusters estimated by the considered clustering indices for As future work, we plan to investigate the behavior of our each query are presented in Table 2. Looking at Table 1, we framework by introducing additional modalities/features, examine observe a relative increase of 9.65% in MAP, when our topic the application of alternative (other than LDA) text clustering detection framework is compared to the second highest MAP score approaches, as well as investigate the extraction of language- (by Hartigan+LDA) and a relative increase of 10.20%, when agnostic concepts and named entities, something that could provide compared to the most recent approach (NbClust+LDA). multilingual capabilities to our topic detection framework. In general, the proposed topic detection framework outperforms all the considered clustering approaches both in terms of AP (within each query) and in terms of MAP (overall performance for ACKNOWLEDGEMENTS all queries), with the exception of the “energy policy” query, where This work was supported by the projects MULTISENSOR (FP7- the performance of our framework is matched by that of the Duda 610411) and KRISTINA (H2020-645012), funded by the European and Pseudo t^2 clustering indices. Commission. Finally, we evaluated the time performance of the DBSCAN- Martingale method and we selected several baseline approaches in order to compare their processing time with that of our approach. REFERENCES In Figure 5, the number of news clusters is estimated for T = 5 [1] J. Allan (Ed.), ‘Topic detection and tracking: event-based information iterations for the DBSCAN-Martingale and for maximum number organization’, vol. 12, Springer Science & Business Media, (2012). of clusters set to 15 for the indices Duda, Pseudo t^2, Silhouette, [2] I. Gialampoukidis, S. Vrochidis and I. Kompatsiaris, ‘A hybrid Dunn and SDindex. We observe that DBSCAN-Martingale is faster framework for news clustering based on the DBSCAN-Martingale than all other methods. Even when it is applied to 500 documents, and LDA’, In: Perner, P. (Ed.) Machine Learning and Data Mining in it is able to reach a decision about the number of clusters in Pattern Recognition, LNAI 9729, pp. 170-184, (2016). approximately 0.4 seconds. [3] C. C. Aggarwal and C. Zhai, ‘A survey of text clustering algorithms’, In Mining Text Data, pp. 77-128, Springer US, (2012). [4] M. Qian and C. Zhai, ‘Unsupervised feature selection for multi-view clustering on text-image web news data’, In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1963-1966, ACM, (2014). [5] A. Kumar and H. Daumé, ‘A co-training approach for multi-view spectral clustering’, In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 393-400, (2011). [6] D. M. Blei, A. Y. Ng and M. I. Jordan, ‘Latent dirichlet allocation’, the Journal of machine Learning research, vol. 3, pp. 993-1022, (2003). [7] Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, ‘Hierarchical dirichlet processes’, Journal of the american statistical association, 101(476), (2006). [8] B. Kulis and M. I. Jordan, ‘Revisiting k-means: New algorithms via Bayesian nonparametrics’, arXiv preprint arXiv:1111.0352, (2012). [9] M. Charrad, N. Ghazzali, V. Boiteau and A. Niknafs, ‘NbClust: an R Figure 5. Time performance of DBSCAN-Martingale and several baseline package for determining the relevant number of clusters in a data set’, approaches to estimate the number of news clusters Journal of Statistical Software, 61(6), pp. 1-36, (2014). [10] M. Ester, H. P. Kriegel, J. Sander and X. Xu, ‘A density-based algorithm for discovering clusters in large spatial databases with 5 CONCLUSIONS noise’, In Kdd, 96(34), pp. 226-231, (1996). In this paper, we have presented a hybrid topic detection [11] G. Petkos, M. Schinas, S. Papadopoulos and Y. Kompatsiaris, framework, developed for the purposes of the MULTISENSOR ‘Graph-based multimodal clustering for social multimedia’, Multimedia Tools and Applications, 1-23, (2016). project. Given a query-based search, the framework clusters the [12] M. Ankerst, M. M. Breunig, H. P. Kriegel and J. Sander, ‘OPTICS: retrieved results by topic, without the need to know the number of ordering points to identify the clustering structure’, In ACM Sigmod topics a priori. The framework employs the recently introduced Record, 28(2), pp. 49-60, ACM, (1999). DBSCAN-Martingale method for efficiently estimating the number [13] J. Sander, X. Qin, Z. Lu, N. Niu and A. Kovarsky, ‘Automatic of news clusters, coupled with Latent Dirichlet Allocation for extraction of clusters from hierarchical clustering representations’, In assigning the news articles to topics. Our topic detection Advances in knowledge discovery and data mining, pp. 75-87, framework relies on high-level textual features that are extracted Springer Berlin Heidelberg, (2003). from the news articles, namely textual concepts and named entities. [14] R. J. Campello, D. Moulavi and J. Sander, ‘Density-based clustering In addition, it is multimodal, since it fuses more than one sources based on hierarchical density estimates’, In Advances in Knowledge of information from the same multimedia object. The query-based Discovery and Data Mining, pp. 160-172, Springer Berlin Heidelberg, (2013). topic detection experiments have shown that our framework outperforms several well-known clustering methods, both in terms of Average Precision and Mean Average Precision. A direct comparison, by means of time performance, has shown that our