Query-based Topic Detection Using Concepts and Named
                           Entities
             Ilias Gialampoukidis1, Dimitris Liparas1, Stefanos Vrochidis1, and Ioannis Kompatsiaris1

Abstract.1 In this paper, we present a framework for topic               “DBSCAN-Martingale” method [2], which can deal with the
detection in news articles. The framework receives as input the          aforementioned assumptions. All clusters are progressively
results retrieved from a query-based search and clusters them by         extracted (by a density-based algorithm) by applying Doob’s
topic. To this end, the recently introduced “DBSCAN-Martingale”          martingale and then Latent Dirichlet Allocation is applied for the
method for automatically estimating the number of topics and the         assignment of news articles to topics. Contrary to [2], the
well-established Latent Dirichlet Allocation topic modelling             contribution of this paper is based on the fact that the overall
approach for the assignment of news articles into topics of interest,    framework relies on high-level textual features (concepts and
are utilized. Furthermore, the proposed query-based topic detection      named entities) that are extracted from the retrieved results of a
framework works on high-level textual features (such as concepts         textual query, and can assist any search engine.
and named entities) that are extracted from news articles. Our topic        The rest of the paper is organized as follows: Section 2 provides
detection approach is tackled as a text clustering task, without         related work with respect to topic detection, news clustering and
knowing the number of clusters and compares favorably to several         density-based clustering. In Section 3, our framework for topic
text clustering approaches, in a public dataset of retrieved results,    detection is presented and described. Section 4 discusses the
with respect to four representative queries.                             experimental results from the application of our framework and
                                                                         several other clustering methods to four collections of text
1         INTRODUCTION                                                   documents, related to four given queries, respectively. Finally,
                                                                         some concluding remarks are provided in Section 5.
The need by both journalists and media monitoring companies to
master large amounts of news articles produced on a daily basis, in
order to identify and detect interesting topics and events, has          2         RELATED WORK
highlighted the importance of the topic detection task. In general,      Topic detection is traditionally considered as a clustering problem
topic detection aims at grouping together stories-documents that         [3], due to the absence of training sets. The clustering task usually
discuss about the same topic-event. Formally, a topic is defined in      involves feature selection [4], spectral clustering [5] and k-means
[1] as “a specific thing that happens at a specific time and place       oriented [3] techniques, assuming mainly that the number of topics
along with all necessary preconditions and unavoidable                   to be discovered is known a priori and there is no noise, i.e. news
consequences”. It is clarified [1] that the notion of “topic” is not     items that do not belong to any of the news clusters. Latent
general like “accidents” but is limited to a specific collection of      Dirichlet Allocation (LDA) is a popular approach for topic
related events of the type accident, such as “cable car crash”. We       modelling for a given number of topics k [6]. LDA has been
shall refer to topics as news clusters, or simply clusters.              generalized to nonparametric Bayesian approaches, such as the
    The two main challenges involved in the topic detection              hierarchical Dirichlet process [7] and DP-means [8], which predict
problem are the following: one needs to (1) estimate the correct         the number of topics k. The extraction of the correct number of
number of topics/news clusters and (2) assign the most similar           topics is equivalent to the estimation of the correct number of
news articles into clusters. In addition, the following assumptions      clusters in a dataset. The majority vote among 30 clustering indices
must be made: Firstly, real data is highly noisy and the number of       has been proposed in [9] as an indicator for the number of clusters
clusters is not known a priori. Secondly, there is a lower bound for     in a dataset. In contrast, we propose an alternative majority vote
the minimum number of documents per news cluster.                        among 10 realizations of the “DBSCAN-Martingale”, which is a
    In this context, we present and describe the hybrid clustering       modification of the DBSCAN algorithm [10] with parameters the
framework for topic detection, which has been developed within           density level 𝜀 and a lower bound for the minimum number of
the FP7 MULTISENSOR project2. For a given query-based search,            points per cluster. However, the DBSCAN-Martingale [2] regards
the main idea is to efficiently cluster the retrieved results, without   the density level 𝜀 as a random variable and the clusters are
the need for a pre-specified number of topics. To this end, the          progressively extracted. We consider the general case, where the
framework, recently introduced in [2], combines automatic                number of topics to be discovered is unknown and it is possible to
estimation of the number of clusters and assignment of news              have news articles which are not assigned to any topic.
articles into topics of interest, on the results of a text query. The        Graph-based methods for event detection and multimodal
estimation of the number of clusters is done by the novel                clustering in social media streams have appeared in [11], where a
                                                                         graph clustering algorithm is applied on the graph of items. The
                                                                         decision, whether to link two items or not, is based on the output of
1
  Information Technologies Institute, CERTH, Thessaloniki, Greece,       a classifier, which assigns or not, the candidate items in the same
email: {heliasgj, dliparas, stefanos, ikom}@iti.gr
2
  http://www.multisensorproject.eu/
cluster. Contrary to this graph-based approach, we cluster news              In our approach, the constructed DBSCAN-Martingale
items in an unsupervised way.                                             combines several density levels and is applied on high-level
   Density-based clustering does not require as input the number of       concepts and named entities. In the following, the construction of
topics. OPTICS [12] is very useful for the visualization of the           DBSCAN-Martingale is briefly reported.
cluster structure and for the optimal selection of the density level 𝜀.
The OPTICS-ξ algorithm [12] requires an extra parameter ξ, which
has to be manually set in order to find “dents” in the OPTICS             3.1       The DBSCAN-Martingale
reachability plot. The automatic extraction of clusters from the
OPTICS reachability plot, as an extension of the OPTICS-ξ                 Given a collection of 𝑛 news articles, density-based clustering
algorithm, has been presented in [13] and has been outperformed           algorithms output clustering vector 𝐶 with values the cluster IDs
by HDBSCAN [14] in several datasets of any nature. In the context         𝐶[𝑗] for each news item 𝑗 = 1,2, … , 𝑛, where we denote by 𝐶[𝑗]
of news clustering, however, we shall examine whether some of             the 𝑗-th element of a vector 𝐶. In case the 𝑗-th document is not
these density-based algorithms perform well on the topic detection        assigned to any of the clusters, the 𝑗-th cluster ID is zero.
problem and by comparing them with our DBSCAN-Martingale, in              Assuming that 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀) is the clustering vector provided by the
terms of the number of estimated topics. All the aforementioned           DBSCAN [10] algorithm for the density level 𝜀, the problem is to
methods, which do not require the number of topics to be known a          combine the results for several values of 𝜀, into one unique
priori, are combined with LDA in order to examine whether the use         clustering result. To that end, a martingale construction has been
of DBSCAN-Martingale (combined with LDA) provides the most                presented in [2], where the density level 𝜀 is a random variable,
efficient assignment of news articles to topics.                          uniformly sampled in a pre-defined interval.


3         TOPIC DETECTION USING CONCEPTS
          AND NAMED ENTITIES
The MULTISENSOR framework for topic detection, which is
presented in Figure 1, is approached as a news clustering problem,
where the number of topics needs to be estimated. The overall
framework is based on textual features, namely concepts and
named entities. The number of topics k is estimated by DBSCAN-
Martingale and the assignment of news articles to topics is done
using Latent Dirichlet Allocation (LDA).
   LDA has shown great performance in text clustering, given the
number of topics. However, in realistic applications, the number of
topics is unknown to the system. On the other hand, DBSCAN
does not require as input the number of clusters, but its
performance in text clustering is very weak, due to the fact that it      Figure 2. One realization of the DBSCAN-Martingale with T = 2 iterations
assigns too much noise to the news article collection and this                                     and 3 topics detected [2]
results in very limited performance [2]. Moreover, it is difficult to
find a unique density level that can output all clusters. Thus, we           The DBSCAN-Martingale progressively updates the estimation
keep only the number of clusters using density-based clustering           of the number of clusters (topics), as shown in Figure 2, where 3
and the assignment of documents to topics is done by the well-            topics are detected in 2 iterations of the process. Due to the
performing LDA.                                                           randomness in the selection of the density levels 𝜀, it is likely that
                                                                          each realization of the DBSCAN-Martingale will output a random
                                                                          variable 𝑘̂ as an estimation of the number of clusters. Hence, we
                                                                          allow 10 realizations 𝑘̂1 , 𝑘
                                                                                                      ̂2 , … , 𝑘̂
                                                                                                                10 and the final estimation of the
                                                                          number of clusters is the majority vote over them. An illustrative
                                                                          example of 5 clusters in the 2-dimensional plane is demonstrated in
                                                                          Figure 3.


     Figure 1. The MULTISENSOR topic detection framework using
                    DBSCAN-Martingale and LDA
                                                                          Figure 3. Example in the 2-dimensional plane and the histogram of results
                                                                                     after 100 realizations of the DBSCAN-Martingale
                                                                                      energy policy
   In brief, the DBSCAN-Martingale is mathematically formulated
as follows. Firstly, a sample of size 𝑇 𝜀𝑡 , 𝑡 = 1,2, … , 𝑇 is randomly               home appliances
generated in [0, 𝜀𝑚𝑎𝑥 ], where 𝜀𝑚𝑎𝑥 is an upper bound for the                         solar energy
density levels. The sample of 𝜀𝑡 , 𝑡 = 1,2, … , 𝑇 is then sorted in
increasing order. For each density level 𝜀𝑡 we find the                       It should be noted that the aforementioned queries are
corresponding clustering vectors 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀𝑡) for all stages 𝑡 =            considered representative, with respect to the use cases addressed
                                                                           by the MULTISENSOR project. The output of our topic detection
1,2, … , 𝑇. In the first stage, all clusters detected by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀1) are
                                                                           framework can be visualized in Figure 4 for the query “home
kept, corresponding to the lowest density level 𝜀1 . In the second
                                                                           appliances”, where the retrieved results are clustered by 9 topics.
stage (𝑡 = 2), some of the detected clusters by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀2) are new
                                                                           The font size of the clusters’ labels depends on the particular word
and some of them have also been detected by 𝐶𝐷𝐵𝑆𝐶𝐴𝑁(𝜀1) . In order
                                                                           probability within each cluster.
to keep only the newly detected clusters, we keep only groups of
numbers of the same cluster ID with size greater than 𝑚𝑖𝑛𝑃𝑡s.
Finally, the cluster IDs are relabelled and the maximum value of a         4.2         Evaluation results
clustering vector provides the number of clusters.
Complexity: The DBSCAN-Martingale requires 𝑇 iterations of the             In order to evaluate the clustering of the retrieved news articles, we
DBSCAN algorithm, which runs in 𝒪(𝑛 log 𝑛) if a tree-based                 use the average precision (AP), broadly used in the context of
spatial index can be used and in 𝒪(𝑛2 ) without tree-based spatial         information retrieval, clustering and classification. A document 𝑑
indexing [12]. Therefore, the DBSCAN-Martingale runs                       of a cluster 𝐶 is considered relevant to 𝐶 (true positive), if at least
                                                                           one concept associated with document 𝑑 appears also in the label
in 𝒪(𝑇𝑛 log 𝑛) for tree-based indexed datasets and in 𝒪(𝑇𝑛2 )
                                                            4              of cluster 𝐶. It should be noted that the labels of the clusters
without tree-based indexing. Our code3 is written in R , using the
        5                                                                  (topics) are provided by the concepts or named entities that have
dbscan package, which runs DBSCAN in 𝑂(𝑛 𝑙𝑜𝑔 𝑛) with kd-tree
                                                                           the highest probability (provided by LDA) within each topic.
data structures for fast nearest neighbor search.                          Precision is considered the fraction of relevant documents in a
                                                                           cluster and average precision is the average for all clusters of a
                                                                           query. Finally, we average the AP scores for all considered queries
3.2       Latent Dirichlet Allocation (LDA)                                to obtain the Mean Average Precision (MAP).
LDA assumes a Bag-of-Words (BoW) representation of the                         We compared the clustering performance of the proposed topic
collection of documents and each topic is a distribution over terms        detection framework, in which the DBSCAN-Martingale algorithm
in a fixed vocabulary. LDA assigns probabilities to words and              (for estimating the number of topics) and LDA (for assigning news
assumes that documents exhibit multiple topics, in order to assign a       articles to topics) are employed, against a variety of well-known
probability distribution on the set of documents. Finally, LDA             clustering approaches, which were also combined with LDA for a
assumes that the order of words does not matter and, therefore,            fair comparison. DP-means is a Dirichlet process and we used its
LDA is not applicable to word 𝑛-grams for 𝑛 ≥ 2, but can be                implementation in R8. HDBSCAN is a hierarchical DBSCAN
applied to named entities and concepts. This input allows topic            approach, which uses the “excess-of-mass” (EOM) approach to
detection even in multilingual corpora, where 𝑛-grams are not              find the optimal cut. Nbclust is a majority vote of the first 16
available in a common language.                                            indices, which are all described in detail in [9].

4         EXPERIMENTS
In this Section, we describe our dataset and evaluate our method.


4.1       Dataset description
A part of the present MULTISENSOR database (in which articles
crawled from international news websites are stored) was used for
the evaluation of our query-based topic detection framework. We
use the retrieved results for a given query in order to cluster them
into labelled clusters (topics) without knowing the number of
clusters. The concepts and named entities are extracted using the
DBpedia spotlight6 online tool and the final concepts and named
entities replaced the raw text of each news article. The final
collection of text documents is available online7.
The queries that were used for the experiments are the following:                 Figure 4. Demonstration of the MULTISENSOR topic detection
       energy crisis                                                                                      framework


3
  https://github.com/MKLab-ITI/topic-detection
4
  https://www.r-project.org/
5
  https://cran.r-project.org/web/packages/dbscan/index.html
6
  https://dbpedia-spotlight.github.io/demo/
7                                                                          8
  http://mklab2.iti.gr/project/query-based-topic-detection-dataset             https://github.com/johnmyleswhite/bayesian_nonparametrics
Table 1. Average Precision (± standard deviation) and Mean Average Precision over 10 runs of LDA using the estimated number of topics
Index + LDA                     energy crisis          energy policy        home appliances          solar energy          MAP
CH                             0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
Duda                           0.4498±0.0671          0.5534±0.0457          0.4299±0.0237          0.4484±0.0067          0.4703
Pseudo t^2                     0.4498±0.0671          0.5534±0.0457          0.4299±0.0237          0.4484±0.0067          0.4703
C-index                        0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
Ptbiserial                     0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
DB                             0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
Frey                           0.3541±0.0181          0.3911±0.0033           0.3745±0.064          0.4484±0.0067          0.3920
Hartigan                       0.5938±0.0502          0.5336±0.0375          0.5942±0.0282          0.5961±0.0347          0.5794
Ratkowsky                      0.5357±0.0151          0.5371±0.0357          0.4962±0.0721          0.5375±0.0446          0.5266
Ball                           0.4207±0.0093          0.4501±0.0021           0.4975±0.016          0.4464±0.0614          0.4536
McClain                        0.5786±0.0425          0.5371±0.0357           0.3745±0.064          0.5961±0.0347          0.5215
KL                             0.5786±0.0425          0.5371±0.0357          0.5701±0.0145          0.5961±0.0347          0.5704
Silhouette                     0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
Dunn                           0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
SDindex                        0.3541±0.0181          0.3911±0.0033          0.5942±0.0282          0.4484±0.0067          0.4469
SDbw                           0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
NbClust                        0.5786±0.0425          0.5371±0.0357          0.5942±0.0282          0.5961±0.0347          0.5765
DP-means                       0.3541±0.0181          0.3911±0.0033           0.3745±0.064          0.4484±0.0067          0.3920
HDBSCAN-EOM                    0.4498±0.0671          0.3911±0.0033          0.5951±0.0184          0.5375±0.0446          0.4933
DBSCAN-Martingale              0.7691±0.0328          0.5534±0.0457          0.6115±0.0225          0.6073±0.0303          0.6353

                              Table 2. Estimation of the number of topics in the MULTISENSOR queries
       Index                     energy crisis         energy policy            home appliances         solar energy
       CH                        12                    8                        15                      15
       Duda                      4                     4                        3                       2
       Pseudo t^2                4                     4                        3                       2
       C-index                   12                    8                        15                      15
       Ptbiserial                12                    8                        15                      15
       DB                        12                    8                        15                      15
       Frey                      2                     2                        2                       2
       Hartigan                  11                    7                        15                      15
       Ratkowsky                 7                     8                        5                       5
       Ball                      3                     3                        3                       3
       McClain                   12                    8                        2                       15
       KL                        12                    8                        11                      15
       Silhouette                12                    8                        15                      15
       Dunn                      12                    8                        15                      15
       SDindex                   2                     2                        15                      2
       SDbw                      12                    8                        15                      15
       NbClust                   12                    8                        15                      15
       DP-means                  2                     2                        2                       2
       HDBSCAN-EOM               4                     2                        10                      5
       DBSCAN-Martingale         6                     4                        9                       10
    The AP scores per query and the MAP scores per method over         approach is faster than several well-performing methods in the
10 runs of LDA are displayed in Table 1, for each estimation of the    estimation of the number of clusters, given as input the same
number of topics combined with LDA. In addition, the numbers of        number of query-based retrieved news articles.
news clusters estimated by the considered clustering indices for          As future work, we plan to investigate the behavior of our
each query are presented in Table 2. Looking at Table 1, we            framework by introducing additional modalities/features, examine
observe a relative increase of 9.65% in MAP, when our topic            the application of alternative (other than LDA) text clustering
detection framework is compared to the second highest MAP score        approaches, as well as investigate the extraction of language-
(by Hartigan+LDA) and a relative increase of 10.20%, when              agnostic concepts and named entities, something that could provide
compared to the most recent approach (NbClust+LDA).                    multilingual capabilities to our topic detection framework.
    In general, the proposed topic detection framework outperforms
all the considered clustering approaches both in terms of AP
(within each query) and in terms of MAP (overall performance for       ACKNOWLEDGEMENTS
all queries), with the exception of the “energy policy” query, where   This work was supported by the projects MULTISENSOR (FP7-
the performance of our framework is matched by that of the Duda        610411) and KRISTINA (H2020-645012), funded by the European
and Pseudo t^2 clustering indices.                                     Commission.
    Finally, we evaluated the time performance of the DBSCAN-
Martingale method and we selected several baseline approaches in
order to compare their processing time with that of our approach.      REFERENCES
In Figure 5, the number of news clusters is estimated for T = 5
                                                                       [1]    J. Allan (Ed.), ‘Topic detection and tracking: event-based information
iterations for the DBSCAN-Martingale and for maximum number                   organization’, vol. 12, Springer Science & Business Media, (2012).
of clusters set to 15 for the indices Duda, Pseudo t^2, Silhouette,    [2]    I. Gialampoukidis, S. Vrochidis and I. Kompatsiaris, ‘A hybrid
Dunn and SDindex. We observe that DBSCAN-Martingale is faster                 framework for news clustering based on the DBSCAN-Martingale
than all other methods. Even when it is applied to 500 documents,             and LDA’, In: Perner, P. (Ed.) Machine Learning and Data Mining in
it is able to reach a decision about the number of clusters in                Pattern Recognition, LNAI 9729, pp. 170-184, (2016).
approximately 0.4 seconds.                                             [3]    C. C. Aggarwal and C. Zhai, ‘A survey of text clustering algorithms’,
                                                                              In Mining Text Data, pp. 77-128, Springer US, (2012).
                                                                       [4]    M. Qian and C. Zhai, ‘Unsupervised feature selection for multi-view
                                                                              clustering on text-image web news data’, In Proceedings of the 23rd
                                                                              ACM International Conference on Conference on Information and
                                                                              Knowledge Management, pp. 1963-1966, ACM, (2014).
                                                                       [5]    A. Kumar and H. Daumé, ‘A co-training approach for multi-view
                                                                              spectral clustering’, In Proceedings of the 28th International
                                                                              Conference on Machine Learning (ICML-11), pp. 393-400, (2011).
                                                                       [6]    D. M. Blei, A. Y. Ng and M. I. Jordan, ‘Latent dirichlet allocation’,
                                                                              the Journal of machine Learning research, vol. 3, pp. 993-1022,
                                                                              (2003).
                                                                       [7]    Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, ‘Hierarchical
                                                                              dirichlet processes’, Journal of the american statistical association,
                                                                              101(476), (2006).
                                                                       [8]    B. Kulis and M. I. Jordan, ‘Revisiting k-means: New algorithms via
                                                                              Bayesian nonparametrics’, arXiv preprint arXiv:1111.0352, (2012).
                                                                       [9]    M. Charrad, N. Ghazzali, V. Boiteau and A. Niknafs, ‘NbClust: an R
Figure 5. Time performance of DBSCAN-Martingale and several baseline
                                                                              package for determining the relevant number of clusters in a data set’,
           approaches to estimate the number of news clusters
                                                                              Journal of Statistical Software, 61(6), pp. 1-36, (2014).
                                                                       [10]   M. Ester, H. P. Kriegel, J. Sander and X. Xu, ‘A density-based
                                                                              algorithm for discovering clusters in large spatial databases with
5         CONCLUSIONS                                                         noise’, In Kdd, 96(34), pp. 226-231, (1996).
In this paper, we have presented a hybrid topic detection              [11]   G. Petkos, M. Schinas, S. Papadopoulos and Y. Kompatsiaris,
framework, developed for the purposes of the MULTISENSOR                      ‘Graph-based multimodal clustering for social multimedia’,
                                                                              Multimedia Tools and Applications, 1-23, (2016).
project. Given a query-based search, the framework clusters the
                                                                       [12]   M. Ankerst, M. M. Breunig, H. P. Kriegel and J. Sander, ‘OPTICS:
retrieved results by topic, without the need to know the number of            ordering points to identify the clustering structure’, In ACM Sigmod
topics a priori. The framework employs the recently introduced                Record, 28(2), pp. 49-60, ACM, (1999).
DBSCAN-Martingale method for efficiently estimating the number         [13]   J. Sander, X. Qin, Z. Lu, N. Niu and A. Kovarsky, ‘Automatic
of news clusters, coupled with Latent Dirichlet Allocation for                extraction of clusters from hierarchical clustering representations’, In
assigning the news articles to topics. Our topic detection                    Advances in knowledge discovery and data mining, pp. 75-87,
framework relies on high-level textual features that are extracted            Springer Berlin Heidelberg, (2003).
from the news articles, namely textual concepts and named entities.    [14]   R. J. Campello, D. Moulavi and J. Sander, ‘Density-based clustering
In addition, it is multimodal, since it fuses more than one sources           based on hierarchical density estimates’, In Advances in Knowledge
of information from the same multimedia object. The query-based               Discovery and Data Mining, pp. 160-172, Springer Berlin
                                                                              Heidelberg, (2013).
topic detection experiments have shown that our framework
outperforms several well-known clustering methods, both in terms
of Average Precision and Mean Average Precision. A direct
comparison, by means of time performance, has shown that our