<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query-based Topic Detection Using Concepts and Named Entities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilias Gialampoukidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Liparas</string-name>
          <email>dliparas@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Vrochidis</string-name>
          <email>stefanos@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute, CERTH</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>1 In this paper, we present a framework for topic detection in news articles. The framework receives as input the results retrieved from a query-based search and clusters them by topic. To this end, the recently introduced “DBSCAN-Martingale” method for automatically estimating the number of topics and the well-established Latent Dirichlet Allocation topic modelling approach for the assignment of news articles into topics of interest, are utilized. Furthermore, the proposed query-based topic detection framework works on high-level textual features (such as concepts and named entities) that are extracted from news articles. Our topic detection approach is tackled as a text clustering task, without knowing the number of clusters and compares favorably to several text clustering approaches, in a public dataset of retrieved results, with respect to four representative queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The need by both journalists and media monitoring companies to
master large amounts of news articles produced on a daily basis, in
order to identify and detect interesting topics and events, has
highlighted the importance of the topic detection task. In general,
topic detection aims at grouping together stories-documents that
discuss about the same topic-event. Formally, a topic is defined in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as “a specific thing that happens at a specific time and place
along with all necessary preconditions and unavoidable
consequences”. It is clarified [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that the notion of “topic” is not
general like “accidents” but is limited to a specific collection of
related events of the type accident, such as “cable car crash”. We
shall refer to topics as news clusters, or simply clusters.
      </p>
      <p>The two main challenges involved in the topic detection
problem are the following: one needs to (1) estimate the correct
number of topics/news clusters and (2) assign the most similar
news articles into clusters. In addition, the following assumptions
must be made: Firstly, real data is highly noisy and the number of
clusters is not known a priori. Secondly, there is a lower bound for
the minimum number of documents per news cluster.</p>
      <p>
        In this context, we present and describe the hybrid clustering
framework for topic detection, which has been developed within
the FP7 MULTISENSOR project2. For a given query-based search,
the main idea is to efficiently cluster the retrieved results, without
the need for a pre-specified number of topics. To this end, the
framework, recently introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], combines automatic
estimation of the number of clusters and assignment of news
articles into topics of interest, on the results of a text query. The
estimation of the number of clusters is done by the novel
“DBSCAN-Martingale” method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which can deal with the
aforementioned assumptions. All clusters are progressively
extracted (by a density-based algorithm) by applying Doob’s
martingale and then Latent Dirichlet Allocation is applied for the
assignment of news articles to topics. Contrary to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the
contribution of this paper is based on the fact that the overall
framework relies on high-level textual features (concepts and
named entities) that are extracted from the retrieved results of a
textual query, and can assist any search engine.
      </p>
      <p>
        The rest of the paper is organized as follows: Section 2 provides
related work with respect to topic detection, news clustering and
density-based clustering. In Section 3, our framework for topic
detection is presented and described. Section 4 discusses the
experimental results from the application of our framework and
several other clustering methods to four collections of text
documents, related to four given queries, respectively. Finally,
some concluding remarks are provided in Section 5.
2
Topic detection is traditionally considered as a clustering problem
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], due to the absence of training sets. The clustering task usually
involves feature selection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], spectral clustering [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and k-means
oriented [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] techniques, assuming mainly that the number of topics
to be discovered is known a priori and there is no noise, i.e. news
items that do not belong to any of the news clusters. Latent
Dirichlet Allocation (LDA) is a popular approach for topic
modelling for a given number of topics k [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LDA has been
generalized to nonparametric Bayesian approaches, such as the
hierarchical Dirichlet process [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and DP-means [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which predict
the number of topics k. The extraction of the correct number of
topics is equivalent to the estimation of the correct number of
clusters in a dataset. The majority vote among 30 clustering indices
has been proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as an indicator for the number of clusters
in a dataset. In contrast, we propose an alternative majority vote
among 10 realizations of the “DBSCAN-Martingale”, which is a
modification of the DBSCAN algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with parameters the
density level  and a lower bound for the minimum number of
points per cluster. However, the DBSCAN-Martingale [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] regards
the density level  as a random variable and the clusters are
progressively extracted. We consider the general case, where the
number of topics to be discovered is unknown and it is possible to
have news articles which are not assigned to any topic.
      </p>
      <p>
        Graph-based methods for event detection and multimodal
clustering in social media streams have appeared in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where a
graph clustering algorithm is applied on the graph of items. The
decision, whether to link two items or not, is based on the output of
a classifier, which assigns or not, the candidate items in the same
cluster. Contrary to this graph-based approach, we cluster news
items in an unsupervised way.
      </p>
      <p>
        Density-based clustering does not require as input the number of
topics. OPTICS [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is very useful for the visualization of the
cluster structure and for the optimal selection of the density level  .
The OPTICS-ξ algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] requires an extra parameter ξ, which
has to be manually set in order to find “dents” in the OPTICS
reachability plot. The automatic extraction of clusters from the
OPTICS reachability plot, as an extension of the OPTICS-ξ
algorithm, has been presented in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and has been outperformed
by HDBSCAN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in several datasets of any nature. In the context
of news clustering, however, we shall examine whether some of
these density-based algorithms perform well on the topic detection
problem and by comparing them with our DBSCAN-Martingale, in
terms of the number of estimated topics. All the aforementioned
methods, which do not require the number of topics to be known a
priori, are combined with LDA in order to examine whether the use
of DBSCAN-Martingale (combined with LDA) provides the most
efficient assignment of news articles to topics.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>TOPIC DETECTION USING CONCEPTS</title>
    </sec>
    <sec id="sec-3">
      <title>AND NAMED ENTITIES</title>
      <p>The MULTISENSOR framework for topic detection, which is
presented in Figure 1, is approached as a news clustering problem,
where the number of topics needs to be estimated. The overall
framework is based on textual features, namely concepts and
named entities. The number of topics k is estimated by
DBSCANMartingale and the assignment of news articles to topics is done
using Latent Dirichlet Allocation (LDA).</p>
      <p>
        LDA has shown great performance in text clustering, given the
number of topics. However, in realistic applications, the number of
topics is unknown to the system. On the other hand, DBSCAN
does not require as input the number of clusters, but its
performance in text clustering is very weak, due to the fact that it
assigns too much noise to the news article collection and this
results in very limited performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, it is difficult to
find a unique density level that can output all clusters. Thus, we
keep only the number of clusters using density-based clustering
and the assignment of documents to topics is done by the
wellperforming LDA.
      </p>
      <p>In our approach, the constructed DBSCAN-Martingale
combines several density levels and is applied on high-level
concepts and named entities. In the following, the construction of
DBSCAN-Martingale is briefly reported.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>The DBSCAN-Martingale</title>
      <p>
        Given a collection of  news articles, density-based clustering
algorithms output clustering vector  with values the cluster IDs
 [ ] for each news item  = 1,2, … ,  , where we denote by  [ ]
the  -th element of a vector  . In case the  -th document is not
assigned to any of the clusters, the  -th cluster ID is zero.
Assuming that   ( ) is the clustering vector provided by the
DBSCAN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] algorithm for the density level  , the problem is to
combine the results for several values of  , into one unique
clustering result. To that end, a martingale construction has been
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where the density level  is a random variable,
uniformly sampled in a pre-defined interval.
      </p>
      <p>The DBSCAN-Martingale progressively updates the estimation
of the number of clusters (topics), as shown in Figure 2, where 3
topics are detected in 2 iterations of the process. Due to the
randomness in the selection of the density levels  , it is likely that
each realization of the DBSCAN-Martingale will output a random
variable  ̂as an estimation of the number of clusters. Hence, we
allow 10 realizations ̂1, ̂2, … ,  ̂10 and the final estimation of the
number of clusters is the majority vote over them. An illustrative
example of 5 clusters in the 2-dimensional plane is demonstrated in
Figure 3.</p>
      <p>In brief, the DBSCAN-Martingale is mathematically formulated
as follows. Firstly, a sample of size    ,  = 1,2, … ,  is randomly
generated in [0,   ], where   is an upper bound for the
density levels. The sample of   ,  = 1,2, … ,  is then sorted in
increasing order. For each density level   we find the
corresponding clustering vectors   (  ) for all stages  =
1,2, … ,  . In the first stage, all clusters detected by   ( 1) are
kept, corresponding to the lowest density level  1. In the second
stage ( = 2), some of the detected clusters by   ( 2) are new
and some of them have also been detected by   ( 1). In order
to keep only the newly detected clusters, we keep only groups of
numbers of the same cluster ID with size greater than  s.
Finally, the cluster IDs are relabelled and the maximum value of a
clustering vector provides the number of clusters.</p>
      <p>
        Complexity: The DBSCAN-Martingale requires  iterations of the
DBSCAN algorithm, which runs in  ( log  ) if a tree-based
spatial index can be used and in  ( 2) without tree-based spatial
indexing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Therefore, the DBSCAN-Martingale runs
in  ( log  ) for tree-based indexed datasets and in  (  2)
without tree-based indexing. Our code3 is written in R4, using the
dbscan5 package, which runs DBSCAN in  (   ) with kd-tree
data structures for fast nearest neighbor search.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Latent Dirichlet Allocation (LDA)</title>
      <p>LDA assumes a Bag-of-Words (BoW) representation of the
collection of documents and each topic is a distribution over terms
in a fixed vocabulary. LDA assigns probabilities to words and
assumes that documents exhibit multiple topics, in order to assign a
probability distribution on the set of documents. Finally, LDA
assumes that the order of words does not matter and, therefore,
LDA is not applicable to word  -grams for  ≥ 2, but can be
applied to named entities and concepts. This input allows topic
detection even in multilingual corpora, where  -grams are not
available in a common language.
4
4.1</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-7">
      <title>Dataset description</title>
      <p>In this Section, we describe our dataset and evaluate our method.
A part of the present MULTISENSOR database (in which articles
crawled from international news websites are stored) was used for
the evaluation of our query-based topic detection framework. We
use the retrieved results for a given query in order to cluster them
into labelled clusters (topics) without knowing the number of
clusters. The concepts and named entities are extracted using the
DBpedia spotlight6 online tool and the final concepts and named
entities replaced the raw text of each news article. The final
collection of text documents is available online7.</p>
      <p>The queries that were used for the experiments are the following:
 energy crisis
3 https://github.com/MKLab-ITI/topic-detection
4 https://www.r-project.org/
5 https://cran.r-project.org/web/packages/dbscan/index.html
6 https://dbpedia-spotlight.github.io/demo/
7 http://mklab2.iti.gr/project/query-based-topic-detection-dataset



energy policy
home appliances
solar energy</p>
      <p>It should be noted that the aforementioned queries are
considered representative, with respect to the use cases addressed
by the MULTISENSOR project. The output of our topic detection
framework can be visualized in Figure 4 for the query “home
appliances”, where the retrieved results are clustered by 9 topics.
The font size of the clusters’ labels depends on the particular word
probability within each cluster.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Evaluation results</title>
      <p>In order to evaluate the clustering of the retrieved news articles, we
use the average precision (AP), broadly used in the context of
information retrieval, clustering and classification. A document 
of a cluster  is considered relevant to  (true positive), if at least
one concept associated with document  appears also in the label
of cluster  . It should be noted that the labels of the clusters
(topics) are provided by the concepts or named entities that have
the highest probability (provided by LDA) within each topic.
Precision is considered the fraction of relevant documents in a
cluster and average precision is the average for all clusters of a
query. Finally, we average the AP scores for all considered queries
to obtain the Mean Average Precision (MAP).</p>
      <p>
        We compared the clustering performance of the proposed topic
detection framework, in which the DBSCAN-Martingale algorithm
(for estimating the number of topics) and LDA (for assigning news
articles to topics) are employed, against a variety of well-known
clustering approaches, which were also combined with LDA for a
fair comparison. DP-means is a Dirichlet process and we used its
implementation in R8. HDBSCAN is a hierarchical DBSCAN
approach, which uses the “excess-of-mass” (EOM) approach to
find the optimal cut. Nbclust is a majority vote of the first 16
indices, which are all described in detail in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
8 https://github.com/johnmyleswhite/bayesian_nonparametrics
      </p>
      <p>The AP scores per query and the MAP scores per method over
10 runs of LDA are displayed in Table 1, for each estimation of the
number of topics combined with LDA. In addition, the numbers of
news clusters estimated by the considered clustering indices for
each query are presented in Table 2. Looking at Table 1, we
observe a relative increase of 9.65% in MAP, when our topic
detection framework is compared to the second highest MAP score
(by Hartigan+LDA) and a relative increase of 10.20%, when
compared to the most recent approach (NbClust+LDA).</p>
      <p>In general, the proposed topic detection framework outperforms
all the considered clustering approaches both in terms of AP
(within each query) and in terms of MAP (overall performance for
all queries), with the exception of the “energy policy” query, where
the performance of our framework is matched by that of the Duda
and Pseudo t^2 clustering indices.</p>
      <p>Finally, we evaluated the time performance of the
DBSCANMartingale method and we selected several baseline approaches in
order to compare their processing time with that of our approach.
In Figure 5, the number of news clusters is estimated for T = 5
iterations for the DBSCAN-Martingale and for maximum number
of clusters set to 15 for the indices Duda, Pseudo t^2, Silhouette,
Dunn and SDindex. We observe that DBSCAN-Martingale is faster
than all other methods. Even when it is applied to 500 documents,
it is able to reach a decision about the number of clusters in
approximately 0.4 seconds.
5</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have presented a hybrid topic detection
framework, developed for the purposes of the MULTISENSOR
project. Given a query-based search, the framework clusters the
retrieved results by topic, without the need to know the number of
topics a priori. The framework employs the recently introduced
DBSCAN-Martingale method for efficiently estimating the number
of news clusters, coupled with Latent Dirichlet Allocation for
assigning the news articles to topics. Our topic detection
framework relies on high-level textual features that are extracted
from the news articles, namely textual concepts and named entities.
In addition, it is multimodal, since it fuses more than one sources
of information from the same multimedia object. The query-based
topic detection experiments have shown that our framework
outperforms several well-known clustering methods, both in terms
of Average Precision and Mean Average Precision. A direct
comparison, by means of time performance, has shown that our
approach is faster than several well-performing methods in the
estimation of the number of clusters, given as input the same
number of query-based retrieved news articles.</p>
      <p>As future work, we plan to investigate the behavior of our
framework by introducing additional modalities/features, examine
the application of alternative (other than LDA) text clustering
approaches, as well as investigate the extraction of
languageagnostic concepts and named entities, something that could provide
multilingual capabilities to our topic detection framework.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was supported by the projects MULTISENSOR
(FP7610411) and KRISTINA (H2020-645012), funded by the European
Commission.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          (Ed.), '
          <article-title>Topic detection and tracking: event-based information organization'</article-title>
          , vol.
          <volume>12</volume>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>Science</given-names>
          </string-name>
          &amp; Business Media, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gialampoukidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vrochidis</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          , '
          <article-title>A hybrid framework for news clustering based on the DBSCAN-Martingale and LDA'</article-title>
          , In: Perner,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (Ed.)
          <source>Machine Learning and Data Mining in Pattern Recognition, LNAI 9729</source>
          , pp.
          <fpage>170</fpage>
          -
          <lpage>184</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , '
          <article-title>A survey of text clustering algorithms'</article-title>
          ,
          <source>In Mining Text Data</source>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>128</lpage>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Qian</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , '
          <article-title>Unsupervised feature selection for multi-view clustering on text-image web news data'</article-title>
          ,
          <source>In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          , pp.
          <fpage>1963</fpage>
          -
          <lpage>1966</lpage>
          , ACM, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Daumé</surname>
          </string-name>
          , '
          <article-title>A co-training approach for multi-view spectral clustering'</article-title>
          ,
          <source>In Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source>
          , pp.
          <fpage>393</fpage>
          -
          <lpage>400</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          and
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          , 'Latent dirichlet allocation',
          <source>the Journal of machine Learning research</source>
          , vol.
          <volume>3</volume>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Teh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Beal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          , '
          <article-title>Hierarchical dirichlet processes'</article-title>
          ,
          <source>Journal of the american statistical association</source>
          ,
          <volume>101</volume>
          (
          <issue>476</issue>
          ), (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kulis</surname>
          </string-name>
          and
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          , '
          <article-title>Revisiting k-means: New algorithms via Bayesian nonparametrics'</article-title>
          ,
          <source>arXiv preprint arXiv:1111.0352</source>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Charrad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ghazzali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Boiteau</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Niknafs</surname>
          </string-name>
          , '
          <article-title>NbClust: an R package for determining the relevant number of clusters in a data set'</article-title>
          ,
          <source>Journal of Statistical Software</source>
          ,
          <volume>61</volume>
          (
          <issue>6</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          , '
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise'</article-title>
          , In Kdd,
          <volume>96</volume>
          (
          <issue>34</issue>
          ), pp.
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Petkos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          , '
          <article-title>Graph-based multimodal clustering for social multimedia'</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ankerst</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Breunig</surname>
            ,
            <given-names>H. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kriegel</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , '
          <article-title>OPTICS: ordering points to identify the clustering structure'</article-title>
          ,
          <source>In ACM Sigmod Record</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          , ACM, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Kovarsky</surname>
          </string-name>
          , '
          <article-title>Automatic extraction of clusters from hierarchical clustering representations', In Advances in knowledge discovery and data mining</article-title>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>87</lpage>
          , Springer Berlin Heidelberg, (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moulavi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          , '
          <article-title>Density-based clustering based on hierarchical density estimates'</article-title>
          ,
          <source>In Advances in Knowledge Discovery and Data Mining</source>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>172</lpage>
          , Springer Berlin Heidelberg, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>