=Paper= {{Paper |id=Vol-3102/paper10 |storemode=property |title=Dialog Clustering: a Framework for Automatic Text Clustering |pdfUrl=https://ceur-ws.org/Vol-3102/paper10.pdf |volume=Vol-3102 |authors=Nicola Massarenti,Giorgio Lazzarinetti |dblpUrl=https://dblp.org/rec/conf/aiia/MassarentiL21a }} ==Dialog Clustering: a Framework for Automatic Text Clustering== https://ceur-ws.org/Vol-3102/paper10.pdf
    Dialog Clustering: a Framework for Automatic
                   Text Clustering?

             Massarenti Nicola1[0000−0002−8882−4252] and Lazzarinetti
                          Giorgio1[0000−0003−0326−8742]

               Noovle S.p.A, Milan, Italy https://www.noovle.com/en/




        Abstract. Despite the recent advances in Natural Language Process-
        ing techniques, when it comes to unsupervised applications, such as text
        clustering, many critical issues arise, as evaluating the results of an un-
        supervised algorithm or making an unsupervised algorithm work in an
        automatic fashion. In this research we propose an innovative framework
        based on machine learning for the creation of a system capable of ad-
        dressing these critical issues. The framework consists of four main steps:
        a step for converting text into embedding, a step to reduce the dimension-
        ality of the generated embedding, a step for clustering data and, finally,
        a step to evaluate the results obtained and selecting the best cluster-
        ing produced. Each of these steps uses different models equipped with a
        mechanism for identifying and automatically selecting hyperparameters.
        Our framework guarantees good performance in clustering text without
        an a-priori knowledge of the data. Observing the results it can be seen
        how the subdivisions of the dataset vary according to the models used
        in the different steps of the proposed system and how groups focused on
        specific themes are identified.

        Keywords: Natural Language Processing · Text Clustering · Dimen-
        sionality Reduction · Text Embedding.


1     Overview

In recent years, thanks to the advances in Natural Language Processing (NLP)
techniques [1–3], an ever increasing number of text applications, as part-of-speech
tagging [6], text classification [4], document summarization [4], named entity
recognition [7], text clustering [9], machine translation and chatbots, have been
developed [5]. Among these, the use of chatbots has been very popular since
its inception in 1960. In the past several years, giant companies have invested
?
    Activities were partially funded by Italian ”Ministero dello Sviluppo Economico”,
    Fondo per la Crescita Sostenibile, Bando “Agenda Digitale”, D.M. Oct. 15th, 2014
    - Project n. F/020012/02/X27 - “Smart District 4.0”.

    Copyright ©2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
        N. Massarenti et al.

in artificial intelligence and developed conversational engines (such as Google
DialogFlow, Amazon Lex, Azure Bot Service) that allow customers to create
their own personalized dialog system [8]. However, the design of the conversa-
tional flow is complex, often error-prone and incomplete [65]. To enhance their
creation many companies make use of existing unstructured corpora of text, com-
ing from real conversations with customers but, the exploration and exploitation
of such corpora is difficult, since data is unstructured and noisy. A better under-
standing of such data can be obtained by using unsupervised machine learning
algorithms, such as clustering, which, indeed, aims at detecting patterns and get-
ting insights from data. Text clustering applications, given their unsupervised
nature, however are not always directly measurable and need to involve humans
to explain the results. Thus, the goal of this research is that of developing a
methodology for creating text clustering addressing these issues and trying to
automate a process that has until now been manual. This research is driven by
the business need of a partner company that aims at creating an application to
continuously improve an existing conversational customer service agent able to
assist customers to manage and maintain the machine they produce. The goal is
that of collecting all the chatbot’s fallbacks (i.e. questions to which the chatbot
is not able to give and answer) and use them subsequently to derive new intents
to re-train the chatbot. Thus, we design a methodology to automatically cre-
ate and evaluate clusters of text coming from conversations, in order to derive
from the corpus new possible intents. The developed framework is composed of
a multi-step pipeline with an embedding step for creating text embeddings, a
dimensionality reduction step, a clustering step and a final evaluation to select
and evaluate the clusterings produced. The rest of this paper is organized as
follows. In Chapter 2 an overview of the state of the art is presented, with a
focus on text embedding, dimensionality reduction and clustering. In Chapter 3
the pipeline that enables the creation of the automatic clustering framework is
defined and in Chapter 4 the experimental results obtained are shown. Finally
in Chapter 5 some conclusive marks.


2     State of the Art

In the context of text clustering [9], the critical points for the realization of a
clustering system are three: the conversion of the text into a set of features,
a.k.a embedding, the dimensionality reduction of the features and the clustering
algorithm itself [10]. In the following we focus on these main issues.


2.1    Text Embedding

The representation of words and documents is a fundamental activity in NLP ap-
plications and for years the predominant methodology for this task has been that
of the Vector Space Model (VSM) [11]. The intuition is to produce an encoding
of words and documents in t-dimensional vectors, where each element represents
a unique term contained in the documents. VSM can be performed considering
             Dialog Clustering: a Framework for Automatic Text Clustering

global information (i.e. all the corpus of text) often via a co-occurrence matrix,
or local information (i.e. words and their context within a sentence) via the use
of neural networks. One of the first approaches to global information-based VSM
is the calculation of the term frequency - inverse document frequency (TF-IDF)
statistic which, given a corpus of texts, weights the values of the t-dimensional
vector with respect to the word in the document, and with respect to the fre-
quency in all documents of the corpus [12]. More recently GloVe [14] has been
introduced, which performs an encoding of semantic relationships between words
by leveraging the intuition that co-occurrence relationships between word pairs
are more informative than simple counting. On the other hand, the first impor-
tant contributions to modern local information-based VSM were the continuous
bag-of-words (CBOW) and the skip-gram (SG) models [15]. These models have
been used for the development and dissemination of Word2Vec [16, 17] a model
that leverages neural networks’ hidden layers representation of the feature. In
2016 Facebook Inc published FastText [18], where they proposed to use the
n-grams [19, 20], thus including in the embedding the morphology and the com-
position of words. In 2018 three important models have been published: ELMo
[21, 22], a bidirectional Long Short Term Memory (LSTM) model, Universal Sen-
tence Encoder (USE) [48], a model developed specifically for the production of
sentence embedding and Bidirectional Encoder Representations for Transformers
(BERT) [23], a Google’s model pre-trained on Wikipedia and BooksCorpus [49]
that uses transformers [24, 25] in a bidirectional architecture based on Recurrent
Neural Network (RNN) that implements an attention mechanism for contextual
representations of sentences. Following BERT there have been several models
trained on other corpora, as AlBERTo [26], a model specifically for the italian
language.

2.2   Dimensionality Reduction
Dimensionality reduction is the process of transforming multidimensional vari-
ables from vector spaces (VS) with a large number of dimensions to VSs with
a significantly reduced number of dimensions so that the intrinsic and most sig-
nificant properties of the variables are preserved [28]. One of the best known
dimensionality reduction techniques is Principal Component Analysis (PCA)
[27, 29] a statistical technique that determines the principal components (PC)
by carrying out a linear transformation of the variables such as to maximize
the variance [30]. PCA has proven to be effective in many applications even if,
however, since it does not take into account the distribution of data sets it has
been extended to Kernel PCA which is the reformulation of PCA obtained by
applying a kernel function, used in order to take advantage of the kernel trick
[31, 32]. Other effective approaches are Isomap [33] and UMAP [34], which are
manifold-based dimensionality reduction methods. Finally, more recently, Au-
toencoders [35], based on multiple feed-forward neural networks have also been
applied to reduce the dimensionality of the data. The main criticality in the
use of autoencoders to reduce dimensionality lies in the training process of the
neural network which requires a large dataset.
       N. Massarenti et al.

2.3   Clustering
Clustering is an unsupervised machine learning technique that aims at grouping
data points into two or more sets so that data in the same set are closer to each
other with respect to the data points in the other set [36, 37, 39]. The clustering
models are mainly divided into 8 categories as described in Table 1.

                   Table 1. Clustering Algorithms per category.

Clustering    Description                               Algorithms
category
Partition     base their theory by designating the data K-means [39]
based         center as the corresponding cluster center
Hierarchy     build hierarchical relationships between BIRCH [40], CURE and
based         the data to be clustered and, on the ba- ROCK [36]
              sis of these, they separate the datar
Fuzzy theory based on the idea that the discrete value FCM [41] and FCS [42]
based         of belonging to a cluster, 0, 1, is to be re-
              placed with the continuous interval [0, 1]
              in order to describe the relationships of be-
              longing of data to clusters
Distribution base the clustering of data on the statisti- Gaussian Mixture Model
based         cal distribution of features                  (GMM) [43]
Density based base their output on the idea that regions DBSCAN [44], OPTICS [45]
              with high density are those that contain and Mean-shift [46]
              data belonging to the same cluster
Graph theory based on graphs, where nodes are the data CLICK [36]
based         and links are the relationships between the
              data
Grid based    change the original data space into a grid STING and CLIQUE [47]
              structure
Fractal based based on fractal theory                       FC [36]


    Partition based models have low complexity but they have poor performance
in case of non-convex data or with many outliers and, as hierarchy based models,
they require the number of clusters to be defined in advance. Fuzzy theory based
models, on the other hand, have poor scalability and strong sensitivity to the
hyperparameters of the models. On the contrary, the distribution based models
have a higher scalability but they strongly depend on the hyperparameters too.
The same holds for density based models, which however are very efficient and
suitable for data with arbitrary topologies. Graph theory based models have high
accuracy and efficiency of clustering, however the complexity increases signifi-
cantly as the complexity of the graph increases. Grid-based models are highly
scalable and characterized by low complexity but the output is very sensitive to
the granularity (mesh size). Finally, fractal theory based models have the ad-
vantage of having linear complexity with the number of data, however they are
not enough performing to compete with the other clustering techniques.
              Dialog Clustering: a Framework for Automatic Text Clustering

3     Methodology
To understand how to create a system that, starting from a corpus of text, trains
different clustering models using hyper parameter optimization techniques and
selects the best ones based on a series of combined evaluation metrics, it is firstly
necessary to focus on the dataset provided to implement the solution.

3.1   Dataset Description
The available dataset is associated with a virtual agent structured in intents, ac-
cording to the business needs of the partner company.The dataset is composed of
1297 phrases, used by the conversational engine for training the model, represen-
tative of 140 intents with an average number of training phrases for each intent
equal to 9.26. The main topics concern document management, the resolution
of problems encountered with machinery, contract management and restoration
operations.

3.2   Clustering Pipelines
The results of the clustering system are computed by pipelines Pi , which are de-
fined by the set of models implemented and their configurations. More formally:

                                Pi = Eel , Ddm , Ccn
                                      
                                                                               (1)
                                                                 
where Ee ∈ {E1 , ..., EE } is the e-th embedding model and Eel ∈ Ee1 , ..., EeL is
the l-th configuration of the model Ee ; Dd is the d-th dimensionality reduction
model and Ddm is the m-th configuration; Cc is the c-th clustering model and
Ccn is the n-th configuration. At the end of the executions of all the pipelines
different clusterings are obtained which are analyzed and evaluated in order to
choose the ones that best adhere to the data, that is, those that have the best
evaluation.

3.3   Embedding
The first step of the proposed methodology is represented by the application
of a set of the embedding models to the input dataset. This step is carried
out over a dataset cleaned via stop words elimination and pre-processed via
stemming, in order to reduce the noise present in the data. At this step three
different embeddings are carried out: BERT, AlBERTo and USE. These models,
indeed, represent the state of the art for embedding production and, moreover,
they are equipped with tools suitable to support their industrialization. Indeed
all the models are written in Python [51] and uses the TensorFlow framework
[52] which, through theTensorFlow Hub module [53] (except for ALBERTo [56]),
allows to download the pre-trained models (BERT [54], ALBERTo and USE [57])
and the pre-processing [55, 56] used for tokenization. BERT expects the inputs
to be pre-processed and structured according to specific requirements. For each
        N. Massarenti et al.

input phrase it is required to divide the words into sequences of tokens and to add
the token [CLS] at the beginning of the sequence and the token [SEP] at the end.
BERT uses the WordPiece tokenizer [50], whose vocabulary is initialized with
all individual characters and augmented with the most frequent and probable
word combinations. AlBERTo implements both the model and the tokenization
and pre-processing. USE is a model based on the Transformer architecture that
transforms the sentences into 512-dimensional embeddings.

3.4   Dimensionality reduction
The second step of the proposed methodology is represented by the application
of a set of dimensionality reduction models to each of the embedding produced
at the previous step. At this step three different dimensionality reduction tech-
niques are implemented PCA, Kernel PCA and UMAP. With respect to hyper-
parameters selection, these model can be of two different types: subject to hyper
parameter tuning (auto-tunable), i.e. there is a methodology to automatically
select the optimal hyperparameters, or not subject to hyperparameters tuning
(non-auto-tunable), i.e. there is not a methodology to automatically select the
optimal hyperparameters. The same considerations hold for the clustering al-
gorithms. Following the execution of the dimensionality reduction model, the
standardization of the features is performed to project the data into a uniform
and independent space from the pipeline for the next step of clustering.

PCA The first dimensionality reduction model used by the proposed system
uses the implementation of the Incremental PCA [58] which, differently from
PCA, does not process all the dataset in a single run, but computes the PC
incrementally, thus limiting the use of the required memory without placing
constraints on the size of the dataset. This method is non-auto-tunable, thus to
determine the number of principal components to reduce the dimensionality, we
select all and only the main components that describe the 90% of the variance.
The data is processed in blocks of 5000.

Kernel PCA As mentioned, Kernel PCA is an extension of the PCA dimen-
sionality reduction model based on the use of kernel methods. Kernel PCA is
auto-tunable, i.e. the best configuration among all those available can be au-
tomatically searched. The methodology applied for the identification and auto-
matic selection of hyperparameters of the kernel PCA algorithm is a variant of
the one presented in [59]. Specifically, after defining K1 , ...KK kernel PCA config-
urations to be explored during hyper-parameter tuning, the set of configurations
is selected such that:
                          Kk̄ |k̄ = arg max score(Ki)                            (2)
                                       i|1≤i≤K

That is, defined the original dataset X, for each configuration Ki , 1 ≤ i ≤ K
the i-th configuration is applied such that:
                               Ii = kernelP CA(Ki , X)                           (3)
             Dialog Clustering: a Framework for Automatic Text Clustering

Once the image Ii of X is obtained through the kernel function defined by
the configuration Ki , the eigenvalues and eigenvectors of Ii are computed. If the
number of eigenvalues Ni required to describe 90% of the variance of Ii is greater
than the number of components of the original feature space, a very low score
is assigned, otherwise, the new feature space IiNi obtained by applying Kernel
PCA with the configuration KiNi = (Ki , Ni ) is calculated. The selection of the
best set of hyperparameters is done by evaluating the reconstruction error. An
inverse image Zi of the variable IiNi is defined such that:

                                 Zi = H −1 (IiNi )                             (4)

where H −1 is the function that determines the inverse image. The reconstruction
error is calculated as:
                                 Ei = d(X, Zi )                              (5)
where d(X, Zi ) is the Manhattan distance. Following the calculation of the re-
construction error, the score associated with the configuration is calculated as
follows:                             X       X
                    Ki |scorei = −                 Ei (n, m)                 (6)
                                    1≤n≤|X| 1≤m≤Ni

    An application problem that arises with Kernel PCA in cases where the size
of the input is high is how to mediate the management of parallelization of
executions with the management of available memory. To meet this need we
compute an heuristics based on the available memory, the number of available
processors and the size of the input.

UMAP Finally, UMAP [34], is a dimensionality reduction model based on the
theoretical framework of Reimann geometry and topological algebra that builds
a high-dimensional graph where the weights of the links between two nodes
represent their probability of being connected. The connectivity of the graph
is determined by defining a circle around each node and connecting those with
overlapping circles. UMAP guarantees the conservation of the local structure
through a mechanism that binds the connection of each point to the one closest
to it. Following the construction of the high-dimensional graph, UMAP optimizes
its structure to reduce the components and creates an analogous graph of reduced
dimensionality as similar as possible to the one originally created. This module
is non-auto-tunable, thus we set two configurations chosen empirically: the first
foresees to implement the model with the parameter n neighbors equal to 5 and
n components equal to 30, the second foresees to set n neighbors equal to 15 and
n components equal to 50.

3.5   Clustering
The third step of the methodology is that of the computation of a set of clustering
algorithms for each dataset produced by the dimensionality reduction step. Given
the large number of models available for clustering, for each clustering method,
        N. Massarenti et al.

it is necessary to have a methodology for automatic selection of the optimal
configuration of the hyperparameters. For this reason, we decided to implement
only auto-tunable clustering algorithms.


K-means K-Means [39] bases its clustering strategy on the updating of cluster
centroids through an iterative process carried out until convergence. The process
of convergence of the algorithm performs initialization (1), assignment of clusters
(2) and updating of centroids (3). While the initialization, which involves the
random assignment of the centroids, is performed only once, the step of assigning
the data to the nearest centroid and that of recomputing the position of the
centroids are carried out until the difference between the last centroids obtained
and those calculated at the previous iteration is negligible, i.e. when convergence
is reached. K-Means, requires the definition of the number of clusters in advance,
thus we implement an optimization process to determine the best number of
clusters. The process involves performing different clustering each with a different
and progressive number of clusters. For each clustering the distances of the points
from the centroid of the cluster to which they have been assigned are calculated.
Finally, the Within Cluster Sum of Squares (WCSS) value of each configuration
is calculated as the sum for all clusters of the sum of the square Manhattan
distance from each point to the centroid of the cluster it is assigned. Given the
set of WCSSs for each clustering produced (with different number of clusters) the
elbow point is sought through the implementation of the kneedle algorithm [60],
then the best configuration is identified and, consequently, the optimal number
of clusters.


OPTICS OPTICS [45] is based on the following strategy: a point belongs to a
cluster if it is close to other points belonging to that cluster. The key parameter
of OPTICS is minPts which identifies the minimum number of points required to
define a cluster. Thanks to minPts, OPTICS defines three types of points: core
points, border points and outliers. The core points are those points that are close
to at least minPts points, the border points are those reachable from a core point
but which have no minPts points. The outliers, on the other hand, are the points
that are neither core points nor border points. To define minPts we develop an
automatic mechanism that involves the execution of different configurations that
explore the clustering determined with different minPts and then the assignment
of a score determined through the analysis of a general evaluator as described
in Paragraph 3.6.


Spectral Clustering Spectral Clustering [39] is a graph based model that de-
termines how to cluster data by analyzing the spectrum of the similarity matrix.
Given an enumerated set of data points, the similarity matrix may be defined
as a symmetric matrix A, where Ai,j > 0 represents a measure of the similarity
between data points with indices i and j. The general approach to spectral clus-
tering is to use K-means on the relevant eigenvectors of the Laplacian matrix
              Dialog Clustering: a Framework for Automatic Text Clustering

of A. The eigenvectors that are relevant are the ones that correspond to the
smallest eigenvalues of the Laplacian except for the smallest eigenvalue which
will have a value of 0. One of the open themes of Spectral Clustering is managing
data at multiple scales. For this reason we introduce a local scaling mechanism,
originally presented in [61], to improve data representation. The introduction of
local scaling involves calculating the affinity of two points using the Manhattan
distance and the local scale obtained by applying k-nearest neighbor [62] with
k = 7 as described in [61]. Each clustering produced is then associated with a
score assigned by a general evaluator as described in Paragraph 3.6: at the end
of the iterations the clustering with the best score is chosen.


3.6   Clustering evaluation and selection

The unsupervised clustering system presented in this research aims at determin-
ing the best N clustering among those produced by the I pipelines. In order to
achieve this goal we designed a general evaluator, who determines the goodness
of each clustering with respect to the characteristics of cohesion and separa-
tion. Moreover, we designed a relative cohesion evaluator, which determines the
cohesion of clusters with respect to a clustering set and a relative separation
evaluator, which determines the separation between clusters against a clustering
set. The procedure for selecting the best clustering involves to select, among all
clusterings produced by the pipelines, just the first 10 with the best score as-
signed by the general evaluator. The relative cohesion evaluator and the relative
separation evaluator are also computed to give the user the possibility of fur-
therly exploring the results and selecting the best clustering according to their
needs. As general evaluator we use the Silhouette [63] defined as:

                                          b(xi ) − a(xi )
                            sil(xi ) =                                            (7)
                                         max{a(xi ), b(xi )}

where a(xi ) is the average distance of xi from the other points belonging to
the same cluster and b(xi ) represents the distance between xi and the nearest
cluster, calculated as the average distance of xi from all points belonging to the
closest cluster. The range of values returned by the silhouette metric is [-1,1],
where good clustering for the point is achieved when sil(xi ) → 1 . The silhouette
score of all clustering is obtained by averaging the silhouette scores of the points
in the dataset.
    The relative cohesion evaluator determines the cohesion of the clusters pro-
duced by a model and normalizes this value with respect to the cohesion values
of a clustering set. Given a set of clustering R1 , ..., Rk , defined C k the clusters
created by the k-th clustering and X the dataset shared by all clusterings, the
cohesion associated with clustering t is defined as:
                              s
                                   1 X   1 X          2
                     θ(Rt ) = 2     t
                                             d(x, ci )                            (8)
                                  |C | t
                                         mi
                                         Ci ∈C      x∈Ci
        N. Massarenti et al.

where mi is the number of points belonging to the i-th cluster, ci is the centroid
of the i-th cluster and d(x, y) is the Manhattan distance. The relative cohesion
is calculated using the following equation:
                              θ(Rt ) − min{θ(R1 ), ..., θ(R − k)}
    θrel (Rt ) = 1 −                                                              (9)
                       max{θ(R1 ), ..., θ(R − k)} − min{θ(R1 ), ..., θ(R − k)}
    Similarly, the relative separation evaluator determines the separation between
clusters and normalizes that value against the separation values of a clustering
set. Thus, given ci the centroid of cluster Ci , the separation associated with
clustering is                      s
                                        1 X
                          λ(Rt ) = 2             mi d(ci , c)2                 (10)
                                      |C t |   t
                                            Ci ∈C

from which to compute λrel (Rt ) in the same way as for relative cohesion.


4    Experimental results
As explained, the goal of this research is to create a framework for the automatic
creation and evaluation of text clustering. The goal is not that of producing
a unique clustering, but a set of clusterings with some descriptive evaluation
metrics that the final user can exploit to explore the corpus of text and derive a
set of intents for the re-training of a chatbot system or even for the first training.
Indeed, the corpus of text comes from the chatbot’s fallback, i.e. a set of questions
to which the chatbot was not able to find an answer. Thus, in order to let the
framework be usable, we design a web page that allows the user to upload the
corpus of text, run the clustering methodology and visualize just the first 10
clusterings with best evaluation metrics. In order to visualize the results three
main plots are shown in the web page: a bar plot that show for each clustering
the number of clusters, the general evaluation score and the mean cardinality of
clusters (Figure 1 top-left); a bar plot that for a specific clustering that can be
selected show the cardinality of each cluster (Figure 1 bottom-left); a word cloud
plot that for a specific cluster selected from the previous bar plot shows the main
words of that cluster and the main sentences (Figure 1 top/bottom-right).
    After running the framework with the input dataset we have analyzed the
10 clustering produced. It is clear that the evaluation of the entire methodol-
ogy cannot be done in a quantitative way as for supervised machine learning
algorithms. Indeed, even if the dataset used for the experiment has a ground
truth (each sentence has been manually associated with an intent) we cannot
state that the only acceptable results produced by the clustering algorithms is
that defined by the ground truth, since the goal of unsupervised algorithms is
indeed that of finding hidden structure in the original dataset. Thus, we man-
ually explore each clustering produced to see if the results are coherent and
it emerged that apart from some clustering with a very small number of large
clusters, the other clustering produced actually manage to divide the dataset in
group of text with the same meaning or related to the same topic. Indeed, of the
              Dialog Clustering: a Framework for Automatic Text Clustering




       Fig. 1. Bar plot of the meaningful clustering produced by the pipeline.




10 kept clustering, 5 are meaningful and contains specific concept related to the
use and maintenance of the machine. These 5 clusterings have approximately 30
to 100 clusters. Of these clusterings, three have less than 50 clusters, while two
have respectively 78 and 103 clusters. These two clustering are even more fine,
since they are able to distinguish even sentences with similar word but different
meaning. As far as the other clustering with less than 30 clusters is concerned,
instead, they contain some very large clusters which are noisy and some small
clusters which are meaningful. Therefore, these clustering seem to be less useful
than the others. Even though the evaluations are empirical, we can state that
the methodology is able to produce meaningful clusterings and the framework
created allows to exploit such clusterings to detect topics and derive intents.
    To conclude, we also make some experiment to define the required hardware
needed to run the methodology. In particular, we run the methodology over a
Google Cloud Compute Engine [64] with 8 vCPUs and 32 GB of RAM. We test
the performance of the framework in terms of execution time, CPU and RAM
utilization with two different dataset, one with 5000 sentences and one with
20000 sentences. In Table 2 the results are shown. It can be seen that in the first
case, the duration of the pipeline and the use of resources is limited and almost
never reach 100%. In the second case, the use of resources extremely increases and
reaches 100% of CPU utilization in 2 cases: Kernel PCA and Spectral Clustering.
Moreover, for these two algorithms, the time extremely increases. As far as the
created framework is concerned, in order to limit the use of resources, we set
a limit to the input dataset to 20000 rows, to avoid exhausting resources. This
limit is actually enough for the partner company and for the specific use case.
       N. Massarenti et al.

                         Table 2. Resource consumption

                                    Dataset 5k          Dataset 20k
                              Durata (h) %CPU %RAM Durata (h) %CPU %RAM
Embedding      BERT           0.2        82   12   0.82       80   14
               PCA            0.02       80   10   0.02       76   4
Dimensionality
               Kernel
Reduction                 0.13         100    15     6.25        100    50
               PCA
               UMAP       0.02         13     3      0.03        15     3
               K-means 0.07            90     6      0.6         91     2
Clustering     OPTICS 0.07             16     5      0.57        100    8
               Spectral
                          1.07         13     6      12.3        17     12
               Clustering



5   Conclusion and future works


This study presents a methodology based on machine learning for the realization
of an unsupervised clustering system. The identified methodology consists of an
embedding step, composed of the non-auto-tunable models BERT, AlBERTo
and USE, a dimensionality reduction step, that consists of the non-auto-tunable
PCA and UMAP models and the auto-tunable Kernel PCA model, the clustering
step, composed of the auto-tunable K-Means, OPTICS and Spectral Clustering
models and a final evaluation step, where the metric used to select the best clus-
terings provides a score with respect to both cohesion and separation of clusters
(further evaluations are obtained through metrics that focus only on cohesion
and separation of clusters). Ultimately, the developed methodology shows good
performance with respect to the clustering goal, even though performance con-
siderations are empirical. In conclusion, the methodology is effective in dividing
datasets into clusters, provides more than one clustering and accompanies the
results with the metadata of the evaluators used. In general, it appears to be a
performing system to be used in an industrial context. Its main potential and
innovativeness lies in the automatic system for selecting the hyperparameters
and the best clustering: through these mechanisms, the sequences of models and
hyperparameters that are most adherent and suitable for the dataset are cho-
sen. Thanks to this mechanism it is, indeed, possible to automatically obtain
the clustering of data never explored and extract useful information for business
logic. Even though the framework created works fine, some enhancement can
be performed. In particular, from the clustering logic perspective, the pipeline
can be enriched with other algorithms for the three different steps. However,
the main enhancement to extend the framework are related to the creation of
an auto scaling infrastructure that is able to keep the training time low by in-
creasing the resources, so that the framework is no limited in the dimension of
the input and does not require the user to wait for long for the creation of the
clustering.
              Dialog Clustering: a Framework for Automatic Text Clustering

References
1. Hindle, A., Barr, E., Gabel, M., Su, Z., Devanbu, P.: On the naturalness of software.
   Communications of the ACM 59, 122–131 (2016)
2. Otter, D. W., Medina, J. R., Kalita, J. K.: A Survey of the Usages of Deep Learn-
   ing for Natural Language Processing. IEEE Transactions on Neural Networks and
   Learning Systems 32(2), 604–624 (2021)
3. Cai, T., Giannopoulos, A., Yu, S., Kelil, T., Ripley, B., Kumamaru, K., Rybicki, F.,
   Mitsouras, D.: Natural Language Processing Technologies in Radiology Research
   and Clinical Applications. Radiographics: a review publication of the Radiological
   Society of North America 36, 176–191 (2016)
4. Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Presa, R. M., Shyu, M.-L.,Chen,
   S.-C., Iyengar, S.: A Survey on Deep Learning: Algorithms, Techniques, and Appli-
   cations. ACM Computing Surveys 51, 1–36 (2018)
5. Zhang, X., Zhao, J., Lecun, Y.: Character-level Convolutional Networks for Text
   Classification. In: Proceedings of the 28th International Conference on Neural Infor-
   mation Processing Systems, pp. 649–657. MIT Press, Cambridge, MA, USA (2015)
6. Plank, B.,Søgaard, A., Goldberg, Y.: Multilingual Part-of-Speech Tagging with Bidi-
   rectional Long Short-Term Memory Models and Auxiliary Loss. in: Proceedings of
   the 54th Annual Meeting of the Association for Computational Linguistics, pp.
   412–418. Association for Computational Linguistics, Berlin, Germany (2016)
7. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural Ar-
   chitectures for Named Entity Recognition. In: Proceedings of the 2016 Conference
   of the North American Chapter of the Association for Computational Linguistics:
   Human Language Technologies, pp. 260–270. Association for Computational Lin-
   guistics, San Diego, California (2016)
8. Ait-Mlouk, A., Jiang, L.: KBot: A Knowledge Graph Based ChatBot for Natural
   Language Understanding Over Linked Data. IEEE Access (2020)
9. Xu, D., Tian, Y.: A Comprehensive Survey of Clustering Algorithms. Annals of
   Data Science 2, 165–193 (2015)
10. Abualigah, L., Khader, A. T., Al-Betar, M.,Hanandeh, E.:. Unsupervised Text
   Feature Selection Technique Based on Particle Swarm Optimization Algorithm for
   Improving the Text Clustering. In: Proccedings of First EAI International Confer-
   ence on Computer Science and Engineering, EAI (2017)
11. Felipe, A., Geraldo, X.: Word Embeddings: A Survey. arXiv preprint,
   arXiv:1901.09069 (2019)
12. Qaiser, S., Ali, R.: Text Mining: Use of TF-IDF to Examine the Relevance of Words
   to Documents. International Journal of Computer Applications 181(1) (2018)
13. Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: A Latent Variable Model Ap-
   proach to PMI-based Word Embeddings. Transactions of the Association for Com-
   putational Linguistics 4 385-399 (2016)
14. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Repre-
   sentation. EMNLP 14 1532-1543 (2014)
15. Goodfellow, I., Bengio, Y., Courville, A.: Deeplearning. MITpress (2016)
16. Google Code Archive - Long-term storage for Google Code Project Hosting.
   https://code.google.com/archive/p/word2vec/. Last accessed 14 Sept 2021.
17. Word2vec - Wikipedia. https://it.wikipedia.org/wiki/Word2vec. Last accessed 14
   Sept 2021.
18. FastText - Facebook Research. https://research.fb.com/blog/2016/08/fasttext/.
   Last accessed 14 Sept 2021.
        N. Massarenti et al.

19. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with
   Subword Information. Transactions of the Association for Computational Linguistics
   5, 135–146 (2016)
20. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text
   Classification. In: Proceedings of the 15th Conference of the European Chapter of
   the Association for Computational Linguistics, pp. 427–431. Association for Com-
   putational Linguistics, Valencia, Spain (2017)
21. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
   L.: Deep contextualized word representations. In: Proceedings of the 2018 Confer-
   ence of the North American Chapter of the Association for Computational Linguis-
   tics: Human Language Technologies, pp. 2227–2237. Association for Computational
   Linguistics, New Orleans, Louisiana (2018)
22. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained Models for Nat-
   ural Language Processing: A Survey. Science China Technological Sciences 63,1872–
   1897 (2020)
23. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep
   Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019
   Conference of the North American Chapter of the Association for Computational
   Linguistics: Human Language Technologies, pp. 4171–4186. Association for Com-
   putational Linguistics, Minneapolis, Minnesota (2018)
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.
   N., Kaiser, L., Polosukhin, I.: Attention is All You Need. ArXiv preprint,
   arXiv:1706.03762 (2017)
25. Hu, D.: An Introductory Survey on Attention Mechanisms in NLP Problems. In:
   Bi Y., Bhatia R., Kapoor S. (eds) Intelligent Systems and Applications. IntelliSys
   2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham.
   https://doi.org/10.1007/978-3-030-29513-4
26. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: ALBERTO:
   Italian BERT Language Understanding Model for NLP Challenging Tasks Based
   on Tweets. In: Proceedings of the Sixth Italian Conference on Computational Lin-
   guistics (CLiC-it 2019), volume 2481. CEUR (2019)
27. Cunningham, J., Ghahramani, Z.: Linear Dimensionality Reduction: Survey, In-
   sights, and Generalizations. Journal of Machine Learning Research 16, 2859–2900
   (2015)
28. Xie, H., Li, J., Xue, H.: A survey of dimensionality reduction techniques based on
   random projection. ArXiv prerprint, arXiv:1706.04371 (2017)
29. Gadekallu, T., Reddy, P., Lakshman, K., Kaluri, R., Rajput, D., Srivastava, G.,
   Baker, T.: Analysis of Dimensionality Reduction Techniques on Big Data. IEEE
   Access 8, 54776–54788 2020
30. Jolliffe, I., Cadima, J.: Principal component analysis: A review and recent develop-
   ments. Philosophical Transactions of the Royal Society A: Mathematical, Physical
   and Engineering Sciences 374 (2016)
31. Kim, C., Klabjan, D.: A Simple and Fast Algorithm for L1-norm Kernel PCA.
   IEEE Transactions on Pattern Analysis and Machine Intelligence 42(8), 1842 -
   1855 (2019)
32. Van der Maaten, L., Postma, E., Herik, H.: Dimensionality Reduction: A Compar-
   ative Review. Journal of Machine Learning Research 10 (2009)
33. Yang, B., Xiang, M., Zhang, Y.: Multi-manifold Discriminant Isomap for visual-
   ization and classification. Pattern Recognition 55 (2016)
34. McInnes, L., Healy, J.: UMAP: Uniform Manifold Approximation and Projection
   for Dimension Reduction. ArXiv preprint, arXiv:1802.03426 (2018)
              Dialog Clustering: a Framework for Automatic Text Clustering

35. Tschannen, M., Bachem, O., Lucic, M.: Recent Advances in Autoencoder-Based
   Representation Learning. ArXiv preprint, arXiv:1812.05069 (2018)
36. Xu, D., Tian, Y.: A Comprehensive Survey of Clustering Algorithms. Annals of
   Data Science 2 (2015)
37. Wong, K.-C.: A Short Survey on Data Clustering Algorithms. 2015 Second In-
   ternational Conference on Soft Computing and Machine Intelligence (ISCMI), pp.
   64-68. IEEE (2015)
38. Singh, S., Srivastava, S.: Review of Clustering Techniques in Control System. Pro-
   cedia Computer Science 173, 272–280 (2020)
39. Jason X., K. L.: Power k-Means Clustering. ICML, 2019.
40. Lorbeer, B., Kosareva, A., Deva, B., Softić, D., Ruppel, P., Küpper, A.: Variations
   on the Clustering Algorithm BIRCH. Big Data Research 11 (2017)
41. Gosain, A., Dahiya, S.: Performance Analysis of Various Fuzzy Clustering Algo-
   rithms: A Review. Procedia Computer Science 79, 100-111. (2016)
42. Yongli, L., Hengda W., Tianyi D., Jingli C., Hao, C.: Incremental fuzzy clustering
   based on a fuzzy scatter matrix. J. Inf. Process. Syst. 15, 359-373 (2019)
43. Viroli, C., Mclachlan, G.: Deep Gaussian Mixture Models. Statistics and Comput-
   ing 29 (2019)
44. Chen, Y., Tang, S., Bouguila, N., Wang, C., Du, J., Li, H.: A Fast Clustering
   Algorithm based on pruning unnecessary distance computations in DBSCAN for
   High-Dimensional Data. Pattern Recognition 83 (2018)
45. Schubert, E., Gertz, M.: Improving the Cluster Structure Extracted from OPTICS
   Plots. LWDA (2018)
46. Pulkit, M., Abhishek, M., Saket, A., Sushil, M.: Deep mean shift clustering. 2015
   Second International Conference on Soft Computing and Machine Intelligence (IS-
   CMI), pp. 64-68. IEEE (2019)
47. Suman, S., Rani, P.: A Survey on STING and CLIQUE Grid Based Clustering
   Methods. International Journal of Advanced Research in Computer Science 8, 1510-
   1512 (2017)
48. Cer, D., Yang, Y., Kong, S.-Y. Hua, N., Limtiaco, N., John, R., Constant, N.,
   Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., Kurzweil, R.:
   Universal Sentence Encoder. ArXiv preprint, arXiv:1803.11175 (2018)
49. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler,
   S.: Aligning Books and Movies: Towards Story-Like Visual Explanations by Watch-
   ing Movies and Reading Books. 2015 IEEE International Conference on Computer
   Vision (ICCV), 19-27 (2015)
50. Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words
   with Subword Units. ArXiv preprint, arXiv:1508.07909 (2016)
51. Welcome to Python.org. https://www.python.org/. Last accessed 14 Sept 2021.
52. TensorFlow. https://www.tensorflow.org/. Last accessed 14 Sept 2021.
53. TensorFlow Hub. https://www.tensorflow.org/hub. Last accessed 14 Sept 2021.
54. TensorFlow Hub. https://tfhub.dev/tensorflow/bert multi cased L-12 H-768 A-
   12/3. Last accessed 14 Sept 2021.
55. TensorFlow Hub. https://tfhub.dev/tensorflow/bert multi cased preprocess/1.
   Last accessed 14 Sept 2021.
56. GitHub - marcopoli/AlBERTo-it: AlBERTo the first italian BERT model forTwit-
   ter languange understanding. https://github.com/marcopoli/AlBERTo-it. Last ac-
   cessed 14 Sept 2021.
57. TensorFlow        Hub.        https://tfhub.dev/google/universal-sentence-encoder-
   multilingual/3. Last accessed 14 Sept 2021.
        N. Massarenti et al.

58. Ross, D.,Lim, J., Lin, R.-S., Yang, M.-H.: Incremental Learning for Robust Visual
   Tracking. International Journal of Computer Vision 77, 125-141 (2008)
59. Md Ashad, A., Kenji, F.: Hyperparameter selection in kernel principal component
   analysis. Journal of Computer Science 10, 1139-1150 (2014)
60. Satopaa, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a Kneedle in a
   Haystack: Detecting Knee Points in System Behavior. 2011 31st International Con-
   ference on Distributed Computing Systems Workshops, 166–171 (2011)
61. Zelnik-Manor, L., Perona, P.: Self-Tuning Spectral Clustering. Adv. Neural Inf.
   Process. Syst. 17 (2004)
62. Garcı́a-Pedrajas, N., Romero, J., Cerruela, G.: A Proposal for Local k Values for
   k-Nearest Neighbor Rule. IEEE Transactions on Neural Networks and Learning
   Systems 28, 1–6 (2015)
63. Wang, F., Franco-Penya, H.-H., Kelleher, J., Pugh, J., Ross, R.: An Analysis of
   the Application of Simplified Silhouette to the Evaluation of k-means Clustering
   Validity. MLDM (2017)
64. Compute Engine documentation — Compute Engine Documentation.
   https://cloud.google.com/compute/docs. Last accessed 14 Sept 2021.
65. Feine, J., Morana, S., Maedche, A.: Designing a Chatbot Social Cue Configuration
   System. In: International Conference on Information Systems. Munich (2019)