Guiding Users by Dynamically Generating Questions in a Chatbot System Jannis Pilgrim1 , Jakob Kemmler1 , Moritz Wassmer1 , Silvio Echsle1 and Andreas Lommatzsch2 1 Technische Universität Berlin, Straße des 17. Juni 135, D-10623 Berlin, Germany 2 DAI-Labor, TU Berlin, Ernst-Reuter-Platz 7, D-10587 Berlin, Germany Abstract Chatbots efficiently support users in finding relevant answers in complex domains. They aggregate data from different sources and provide information in an interactive dialog. In a conversation, chatbots mime human experts providing information in well-consumable pieces. They try to guide users towards predicted information needs. One challenge for chatbots consists in generating questions if user inputs are ambiguous or incomplete. Computing good counter-questions requires an understanding of the user’s intentions and a good structuring of the data to provide the needed information in a suitable format. In this work we present a solution for generating clarification-questions based on dynamic data collections applying semantic clustering and flexible questions trees. We optimize and evaluate our approach for a chatbot tailored for answering questions related to services offered by the local Public Administration. We show that our approach efficiently helps users to find the relevant information in a natural conversation avoiding long lists for potentially interesting search results. The approach is based on a data enrichment and knowledge extraction pipeline that enables the adaptation of the components to different knowledge sources and the specific requirements of new domains. 1. Introduction The rapid advances of NLP techniques in the last years and the big popularity of social media chat systems have led to a growing interest in chatbots. Chatbots mime the behavior of a human expert and provide relevant information and answers to users in suitable “pieces”. Compared with complex web documents or long lists provided by search engines, chatbots guide users in a natural dialog to the needed information. Providing an adequate answer to a complex user question is a challenging task, since user questions can be related to a huge range of topics and aspects. Moreover, user questions are often imprecise and ambiguous due to limited knowledge of the domain. Thus, a chatbot must determine the most relevant information based on the context and the knowledge about the user (making sure that all potentially relevant cases are considered). The computation of potentially relevant responses can be efficiently done based on Information Retrieval methods, optimized LWDA’22: Lernen, Wissen, Daten, Analysen. October 05–07, 2022, Hildesheim, Germany Envelope-Open jannis.pilgrim@campus.tu-berlin.de (J. Pilgrim); jakob.kemmler@campus.tu-berlin.de (J. Kemmler); moritz.wassmer@campus.tu-berlin.de (M. Wassmer); silvio.echsle@campus.tu-berlin.de (S. Echsle); andreas.lommatzsch@dai-labor.de (A. Lommatzsch) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) of finding potentially relevant information in large data collections. Determining the response best-fitting the user’s intention usually requires additional information, which chatbots must collect by asking the right questions. The generation of good questions requires a precise understanding of the user question, a deeper analysis of potentially matching answers, and NLP techniques for generating questions and understanding given text snippets. In this paper we analyze the scenario of optimizing a chatbot tailored for providing answers related to the public services of a major German city. In our scenario, citizens need information about the services offered by the public administration (e.g. how to get a residential parking permit or how to get a passport for babies). The conversation is usually started by the user with an initial question. Based on the question, the chatbot predicts potentially relevant services. In order to find the demanded information, the chatbot generates questions optimized for reducing the ambiguity in the user question and to reach the requested information with a minimal number of steps. We develop a component that deploys language models and dynamic decision trees for guiding the user to get the intentionally demanded answers. The remainder of this paper is organized as follows. In Section 2 related research is sum- marized along an explanation to how they contributed to our work. Section 3 describes our approach and the structure of the underlying data. Subsequently, the evaluation of our approach is presented in Section 4, through quantitative and qualitative analysis of the chosen methods and developed chatbots. Finally, Section 5 discusses the accomplishments of this paper and provides an outlook on future work. 2. Related Work In this section we review prior work on asking clarifying questions for conversational Informa- tion Retrieval (IR) and related methods. We first look at the usefulness of clarifying questions; then we review related approaches for systems asking clarifying questions. We conclude this section by analyzing clustering methods for IR and reviewing literature on decision trees for potentially relevant knowledge extraction methods that can be used in generating clarifying questions. Usefulness of Clarification Questions Presenting information on small screen devices or in voice-only situations is challenging. Information access can be improved by systems that actively support interaction with the user [1]. It has been shown, that users like to be asked for clarification [2, 3]. Information obtained by clarifying questions can make huge improvements in terms of retrieval performance [4]. Clarification Question Models Zamani et al. [3] proposed three models to generate clarifi- cation questions and candidate answers for open-domain web search. The authors combined a rule-based slot filling model using question templates, a supervised model, and a reinforcement learning-based model. By aggregating huge collections of log data, weak supervision signals from query reformulation data are extracted and used for improving the models. The approach is not applicable in our scenario due to the small amount of log data that could be used for query reformulations learning. Rosset et al. [5] built 2 conversational question suggestion models based on a BERT-based ranker and a GPT-2-based generator. They trained the ranking model in a multi-task fashion, mainly on weak supervision labels obtained from past users behavior like clicks on “People Also Ask” (PAA) Panes but also on human annotated relevance labels. Their natural language generation model is trained on the PAA questions which were clicked after the user issued a query. In our scenario a PAA approach can not be used due to the lack of sufficient logged existing user questions. Aliannejadi et al. [4] collected a dataset named “Qulac” by crowd-sourcing to foster research in Clarifying Questions in IR for open-domain. They proposed a conversational search system that selects a clarifying question from a pool of questions and ranks stored documents based on the users answer. The researchers split up the task in question retrieval, selection and document retrieval. This approach requires predefined clarifying questions as well as query-question- answer-target mappings to train all components of the system. Due to the resource constraints of this project, an equivalent dataset cannot be generated, making this approach a bad fit for our use-case. Datasets Even though a lot of datasets related to conversational search exist [6, 4, 7, 8, 9, 2, 10] most of them are either too domain-specific or not suitable regarding their structure to make use of them for our task. In addition, most of the datasets are in English language while our domain lies in German public administration language, which involves a lot of unique words. Therefore, we found those datasets not to be helpful for solving our problem like for example transfer learning existing models for our use case. Clustering Text For an efficient refinement of a set of potential topics, the resources must be categorized first. This categorization might be based on the annotations provided by the dataset, which mostly follows a textual format. Following, we will discuss two papers, investigating different aspects of the concept of document clustering. Leouski and Croft [11] compared different clustering techniques for analyzing textual retrieval results. They found good success using agglomerative hierarchical clustering algorithms in combination with frequency-based embeddings. Additionally, their results showed that human oriented evaluation should be preferred over an artificial one. Mohammed et al. [12] analyzed document clustering by comparing two strategies based on a variety of popular evaluation methods. The researchers present an approach based on semantic embeddings in combination with a density-based clustering algorithm (DBSCAN) to outperform a frequency based embedding in combination with K-Means, especially on large datasets. Discussion Most of the related works have in common that data and feedback signals were either crowd-sourced, or obtained by creating weakly supervised labels with the help of massive log data. Therefore they were able to train models with supervised methods. In many conversa- tional IR settings outside of major web search industry, there is most likely not sufficient data for weak supervision available, or there are not enough resources to obtain data (as in our use-case). Approaches built upon clustering methods or decision tree-based keyword selection combined with question templates may tackle these problems, since they do only require keywords for the documents to be retrieved. Therefore, it is worthy to investigate the proposed approaches. 3. Approach In this section we give a comprehensive explanation of the three approaches developed in the context of this paper. First, we explain the dataset at the basis of our approaches. Then, a description of the fundamental algorithm, shared by all approaches is provided. After that, we give a detailed description of the implemented approaches as well as the technologies used. 3.1. Data The basis of our approaches is an annotated dataset provided by the city of Berlin1 . The dataset consists of 881 descriptions of services offered by the city administration, such as the renewal of a personal ID card or getting a residential parking permit. Each service entry is created by a human expert (“editor”) with a list of keywords describing the respective service. These keywords consist of nouns, verbs, and numbers. A simplified example is shown in Table 1. Table 1 The table shows an example which keywords are assigned to offered services Service Keywords Parking ID Application Apply, Parking ID card Application Apply Parking ID Lost Lost, Parking Info about Pet ID Card Pet Change Address on ID Card Address 3.2. General Approach Our three approaches share the first step with the chatbot in place - the first user interaction in the form of an initial question. The user question is sent to an Apache Solr2 server, providing an efficient full-text access to the aforementioned dataset. Given the query, the server retrieves a list of relevant resources. This list is iteratively refined until the desired resource is found. This refinement process consists of four steps. First, all relevant documents are categorized and grouped based on their respective keyword annotations. Then, based on a heuristic, one group is selected and a superior term, representing all resources in that group, is inferred. Based on this superior term, a binary counter question using a question template is constructed and presented to the user (Does your question revolve around topic X ? ). When the user answers affirmative, all resources but the ones contained in the selected group are removed from the list of possible resources. Otherwise, all resources 1 https://service.berlin.de/ 2 https://solr.apache.org/ contained in the selected group are removed. This procedure is repeated until only one resource is left which is then presented to the user as the final answer. The implementation of this algorithm raises two main challenges: the categorization of the resources and the inferences of the superior term describing the selected group. Following, we give a description of three approaches tackling these challenges using different strategies and technologies. 3.3. Service Clustering The first approach, we name Service Clustering, is based on the applied method of grouping resources. For a visual depiction, see Fig. 1. After the initial list of relevant resources is retrieved from the indexed dataset, resources are categorized through clustering. To enable this, a meaningful representation is needed to make resources directly comparable to each other. This meaningful representation is created by encoding the keyword annotations for each resource using a text embedding. The embedding that was used is TF-IDF-based, as annotations commonly overlap between resources. The TF-IDF embedding emphasizes keywords that are distinct between resources. As those keywords carry the most information for differentiating between resources, this embedding provides vectors optimized for our setting. For clustering on the resulting vector representations, two algorithms were tested: K-Means and DBScan. After clustering, one cluster is selected for generating a counter question. This selection should be made such that the information gained from the user is maximized, independent of the answer. Here, always picking the largest cluster is best strategy as the number of resources that can be eliminated is maximized. Having selected the largest cluster, a superior term needs to be inferred. To achieve this, a semantic embedding was used. Semantic embeddings encode words such that the distance between their respective vectors corresponds to the semantic similarity between the words. Such a representation was created for every keyword contained in the annotation of at least one resource in the selected cluster. For this, the large German model3 from the Spacy library was used. As the annotation vocabulary is highly domain specific, a reliable semantic embedding based on the used language models could not be computed for all potential relevant keywords. For the inference only keywords were used for which an embedding could be created. After the encoding step, the superior term was selected as the keyword whose representation has the smallest summed up distances to all other representations. This keyword forms the cluster centroid and is semantically closest to all words in the cluster, thus representing it best. As the measure of distances between vectors, cosine similarity was used. Clustering Algorithms For clustering the services, two algorithms were tested: K-Means [13] and DBSCAN [14]. The following gives a short description on the thought process behind this decision. Those two algorithms are commonly used in different problem settings surrounding text clustering [15, 16, 17]. As both are based on different ideas and therefore come with varying drawbacks, they are often compared to each other to find the best performing approach in a certain domain [18, 19]. 3 https://spacy.io/models/de#de_core_news_lg K-Means K-Means is one the most popular clustering algorithms with a variety of applications, including document classification. In K-Means, the number of desired clusters needs to be specified. With the binary question template used in this approach, this is beneficial. In each iteration the maximum percentage that the set of relevant documents can be guaranteed to be pruned, is 50%. This is the case when the clustering results in two equally sized clusters. No matter the user’s answer, half of the resources can be eliminated, resulting in a logarithmic conversion speed. K-Means allowing to specify the number of clusters to two might give a good approximation of these ideal conditions. On the other hand, large clusters come with a significant drawback. As the number of resources per cluster increases, the complexity of finding a representative superior term in- creases. This might result in imprecise counter questions and therefore an error prone retrieval performance. DBScan DBScan is a density-based algorithm, commonly used in the context of document classification. In contrast to K-Means clustering, DBScan does not require a fixed number of clusters. Instead a parameter epsilon is defined, specifying the maximum distance between two data points for them to be considered to be in one cluster. The methods results in a dynamically adapted number of clusters. That allows us to control the in-cluster similarity and to facilitate the inference of superior terms. In our experiments we found an epsilon of 1.3 to result in the best quality of superior terms. A high in-cluster similarity also results in smaller clusters and therefore a larger number clusters. This is likely to limit the number of resources that can be eliminated in each iteration, resulting in a slower conversion speed. Initial Query User Answer Infer Ask Cluster Superior Counter Services Term Question Result Set Refine yes >1? Result Set no Return Resource Figure 1: Visual depiction of the Service Clustering approach. 3.4. Question Tree The next approach, we call the Question Tree approach, is based on the common decision tree algorithm ID3 [20]. Contrary to its usual usage in classification it is herein applied to the topic of information retrieval. It is constructed on run-time for every single query. Each node represents a question to the user and the target variable is the service name, implying that purity is reached only if the sample size in that branch is one. Table 2 Keywords transformed to decision variable Service Address Apply Lost Parking Pet Parking ID Application 0 1 0 1 0 ID Card Application 0 1 0 0 0 Parking ID Lost 0 0 1 1 0 Info about Pet ID Card 0 0 0 0 1 Change Address on ID card 1 0 0 0 0 The keywords annotation (Table 1) is used as a decision variable for the tree. Each keyword is mapped as a Boolean variable, indicating that a service either has it in their keyword list (equals 1) or not (equals 0). This results in a 𝑛 × 𝑚 matrix, where 𝑛 is the number of services in a result set and 𝑚 the number of keywords associated with any of these services. The result of the transformation is shown in Table 2. Fig. 2 depicts a fully constructed tree for the fictitious example of an initial query of “iden- tification” originating from the just-seen tables. Each node, in orange, represents a question about the keyword in its title. The heuristic of finding the group (the keyword) to ask the user about is choosing the variable that maximizes information, which is a greedy approach. The representation is always the keyword itself. User: “identification” no Apply? yes no no Parking? yes yes Lost? Parking ID ID card Parking Application Application ID lost no yes Pet? Change Info on Pet address on ID Card ID card Figure 2: Graph visualizing all different possible paths after the fictional initial query of “identification”. Nodes (orange) represent question “Does it have something to do with X?”, grey boxes describe the assigned services (answers). Due to the keywords being annotated by human annotators, imperfections could be found. These imperfections came in the form of different keywords that carry the same semantic infor- mation. This leads to unwanted side effects as connections between corresponding resources could not be made. In some cases, this makes for a frustrating user experience as sequential counter questions might ask for information already provided by the user. Additionally, this artificially blows up the decision tree. These mistakes in the annotation are of four categories: synonyms, different spelling of the same word, different grammatical surface forms of the same word, words very close in semantic information. To tackle this problem, we grouped these “similar” words to find a representation for each group. In order to find these groups, an initial clustering was applied, before the other steps were executed. As the number of semantically unique words and therefore the number of clusters was not known beforehand, DBSCAN [21] was used again. We set epsilon to 0.2 to ensure a high semantic similarity in each cluster. Spacys4 large German model is used for defining a semantic embedding; the cosine similarity in the vector space is used for computing the similarity. Fig. 3 shows the program flow of the chatbot using the question tree approach. Initial Query Cluster synonyms User Answer Map KW Ask Occur- Choose KW Question rences Resultset Refine yes >k? Resultset no Return resultset Figure 3: Graph visualizing the chatbot program flow using the question tree approach 4. Evaluation In this section we describe our evaluation procedure and present the results obtained from the different approaches. 4.1. Quantitative Analysis Procedure The main objective of the chatbot is to help users in finding the desired information quickly and to suggest only the service fitting the users intent. To ensure reproducibility and comparability in the evaluation, user interactions were simulated while logging of various measures is performed in the background. The test dataset consists of about 6,500 real user dialogues. We connected every initial query 𝑞 to a service the user has clicked on 𝑠 at some point of their dialog with the chatbot and assumed this to be their actual search intent, e.g. ground truth. As can be seen in Algorithm 1, the result set 𝑅 initially returned from the Solr 4 https://spacy.io/ system is iteratively refined based on simulated answers to the chatbots question. We reduced complexity by the assumption that users are always able to answer all the questions correctly. The evaluation is biased towards actual chatbot usage, as only 363 of the total 881 services could be mapped to the initial user queries. Algorithm 1 Evaluation Procedure Pseudocode for (𝑞, 𝑠) in 𝑄 do 𝑅 ← 𝑠𝑜𝑙𝑟.getResults(𝑞) while length(𝑅) ≥ 𝑘 do 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 ← 𝑐ℎ𝑎𝑡𝑏𝑜𝑡.getQuestion(𝑅) 𝑎𝑛𝑠𝑤𝑒𝑟 ← findCorrectAnswer(𝑅, 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑠) ▷ simulate correct user answer 𝑅 ← 𝑐ℎ𝑎𝑡𝑏𝑜𝑡.refineResultset(𝑅, 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑎𝑛𝑠𝑤𝑒𝑟) end while end for If the initial query yields a result set of any length, all three approaches are guaranteed to find the intended user services, respectively a result set of length 𝑘. This is due to the fact that the Service Clustering approaches re-cluster at every iteration and that every combination of keywords is unique. We introduce two different measures to compare the approaches: • Mean Turns: Mean turns needed to find a service • Mean Information Gain: Mean information gain of the answer to a question The Information Gain (IG) of an iteration (Choose Question-Answer-Refine) 𝑛 is defined as the difference between the natural logarithm of the length of the result set 𝑅 before and that after the answer, depicted in following equation: 𝐼 𝐺(𝑛) = ln(𝑙𝑒𝑛𝑔𝑡ℎ(𝑅𝑛−1 )) − ln(𝑙𝑒𝑛𝑔𝑡ℎ(𝑅𝑛 )) (1) Results We evaluate the different approaches quantitatively with the just described procedure and iterate until 𝑘 = 1. Fig. 4 shows the distribution over how many turns are needed until a conversation converges and Table 3 holds all results of the quantitative analysis. The distribu- tions are varying significantly. SC KMeans takes with over five turns on average the longest to converge but is also the most prone to outliers (e.g. services that might be hard to find) and very evenly distributed. SC DBSCAN is converging faster on average but includes some conversations taking over 25 turns. The Question Tree approach using DBSCAN outperforms all other approaches in terms of speed with less than four turns and an information gain of 0.950 per question on average. None of the three approaches dominate all measures - however the Question Tree seems to more suited for the use case, at least according to the quantitative approach. In the following, we will highlight the qualitative point of view. 4.2. Quality of Questions In this section we analyze the quality of the generated questions focusing on the keywords chosen by the approaches. 25 20 15 turns 10 5 0 SC DBSCAN SC KMeans Question Tree DBSCAN Algorithm Figure 4: Distribution over turns needed till service found. The figure shows that SC DBScan has the highest variance. On Average, the Question Tree DBScan needs the smallest number of turns. Table 3 The table shows the obtained performance scores for the implemented methods. Algorithm Mean Turns Max Turns Information Gain Service Clustering DBSCAN 4.69 27 0.904 Service Clustering KMeans 5.07 15 0.826 Question Tree DBSCAN 3.88 18 0.950 The keywords used as the basis for the generated questions should come from the user’s vocabulary (to ensure that the user knows the terms) and the keywords should be unique and simple (to minimize the ambiguity). When analyzing the approaches, an interplay between two characteristics of chatbots can be observed. Chatbots, whose largest cluster comprises almost 50% of the services converge particularly quickly on a result set. Clustering methods that are able to cluster around 50% of the services in each iteration also halve the result set by about 50%, regardless of the user’s response. Here one can observe parallels to bisection, where the interval width is halved with each step, to perform in a runtime of 𝑂(𝑙𝑜𝑔(𝑛)). The second relevant property is the representativeness of the keyword. Some approaches tend to find very general keywords or ask for the same keyword multiple times. This can be attributed to the fact that too large clusters have been formed, for which it is difficult to find a common keyword. The qualitative experiments confirm our assumption that chatbots with higher convergence times ask qualitatively better questions. The specific characteristics of the algorithms have been studied at several examples. The main observations are explained in the subsequent paragraphs. Question Tree The question tree is one of the faster converging approaches. This can be explained by the fact that the focus of the algorithm is on selecting particularly good keywords and the clusters result from the user’s decision. Service Clustering with k-means The approach of service clustering with k-means as the clustering method was convincing with particularly short dialogues. The reason that this approach converges so quickly to a solution can also be attributed to the cluster size of the largest cluster being close to 50% of the result set. This forces in some situations that services are assigned to large clusters, but have no influence on the keyword in these clusters. These decisions ultimately make for less precise queries due to an inaccurately chosen keyword. Service Clustering with DBSCAN The third approach uses the clustering algorithm DB- SCAN. This algorithm determines the number of clusters at runtime, which allows it to create new clusters according to the number of topics. By adjusting the epsilon parameter accordingly, this effect can be controlled. This control allows to keep an eye on the convergence speed as well as on the quality of the questions. Overall, our analysis shows that the Question Tree approach provides the best questions. 5. Conclusion In this work, we presented a solution for generating counter questions based on dynamic data collections. We developed and evaluated three approaches combining semantic clustering and decision trees. The methods have been optimized to the specific requirements of the chatbot for the German public administration. Our experiments show that our method generates reasonable questions, effectively guiding users to desired resources in an intuitive conversational style. Our solutions provide the basis for a well-working interactive information retrieval system. Our approach can also be applied to similar scenarios, since it can be used with text collections of answers or documents. With our findings, we contributed to the research in chatbot systems and information retrieval. As future work we plan to improve the generation of questions to improve the natural soundness and to better adapt to the context specific language style. Acknowledgment We thank the ITDZ Berlin for supporting the development of the chatbot framework. References [1] W. B. Croft, The importance of interaction for information retrieval, in: Procs. of the 42nd Intl. ACM SIGIR Conf., ACM, NY, USA, 2019, p. 1–2. doi:10.1145/3331184.3331185 . [2] J. Kiesel, A. Bahrami, B. Stein, A. Anand, M. Hagen, Toward voice query clarification, 2018. URL: https://dl.acm.org/doi/pdf/10.1145/3209978.3210160. [3] H. Zamani, S. Dumais, N. Craswell, P. Bennett, G. Lueck, Generating Clarifying Questions for Information Retrieval, ACM, NY, NY, USA, 2020, p. 418–428. [4] M. Aliannejadi, H. Zamani, F. Crestani, W. Croft, Asking clarifying questions in open- domain information-seeking conversations, 2019. [5] C. Rosset, C. Xiong, X. Song, D. Campos, N. Craswell, S. Tiwary, P. Bennett, Leading conversational search by suggesting useful questions, in: The Web Conference ’20, 2020. [6] H. Zamani, G. Lueck, E. Chen, R. Quispe, F. Luu, N. Craswell, Mimics: A large-scale data collection for search clarification, in: Proc. of the 29th ACM CIKM, CIKM ’20, ACM, New York, NY, USA, 2020, p. 3189–3196. doi:10.1145/3340531.3412772 . [7] C. Qu, L. Yang, W. B. Croft, J. R. Trippas, Y. Zhang, M. Qiu, Analyzing and characterizing user intent in information-seeking conversations, in: The 41st Intl. ACM SIGIR Conf., ACM, 2018. doi:10.1145/3209978.3210124 . [8] R. Lowe, N. Pow, I. Serban, J. Pineau, The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems (2015). doi:10.18653/v1/W15- 4640 . [9] F. Radlinski, K. Balog, B. Byrne, K. Krishnamoorthi, Coached conversational preference elicitation: A case study in understanding movie preferences, in: Procs. of the 20th SIGdial Meeting on Discourse and Dialogue, ACL, Stockholm, Sweden, 2019, pp. 353–360. [10] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, L. Zettlemoyer, Quac : Question answering in context, 2018. URL: https://arxiv.org/abs/1808.07036. doi:10.48550/ ARXIV.1808.07036 . [11] A. V. Leouski, W. B. Croft, An evaluation of techniques for clustering search results, Technical Report, 1996. [12] S. M. Mohammed, K. Jacksi, S. R. M. Zeebaree, Glove Word Embedding and DBSCAN algorithms for Semantic Document Clustering, in: Intl.Conf. on Advanced Science and Engineering, 2020, pp. 1–6. doi:10.1109/ICOASE51841.2020.9436540 . [13] E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications, biometrics 21 (1965) 768–769. [14] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise., in: kdd, volume 96, 1996, pp. 226–231. [15] C. Xiong, Z. Hua, K. Lv, X. Li, An improved k-means text clustering algorithm by optimizing initial cluster centers, in: 7th Intl. Conf. on Cloud Comp. and Big Data, 2016, pp. 265–268. [16] R. G. Cretulescu, D. Morariu, M. Breazu, D. Volovici, Dbscan algorithm for document clustering, Intl. Journal of Adv. Statistics and IT&C for Economics and Life Sciences 9 (2019). [17] R. N. G. Indah, R. Novita, O. B. Kharisma, R. Vebrianto, S. Sanjaya, T. Andriani, W. P. Sari, Y. Novita, R. Rahim, et al., Dbscan algorithm: twitter text clustering of trend topic pilkada pekanbaru, in: Journal of physics, volume 1363, IOP Publishing, 2019, p. 012001. [18] M. A. Ahmed, H. Baharin, P. N. Nohuddin, Analysis of k-means, dbscan and optics cluster algorithms on al-quran verses, Intl. Journ. of Adv. Computer. Science and Apps. 11 (2020). [19] D. Xu, Y. Tian, A comprehensive survey of clustering algorithms, Annals of Data Science 2 (2015) 165–193. [20] Data mining, practical machine learning tools and techniques, in: I. H. Witten, E. Frank, M. A. Hall, C. J. Pal (Eds.), Data Mining, 4th ed., Morgan Kaufmann, 2017, pp. i–iii. doi:https://doi.org/10.1016/B978- 0- 12- 804291- 5.00014- 3 . [21] S. Mohammed, K. Jacksi, S. Zeebaree, A state-of-the-art survey on semantic similarity for document clustering using glove and density-based algorithms, Journal of Electrical Engineering and Computer Science 22 (2021) 552–562. doi:10.11591/ijeecs.v22.i1 .