Dask-Based Efficient Clustering of Educational Texts⋆ Fail Gafarov1[0000−0003−4704−154X] , Dmitriy Minullin1[0000−0001−7713−5251] , and Viliuza Gafarova2[0000−0001−7215−4007] 1 Kazan Federal University, 18 Kremlyovskaya street. Kazan 420008, Russian Federation fgafarov@yandex.ru 2 TR AS Institute of Applied Semiotics Levo-Bulachnaya str., 36A, Kazan, 420111, Russian Federation Abstract. Document clustering process is a long running and compu- tationally demanding process. The need for systems that allow fast doc- ument clustering is especially relevant for processing large volumes of text data (Big Data). In this work we present a distributed text clus- tering framework based on Dask open source library for parallel and distributed computing. The Dask-based processing system developed in this work allows to execute all necessary operations related to the clus- tering of text documents in a parallel mode. We realized parallel agglom- erative clustering algorithm of cosine similarity matrices computed from term frequency-inverse document frequency (TF-IDF) feature matrices of input texts. The system had been applied to intellectual analysis of ed- ucational data accumulated in the system ”Electronic education of the Tatarstan Republic” from 2015 to 2020. Specially, by using developed system we clustered the text documents describing lesson planning, and also performed a comparative analysis of the average marks of students, whose training was carried out according to lesson planning belonging to different clusters. Keywords: Big Data, Dask, educational data mining, python, docu- ment clustering, TF-IDF, ANOVA 1 Introduction Big data analytics describes the methods of processing large amounts of data in order to discover hidden patterns [1], market trends, customer preferences and other useful information for making the right decisions [11], [30]. It has been adopted by a wide variety of industries and has become a separate industry [26]. Using the Big Data methods in educational system includes measuring, collecting, analyzing and presenting huge volumes of structured and unstructured data about students and the educational environment in order to understand the peculiarities of the functioning and development of the educational system [25], [12]. For the successful management of the educational process, it is necessary ⋆ The reported study was funded by RFBR, project number 19-29-14082 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 362 to promptly process numerous various incoming data online, so the use of Big Data technologies becomes a necessity [19]. Working with large amounts of data requires not only the availability of modern hardware, but also mathematical algorithms that would reduce the required number of computing operations for a computer [3], [37]. Text mining in big data analytics is emerging powerful tool for the analysis of unstructured textual data. These methods are used to extract new knowledge and to identify significant patterns and correlations hidden in the data [16]. A method for extracting associative feature information by using text mining from health big data is proposed in [21], social network analysis algorithms are utilized for identifying the emerging trends for big data domain [18], text mining has gained its popularity with big data resources when analyzing big data in the financial sector [6]. Document clustering is the process of finding groups of similar documents in a collection of documents. Clustering algorithms are among the most popular data mining methods, and are widely used for processing text data. They have a wide range of applications such as classification [5, 4], visualization [8] and organization of documents [13]. Various methods of text clustering are used, the most popular of which are LSA / LSI - Latent Semantic Analysis / Indexing [10], Suffix Tree Clustering [44], Scatter/Gather [9]. Recently, methods based on the use of neural networks have gained popularity [27] together with classical clustering methods, such as the k-means algorithm [28]. In document clustering tasks the problem of determining the optimal number of clusters arises [23], [36]. By using hierarchical clustering, researchers often have to work with dendrograms and manually analyze them and trim the required number of clusters. This approach is followed by several problems, firstly, this is the obvious slow nature of the study, secondly, it is the human factor of the possibility of making mistakes. In cases where the threshold is not easy to determine, two different researchers may come to different conclusions about the correct number of clusters. There are various approaches for determining the optimal number of clusters: gap statistic, elbow method, mode, maximum difference [43], [20], [45]. This work demonstrates the possibilities of using the Dask cluster comput- ing system [32] for analyzing the texts describing lesson planning in schools. The study is based on the data collected in the state information system ”Elec- tronic education in the Republic of Tatarstan.” This system includes a large-scale database of educational information on all students and all teachers of schools located in Tatarstan Republic. Teachers fill out lesson topics in the electronic journal, as well as homework through their personal accounts. They also fill in the system and current grades given to their students. Be believe that the success of educational process is depending on the teach- ing material used by teacher. The main goal of this work is aimed to build an analytical system for finding and studying the differences between the academic performance of schoolchildren studying in different educational and method- ological complexes or in different curricula. We can obtain information about 363 the differences in teaching and methodological materials only on the basis of an analysis of textual data on the conducted lessons, which were filled by teachers through their personal accounts. The main problem is that these data are not marked up and not annotated, i.e. teachers do not indicate which curriculum they teach their lessons. Therefore to solve this problem, we need to use an automatic text clustering system. For this research depersonalized datasets were provided in Comma-separated values (CSV files) format. These files contains information about teachers, pupils, lessons and marks for all subjects and all grades. The total amount of data is over 60 GB. Because the original data consists of a large number of texts (Big Data) of different sizes that will take a lot of time to process by using traditional methods, we need to develop high-performance computing (HPC) programming based on cluster computing to efficiently processing text data. In this work we developed HPC computational framework based on Dask dis- tributed library [32]. To convert a document into structured format the weight- ing schema TF-IDF (Term Frequency –Inverse Document Frequency) is used. To measure the distance between the text we used cosine similarity algorithm. The division of texts into clusters was carried out by using an Agglomerative clustering algorithm, with the help of Elbow method to find the optimal number of clusters. Bu using the results of text clustering, we carried out a comparative analysis of the average academic performance of students whose lesson planning texts belong to different clusters. The developed system makes it possible to efficiently and quickly carry out a full-fledged analysis of this type for all grades and subject. 2 Dask-Based Parallel Processing Framework To efficiently and quickly process a large amount of unstructured data we have to use BigData technologies, and also the powers of computing clusters are required. For high-performance calculations a computing cluster containing 4 virtual ma- chines was deployed (each VM has 1TB HDD, 32 GB RAM, 16 CPU cores), with installed Python-based library for parallel computing - Dask. Dask is a flexible parallel big data processing library, designed to provide scalability and to extend the capabilities of existing Python packages and libraries [32]. Dask allows users to integrate and to run in parallel mode existing Python-based scripts written by using popular libraries such as NumPy, SciPy, Pandas and others. The main advantage of using the Dask library for processing large texts is based on the ability of the system to perform computations with data volumes that are larger than the available memory of single computer [39], [17], [14]. The functionality of Dask which is necessary to perform its tasks can be divided into two parts: 1. Dynamic task scheduling optimized for cluster based HPC computation. 2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas,or Python iterators to larger-than- 364 memory or distributed environments. These parallel collections run on top of dynamic task schedulers. Fig. 1. Schematic representation of the system’s architecture All data, initially obtained in the form of ssv files, had been combined into more convenient data structures (DataFrames) by using Dask’s merge() method, and stored on cluster’s computer’s hard drives in parquet binary data format [42]. This dataframe, in addition to the texts describing the topics of the lessons, also contains an information about lesson’s dates, grades, subjects and teachers. The distributed data processing framework is schematically drawn in Figure 1 in the left panel. In the the right side of the Figure 1 sequential steps of the text processing pipeline, executed in parallel mode on the computational cluster are demonstrated. In the first processing step, we pre-processed the entire text cor- pus to remove noisy and less useful words. Next, we applied TF-IDF and cosine similarity calculation, followed by hierarchical agglomerative clustering, by using 365 the ”elbow” method to determine the number of clusters. Such parallel execution of processing methods is started by calling dataframe’s apply method, and by specifying the corresponding method name, which must be executed in parallel mode as a parameter of dataframe’s apply method. To perform parallel process- ing of school subjects (ProcSubjects), the following call to the apply method is used: lessons_texts.groupby(['subjectID']).apply(ProcSubjects).compute() The method for processing texts for particular grades (ProcessGrades) was launched in a similar way: def ProcSubjects(df): df.groupby(['grade_number']).apply(ProcessGrades) Data reprocessing. As the first step the texts (originally there were 95786285 texts), describing lessons, were combined into a 580698 text documents, for each subject and for each teacher. These texts had been subjected to the following processing: 1. The document texts are divided into tokens 2. From the array of tokens we removed punctuation marks, empty lines and stop words, which do not have any special meaning to the sentence. Theses steps are necessary to improve the accuracy of the text’s comparison. Document text vectorization by term frequency-inverse document frequency (TF-IDF) method. Traditional text similarity measurements use TF-IDF (term frequency (TF)x inverse document frequency (IDF)) method (first introduced in [35]) to compute similarity between text documents by using cosine similarity [7], to examine the relevance of words to documents [31], short-text clustering [38], text categorization [41], pattern mining on text data[2]. TF-IDF reflects how important a word is to a document in a collection or corpus, and therefore it is often used also as a weighting factor in information retrieval and text mining. The TF-IDF value is proportional to how often a word occurs in a particular document, but is compensated for by the frequency of the word in the entire corpus. This property of TF-IDF helps to take into account the fact that some words are usually much more common than others [7]. For a collection of words t ∈ T that appears in a set of N documents d ∈ D with length nd , IF-IDF is computed by the formulas ft,d TF = (1) nd N IDF = log (2) dft W = T F ∗ IDF, (3) where ft,d is the frequency of word t in document d, dft is the number of docu- ments in which the word t appears. 366 Cosine similarity and distance matrix. At the next step we constructed a text similarity matrix by making similarity analysis of the compared texts. The cosine similarity method [40] was chosen as a comparison method for TF-IDF vectorized texts. Cosine similarity is a measure of the similarity between two vectors, and measures the cosine of the angle between them. For two feature vectors, a and b, cosine similarity, cos(θ), can be represented using the dot product and the [15] norm: Pn ab ai bi cos(θ) = = pPn i=1pPn , (4) ∥a∥∥b∥ a i=1 i i=1 bi a – coordinates of the first vector (TF-IDF vector for first text), b – co- ordinates of the second vector (TF-IDF vector for the second text). Further, calculating cosine similarity in pairs of vectors by using a cyclic algorithm a text similarity matrix is constructed. Based on the similarity matrix, we calculated the distance matrix (by formula distance matrix = 1 − similarity matrix) for using it as an input data for agglomerative clustering algorithm. Agglomerative clustering. Clustering is one of the most common unsuper- vised machine learning problems. For text clustering we used the most common type of hierarchical clustering - the agglomerative clustering. This method is used to group objects in clusters based on their similarity. The algorithm starts by treating each object as a single cluster. In the next steps, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects [33]. The results of agglomerative clustering are usually presented in a dendrogram (an example of dendrogramm is presented in Figure 2). The dendrogram is convenient, because it allows to visually observe the clus- tering process. To calculate the distance between clusters Ward’s method was used. Ward’s minimum variance method calculates the distance between cluster members and the centroid. The centroid of a cluster is defined as the point at which the sum of squared Euclidean distances between the point itself and each other point in the cluster is minimised. The increment in the sum of the squares of the distances of objects to the center of the cluster, obtained as a result of their union, is taken as the distance between the clusters As a result, we get a complete clustering tree of our initial set from which we can get clustering of any level [29]. Determining the number of clusters by ”elbow” method. Determin- ing the optimal number of clusters is a fundamental issue in partitioning cluster- ing, which requires the user to specify the number of clusters k to be generated. A simple and popular method consists of visual analysis of the dendrogram to see if it suggests a particular number of clusters. This approach is very subjec- tive, but its main drawback is the impossibility of visual analysis of hundreds of dengrograms within the framework of automatic processing. To find optimal number of clusters in automatic mode we use ”elbow” method. This method is based on calculating the Within-Cluster-Sum of Squared Errors (WSS) for dif- ferent number of clusters (k) and selecting the k for which change in WSS starts 367 Fig. 2. Example of dendrogram for subject mathematics in 3rd grade to diminish. Based on the numerical results carried out in the [43] elbow method and the maximum difference seem to perfectly capture the simulated data. To demonstrate ”elbow” method we plot a line chart of the SSE for each value of k (see Figure 3). Our goal is to choose a small value of k that still has a low SSE, and the elbow point usually represents the point where we start to have diminishing returns by increasing k (Figure 3). We used the KElbowVisualizer class from yellowbrick Python library to select the optimal number of clusters. Now, knowing the number of clusters by using the ”eblow” method, we can split our set of texts into k clusters. In figure 4 we show the numbers of texts by clusters for subject ”mathematics” in 3-th grade. For checking the reliability of the results obtained using the cluster-based computational system based on Dask, as well as to assess the performance of parallel cluster processing pipeline, we have performed the same data compu- tations by using desktop computer without using the Dask system. It should be noted that during performing the same processing text processing pipelines on usual desktop computers we we ran into problems, mainly due to the fact that processing large data files requires large amounts of RAM. To solve this problem we need to split huge text files into smaller files (which must be small enough to fit into the RAM of a desktop computer), and process them sequen- tially, constantly reading and writing intermediate results to hard drives, which slows down data processing a dozen times. Data processing, which takes about an hour by using a Dask-based cluster system (4 VMs with 1TB HDD, 32 GB RAM, 16 CPU cores), tooks a several days on desktop computers. 368 Fig. 3. Exaple of ”elbow” end point for subject ”mathematics” in 3th grade 3 Results By using the computational framework, presented above, we performed cluster analysis of the text describing lesson’s for all school subjects for all grades (2-11). To confirm the reliability of the clustering results obtained by our system, some of the texts (mathematics, 3rd grade) were by-hand analyzed by teachers, who confirmed the correctness of automatic division into clusters. These teachers also confirmed that our system correctly identifies educational and methodological complexes that are used in primary grades. We estimated that the accuracy of correctly clustered text on manually tested subset achieved 93%. Incorrectly clustered texts (7%) is due to the fact that some teachers did not completely correctly and incompletely entered information into the system. A comparative analysis of WSS for mathematics and Russian language sub- jects among grades 2-11 (Fig. 5), shows that in the primary grades the values of WSS are very high, what means a presence strong difference between lesson texts. This difference gradually disappears by 10-11 class. A similar picture is observed for all academic years (in the Figure 5, we presented the results only for the 2015-2016 and 2016-2017 academic years). This is due to the fact that after primary school in the middle level school, students learns according to a certain program. The almost identical similarity of the lines in the 10th and 11th grade is because that all students in all schools follow the same curriculum. 369 Fig. 4. The number of texts in each cluster for the texts of lessons on the subject of mathematics in 3-th grade These features discovered by our system are confirmed by teachers who work in schools, which confirms the reliability of our conclusions. Since there is a significant difference between the texts of lesson planning in primary grades, we conducted a study of the dependence of the average grades of students on clusters of lesson planning texts. For the primary preprocessing of student’s marks, which consisted in grouping data and calculating the average marks of students we also used capabilities of the Dask library. As a result, we get the following marks (Figure 6), on the basis of which the analysis of variance will be carried out. Analysis of variance (ANOVA) was used to find and to determine the differences between the mean marks of students belonging to different clusters[22]. ANOVA-test had been used because it allows simultaneous comparison of a larger number of samples, in contrast to Student’s t-test, which allows only pairwise comparison. The results of this analysis do not show the exact differences between groups, but can only tell where they are and where they are not. In other words, if the hypothesis of the equality of all groups is rejected, this does not mean that all groups are different, but only that some of the groups are different, maybe everything, but maybe only a few. The p-value is calculated by using the sampling distribution of the test statis- tic under the null hypothesis, the sample data, and the type of test being done (lower-tailed test, upper-tailed test, or two-sided test). After the analysis for 370 Fig. 5. WSS for different values of clusters count k (a - mathematics 2015-2016, b - mathematics 2016-2017, c - Russian language 2015-2016, d - Russian language 2016- 2017.) Fig. 6. Box-plots of student’s mean mark by cluster in mathematics for 2015-2016 academic different subject for primary school grades (2-4) we get the following values of the p-value by grades and subjects (Table 1): Based on the results obtained, for the primary school level, subjects can be divided into two groups : 1. Subjects that do not depend on the educational-methodical complex (p- values > 0.05): music, English language, physical education, 2. Subjects depending on the educational and methodological complex (p- values < 0.05): mathematics, Russian language, literature, technology, arts. For the middle and senior levels of the school into three groups (data not shown): 371 2-nd grade 3-rd grade 4-th grade Physical education 0,100 0,045 0,041 Mathematics 0,000 0,003 0,000 Russian language 0,000 0,000 0,000 Literature 0,000 0,000 0,000 Technology 0,010 0,009 0,005 Music 0,896 0,901 0,000 Arts 0,000 0,001 0,000 English language 0,726 0,006 0,561 Table 1. One-way ANOVA-test p-values for some primary school subjects for the 2015-2016 academic year 1. Subjects that do not depend on the educational and methodological complex: English language, computer science, technology. 2. Subjects depending on the educational and methodological complex: mathe- matics, physical education, Russian language, literature, physics, Tatar lan- guage, Tatar literature. 3. Floating subjects: biology, geography, history, social studies, chemistry. Comparing with the results of the other years at the primary school level, this situation remains the same, with the exception of music, which have moved to other groups. For middle and senior school level, the situation remains largely unchanged except for the history subject that moved into the first group. 4 Conclusion By now, various databases of governmental and non-governmental organizations have been accumulated huge amounts of various types data. Because of a large volumes of datasets, for processing and analysing these datasets Big Data meth- ods are needed. In this work, we have built an information-analytical system for processing data describing in detail the school educational process. The system is built on the basis of the distributed computing Dask library. The use of this system allows to perform efficient and high-performance analysis (clustering) of text data. We demonstrated the application of the developed system for data processing stored in ”Electronic Education of the Republic of Tatarstan” sys- tem, and have obtained interesting and important conclusions in the educational analytic field. The approach developed in this work can be used not only in ed- ucation systems, it can be used for any system where it is necessary to conduct cluster analysis of large volumes of text (or non-text) data. The academic success of schoolchildren depends on many different factors associated with the educational environment, with the individual characteristics of students, and so on. Naturally, the textual data of educational and method- ological complexes are not the only the single factor that completely determines the academic success of schoolchildren. But we believe this is one of the most important factors. In this work, on the basis of quantitative measures, we were 372 have foun that in different classes for different disciplines, educational-methodical complexes have a different level of influence on student performance. Apache Spark [34] based tools are also quite often used for big data process- ing. However, Dask is smaller, lighter and simpler than Spark, and flexible as Pandas. Program scripts are easy to debug on personal computers before deploy- ing on a cluster, because Dask also works on personal computers. In our project, we have deployed our system on a static computing cluster, but in future we will use the Kubernetes-like technology [24] for elastic deploying in Amazon or Google clouds. Acknowledgements. The reported study was funded by RFBR, project num- ber 19-29-14082. References 1. Adnan, K., Akbar, R.: An analytical study of information extraction from un- structured and multidimensional big data. Journal of Big Data 6(1), 91 (2019). https://doi.org/10.1186/s40537-019-0254-8 2. Agnihotri, D., Verma, K., Tripathi, P.: Pattern and cluster mining on text data. In: 2014 Fourth International Conference on Communication Systems and Network Technologies. pp. 428–432 (2014). https://doi.org/10.1109/CSNT.2014.92 3. Aljawarneh, S., Yassein, M.B., Talafha, W.A.: A resource-efficient encryption algo- rithm for multimedia big data. Multimedia Tools and Applications 76(21), 22703– 22724 (2017). https://doi.org/10.1007/s11042-016-4333-y 4. Angelova, R., Siersdorfer, S.: A neighborhood-based approach for cluster- ing of linked document collections. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. p. 778–779. CIKM ’06, Association for Computing Machinery (2006). https://doi.org/10.1145/1183614.1183726 5. Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. SIGIR Forum 31, 314–323 (1997). https://doi.org/10.1145/278459.258601 6. Bach, M., Krstic, Z., Seljan, S., Turulja, L.: Text mining for big data analysis in financial sector: A literature review. Sustainability (Switzerland) 11(5) (2019). https://doi.org/10.3390/su11051277 7. Bafna, P., Pramod, D., Vaidya, A.: Document clustering: Tf-idf approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). pp. 61–66 (2016). https://doi.org/10.1109/ICEEOT.2016.7754750 8. Chakrabarti, K., Mehrotra, S.: Local dimensionality reduction: A new approach to indexing high dimensional spaces. In: Proceedings of the 26th VLDB Conference. pp. 89–100 (2000) 9. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval. p. 318–329. SIGIR ’92, Association for Computing Machinery (1992). https://doi.org/10.1145/133160.133214 10. Dumais, S.: Latent semantic analysis. Annual Review of Information Science and Technology 38, 188–230 (2004). https://doi.org/10.1002/aris.1440380105 373 11. Elgendy, N., Elragal, A.: Big data analytics: A literature review paper. In: Perner, P. (ed.) Advances in Data Mining. Applications and Theoretical Aspects. pp. 214– 227. Springer International Publishing, Cham (2014) 12. Elia, G., Solazzo, G., Lorenzo, G., Passiante, G.: Assessing learn- ers’ satisfaction in collaborative online courses through a big data approach. Computers in Human Behavior 92, 589–599 (2019). https://doi.org/https://doi.org/10.1016/j.chb.2018.04.033 13. Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139–172 (1987). https://doi.org/10.1023/A:1022852608280 14. Ford, A.S., Weitzner, B.D., Bahl, C.D.: Integration of the rosetta suite with the python software stack via reproducible packaging and core programming interfaces for distributed simulation. Protein Science 29(1), 43–51 (2020). https://doi.org/https://doi.org/10.1002/pro.3721 15. Grishin, V.: Method of analysis and search for borrowings in the text. Problems of science 7, 31 (2018) 16. Hassani, H., Beneki, C., Unger, S., Mazinani, M.T., Yeganegi, M.R.: Text min- ing in big data analytics. Big Data and Cognitive Computing 4(1) (2020). https://doi.org/10.3390/bdcc4010001 17. Henriques, J., Caldeira, F., Cruz, T., Simões, P.: Combining k-means and xg- boost models for anomaly detection using log datasets. Electronics 9(7) (2020). https://doi.org/10.3390/electronics9071164 18. Jalali, S.M.J., Park, H.W., Vanani, I.R., Pho, K.H.: Research trends on big data domain using text mining algorithms. Digital Scholarship in the Humanities (04 2020). https://doi.org/10.1093/llc/fqaa012 19. Javidi, G., Rajabion, L., Sheybani, E.: Educational data mining and learning an- alytics: Overview of benefits and challenges. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI). pp. 1102–1107 (2017). https://doi.org/10.1109/CSCI.2017.360 20. Jung, Y., Park, H., Du, D.Z., Drake, B.L.: A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization 25(1), 91–111 (2003). https://doi.org/10.1023/A:1021394316112 21. Kim, J.C., Chung, K.: Associative feature information extraction using text mining from health big data. Wireless Personal Communications 105(2), 691–707 (Mar 2019). https://doi.org/10.1007/s11277-018-5722-5 22. Kim, T.: Understanding one-way anova using conceptual figures. Korean Journal of Anesthesiology 70, 22 (02 2017). https://doi.org/10.4097/kjae.2017.70.1.22 23. Kothari, R., Pitts, D.: On finding the number of clusters. Pattern Recognition Letters 20(4), 405–416 (1999). https://doi.org/https://doi.org/10.1016/S0167- 8655(99)00008-2 24. Kristiani, E., Yang, C.T., Wang, Y.T., Huang, C.Y.: Implementation of an edge computing architecture using openstack and kubernetes. In: Kim, K.J., Baek, N. (eds.) Information Science and Applications 2018. pp. 675–685. Springer Singapore, Singapore (2019) 25. Logica, B., Magdalena, R.: Using big data in the academic en- vironment. Procedia Economics and Finance 33, 277–286 (2015). https://doi.org/https://doi.org/10.1016/S2212-5671(15)01712-8 26. Lu, Y.: Industry 4.0: A survey on technologies, applications and open re- search issues. Journal of Industrial Information Integration 6, 1–10 (2017). https://doi.org/https://doi.org/10.1016/j.jii.2017.04.005 374 27. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Con- ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings (2013), http://arxiv.org/abs/1301.3781 28. Mogotsi, I.: Christopher d. manning, prabhakar raghavan, and hinrich schütze: Introduction to information retrieval. Inf. Retr. 13, 192–195 (2010). https://doi.org/10.1007/s10791-009-9115-y 29. Murtagh, F., Legendre, P.: Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? Journal of Classification 31(3), 274– 295 (2014). https://doi.org/10.1007/s00357-014-9161-z 30. Poleto, T., de Carvalho, V.D.H., Costa, A.P.C.S.: The roles of big data in the decision-support process: An empirical investigation. In: Delibašić, B., Hernández, J.E., Papathanasiou, J., Dargam, F., Zaraté, P., Ribeiro, R., Liu, S., Linden, I. (eds.) Decision Support Systems V – Big Data Analytics for Decision Making. pp. 10–21. Springer International Publishing, Cham (2015) 31. Qaiser, S., Ali, R.: Text mining: Use of tf-idf to examine the relevance of words to documents. International Journal of Computer Applications 181(1), 25–29 (Jul 2018). https://doi.org/10.5120/ijca2018917395 32. Rocklin, M.: Dask: Parallel computation with blocked algorithms and task scheduling. In: Python in Science Conference. pp. 126–132 (2015). https://doi.org/10.25080/Majora-7b98e3ed-013 33. Roux, M.: A comparative study of divisive and agglomerative hierarchi- cal clustering algorithms. Journal of Classification 35(2), 345–366 (2018). https://doi.org/10.1007/s00357-018-9259-9 34. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. International Journal of Data Science and Analytics 1(3), 145–164 (Nov 2016). https://doi.org/10.1007/s41060-016-0027-9 35. Salton, G., Yang, C.: On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372 (1973) 36. Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: 16th IEEE Interna- tional Conference on Tools with Artificial Intelligence. pp. 576–584 (2004). https://doi.org/10.1109/ICTAI.2004.50 37. Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCul- loch, R.E.: Bayes and big data: the consensus monte carlo algorithm. International Journal of Management Science and Engineering Management 11(2), 78–88 (2016). https://doi.org/10.1080/17509653.2016.1142191 38. Seifzadeh, S., Farahat, A.K., Kamel, M.S., Karray, F.: Short-text clustering us- ing statistical semantics. In: Proceedings of the 24th International Conference on World Wide Web. p. 805–810. WWW ’15 Companion, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2740908.2742474 39. Signell, R.P., Pothina, D.: Analysis and visualization of coastal ocean model data in the cloud. Journal of Marine Science and Engineering 7(4) (2019). https://doi.org/10.3390/jmse7040110 40. Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin 24 (2001) 41. Trstenjak, B., Mikac, S., Donko, D.: Knn with tf-idf based frame- work for text categorization. Procedia Engineering 69, 1356–1364 (2014). https://doi.org/https://doi.org/10.1016/j.proeng.2014.03.129, 24th DAAAM In- ternational Symposium on Intelligent Manufacturing and Automation, 2013 375 42. Vohra, D.: Apache Parquet, pp. 325–335. Apress, Berkeley, CA (2016) 43. Zambelli, A.: A data-driven approach to estimating the num- ber of clusters in hierarchical clustering. F1000Research 5 (2016). https://doi.org/10.12688/f1000research.10103.1 44. Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstra- tion. In: Proceedings of the 21st Annual International ACM SIGIR Confer- ence on Research and Development in Information Retrieval. p. 46–54. SI- GIR ’98, Association for Computing Machinery, New York, NY, USA (1998). https://doi.org/10.1145/290941.290956 45. Zhou, S., Xu, Z., Liu, F.: Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Transac- tions on Neural Networks and Learning Systems 28(12), 3007–3017 (2017). https://doi.org/10.1109/TNNLS.2016.2608001 376