Discussion Paper Prompting the data transformation activities for cluster analysis on collections of documents Tania Cerquitelli, Evelina Di Corso, Francesco Ventura and Silvia Chiusano Politecnico di Torino - Dipartimento di Automatica e Informatica Corso Duca Degli Abruzzi, 24 - 10129 Torino - ITALY {name.surname}@polito.it Abstract. In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weight- ing schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transforma- tion methods (e.g., PCA, LSI) before applying a given data mining algo- rithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a state- of-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain. Keywords: Text mining; parameter-free technique; Big data framework. 1 Introduction Data-driven analysis is a multi-step process, in which data scientists tackle the complex task of configuring the analytics system to transform data into ac- tionable knowledge. Since collections of textual data are usually characterized by a high variability, the knowledge extraction process is challenging and re- quires a lot of expertise. The text mining is focused on studying algorithms to find implicit, previously unknown, and potentially high-quality information from a large collection of documents. Text mining activities include: (i) grouping data/documents with similar properties or similar content [3], (ii) topic detection [1], and (iii) document summarizations [5]. Effectively performing the text mining process on textual data requires a multi-step process involving different algorithms and for each one different spe- cific parameters should be manually set by the end-user. Furthermore, different methods exist and the selection of the optimal ones is guided by the expertise of the analyst. Innovative and scalable solutions need to be devised to suggest to the analyst how configure the mining process to relieve the end-user of the burden of selecting proper methods for the overall cluster analysis process [2]. This paper proposes SELF-DATA (SELF-learning DAta TrAnsformation), a distributed engine to suggest to the analyst a good configuration of the min- ing process to cluster a collection of documents into cohesive and well-separated groups. SELF-DATA relies on different building components that characterize the data distribution of collections of documents and learn how configure the analytics process through self-tuning and self-learning strategies. Based on previ- ous processed datasets, SELF-DATA will also be able, in the future, to suggest to the analyst the complete configuration of the mining process after the charac- terization of the data distribution of a new unexplored dataset. SELF-DATA’s current implementation runs on the Apache Spark [8] framework, supporting parallel and scalable processing for analytics activities. A preliminary set of ex- periments have been performed on four textual data collections to show the potential of some components of SELF-DATA. The experiments highlighted the ability of some of SELF-DATA’s components to automatically identify a good weighting schema, a good transformation method, and good values for specific-algorithm parameters. This paper is organized as follows. Section 2 presents the SELF-DATA en- visioned architecture and a brief description of its main building components, while Section 3 discusses the preliminary performed experiments. Finally, Sec- tion 4 draws conclusions and presents future developments of this work. 2 The SELF-DATA engine SELF-DATA (SELF-learning DAta TrAnsformation) is a distributed engine whose aim is to suggest to the analyst a good configuration of the whole mining process to cluster a textual data collection into correlated groups of documents with a similar topic. The components of the SELF-DATA envisioned architec- ture, as well as the interactions between such components, are shown in Figure 1. It relies on 5 main components exploited in 2 phases, i.e., learning and prediction phases. Learning phase. During the learning phase, SELF-DATA learns good configurations of the whole mining process for specifics data distributions of textual data collections from previous processed datasets. To this aim SELF- DATA relies on 4 main components able to (i) automatically configure the com- plete clustering activity (e.g., selecting the good weighting schema and transfor- mation method, automatically setting the specific-algorithm input parameters) by exploiting self-tuning strategies; (ii) characterize different data distribution through the computation of various indices; (iii) store results of steps (i) and (ii) Fig. 1: The SELF-DATA architecture into a knowledge base; and (iv) exploit the knowledge base to learn which sub- sets of data distribution indices with the corresponding values are mainly able to forecast the best configuration of the cluster analysis process of textual data col- lections. Specifically, the following components characterize the SELF-DATA learning phase. – PASTA (PArameter-free Solutions for Textual Analysis) [4] is a distributed self-tuning engine including parameter-setting procedures to relieve the end- user of the burden of selecting proper values for the overall cluster analysis process on a collection of documents. The PASTA engine, reported in Figure 1, includes all the analytics blocks to make the overall analysis problem more effectively tractable and minimize user intervention. Specifically, PASTA in- cludes three main building blocks: (i) Textual data processing, (ii) Document modeling and transformation including a self-tuning data reduction algo- rithm, and (iii) Self-tuning textual data clustering. PASTA explores different suitable data weighting strategies, based on local (Term-Frequency (TF) and Logarithmic term frequency (Log)) and global (Inverse Document Frequency (IDF)) weights and data transformation meth- ods (e.g., PCA, LSI) before applying a clustering algorithm to gain bet- ter insights from the analysis. To streamline the analytics process PASTA includes two procedures to relieve the end-user of the burden of selecting proper values for algorithm-specific parameters. These procedures evaluate and compare solutions through different quality indices (e.g., Silhouette- based indices, rand index, f-measure) to identify a few good configurations (3-top) of the mining process that yield a good partition of the original dataset. PASTA’s current implementation runs on the Apache Spark frame- work, supporting parallel and scalable processing for analytics activities. – The Feature computation component is an engine capable of characterizing the data distribution of a collection of documents (corpus) through various indices including • # categories: the original number of topics in the dataset under analysis (if known); • Avg frequency terms: the average frequency of a term’s occurrence in the corpus; • Max frequency: the maximum frequency of a term’s occurrence in the corpus; • Min frequency: the minimum frequency of a term’s occurrence in the corpus; • # documents: the number of documents in the corpus; • # terms: number of terms in the corpus, with repetitions; • Dictionary: the number of different terms in the corpus (without repeti- tion); • TTR: the relation between the dictionary variety and the total number of terms in the corpus. It is calculated as a ratio between Dictionary and # terms; • Hapax %: the ratio between number of Hapax (number of terms with one occurrence) and the cardinality of the dictionary; • Guiraud Index: the ratio between the cardinality of the dictionary and the square root of # terms. It represents the lexical richness. – K-DB is a knowledge base storing results of 3-top solutions identified through PASTA on previously processed datasets, including data distribution char- acterization. Specifically, for each processed dataset K-DB stores one record for each solution selected by PASTA. Each record includes the values of the various indices characterizing the data distribution, the selected weight- ing schema and transformation method, with the suggested input parameter value for either the LSI or PCA transformation method, and the suggested value for the input parameter of the clustering algorithm (i.e., the desired number of cluster). – Building a prediction model addresses a classification problem trained on the knowledge base content to generate a prediction model to suggest to the analyst a good configuration/setting for the cluster analysis process on a given unexplored dataset. Many different approaches (e.g., data mining approaches, logic programming, statistics models) can be exploited for this complex and challenging task. Prediction phase. Given a new document collection, SELF-DATA first applies the document processing step and then computes the set of features to characterize data distribution. Given the values of the data distribution indices, the prediction model is exploited to forecast a good configuration of the cluster analysis process. This step suggests to the analyst a good weighting schema (both local and global) and a transformation method, and good values for specific- algorithm input parameters to effectively cluster the data collection into a few well-separated and cohesive groups of documents addressing a similar topic. 3 Preliminary development and results Here we presented a preliminary implementation of the SELF-DATA system with its preliminary results. Preliminary development. A preliminary implementation of some compo- nents of SELF-DATA has been developed. The current implementation includes: (1) the feature computation building block able to characterize the data distribution through some basic indices (see Section 2 for further detail) and (2) an extended version of the PASTA compo- nent presented in [4]. PASTA includes: – two local weights (TF and Log) and two global weights (IDF and Entropy) [6] to measure term relevance; – two data transformation methods LSI and PCA; – two self-tuning algorithms to automatically select the specific-algorithm pa- rameters of data transformation method (input parameter is K, the number of relevant dimensions to consider) and the K-means algorithm (input pa- rameter is kcls , the desired number of clusters) respectively [4]. PASTA exploits various quality metrics to evaluate different configurations of the mining process. Silhouette-based indices: purified and weighted Silhouette. The Silhouette in- dex [7] measures both intra-cluster cohesion and inter-cluster separation. The weighted silhouette index W S [4], computed on a given collection of documents, is the ratio between (i) the sum of the percentage of documents in each positive bin weighted with an integer value (weights in [wmin = 1 − wmax = 10], where the highest weight is assigned to the first bin [1 - 0.9] and so on) and (ii) the overall sum of weights. The higher the weighted silhouette index, the better the identified partition. The purified silhouette index P S [4] disregards documents appearing in singleton clusters to reduce the impact of these documents in the overall Silhouette index. Classification technique to assess the robustness of a given document partition. PASTA builds a classifier using the same input features of the clus- tering algorithm, and the class label assigned by the clustering algorithm as target. PASTA integrates the random forest method [6] to create the classifica- tion model. Quality metrics such as accuracy and f-measure1 are then computed. The higher the classification metrics, the better the overall quality of the clusters. Preliminary results. We performed preliminary experiments on four real col- lections of documents (see Table 1) to evaluate (i) the ability of the proposed indices to characterize and differentiate data distributions and (ii) the ability of the PASTA component to correctly identify good configurations for the cluster analysis. Characterization of the data distribution. Table 1 shows the proposed data distribution indices to characterize data distribution of the considered datasets. The Guirand index is a good parameter to distinguish datasets with a very sparse data distribution (D1 and D2 in Table 1) from less sparse (D3 and D4) datasets. Datasets with high values for lexical richness are usually characterized by a high variability in the data dictionary, thus more sparse data distribution. The TTR index also allows to better differentiate sparse data distribution from more dense data distributions. In fact, the TTR value for D3 is one order of magnitude higher than the TTR values of the others. Table 1: Datasets Wikipedia Twitter Reuters ID D1 D2 D3 D4 # categories 5 10 - 2 Avg frequency terms 17 23 2 12 Max frequency 8,642 19,437 86 2,605 Min frequency 1 1 1 1 # documents 989 2468 978 1,000 # terms 1,138,570 3,118,499 3,997 80,060 Dictionary 67,539 13,649 1,727 6,453 TTR 0.059 0.044 0.432 0.081 Hapax % 0.505 0.519 0.713 0.371 Guiraud Index 63.3 77.3 27.3 22.8 Selecting good configurations for the whole process of cluster analysis. For each dataset PASTA has been run eight times, once for each combination of a weighting function and a transformation method. Table 2 shows the 3-top solutions identified by PASTA for dataset D1 and by considering the LSI method and all weighting functions. The optimal solution is reported in bold. For D1 PASTA identifies for K (i.e., number of relevant dimensions for LSI) 34, 37 and 112 dimensions to be considered for the cluster analysis. Given these numbers of dimensions, PASTA selects kcls = 10 (input parameter for the K- means algorithm) as the optimal partition. We observed that PASTA usually 1 It is the weighted harmonic mean of precision and recall selects as optimal partition the experiment exploiting a low-medium number of dimensions (terms). The higher the K, the more variable the data distribution is and the more complex the cluster activity will be. Thus, the Silhouette-based indices (purified P S and weighted W S) tend to slightly decrease when a large number of terms featuring in each document is analyzed. Furthermore, the re- sults reported in Table 2 highlight that the TF-IDF, Log-IDF and TF-Entropy weighting functions are able to better differentiate the weighting terms, thus identifying a larger number of clusters (associated with different topics in the same category) than the one discovered by Log-Entropy. Table 2: Results for dataset D1 DId Trans Weight K kcls Accuracy WA-F-Measure PS WS 34 6 0.99 0.99 0.3 0.058 D1 LSI TF-IDF 37 10 0.96 0.96 0.3 0.062 112 12 0.91 0.91 0.2 0.039 19 7 0.95 0.95 0.3 0.068 D1 LSI Log-IDF 27 5 0.98 0.98 0.3 0.049 76 6 0.97 0.97 0.2 0.043 38 10 0.91 0.91 0.3 0.056 D1 LSI TF-Entropy 44 16 0.89 0.89 0.3 0.052 90 6 0.97 0.96 0.2 0.037 21 5 0.99 0.99 0.4 0.076 D1 LSI Log-Entropy 27 5 0.98 0.98 0.4 0.066 84 6 0.97 0.97 0.2 0.048 We also compared the optimal configurations selected by PASTA when it explores LSI with respect to the ones obtained with PCA. Table 3 shows the best configuration selected by PASTA for both LSI and PCA and for all weighting functions. Results are obtained on D3 and D4 datasets. For datasets D1 and D2 the LSI method outperforms PCA (detailed results are omitted to lack of space), whereas for dataset D3 and D4 PCA yield a better final partition than LSI. Thus, we can conclude that the LSI method outperforms PCA in the presence of a larger data domain, and usually a more sparse data distribution. 4 Conclusions and future work This paper presents a challenging vision for a self-learning engine capable of predicting how to configure the cluster analysis process of a collection of textual data. A preliminary implementation of some components has been presented and evaluated on four real collections of documents. We plan to develop the complete SELF-DATA system in a future work. Table 3: Results for D3 and D4 datasets DId Trans Weight K kcls Accuracy WA-F-Measure PS WS TF-IDF 16 20 0.89 0.89 0.4 0.063 Log-IDF 16 18 0.86 0.86 0.4 0.062 D3 LSI TF-Entropy 13 20 0.9 0.9 0.4 0.061 Log-Entropy 12 15 0.86 0.86 0.4 0.059 TF-IDF 16 17 0.87 0.87 0.3 0.061 Log-IDF 16 17 0.92 0.92 0.4 0.064 D3 PCA TF-Entropy 13 20 0.88 0.88 0.3 0.064 Log-Entropy 12 18 0.9 0.88 0.4 0.062 TF-IDF 30 3 0.96 0.96 0.3 0.045 Log-IDF 18 8 0.94 0.94 0.3 0.064 D4 LSI TF-Entropy 29 7 0.92 0.92 0.3 0.059 Log-Entropy 18 8 0.94 0.94 0.3 0.063 TF-IDF 30 3 0.99 0.99 0.3 0.047 Log-IDF 18 13 0.9 0.9 0.3 0.069 D4 PCA TF-Entropy 29 3 0.98 0.98 0.3 0.052 Log-Entropy 18 5 0.95 0.95 0.3 0.058 References 1. Luca Cagliero, Tania Cerquitelli, Paolo Garza, and Luigi Grimaudo. Twitter data analysis by means of strong flipping generalized itemsets. Journal of Systems and Software, 94:16–29, 2014. 2. Tania Cerquitelli, Elena Baralis, Lia Morra, and Silvia Chiusano. Data mining for better healthcare: A path towards automated data analysis? In 32nd IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2016, Helsinki, Finland, May 16-20, 2016, pages 60–63, 2016. 3. Tania Cerquitelli, Silvia Chiusano, and Xin Xiao. Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario. Expert Syst. Appl., 55:297–312, 2016. 4. Evelina Di Corso, Tania Cerquitelli, and Francesco Ventura. Self-tuning techniques for large scale cluster analysis on textual data collections. In Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Marrakesh, Morocco, April 3rd-7th, 2017, pages 771–776, 2017. 5. Oskar Gross, Antoine Doucet, and Hannu Toivonen. Language-independent multi- document text summarization with document-specific word associations. In 31st Annual ACM Symposium on Applied Computing, Pisa, Italy, 2016. ACM, ACM. 6. Pang-Ning T. and Steinbach M. and Kumar V. Introduction to Data Mining. Addison-Wesley, 2006. 7. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65, 1987. 8. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur- phy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient dis- tributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI’12, 2012.