AnalyzeLab: a Tool to Help Machine Learning Developers Evaluating their Models Wenhan Yang Jing Zhai Université de Toulouse Université de Toulouse Toulouse, France Toulouse, France wenhan.yang53@gmail.com jing.zhai@ut-capitole.fr ABSTRACT the results to the user in a visual way that helps her/him to quickly We have developed a tool that aims at helping machine learning analyze the impact of the model parameters. developers when defining the parameters of their models. This tool The rest of the paper is organized as follows: Section 2 presents allows users to provide (pre-processed) training data under the the use case we use to illustrate the system functionalities. Section 3 form of sets of extracted features and labels, as well as test data sets presents the different system functionalities throughout commented on which the model should be trained/tested. The users get a visual screen shots. Section 4 concludes this paper and mentions some view of the accuracy of the obtained model. The users can select possible future developments. the features as they want, to include in the model as well as the examples to consider during training and the examples to use in 2 INFORMATION CHECK-WORTHINESS AS A the evaluation. Several machine learning algorithms are available USE CASE and can be chosen among (K-Nearest Neighbors, Naive Bays, SVM, Information check-worthiness task As an illustrative use case, and Decision Tree). This makes a very useful tool for developers we consider in this paper the information check-worthiness task which is interactive and visualised. This paper presents this tool. introduced in CLEF 2018 evaluation forum [2]. The systems that answer this shared task predict whether a piece of information CCS CONCEPTS (a sentence from a political discourse) should be prioritized for • Information systems → Information retrieval; Content anal- truthfulness checking [6]. ysis and feature selection; Clustering and classification. Information check-worthiness data sets The data sets we use KEYWORDS in this illustrative use case were provided by (1) the task organizers Machine learning evaluation; information check-worthiness; Inter- with regard to the sentences to check and the ground truth [6] (2) active evaluation; Development Lespagnol et al. on demand regarding the features extracted from the above mentioned collection. Each short sentence, is represented 1 INTRODUCTION by heterogeneous types of features: information nutritional label Machine learning is widely used for many tasks including in in- based on [3], linguistics, category hierarchy, and word-embedding formation retrieval (IR) tasks. When developing such models, an based on Word2Vec model [5]; these features have been used in the important part is the features used to train the model. Indeed, train- CheckThat! shared task [1] and are described in [4]; they cannot ing is based on labelled examples that are represented by features be described in because of page limit. or characteristics. Those examples are automatically analysed by Information Nutrition Label [3] is to help the online information the machine to elaborate a model that is trained to predict the ap- consumer, proposing an Information Nutrition Label, resembling propriate decision (the label) both for training examples but also nutrition fact labels on food packages. Such a label describes, along for unseen data. Another important part of machine learning is the a range of agreed-upon dimensions, the contents of the product algorithm used; and many algorithms have been developed in the (an information object) in order to help the consumer (reader) in literature, either to predict a value (e.g. regression) or to predict a deciding about the consumption of the object. Both AnalyzeLab and Information Nutrition Label are creating class (e.g. Random Forest, Support Vector Machine) for any input. In many cases developing and selecting features as well as choos- for contributing less difficult to judge the trustworthiness of news ing a ML algorithm also implies an evaluation process where the found on the Web with the proliferation of online information designers experiment features and algorithms. Evaluation measures sources. The difference between these tools is that Information how accurate the model is, either on the training data set, or on a Nutrition Label is more specifically applied by using natural lan- test data set. guage processing and AnalyzeLab is mostly concentrate on several In this paper, we present a tool we developed in order to help machine learning algorithms. In the future we have the tendency designers and researchers when elaborating features. This tool to develop AnalyzeLab with NLP to apply further functions. allows a researcher to select the features, algorithms, and data sets (from files the user provides in a directory) s/he wants to evaluate. 3 SYSTEM FUNCTIONALITIES AND The application then trains the ML model and evaluates it. It shows ILLUSTRATIVE EXAMPLES "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- The tool allows users to choose (1) features, (2) training datasets, (3) mons License Attribution 4.0 International (CC BY 4.0)." test datasets and (4) the algorithms to be used for training/testing. Supported by Institut de Recherche en Informatique de Toulouse, CNRS UMR5505, At each run, the user can select one or several items of each category. France. Figure 1 presents the user interface. Wenhan Yang and Jing Zhai Figure 3: Detailed measures of the results corresponding to the heat map from 2 (a) . Figure 1: User Interface This interface helps developers to figure out which features and algorithm suit better for the chosen dataset(s). Indeed, as an input, the interface provide the user with two types of results as follows: • a colored confusion matrix which helps her/him to quickly have an overview of the accuracy of the run (see Figure 2); • numerical results in terms of precision, recall, f1-score, quan- tity of the dataset, accuracy, macro avg and weighted avg for each class (See Figure 3) Figure 4: custome datasets input . We considered the case where the user chose the linguistic fea- tures as features among Linguistic features, Entity features, Category features and Word-embedding based features that are available. Also the user chose the ’Vice presidential’ dataset as training dataset among the three that are available. The user chose ’Donald Trump’s address to congress’ as the test dataset among seven available for testing. Finally, the user selects the algorithms to use Linear SVC and Decision Tree. We displayed confusion matrix and detailed measures of the results of the applied machine learning method. In the confusion matrix, "1" (resp. "0") for Predicted values corresponds to predicted check-worthy (resp. predicted not check-worthy). Similarly, "1" (resp. "0") for True values corresponds to labeled as check-worthy (resp. not labeled as check-worthy). However, in the interface, the user can choose different features, training data sets and test data sets and different algorithms to compare the results. We have implemented an option for custom dataset (See Figure 4). In order to apply user’s datasets to train and test the model, user needs to provide a dataset that has same structure as ours. The first line is the full name of the user’s training file and second line is the full name of the user’s labeled data set. Depending on the (a) Using SVC algorithms (b) Using DT algorithm ML problem, developers will use different features, and different data sets. In the current interface the boxes are not yet dynamically Figure 2: Example of an heat map obtained created but this is an extension that can be implemented in the future. AnalyzeLab: a Tool to Help Machine Learning Developers Evaluating their Models 4 CONCLUSION AND FUTURE WORK In this paper we have presented a tool that we believe could be very useful to developers when finalizing their ML models and features to include in a model. By interactively selecting the model parameter and visualizing the results, we make the development more simple. This tool could be expanded in various ways. First, we could add other evaluation measures than the confusion matrix for visual evaluation the user could choose among. Second, we could have a dynamic list of features automatically detected from costumed data sets and allow users to choose which features they want to use to train the model. Third, we could add some indicators that developers want for their model. For example, they could set an acceptable goal score for one or several indicators (e.g. Recall > 0.75) and the application could then go through all the possible combinations of features, algorithms and data sets in order to find the best model that matches the requirements if any. ACKNOWLEDGMENTS We would like to express our very great appreciation to Josiane Mothe for her valuable and constructive suggestions and supervi- sion during the planning and development of this research work. We also would like to thank Mickey Fraanje, Reynaldo Quin- tero, Manish Adhikari, Elijah Adeogun, Patrick Siekmeier, Amrutha Thalappan for their initial contribution to this tool. REFERENCES [1] Romain Agez, Clément Bosc, Cédric Lespagnol, Noémie Petitcol, and Josiane Mothe. 2018. IRIT at CheckThat! 2018. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France. [2] Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric SanJuan, Linda Cappellato, and Nicola Ferro. 2018. Experimental IR Meets Multilinguality, Multimodality, and Interaction. In Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). Lecture Notes in Computer Science (LNCS), Vol. 11018. Springer. [3] Norbert Fuhr, Anastasia Giachanou, Gregory Grefenstette, Iryna Gurevych, An- dreas Hanselowski, Kalervo Jarvelin, Rosie Jones, YiquN Liu, Josiane Mothe, Wolf- gang Nejdl, et al. 2018. An Information Nutritional Label for Online Documents. In ACM SIGIR Forum, Vol. 51. ACM, 46–66. [4] Cédric Lespagnol, Josiane Mothe, and Md Zia Ullah. 2019. Information Nutritional Label and Word Embedding to Estimate Information Check-Worthiness. In Proceed- ings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 941–944. [5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 3111–3119. [6] Preslav Nakov and al. 2018. Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. In Proceedings of the Ninth International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodality, and Interaction (Lecture Notes in Computer Science). Springer.