Small Data and Data Centric AI: Case Study from the Master’s Program in Artificial Intelligence at Sofia University Maria Nisheva-Pavlova 1, 2 and Bilyana Dobreva 1 1 Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski, 5 James Bourchier Blvd., Sofia, 1164, Bulgaria 2 Institute of Mathematics and Informatics – Bulgarian Academy of Sciences, 8 Acad. Georgi Bonchev Str., Sofia, 1113, Bulgaria Abstract Recently, the term “small data” has become essential in the field called “data centric AI”. While big data is used for different types of correlation analysis, small data is the real source for finding causal relationships between the objects studied. The paper discusses the experience in creating small datasets and transfer learning, gained in the Master’s program in Artificial Intelligence at the Faculty of Mathematics and Informatics at Sofia University, focusing on some good examples of student projects. Keywords Big data, small data, data centric AI, transfer learning, question answering system 1. Introduction After the initial wave of research and technological developments related to big data, the interest in the so-called small data and especially in the methodolo- gies for creating appropriate small datasets and their use in the field of data cen- tric artificial intelligence is constantly growing. Correctly constructed small data are commonly used by people in decision-making in various areas of particular public importance. The creation and use of suitable small datasets, along with the application of proper kinds of transfer learning, is the basis of data centric arti- ficial intelligence. In recent years, a number of successful projects (mostly pre- diploma and diploma projects) of students from the Master’s program in artificial intelligence at the Faculty of Mathematics and Informatics at Sofia University are addressing this issue. Information Systems & Grid Technologies: Fifteenth International Conference ISGT’2022, May 27–28, 2022, Sofia, Bulgaria EMAIL: marian@fmi.uni-sofia.bg (M. Nisheva-Pavlova); bddobreva@uni-sofia.bg (B. Dobreva) ORCID: 0000-0002-9917-9535 (M. Nisheva-Pavlova) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Small data and its importance for the creation of decision support systems According to the most popular informal definition, “small data is data that is ‘small’ enough for human comprehension. It is data in a volume and format that makes it accessible, informative and actionable” [1]. A more formal defini- tion of small data has been given by Allen Bonde: “Small data connects people with timely, meaningful insights (derived from big data and/or “local” sources), organized and packaged – often visually – to be accessible, understandable, and actionable for everyday tasks” [2]. As many authors note, small data is what people usually think of as data. While big data can be understood as high-volume raw data coming from het- erogeneous sources (e.g. social media publications, customer transactions, etc.) which is difficult to comprehend and manage, small data is produced from raw data by cleaning and reducing it into small, visually-appealing objects represent- ing particular aspects of large datasets. From the point of view of their applicability for the creation of different types of data analytics systems, the clearest dividing line between ‘big’ and ‘small’ data can be formulated as follows [3]: “Big Data is all about finding correlations, but Small Data is all about finding the causation, the reason why”. More precisely, big data is important is all cases of building medium-term and long-term policies and strategic decisions. From the other side, small data refers to definite and specific attributes of datasets, which can be used to analyze the current situation in depth and to make adequate personalized decisions. Therefore, small data is best placed to support decision-making at the current time. For example, clinicians favor small data over big data for healthcare assess- ments as well as for building personalized prediction and decision-making mod- els (see Figure 1). Figure 1: Comparison of the applicability of Big Data and Small Data models in healthcare [4] 171 Also, there are various types of cases in which a particular person or organi- zation needs quick and instant analysis of the available data and there is no need to use big data analytical tools for the purpose. 3. Data centric AI The concept of data centric artificial intelligence, which has recently been actively involved in research and applied development, refers to building AI sys- tems with quality data. The data centric AI approach is based on the idea to focus on ensuring that the data used clearly show what the developed AI system needs to learn. As Andrew Ng notes in his popular interview for IEEE Spectrum [5], “data centric AI is the discipline of systematically engineering the data needed to suc- cessfully build an AI system”. So, if until recently the dominant idea was to focus on improving the code, nowadays it is more effective for a lot of applications to consider that the quality of code is generally a solved problem and the focus should be moved to finding approaches to improve the data [5]. In particular, instead of working directly with a large amount of raw and noisy data, it is better to make at the beginning appropriate efforts to improve the consistency of the data and in this way to achieve a significant improvement in productivity. Especially for big data applications, the common approach has been: “If the data is noisy, let’s just get a lot of data and the algorithm will aver- age over it” [5]. But the data centric approach assumes to try to develop tools that point on data inconsistencies and give an effective way to overcome most of them in order to get a truly high performing system. Following the data centric AI paradigm, a significant number of pre-diploma and diploma projects in various application areas are being developed in the Mas- ter’s program in Artificial Intelligence at the Faculty of Mathematics and Informat- ics at Sofia University St. Kliment Ohridski. Among the most significant of them is the project for a virtual health assistant called Medico-Help [6], developed in 2021. Medico-Help is a web-based expert system that functions as an intelligent chatbot, capable • to automatically collect data from trusted websites, • to build and extend automatically a medical knowledge base and to search in it, • to generate hypotheses for medical diagnoses based on symptoms. As an initial version of the knowledge base of Medico-Help, a small stan- dardized ontology for human diseases2, developed at the School of Medicine at the University of Maryland has been used. The system has a module for auto- 2 https://disease-ontology.org 172 mated collection of specialized data from trusted sources on the Internet. The role of such a source in the pilot version of Medico-Help is played by MedIndia3. The new data retrieved from the documents provided by MedIndia are analyzed and used to gradually enrich the domain knowledge base of Medico-Help. Informa- tion about new drugs and additional symptoms is also periodically added for this purpose. The available version of the knowledge base is used to generate answers to the user questions, most often in the form of assumptions about diagnoses corresponding to the indicated symptoms as well as suggestions about possible treatment regimens. Each diagnosis assumption includes information about the disease such as description, related symptoms, synonyms and drugs. The virtual assistant can also draw the user’s attention to possible other related symptoms that might be missed. 4. Transfer learning A popular approach in deep learning that supports the implementation of the principles of data centric AI is transfer learning where pre-trained models are used as a first approximation of the solution of primarily computer vision and natural language processing (NLP) tasks. Jason Brownlee characterizes transfer learning is a “machine learning meth- od where a model developed for a task is reused as the starting point for a model on a second task” [7]. There are many advantages of using transfer learning instead of a machine learning model built from scratch. The most significant of them are [8]: • a transfer learning model needs less data as compared to a model build from scratch, • a transfer learning model needs less computation power, • a transfer learning model requires less time because most of the heavy work is already done on the pre-trained model and only a relatively small part is done by the new model. The common approach to transfer learning in the field of deep learning is the Pre-trained Model Approach [9]. Its implementation consists of three main stages: • Select Source Model. A proper pre-trained model is chosen from the set of available models. Lately many research institutions release freely available models on challenging datasets that can be included in the pool of candidate models from which to make choice. • Reuse Model. The chosen pre-trained model is then used as the starting point for a model on the current task of interest. 3 https://www.medindia.net 173 • Tune Model. The new model may need to be refined on the input-output data pairs available for the task of interest. Nowadays it is popular to perform transfer learning on natural language pro- cessing problems in which text is used as input or output. For these types of problems, an appropriate word embedding – a mapping of words to a high-dimensional real-valued vector space where different words with a similar meaning have a similar vector representation – is usually constructed and used [10]. There are many efficient techniques for learning this kind of word represen- tations, e.g. Embedding Layer, Word2Vec, GloVe [7]. It is a common practice for research and development organizations to release models, pre-trained on large corpora of text documents under a permissive license. A good illustration of the principles of transfer learning is the natural lan- guage processing technique supported by the recently popular Bidirectional En- coder Representations from Transformers (BERT) [10]. BERT is a method for generating a common language model that can understand natural language. The generated language model can then be used even without additional training. BERT has achieved some of the best results in many NLP tasks. BERT is pre-trained on a very large corpus of non-annotated texts on the task of language modeling (15% of words are masked and BERT is trained to predict them from the context). The other task on which the model is pre-trained, is the task of predicting the next sentence. As a result of the training process, BERT learns appropriate contextual embeddings of words. After the preliminary train- ing with non-annotated data on various tasks, BERT can be fine-tuned with fewer resources and smaller datasets to optimize its work on specific tasks. For fine tun- ing the model is first initialized with the parameters for pre-training and then all parameters are fine-tuned using annotated data from the further tasks. There are particular fine-tuned models for each of these tasks, although they are initialized with the same pre-training parameters4, e.g. BERT-Large, Uncased (Whole Word Masking); BERT-Large, Cased (Whole Word Masking); BERT-Base, Multilin- gual Cased (New); BERT-Base, Chinese, etc. 5. An example: Intelligent system for answering specialized questions about COVID-19 The intelligent system for answering specialized questions about COVID-19 was designed and implemented in 2021–2022 as a diploma project for the com- pletion of the Master’s program in artificial intelligence at Sofia University [11]. It may be considered as a good example of application of the principles of data 4 https://github.com/google-research/bert 174 centric AI, particularly of the transfer learning methodology in solving problems in information retrieval, natural language processing and knowledge discovery in text. The development of the system was motivated by the popular COVID-19 Open Research Dataset Challenge of Kaggle5. It uses the Covid-19 Open Re- search Dataset (CORD-19), released in 2020 by the Allen Institute for AI (AI2) in cooperation with other leading institutions [12]. CORD-19 is a large and growing collection of more than 1,000,000 publications and preprints on Covid-19 and previous coronaviruses such as SARS and MERS. It integrates papers and pre- prints from several sources, collected by Semantic Scholar (see Figure 2). Paper documents are processed to extract full text. Metadata are harmonized by the Semantic Scholar team at AI2. Figure 2: Data sources and structure of CORD-19 [12] In the process of developing the system, preliminary preparation of the data was performed, which includes recognition of the language of each of the avail- able papers and selection of those in English, followed by tokenization of the abstracts and texts of the selected papers. This results in the actual working ver- sion of the dataset, the content of which is used to generate the answers to the user questions. When it receives a question from the user, the system first determines the rank of each paper in the dataset relative to the user question. The Okapi BM25 best matching ranking algorithm6 is used for this purpose and the first five papers with the highest ranks are selected. The next step is to use the BERT Large Uncased Whole World Masking model, pre-trained and fine-tuned on the Stanford Question Answering Dataset7. The texts of the five selected papers and the user question are submitted to it. As 5 https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge 6 https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html 7 https://rajpurkar.github.io/SQuAD-explorer 175 a result of the execution of BERT, the generated answers to the user question are returned. Each answer contains data about the author(s) and the title of the respective paper, its estimated BERT score and BM25 score and a brief descrip- tion of the essence of the results, presented in it, in their most general and abstract formulation. The system was successfully tested on the questions from Round #1 of the cited competition of Kaggle. Figure 3 shows the results of the search for answers to the question “What do we know about vaccines and therapeutics?” (Task 3 from Round #1 of CORD-19 Challenge) and Figure 4 shows the results for the question “What has been published about medical care?” (Task 5 from Round #1 of CORD-19 Challenge). Figure 3: Results for Task 3 from Round #1 of CORD-19 Challenge of Kaggle 176 Figure 4: Results for Task 5 from Round #1 of CORD-19 Challenge of Kaggle The analysis of the obtained experimental results shows that the system is relatively good at generating answers to specific questions, but it is advisable to improve the algorithm of its work by using more NLP techniques like lemmatiza- tion and dividing the texts of CORD-19 papers into separate paragraphs. It would also be useful to enrich the dataset with which the system works with other types of documents related to COVID-19, such as technical reports, messages from governmental institutions and public organizations, etc. Although a domain-specific corpus of data was used to create the system, the approach developed is general enough and can be applied in other areas. 6. Conclusion Our experience in teaching AI and research and development activities in various areas of AI suggests that one of the significant challenges for data cen- tric AI is the lack of validated methodologies – both domain-independent and domain-specific ones – for connecting small data to big data. The development of such methodologies and appropriate supporting software tools, along with the availability of a sufficient number of pre-trained machine learning models for dif- ferent areas, would contribute to the rapid creation of intelligent software systems with great impact on large target groups, providing personalized services and reliable content. 177 6. Acknowledgements This research is supported by Project BG05M2P001-1.001-0004 “Universi- ties for Science, Informatics and Technologies in the e-Society (UNITe)” financed by Operational Program “Science and Education for Smart Growth”, co-financed by the European Regional Development Fund. 7. References [1] R. Pollock, “Forget big data, small data is the real revolution”. The Guardian, 25 April 2013. URL: https://www.theguardian.com/news/datablog/2013/ apr/25/forget-big-data-small-data-revolution (last visit on 31 March 2022). [2] Small Data Group, Defining Small Data. URL: https://smalldatagroup. com/2013/10/18/defining-small-data (last visit on 31 March 2022). [3] C. Sarkar, “Small Data, Big Impact!” – An Interview with Martin Lind- strom. The Marketing Journal, 1 May 2016. URL: https://www.market- ingjournal.org/small-data-big-impact-an-interview-with-martin-lindstrom (last visit on 31 March 2022). [4] R. Kannan, “The Importance of Small data vs Big Data for Healthcare”. TRIGENT, 25 June 2019. URL: https://blog.trigent.com/the-importance- of-small-data-vs-big-data-for-healthcare (last visit on 31 March 2022). [5] A. Ng, “Unbiggen AI”. IEEE Spectrum, 9 February 2022. URL: https:// spectrum.ieee.org/andrew-ng-data-centric-ai (last visit on 31 March 2022). [6] R. Tsanova, Virtual Health Assistant. Master Thesis, Faculty of Mathemat- ics and Informatics, Sofia University St. Kliment Ohridski, 2021 (in Bul- garian). [7] J. Brownlee, “A Gentle Introduction to Transfer Learning for Deep Learn- ing”. Machine Learning Mastery, 16 September 2016. URL: https://ma- chinelearningmastery.com/transfer-learning-for-deep-learning (last visit on 31 March 2022). [8] R. Barman, S. Deshpande, S. Agarwal, U. Inamdar, “Transfer Learning for Small Dataset”. Proceedings of National Conference on Machine Learning, 26th March 2019, ISBN 978-93-5351-521-8, pp. 132–137. [9] P. Marcelino, “Transfer learning from pre-trained models”. Towards Data Science, 23 October 2018. URL: https://towardsdatascience.com/transfer- learning-from-pre-trained-models-f2393f124751 (last visit on 31 March 2022). [10] R. Horev, “BERT Explained: State of the art language model for NLP”. Towards Data Science, 27 September 2021. URL: https://towardsda- tascience.com/bert-explained-state-of-the-art-language-model-for-nlp- f8b21a9b6270 (last visit on 31 March 2022). 178 [11] B. Dobreva, Intelligent System for Answering Specialized Questions about COVID-19. Master Thesis, Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, 2022 (in Bulgarian). [12] L. Wang et al., “CORD-19: The COVID-19 Open Research Dataset”. Pre- print, ArXiv, 2020;arXiv:2004.10706v2. URL: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC7251955 (last visit on 31 March 2022). 179