1. Introduction

Systematic Mapping Study on Use of Pre-Trained Open Machine Learning Models

Riku Alho

Mikko Raatikainen

Jukka K. Nurminen

Lalli Myllyaho

Lucy Ellen Lwakatare

0 0 Department of Computer Science , PL 68 (Pietari Kalmin katu 5) , University of Helsinki , Finland

Accurate understanding of pre-trained open source machine learning models, and their frameworks, and datasets use can help software engineers simplify, reduce costs, and improve the quality of application development to diferent domains. This paper investigates how pre-trained Open Machine Learning (ML) models, and their frameworks, and datasets are shared and used in diferent domains. A systematic mapping study is used to identify published studies. Statistical and qualitative results are formed for 499 studies which provide suficient information regarding the use of open source pre-trained models, frameworks, and datasets. Based on a relatively large sample, the reviewed 499 studies provide a listing of Open ML models, frameworks, and datasets used in research as well as their relative popularity. The selected studies consisted of a large number of diferent domains, which saw benefits ranging from minor decline to moderate improvement when compared to the previously used state of the art machine learning methods. Most of the models in studies were used under the TensorFlow framework with ImageNet as the dataset. The majority of studies were made in laboratory environments. Pre-trained Open ML models show positive promise for improvement in machine learning. Additional diversity of available open source models pre-trained with diferent datasets would improve this efect. More comparable studies are needed, especially from the industry, that use and apply open source machine learning, which report their context, methodology, and performance comprehensively.

eol>Machine Learning Open Source Reuse Systematic Literature Review Systematic Mapping Study

1. Introduction

ware engineers, researchers, students, businesses, and private enthusiasts to help reap the benefits of available Although there are important diferences, machine learn- data without requiring them to invest work on reinventing system developers can learn a lot from traditional ing the wheel [2]. software engineering [1]. Both begin their task by fa- The goal of this study is to assess the shared usage, miliarizing themselves with the problem domain. Soft- adoptability, and evolvability of pre-trained open source ware engineers explore existing and similar solutions, machine learning models in diferent application dosoftware, and databases, whereas machine learning engi- mains. The study is carried out as a systematic mapping neers explore available machine learning options, models, study [3], now an established research method in comframeworks, and datasets for the problem domain. puter science to systematically collect an overview of

Traditionally many technologies related to machine research state-of-the-art. learning have been hidden behind technology indus- This review paper is structured as follows. First, Sectry walls. Not until recently, we have seen a large and tion 2 introduces to the terminology and background systematic introduction of multiple, new open source of this study. In Section 3, the research questions and machine learning technologies: Such shared of-the- applied research method are presented. The results are shelf pre-trained open source machine learning models, introduced and analysed in Section 4. The analysed reframeworks, and datasets can provide competitive state sults and findings are discussed on the Section 5. Finally, of the art capabilities in terms of performance, cost- Section 6 concludes the paper. efectiveness, and adaptability in diferent application domains when applied to new machine learning problems. This may provide afordable new avenues for soft- 2. Background To properly understand the terminology and background for this study, we briefly describe the diferent terms and ifelds related to it. 2.1. Basic concepts of training or input data, which can be challenging to acquire. Transfer learning can be used to mitigate these challenges [9]. Using a pre-trained model taught with large amounts of data that even slightly overlaps the targeted domain may achieve comparable or even better results than using only small datasets available to the targeted domain in question.

Specifically, in this paper, we use the term Open ML model for pre-trained open-source machine learning models that can be reused as such or after retraining as components in other systems.

Artificial Intelligence (AI) consists of all technical aspects

that aim to get computers to imitate intelligent behaviour observed in humans [4]. This includes machine learning, natural language processing (NLP), language synthesis, computer vision, robotics, sensor analysis, optimization, and simulation.

A subset of AI is Machine Learning (ML), which consists of techniques that enable computers to change their functionality based on given information (e.g., sensor data), thus improving their behaviour to achieve the goal [5]. ML techniques include decision trees, neural networks, support vector machines, and many more. 2.3. Frameworks

Neural Networks (NNs) are a part of ML. They are com- The majority of Open ML models are used inside dedputer programs inspired by biological neural network icated frameworks, such as Cafe [ 10], Keras[11], Weka processes [6]. These consist of perceptrons, convolu- [12], PyTorch[13], TensorFlow[14] or MatConvNet [15]. tional neural networks, recurrent neural networks, Boltz- Frameworks work as the interface between an ML model, mann machines, deep neural networks, and many more. users, and hardware and can, thus, afect how hardware Basic NNs with one to a few layers of neurons usually calculations and values are given to the model during require user assistance in forming classification classes. training and use.

Deep Neural Networks (DNNs) are under the NN category.

They are neural networks, which consist of multiple layers providing them the ability to form new classification 2.4. Datasets classes. Most of-the-shelf open ML models ofered by diferent

Machine learning can be categorized into supervised, frameworks are made available pre-trained under a cerunsupervised learning, and reinforcement learning [7]. Su- tain dataset. There are many diferent datasets with varypervised learning utilizes training data for classification ing scales ranging from entries of tens of thousands to and regression. Unsupervised learning constructs pre- a billion. Datasets, such as ImageNet [16], Places365 dictions of classification based on the given input data. [17], CIFAR [18] and Pascal VOC [19], are ofered in preReinforcement learning uses trial and error based on an trained open ML models. The benefit of pre-trained modoracle, such as a repeatable simulation or a game, to find els is that training a model comprehensively on a large the optimal outputs. dataset takes a very long time and a lot of computing

Open source refers to a computer program for which resources and data. For instance, one study [20] reports the source code is available to the general public for use that "it takes a Nvidia M40 GPU 14 days to finish just one or modification from its original design [ 8]. Open source 90-epoch ResNet-50 training execution on the ImageNetcode is a collaborative efort where programmers improve 1k dataset". Transfer learning makes it possible to train upon the source code and share the changes within the useful models with just a fraction of the original computcommunity. Code is released with a license specifying ing efort and with only a small amount of training data. the conditions under which others may download, use, Of-the-shelf pre-trained machine learning models have modify, and publish their versions to the community. gained academic interest, especially after the ImageNet This view to open source is not restricted to any particular Large Scale Visual Recognition Challenge 2014 (ILSVRC license. 2014) [16], which provided a noticeable leap in Open ML model performance used for image classification.

2.2. Open ML Models

ML models are computer programs or components that have formed statistical and mathematical insights from data, such as a trained neural network. Although ML models usually refer to something already trained, they have also been used to refer to untrained, manually tuned, or default-valued programs. Statistical and mathematical insights can be formed with machine learning or manual tuning.

Training an efective ML model requires large amounts

2.5. Raw and Tuned models

Of-the-shelf open ML models have been used as templates for modification and changes that may alter the resource cost, performance, and accuracy of models in the same task. In this paper, we use the term raw model for a reused pre-trained ML model from an open-source provider. We use the term tuned models for those pretrained ML models that have had their parameters manually tuned/changed, their structure modified, or that have

2.6. Related Studies

been amalgamated together for the same task. Amal- Table 1 gamations may consist of structural merging or even Electronic sources searched, and the numbers of papers found majority voting between multiple models of the same and finally included. type trained with diferent subsets of the same retraining dataset.

Electronic sources sNeuamrcbhe(rdoufphliictastpese)r SpringerLink Journals 854 (76) IEEE Xplore 233 (0) Total 1087 (76)

Number of selected results per search 392 107 499

We are not familiar with other systematically conducted

reviews directly related to shared Open ML usage. The • RQ3: What evidence is available on the perforclosest related study was done by Nguyen et al. [21] in mance and evolvability of pre-trained Open ML the form of an expertly opinioned and observed survey. model solutions? It provides information regarding diferent statistically popular Open ML models, frameworks, and hardware. In 3.3. Search strategy addition, there are also lists of open-source ML libraries, such as the one curated by The Institute for Ethical Ma- The automatic search was conducted by executing search chine Learning, available at GitHub [22]. strings on search engines of the following digital libraries:

3. Research approach

The systematic mapping study as a form of a systematic review is a well-defined research method to identify, analyze, and synthesize all relevant studies regarding a particular research question or topic area [23, 3]. The systematic mapping study method was chosen for this paper because it aims at a holistic, credible, and fair overview of studies on shared pre-trained Open ML model usage.

3.1. Protocol

An important step when performing a systematic review is the development of a protocol. The protocol specifies all steps performed during the review, increasing its rigor and reliability. The protocol was constructed following the systematic review guidelines [24]. The protocol used in this study was also inspired and adapted from the procedure introduced by Mahdavi-Hezavehi et al. [25] in their review.

The procedures start with the research question definition, search strategy identification, and search scope selection. After that, study inclusion and exclusion criteria were formed based on the research questions. An empirical data extraction form was created based on the research questions. The data collection was conducted by filling out the data extraction form from the analyzed studies found and included in searches.

3.2. Research questions This study covers the following research questions:

• RQ1: What solutions are used for shared pretrained Open ML models? • RQ2: How does research compare diferent open ML models, datasets, and frameworks? • IEEE Xplore: ("machine learning" OR "Deep learning") AND ("pre-trained model" OR "pre-trained models") • SpringerLink: ("machine learning" OR "Deep learning") AND ("pre-trained model" OR "pretrained models")

The following study inclusion criteria were used for

the inclusion of the papers: • I1: The paper experiments with the usage of pretrained ML models. Experiments are required to collect information in order to analyse solutions, adoptions, and evolvability.

The following study exclusion criteria were used: • E1: The paper does not feature the usage of pretrained Open ML models. If the focus of a paper was on other than Open ML models, the paper was excluded. • E2: The Open ML model used in the paper is presented as a novel one. The model is not yet shared if it is novel. • E3: Paper is an editorial, technical report, position paper, abstract, keynote, opinion, tutorial summary, panel discussion, or a book chapter. • E4: Paper is grey literature. Grey literature is argued to be of lower quality than papers published in journals and conferences as they usually are not thoroughly peer-reviewed [26].

The numbers of papers found and included during the

search phase are shown in Table 1. The publication date of searched papers was limited between 2013-2020. On SpringerLink, the search was limited to journal articles due to a hign number of conference papers (over 1900) requiring identification. We decided to prefer journals over conference papers for their more meticulous peerreview process compared to the shorter review time of conference papers. IEEE consisted of only 41 journal articles in total, so we decided to include conference papers in order to provide a more comprehensive sample.

3.4. Data Extraction

Data was extracted using the data extraction form (Table 2). For the evidence levels (F10), the classification system proposed by Alves et al. [27] was used consisting of six levels: • 1. No evidence. • 2. Evidence obtained from demonstration or working out toy examples. • 3. Evidence obtained from expert opinions or observations. • 4. Evidence obtained from academic studies (e.g., controlled lab experiments).

Task Image classification Image object detection Text classification Video object detection Video classification pNaupmerbse(r%o)f Task pNaupmerbse(r%o)f 395 (79,2%) Face recognition 13 (2,6%)

Image translation 9 159 (31,9%) Text detection 7 (1,4%)

Speech recognition 4 (0,8%) 48 (9,5%) Data mining 5 (1,0%) 14 (2,8%) Music generation 3 (0,6%)

Texture categorization 1 (0,2%) 13 (2,6%) Human activity 1

recognition • 5. Evidence obtained from industrial studies (i.e., studies are done in industrial environments, e.g., causal case studies). • 6. Evidence obtained from industrial application (i.e., actual use of a method in industry)

4. Results and analysis This section first gives an overview of the identified stud

ies and extracted information. After that, the research questions are answered by representing the extracted data and summarizing the data as an answer to each question.

4.1. Results overview and demographics

After performing the search and selection described above in Section 3, we included 499 papers in the data analysis.

Number of papers 35 30 26 19 10 8 3 2 6 2 5 5 5 2 4 2 5 1 4 2 1 3 2 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 27 30 31 33 35 36 40 42 45 47 48 50 54 62 64 75 104 114 125 126 281 697 1641

Citations

The number of papers published each year between papers) of the studies. Text classification was addressed January 2013 and October 2020 is shown in Figure 1. The by 9.5% of the studies. The rest of the domains had less ifrst papers started to appear only in 2016, and the highest than 15 studies addressing them. Some studies addressed number of studies was published in 2020. Figure 1 also more than one domain, so the total number of papers in indicates incremental interest following the ILSVRC 2014 the table is more than the amount reviewed. In summary, competition taking into account writing and publishing the domains related to images or videos are clearly the delays of over a year. In particular, the increase in interest most prevalent. has been exponential rather than linear over recent years.

The number of papers at diferent evidence levels is 4.2. RQ1: Open ML models, datasets, and shown in Table 3. Almost all papers (97.8%, i.e., 488 frameworks papers) provided Level 4 evidence (academic studies) of their findings. The few remaining papers present Level To answer this research question, the data of F6 (used 2 (Demonstration) or Level 5 (industrial study) evidence. models), F7 (used datasets), and F8 (used frameworks) As this review focused on existing Open ML models, it is were analyzed from the data extraction form and summaunsurprising that at least Level 4 is achieved. However, rized in what follows. Because some studies used more there needs to be more practice-oriented studies at Levels than one model in their comparisons, the total numbers 5 and 6. All studies provide mostly academic or industrial- of papers are more than the amount reviewed. level research, but most do not ofer enough comparative Table 6 presents pre-trained Open ML models used evidence to adopt their used models. Only a few studies by the studies and the number of studies applying each critically examined the potential influence of diferent pre-trained model. 149 diferent Open ML models were actors, such as the researchers’ bias, sponsors, and the identified. The most popular pre-trained Open ML model, quality of tests used to validate their study. i.e., VGG-16, is used by 168 (33.7%) studies, while the next

Figure 2 shows the citation counts for studies. As can most popular AlexNet and ResNet-50 are included in 100 be seen, the lowest and highest citation counts are 0 and (20.2%) and 99 (19.8%) studies, respectively. The majority 1641, respectively. 464 (around 93.0%) have a citation of the models have less than six studies using them. count in the range of 0–20, and 35 papers (7.0%) have Table 7 shows datasets used to train and test in the high citation counts in the range of 21–126. A few sig- studies. A total of 49 diferent datasets were identified, nificant outliers were S111 (review of deep learning for and in 114 studies, the framework was not specified. Imatime series classification), S203 (new simple approach geNet is used by 60.7% of the studies (i.e., 303 papers). In for batch normalization), and S31 (new edge detection contrast, the second most popular MS COCO and Google algorithm), with citation counts of 281, 697, and 1641, News Word2Vec are included in a significantly smaller respectively. number of studies, i.e., 19 studies each. The majority of

The domains and tasks addressed by studies are shown the datasets have less than three studies using them. 114 in Tables 4 and 5. 79.2% of the studies (i.e., 395 papers) studies do not mention their datasets explicitly, and thus, addressed Image classification, while the second most they could not be extracted. popular class was Image object detection with 31.9% of Table 8 lists the frameworks used in the studies. We studies (i.e., 159 papers). Biology was addressed by 24.0% identified in total 37 diferent frameworks, and the frameof the studies. Medical got addressed by 19.4% (i.e., 97 work is not specified in 153 studies. 21.8% of the studies (i.e., 109 studies) used the TensorFlow framework. The second most often used is Keras, with 16.2% studies (i.e., 81 studies). Over a half of the frameworks have less than four studies using them. 30.7% of studies did not explicitly mention their frameworks and were unresolvable. 4.2.1. Summary to RQ1 There is a large diversity in shared Open ML model usage across studies, although VGG-16 stood out as the most popular. Many diferent Open ML models have been used in diferent studies. A large part of the variety appears due to the multiple application domains addressed, as seen in Table 4, and their requirements for models with specialized application domain capabilities, such as word relation recognition (Parsey McParseface, word2vec) and music recognition (MidiNet, etc.). With frameworks and especially datasets, we see less disparity between the studies. Especially in the case of datasets, although there is a larger number of diferent datasets than OpenML models, ImageNet stands out as dominant, appearing in 303 (60.7%) studies, and most datasets appear in at most a couple of studies. This lack of disparity may be due to the lack of interest and usefulness regarding the less widely appearing datasets or the significant time and work efort required to train new models based on them.

4.3. RQ2: Comparisons of open ML models, datasets, and frameworks

In addition to the data analyzed in the above RQ1, the 4.3.1. Summary to RQ2 number of diferent Open ML models that were compared within each study was analysed in order to form a con- The number of Open ML models used per study was ceivable adoption preference based on the comparison. counted to see how well they are represented between In Table 9, the number of Open ML models which are comparisons. Around six out of ten studies compared explicitly compared by studies is listed. Many studies Open ML models to other Open ML models, which can that do not compare Open ML models with other open be considered quite a large amount. models, however, make a comparison to diferently li- Also, the amount of dataset and framework comparcensed ones, such as copyrighted or proprietary models isons were counted, but they provided significantly less and frameworks without openly available source code. instrumental results. Only one-tenth of studies compared In total, 59.6% of studies compare their results to other the results of diferent datasets and about slightly over Open ML models, and 35.8% compare to more than one one-tenth with diferent frameworks. Those that did comOpen ML model. 40.4% of the papers do not compare pare datasets and frameworks were also mainly limited their results to other Open ML models. to only comparing two. However, unlike Open ML Mod

Likewise, an analysis was carried out on how many els, datasets and frameworks have only a few dominant diferent datasets were compared in each study. Table designs that are widely applied (cf. RQ1). It should also 10 lists the number of datasets that were compared in be taken into account that dataset and framework results studies. 10.4% of the studies (i.e., 52 papers) compare are inaccurate because many studies do not explicitly the use of other datasets, and 15 of them to more than mention what was used by name. one. As also seen in Table 10, at least 66.1% of the papers The limited amount of contrastive studies did not ofdo not compare their results to other datasets. Due to fer enough information for reliable results of domainthe unresolved datasets used by 22.8% of the studies, an specific or general Open ML model adoptions. The scale unspecified category was added to the table. of dataset and framework adoption is also unclear. How

Finally, Table 11 lists the number of frameworks com- ever, also scientific literature provides some evidence on pared in studies. 15.0% of the studies (i.e., 75 studies) the popular adoption rates of certain Open ML models: compare their results with other frameworks, and 13 of VGG-16 and AlexNet, frameworks such as TensorFlow, them to more than one. As also visible in Table 11, at and datasets like ImageNet. least 54.5% of the studies do not compare their results with other frameworks. The results are not very accurate due to the unresolved framework datasets used by nearly a third (30.5%) of the studies. These were categorized as unspecified in the table.

4.4. RQ3: Evidence available on the performance and evolvability of pre-trained Open ML model solutions

tion approaches to evaluate the raw and tuned models.

The studies could not directly be used to evaluate the The model type used for performance measurement stud- evolvability of models, but gave positive promise for their ies is shown in Table 12. 272 studies (53.8%) give results evolvability. for raw model performance that only use transfer learning. 112 studies (22.1%) provided results for models that were ensembled, modified, or tuned by researchers. Tun- 5. Discussion ing ranges from individual value changes to layer replacement. 122 (24.1%) studies provided both raw and tuned This section summarizes and discusses the main findings, performance results. The performance measurement re- limitations to the review, and threats to validity. sults provided by the studies are not directly comparable and thus not listed. 5.1. Main findings

Table 13 shows the performance of pre-trained Open ML Model solutions described in the studies when the studies compare their results to the previous state-ofthe-art solutions, such as NN, SVM, and constant feature classification algorithms. Improvement in Table 13 consists of studies describing at least one of the Open ML Models outperforming the state-of-the-art solutions. The partial improvement consists of studies describing Open ML Models being competitive and partially outperforming in specific categories, such as lower computational cost without much performance disadvantage compared to others. Decline consists of studies where Open ML models performed worse than other solutions. As seen from Table 13 majority of studies, 74.1% (i.e., 370 papers), provide minor to moderate improvement compared to previous methods. 24 (4.8%) studies found a decline compared to the state-of-the-art performance when using Open ML models. 108 (21.0%) studies do not show improvement, lack comparison, or have mixed results when using Open ML models.

The main goal of this study was to investigate how pretrained Open ML models, frameworks, and datasets are shared and used in diferent domains through a systematic mapping study research method. Based on a relatively large sample, the reviewed 499 studies provide a listing of Open ML models, frameworks, and datasets used in research as well as their relative popularity. The studies consist of many diferent domains, which saw benefits ranging from minor decline to moderate improvement compared to previously used machine learning methods. This indicated that pre-trained Open ML

5.3. Reflection of review To objectively evaluate a systematic review, Kitchenham et al. proposed four quality questions for systematic reviews [27]:

Are inclusion and exclusion criteria described? It is considered that this review meets this criterion as it explicitly

Acknowledgements This work was partly funded by Business Finland under

IML4E and IVVES of ITEA programme. models and frameworks show positive promise for improvement in machine learning. Most of the models in the studies were used under the TensorFlow framework with ImageNet as the pre-train dataset. Most studies were academic, and only a few industrial studies were identified. More industrial-level studies are required to be reviewed in order to have more reliable and accurate representations of Open ML model performance in the real world.

Suggestion for the future is to increase the coverage of studies and modify the review inclusion criteria for study extraction when assessing the usage of shared pre-trained Open ML models. Another more conclusive option is to prototype and create a constantly updated open curated database of results for diferent Open ML models running on diferent frameworks using diferent pre-train datasets. These configurations would then be run and tested on diferent domain datasets. The diferent possible combinations and results could then be calculated on a cloud platform, with the only requirement of having to insert the new model, dataset, or framework addition through a curated application form.

the 47th International Conference on Parallel Pro-

cessing , 2018 , pp. 1 - 10 . [21]

Nguyen ,

Dlugolinsky ,

Bobák ,

Tran ,

ifcial Intelligence Review 52 ( 2019 ) 77 - 124 . [22] Curated list of open source li-

awesome-production-machine-learning/ , 2020 .

[Online; accessed 13-04-2020]. [23]

Kitchenham ,

Charters , Guidelines for perform-

neering , 2007 . [24]

Keele , et al., Guidelines for performing systematic

cal Report , Technical report, Ver. 2 .3 EBSE Technical

Report. EBSE , 2007 . [25]

Mahdavi-Hezavehi ,

Galster ,

Avgeriou , Vari-

mation and Software Technology 55 ( 2013 ) 320 - 343 . [26]

B. A.

Kitchenham ,

Brereton ,

Turner , M. K.

ware Engineering 15 ( 2010 ) 618 - 653 . [27]

Alves ,

Niu ,

Alves , G. Valença, Requirements

Technology 52 ( 2010 ) 806 - 820 . [28]

J. P.

Ioannidis , Why most published research find-

ings are false , PLos med 2 ( 2005 ) e124 . [29]

Eykholt ,

Evtimov ,

Fernandes ,

Li , A . Rah-

on Computer Vision and Pattern Recognition , 2018 ,

pp. 1625 - 1634 . [30]

Vanschoren , J. N. van Rijn ,

Bischl , L. Torgo,

SIGKDD Explorations 15 ( 2013 ) 49 - 60 .