1. Introduction

Neural Network? Finding the Right Neural Network Architecture for a Research Problem

Michael Färber

michael.faerber@kit.edu 1 2

Nicolas Weber

0 2 0 Heidelberg University, Natural Language Processing Group , Im Neuenheimer Feld 325, 69120 Heidelberg , Germany 1 Karlsruhe Institute of Technology (KIT), Web Science Group , Kaiserstr. 89, 76133 Karlsruhe , Germany 2 Workshop Proce dings

Considering the increasing rate of scientific papers published in recent years, for researchers throughout all disciplines it has become a challenge to keep track of which latest scientific methods are suitable for which applications. In particular, an unmanageable amount of neural network architectures has been published. In this paper, we propose the task of recommending neural network architectures based on textual problem descriptions. We frame the recommendation as a text classification task and develop appropriate text classification models for this task. In experiments based on three data sets, we find that an SVM classifier outperforms a more complex model based on BERT. Overall, we give evidence that neural network architecture recommendation is a nontrivial but gainful research topic.

recommender systems machine learning neural network architectures open science

1. Introduction

A multitude of neural network architectures has been proposed, with many more to come. The knowledge ral network architectures. Machine learning researchers and practitioners, such as data scientists and software deWhen to use which neural network architecture?2

So far, approaches to neural architecture search and search engines for research data management have been proposed. Neural architecture search [ 1 ] is concerned with the task of automatically finding the optimal neural network architecture design for a specific task. However, neural architecture search approaches usually restrict themselves to a specific architecture type (e.g., RNN or CNN) and target finding the optimal architecture, such Instead, the focus of this paper is on a diferent level of granularity. The idea is to create a model that finds the most suitable neural network architecture for a research problem described in natural language. Furthermore, neural network search engines and ontologies, such as FAIRnets [ 2, 3 ], difer from us because they allow only nEvelop-O LGOBE 0000-0001-5458-8645 (M. Färber) CEUR htp:/ceur-ws.org ISN1613-073 © 2021 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Workshop Proceedings (CEUR-WS.org) 1See https://www.wikidata.org/.

2See https://datascience.stackexchange.com/questions/20222/ how-to-decide-neural-network-architecture. keyword queries. Chen et al. [ 4 ] found that real information needs are most often formulated as phrases and not as keywords. The latter case constitutes only 32% of the investigated queries. In addition, such search systems renetwork architectures.

In this paper, we propose the task of neural network arspecific text classification tasks in the fact that research problem descriptions as input are largely not available and first need to be created. To this end, we propose two methods that extract the problem descriptions from papers’ abstracts. In addition, the usage of neural network architectures is highly imbalanced in the literature, making the recommendation task a nontrivial challenge. We train and evaluate two state-of-the-art machine learningbased approaches for neural network architecture recscriptions and neural network architectures derived from Wikidata. Our proposed approach can benefit students as well as researchers of various domains. For researchers with little expertise in the field of machine learning in particular, our approach simplifies the process of selecting a suitable neural network model and presumably yields a reduction in time spent on preliminary research on appropriate neural architectures.

To summarize, we make the following contributions: 1. We create evaluation data sets for neural network architecture recommendation, consisting of 66 unique architectures and 284,337 textual problem descriptions.

2. We train and evaluate several classifiers capable Kernel Perceptron Multilayer Perceptron Restricted Boltzmann Machine winner-take-all Hopfield N Neural Abstraction Pyramid Shift Invariant NN Spatial Transformer Network Neural History Compressor Kohonen NN Radial Basis Function N Connectionist Expert System Boltzmann machine Bidirectional Associative Memory Neural Turing Machine Self-organizing Map ResNet

RNTN TDNN Bcpnn MCDNN HONN

Elman N RecCC

Jordan N ADALINE LSTM CPPN CMAC PCNN DRPNN NNPDA MANN RecNN

RoBERTa Neocognitron Cresceptron Modular NN Deep NN

Feedforward NN perceptron

Highway N Transformer AlexNet Text-CNN EntNet Hamming NN LeNet-5 Stochastic NN CapsNet 3D-CNN

GCN CNN GRU SNN DNC

PNN DBN

GAN ESN ELM VAE HTM LSM RNN DQN

of predicting neural network architectures based We only consider English abstracts in which neural on textual problem descriptions.3 network architecture names are mentioned. After carefully analyzing the resulting abstracts, an issue related to

The paper is structured as follows: In Section 2, we the neural network architecture “transformer” is found. describe the creation of the neural network architecture Because the word “transformer” is polysemic, the bulk of set, as well as two data sets with scientific problem de- abstracts mentioning transformers are mostly concerned scriptions. Section 3 discusses the methods to predict with (electrical) engineering. To circumvent this probthe neural network architectures based on textual de- lem, these abstracts are filtered by a keyword list. 4 After scriptions. In Section 4, we present our experiments. We this, 284,337 abstracts remain. conclude in Section 5 with a summary. The abstracts usually include both problem descriptions and names of associated neural network architec2. Data tures. To extract these items, we propose the following methods.

2.1. Neural Network Architectures 2.2.1. Extraction by Abstract Splitting

Our approach utilizes the knowledge graph Wikidata to obtain a list of neural network architectures. The follow- The first approach of creating a data set is based on the ing aspects are taken into consideration: (1) all subclasses observation of Jiang et al. [ 6 ]. The main idea is that of artificial neural networks; (2) the hierarchical struc- abstracts can often be conceptually split into an introducture of these subclasses; (3) aliases and abbreviations. tion and a solution part. After manually checking 500 Our query returns 67 results, of which 66 (see Table 1) randomly selected papers from four conferences (SIGIR, are appropriate for the task at hand (the additional item SIGKDD, RecSys, and CIKM), the result indicates that returned is the “artificial neural network” item itself). 71% of the abstracts adhere to this structure [ 6 ]. We observe that the key phrases “in this paper” and 2.2. Problem Descriptions “this paper” play an important role in the transition between the problem statement and solution parts (see TaOur aim is to recommend neural network architectures ble 2). We therefore check for each sentence in the abbased on problem descriptions. However, problem de- stracts whether these key phrases occur. If there is a scriptions are, to the best of our knowledge, not available match, we mark the sentence as the beginning of the to a large degree. However, we argue that parts of papers’ solution part and all prior sentences as the problem deabstracts are a good approximation of textual research scription part. Table 2 provides an illustration of our problems. Thus, we use the paper abstracts and metadata abstract-splitting approach. from the Microsoft Academic Graph (MAG; [ 5 ]).

3All data and source code is available online at https:// github.com/michaelfaerber/NNARec. 4[BERT, GPT-2, GPT-3, natural language, self-attention]

Example Problem Description: The prediction of fail

ures in rotating machines is an important issue in industries to improve safety, to reduce the cost of maintenance and to prevent accidents.

Example Solution: In this paper a predictive maintenance algorithm, based on the analysis of the orbits shape of the rotor shaft is proposed. It is based on an autonomous image pattern recognition algorithm, implemented by using a

Convolutional Neural Network (CNN).[...] Example Target Label: CNN

descriptions and the neural network architecture names are of interest and not the long method descriptions, we additionally identify the neural network architecture

To evaluate the efectiveness of this method, we let two names mentioned in METHOD (in the example in Table 3: experienced researchers classify 500 randomly selected CNN) given our list of neural network architecture names. splits into the following categories: (1) the split is correct, To minimize redundancy for extractions made in the same (2) the split is incorrect, but a correct split is possible, abstract, if one string is a substring of the other, the longer and (3) the abstract cannot be split into an introduction one is chosen and the other one is dismissed. part and a solution part. The diferences between the A last step to reduce noise is to filter out common annotators lie mostly in the annotators’ conceptions of phrases in the texts that carry no information (e.g., “solve where to set a split, rather than whether a split is possible. this problem” given the template “we use METHOD to Inter annotator agreement can be reported by Cohen’s solve this PROBLEM”). While the quality of the extracted kappa of 0.7538, which indicates a good agreement for problem descriptions is overall satisfying, from 284,337 this task. Overall, based on our analysis, 88.6% of the abstracts mentioning neural network architectures, only randomly sampled splits are evaluated as being correct. 35,829 problem descriptions remain based on this method.

Once the abstracts have been split, only parts of the ab- The resulting data set is designated the Key Phrase Exstracts with mentions of neural network architectures in traction (KE) data set. their respective solution part are included in the data set, with the introduction parts as problem descriptions and the neural network architectures as the labels. We will 2.3. Neural Network Architecture refer to the resulting data set as the Abstract Splitting Mentions (AS) data set.

Due to the diferences in the data set creation, the distribution of neural network architectures difers in our 2.2.2. Extraction by Key Phrase Templates AS and KE data sets. To make them comparable, we The aforementioned method has the drawback that the take two steps. First, to avoid losing all instances of neural network mentioned in the solution part of an sparse classes, the hierarchical structure of some neural abstract is directly related to the problem description out- network architectures allows for the inclusion of some lined in the first part of this abstract. However, problem sparse classes into their parent classes (e.g., GRU is indescriptions in other parts of the abstract are ignored. tegrated into RNN). We perform this step for all classes To combat this issue, we create a method of identifying with less than 200 instances, given there is a hierarchy problem descriptions more precisely. to exploit. Second, because some architectures are rarely

In a first step, we analyze the abstracts that contain mentioned, only classes with at least 200 instances in neural network architecture mentions to obtain an un- both data sets are considered. This leads to both data sets derstanding of recurring phrases in problem descriptions. containing the same classes. From the initial 66 neural From these phrases, we then create templates to extract network architectures retrieved from Wikidata, only 15, problem descriptions in all abstracts. Table 3 illustrates which are listed in Figure 1, remain. an example of a template and a match. Overall, we came up with 44 templates that are based on regular expres- 2.4. Preparing AGENDA as Test Set sions.

As we can see in Table 3, this method generally results The Abstract GENeration data set (AGENDA; Koncelin shorter problem descriptions than the plain abstract Kedziorski et al. [ 8 ]) has been used for automatic text splitting method proposed above. As only the problem generation based on knowledge graphs and consists of ELM

RBF Spiking NN

PNN GAN

SOM Feedforward NN

Deep Belief N Autoencoder

Perceptron

MLP LSTM

RNN Deep NN

CNN 0 50 100 150 200 250 300 knowledge graphs paired with paper titles and paper abstracts from the AI domain. As mentions of tasks and methods are also labeled in these paper abstracts, we can use this data set for an additional, complementary evaluation, particularly as an additional test data set considering its size.

It is important to note that the text spans labeled as problem descriptions in this data set are rather short to be more compatible with knowledge graph entities. We therefore increased the context by considering whole sentences as problem descriptions. The resulting, modified data set, designated mod-AGENDA, has 1,327 instances, distributed over 15 classes, as Figure 1 shows.

3. Methods 4. Experiments 4.1. Evaluation Settings

We use a train-test split of 80:20 for the AS and KE data sets. Each of the methods is trained and tested on either the AS data set or the KE data set. In addition, the models trained on the KE and AS data sets are evaluated on the modified AGENDA data set to evaluate the generalizability of the approaches.

We consider the following methods: (1) SVM. We use scikit-learn’s Tfidf Vectorizer for numeric representations and an SVM implemented via a one-vs-rest classification scheme. (2) Fine-tuned SciBERT. We use SciBERT [ 9 ], a scientific domain-specific, pretrained BERT-model, and ifne-tune it on the classification task with Adam optimizer [ 10 ]. (3) Most frequent class (MFC). We consider the MFC as a baseline.

The task in this paper falls into the realm of supervised classification. The overwhelming majority of instances in each of our our data sets has only a single label. Thus, 4.2. Evaluation Results in the following evaluation, we consider the task as a multiclass, single-label classification task. For this paper, Precision, recall, F1-score5 (all macro-averaged), and acwe consider the following widely used text classification curacy for the MFC baseline, SVM, and fine-tuned Sciapproaches. BERT are reported in Table 4. The results show that the

TF-IDF + SVM. One approach is based on SVM, using SVM classifier trained and tested on the KE data set is TF-IDF for representing the text as vectors. As this can most successful with respect to recall, F1 score, and aclead to very high dimensional sparse vectors, it makes curacy. It beats the more complex SciBERT classifier by sense to filter out stopwords for the vector representation. more than 100 % in accuracy (0.5908 vs 0.2576) and F1

BERT + Classification Layer. As our second ap- score (0.4629 vs 0.1793). However, we note that accuracy proach, we use a fine-tuned BERT-model with an ad- is not an excellent metric for unbalanced data sets. ditional classification layer. Regarding the classifiers trained and tested on the AS data set, the SVM also beats the SciBERT model with respect to precision, F1 score, and accuracy, but with less significance. Here, the accuracy of the SVM is 0.17, and

5The F1-score is calculated as the arithmetic mean over the individual F1 scores [ 11 ]. – KE KE – AS AS – KE KE AS AS AS AS AS mod-AGENDA mod-AGENDA mod-AGENDA mod-AGENDA mod-AGENDA

Method

MFC SVM SciBERT MFC SVM SciBERT MFC SVM SciBERT SVM SciBERT

Precision (Macro) the F1-score is 0.1 higher than that of SciBERT.

SVM and SciBERT trained on the AS and KE data sets perform superior in most cases compared to the MFC baseline. Notably, MFC achieves a higher accuracy than SciBERT on the KE data set.

When evaluating the approaches on the mod-AGENDA data, the results drop significantly. Nonetheless, the SVM classifier still achieves the best results, with only little diference between the AS and KE data sets as training data sets. SciBERT still outperforms the MFC baseline.

The methods trained on the AS data set generalize better to some degree than the methods trained on the KE data set, despite the simple creation process of the AS data set. A likely reason for this phenomenon is that the AS data set is more similar to the AGENDA data set than the KE data set. In particular, the research problem descriptions in the KE data set are much shorter than in the AS data set.

Overall, given 0.59 and 0.57 as the best accuracy scores and 0.46 and 0.44 as the top F1 scores, we come to the conclusion that neural network recommendation based on textual task descriptions is a nontrivial task (motivating our paper), while it indicates that users (e.g., early-career researchers) might find such recommender systems helpful.

5. Conclusion

This paper introduced the task of recommending neural network architectures based on textual problem descriptions. To this end, we created two data sets of labeled problem descriptions. The first splits abstracts by means of signaling phrases and labels the problem parts by matching neural network-architecture names. The second method uses recurring phrases to extract shorter and more precise problem descriptions via regular expressions. We used both data sets to train and evaluate classifiers. We identified the SVM-based approach as a promising method, outperforming a BERT-based approach.

In the future, we will extend our recommender system to machine learning methods in general and combine it with the recommendation of other scholarly entities, such as data sets [ 12 ]. Furthermore, we plan to provide a running system for neural network architecture recommendation accompanied with a user study.

[1]

Elsken ,

J. H.

Metzen ,

Hutter , Neural architecture search: A survey , J. Mach. Learn. Res . 20 ( 2019 ) 55 : 1 - 55 : 21 .

[2]

Nguyen , T. Weller, FAIRnets Search - A Prototype Search Service to Find Neural Networks , in: Proceedings of the International Conference on Semantic Systems, SEMANTiCS'19 , 2019 .

[3]

Nguyen ,

Weller ,

Färber ,

Sure-Vetter , Making Neural Networks FAIR , in: Proceedings of the Second Iberoamerican Conference and First IndoAmerican Conference , KGSWC'20 , 2020 , pp. 29 - 44 .

[4]

Chen , et al., Towards More Usable Dataset Search: From Query Characterization to Snippet Generation , in: Proceedings of 28th ACM International Conference on Information and Knowledge Management , CIKM'19 , 2019 , pp. 2445 - 2448 .

[5]

Sinha ,

Shen , et al., An Overview of Microsoft Academic Service (MAS) and Applications , in : Proceedings of the 24th International Conference on World Wide Web Companion, WWW'15 , 2015 , pp. 243 - 246 .

[6]

Jiang ,

Jia ,

Feng ,

Zhao , Recommending Academic Papers via Users' Reading Purposes, in : Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys'12 , 2012 , pp. 241 - 244 .

[7]

Caponetto ,

Rizzo ,

Russotti ,

Xibilia , Deep Learning Algorithm for Predictive Maintenance of Rotating Machines Through the Analysis of the Orbits Shape of the Rotor Shaft , in: Proceedings of the 1st International Conference on Smart Innovation, Ergonomics and Applied Human Factors, SEAHF'19 , 2019 , pp. 245 - 250 .

[8]

Koncel-Kedziorski ,

Bekal ,

Luan ,

Lapata ,

Hajishirzi , Text Generation from Knowledge Graphs with Graph Transformers, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , NAACLHLT'19 , 2019 , pp. 2284 - 2293 .

[9]

Beltagy ,

Lo , A . Cohan, SciBERT: A Pretrained Language Model for Scientific Text , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP'19 , 2019 , pp. 3613 - 3618 .

[10]

D. P.

Kingma ,

Ba , Adam: A Method for Stochastic Optimization , in: Proceedings of the 3rd International Conference on Learning Representations, ICLR'15 , 2015 .

[11]

Opitz ,

Burst , Macro F1 and macro F1 , CoRR abs/ 1911 .03347 ( 2019 ).

[12]

Färber ,

Leisinger , Recommending Datasets for Scientific Problem Descriptions , in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management , CIKM'21 , 2021 , pp. 3014 - 3018 .