=Paper=
{{Paper
|id=Vol-2699/paper08
|storemode=property
|title=Simplifying Architecture Search for Graph Neural Network
|pdfUrl=https://ceur-ws.org/Vol-2699/paper08.pdf
|volume=Vol-2699
|authors=Huan Zhao,Lanning Wei,Quanming Yao
|dblpUrl=https://dblp.org/rec/conf/cikm/ZhaoWY20
}}
==Simplifying Architecture Search for Graph Neural Network==
Simplifying Architecture Search for Graph Neural Network Huan Zhaoa , Lanning Weib and Quanming Yaoc a 4Paradigm Inc. Shenzhen b Institute of Computing Technology Chinese Academy of Sciences c Hong Kong Abstract Recent years have witnessed the popularity of Graph Neural Networks (GNN) in various scenarios. To obtain optimal data- specific GNN architectures, researchers turn to neural architecture search (NAS) methods, which have made impressive progress in discovering effective architectures in convolutional neural networks. Two preliminary works, GraphNAS and Auto-GNN, have made first attempt to apply NAS methods to GNN. Despite the promising results, there are several drawbacks in expressive capability and search efficiency of GraphNAS and Auto-GNN due to the designed search space. To overcome these drawbacks, we propose the SNAG framework (Simplified Neural Architecture search for Graph neural networks), consisting of a novel search space and a reinforcement learning based search algorithm. Extensive experiments on real-world datasets demonstrate the effectiveness of the SNAG framework compared to human-designed GNNs and NAS methods, including GraphNAS and Auto-GNN.1 1. Introduction neural networks (CNN) and recurrent neural networks (RNN). In very recent time, there are two preliminary In recent years, Graph Neural Networks (GNN) [1, 2] works, GraphNAS [18] and Auto-GNN [19], making the have been a hot topic due to their promising results on first attempt to apply NAS to GNN architecture design. various graph-based tasks, e.g., recommendation [3, 4, 5], Though GraphNAS and Auto-GNN show some promising fraud detection [6], chemistry [7]. In the literature, results, there are some drawbacks in expressive capability various GNN models [8, 9, 10, 11, 12, 13, 6, 5] have been and search efficiency of GraphNAS and Auto-GNN due to designed for graph-based tasks. Despite the success of the designed search space. In the NAS literature [14, 15, these GNN models, there are two challenges facing them. 16, 17], a good search space should be both expressive and The first one is that there is no optimal architecture compact. That is the search space should be large enough which can always behave well in different scenarios. For to subsume existing human-design architectures, thus example, in our experiments (Table 3 and 4), we can the performance of a search method can be guaranteed. see that the best GNN architectures vary on different However, it will be extremely costly if the search space datasets and tasks. It means that we have to spend is too general, which is impractical for any searching huge computational and expertise resources designing method. The search spaces of GraphNAS and Auto- and tuning a well-behaved GNN architecture given a GNN are the same, both of which do not well satisfy specific task, which limits the application of GNN models. the requirement of a good search space. On one hand, Secondly, existing GNN models do not make full use of they fail to include several latest GNN models, e.g., the best architecture design practices in other established the GeniePath [6], for which we give a more detailed areas, e.g., computer vision (CV). For example, existing analysis in Section 3.1 (Table 1). On the other hand, the multi-layer GNN models tend to stack multiple layers search space includes too many choices, making it too with the same aggregator (see bottom left of Figure 1), complicated to search efficiently. which aggregates hidden features of multi-hop neighbors. In this work, to overcome the drawbacks of GraphNAS However it remains to be seen whether combinations of and Auto-GNN and push forward the research of NAS different aggregators in a multi-layer GNN model can approaches for GNN, we propose the SNAG framework further improve the performance. These challenges lead (Simplified Neural Architecture search for Graph neural to a straightforward question: can we obtain well-behaved networks), consisting of a simpler yet more expressive data-specific GNN architectures? search space and a RL-based search algorithm. By To address the above question, researchers turn revisiting extensive existing works, we unify state-of- to neural architecture search (NAS) [14, 15, 16, 17] the-art GNN models in a message passing framework [7], approaches, which have shown promising results in based on which a much more expressive yet simpler automatically designing architectures for convolutional search space is designed. The simplified search space can not only emulate a series of existing GNN models, but Proceedings of the CIKM 2020 Workshops, October 19-20, Galway, also be very flexible to use the weight sharing mechanism, Ireland. © 2020 Copyright for this paper by its authors. Use permitted under Creative which is a widely used technique to accelerate the CEUR Workshop http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) search algorithm in the NAS literature. We conduct Proceedings Figure 1: The whole framework of the propsed SNAG (Best view in color). (a) Upper Left: an example graph with five nodes. The gray rectangle represent the input features attached to each node; (b) Bottom Left: a typical 2-layer GNN architecture following the message passing neighborhood aggregation schema, which computes the embeddings of node “2”; (c) Upper Right: the reinforcement learning pipeline for NAS; d) Bottom Right: an illustration of a search space of the proposed SNAG using 2-layer GNN as backbone, which includes two key components of existing GNN models: node and layer aggregators. extensive experiments to demonstrate the effectiveness 2. Related Works of the SNAG framework comparing to various baselines including GraphNAS and Auto-GNN. To summarize, the 2.1. Graph Neural Network (GNN) contributions of this work are in the following: GNN is first proposed in [1] and in the past five years • In this work, to automatically obtain well-behaved many different variants [8, 9, 10, 11, 12, 13, 6] have been data-specific GNN architectures, we propose the designed, all of which are relying on a neighborhood SNAG framework, which can overcome the drawbacks aggregation (or message passing) schema [7]. As shown of existing NAS approaches, i.e., GraphNAS and Auto- in the left part of Figure 1, it tries to learn the representa- GNN. To better utilize the NAS techniques, we design tion of a given node in a graph by iteratively aggregating a novel and effective search space, which can emulate the hidden features (“message”) of its neighbors, and the more existing GNN architectures than previous works. message can propagate to farther neighborhood in the • We design a RL-based search algorithm and its variant graph, e.g., the hidden features of two-hop neighbors by adopting the weight sharing mechanism (SNAG- can be aggregated in a two-step iteration process. Let WS). By comparing the performance of these two 𝒢 = 𝑁(𝒱, ℰ) be a simple graph with node features variants, we show that the weight sharing mechanism X ∈ R ×𝑑 , where 𝒱 and ℰ represent the node and edge is not empirically useful as we imagined, which aligns sets, respectively. 𝑁 represents the number of nodes and with the latest research in NAS literature [20]. 𝑑 is the dimension of node features. We use 𝑁 (𝑣) to represent the first-order neighbors of a node 𝑣 in 𝒢, i.e., • Extensive experiments on real-world datasets are 𝑁 (𝑣) = {𝑢 ∈ 𝒱|(𝑣, 𝑢) ∈ ℰ}. In the literature, we also conducted to evaluate the proposed SNAG frame- create a new set 𝑁 ̃︀ (𝑣) is the neighbor set including itself, work, comparing to human-designed GNNs and NAS i.e., 𝑁 ̃︀ (𝑣) = {𝑣} ∪ {𝑢 ∈ 𝒱|(𝑣, 𝑢) ∈ ℰ}. methods. The experimental results demonstrate the Then a 𝐾-layer GNN can be written as follows: the superiority of SNAG in terms of effectiveness and 𝑙-th layer (𝑙 = 1, · · · , 𝐾) updates h𝑣 for each node 𝑣 by efficiency compared extensive baseline models. aggregating its neighborhood as (︂ (︂ )︂)︂ h𝑙𝑣 = 𝜎 W(𝑙) · Φ𝑛 {h(𝑙−1) 𝑢 , ∀𝑢 ∈ 𝑁 ̃︀ (𝑣)} , (1) (𝑙) where h𝑣 ∈ R𝑑𝑙 represents the hidden features of a node 𝑣 learned by the 𝑙-th layer, and 𝑑𝑙 is the Table 1 Comparisons of the search space between existing NAS methods and SNAG. For more details of the “Others” columns of GraphNAS/Auto-GNN, we refer readers to the corresponding papers. Node aggregators Layer aggregators Others Hidden Embedding Size, GraphNAS/ GCN,SAGE-SUM/-MEAN/-MAX, MLP, GAT , - Attention Head, Auto-GNN GAT-SYM/-COS/ -LINEAR/-GEN-LINEAR , Activation Function Ours All above plus SAGE-LSTM and GeniePath CONCAT,MAX,LSTM IDENTITY, ZERO corresponding dimension. W(𝑙) is a trainable weight more promising architectures. GraphNAS [18] and Auto- matrix shared by all nodes in the graph, and 𝜎 is a non- GNN [19] are the first two RL-based NAS methods for linear activation function, e.g., a sigmoid or ReLU. Φ𝑛 GNN. is the key component, i.e., a pre-defined aggregation Search space is a key component of NAS approaches, function, which varies across on different GNN models. the quality of which directly affects the final performance For example, in [8], a weighted summation function is and search efficiency. As mentioned in [14, 15, 23, 28, 27, designed as the node aggregators, and in [9], different 22, 29, 26, 25], a good search space should include existing functions, e.g., mean and max pooling, are proposed as human-designed models, thus the performance of an the aggregators. Further, to weigh the importance of designed search algorithm can be guaranteed. In this different neighbors, attention mechanism is incorporated work, by unifying existing GNN models in the message to design the aggregators [10]. passing framework [7] with the proposed node and layer Usually, the output of the last layer is used as the aggregators, we design a more expressive yet simple final representation for each node, which is denoted search space in this work, which is also flexible enough (𝐾) as z𝑣 = h𝑣 . In [12], skip-connections [21] are to incorporate the weight sharing mechanism into our incorporated to propagate message from intermediate RL-based method. layers to an extra layer, and the final representation of the node(︁𝑣 is computed )︁by a layer aggregation as (1) (𝐾) 3. The Proposed Framework z𝑣 = Φ𝑙 h𝑣 , · · · , h𝑣 , and Φ𝑙 can also have different options, e.g., max-pooling, concatenation. Based 3.1. The design of search space on the node and layer aggregators, we can define the two key components of exiting GNN models, i.e., the As introduced in Section 2.1, most existing GNN ar- neighborhood aggregation function and the range of the chitectures are relying on a message passing frame- neighborhood, which tends to be tuned depending on the work [7], which constitutes the backbone of the designed tasks. In Table 1, we list all node and layer aggregators in search space in this work. Besides, motivated by this work, which lays the basis for the proposed SNAG JK-Network [13], to further improve the expressive framework. capability, we modify the message framework by adding an extra layer which can adaptively combine the outputs of all node aggregation layers. In this work, we 2.2. Neural Architecture Search (NAS) argue and demonstrate in the experiments that these Neural architecture search (NAS) [14, 15, 16, 17] aims two components are the key parts for a well-behaved to automatically find better and smaller architectures GNN model, denoted as Node arggregators and Layer comparing to expert-designed ones, which have shown aggregators. The former one focus on how to aggregate promising results in architecture design for CNN and the neighborhood features, while the latter one focus on Recurrent Neural Network (RNN) [22, 23, 24, 25, 26]. In the range of neighborhood to use. Here we introduce the the literature, one of the representative NAS approaches backbone of the proposed search space, as shown in the are reinforcement learning (RL) [14, 15, 27], which trains bottom right part of Figure 1, which consists of two key an RNN controller in the loop: the controller firstly components: generates an candidate architecture by sampling a list of • Node aggregators: We choose 12 node aggregators actions (operations) from a pre-defined search space, and based on popular GNN models, and they are presented then trains it to convergence to obtain the performance of in Table 1. the given task. The controller then uses the performance as the guiding signal to update the RNN parameters, and • Layer aggregators: We choose 3 layer aggregators the whole process is repeated for many iterations to find as shown in Table 1. Besides, we have two more operations, IDENTITY and ZERO, related to skip- Table 2 connections. Instead of requiring skip-connections Dataset statistics of the datasets in the experiments. between all intermediate layers and the final layer in JK-Network, in this work, we generalize this option Transductive Inductive by proposing to search for the existence of skip- Cora CiteSeer PubMed PPI connection between each intermediate layer and the #nodes 2,708 3,327 19,717 56,944 last layer. To connect, we choose IDENTITY, and ZERO #edges 5,278 4,552 44,324 818,716 otherwise. #features 1,433 3,703 500 121 #classes 7 6 3 50 To further inject the domain knowledge from exist- ing GNN architectures, when searching for the skip- connections for each GNN layer, we add one more constraint that the last layer should always be used as 𝑃 (𝛼, 𝜃𝑐 ), and for each architecture, we train them from the final output, thus for a 𝐾-layer GNN architecture, scratch with some hyper-parameters tuning, e.g., the we need to search 𝐾 − 1 IDENTITY or ZERO for the embedding size and learning rate, etc. We then select skip-connection options. the best architecture as the searched one, which aligns with the process in previous works [15, 27]. In our experiments, we empirically set 𝑛 = 10 for simplicity. 3.2. Problem formulation For more technical details, we refer readers to [15, 18]. After designing the search space, denoted as 𝒜, the search Besides, in this work, we also incorporate the weight process implies a bi-level optimization problem [30, 31], sharing mechanism into our framework, and propose the as show in the following: SNAG-WS variant. The key difference between SNAG and SNAG-WS lies in that we create a dictionary to load min𝛼∈𝒜 ℒ𝑣𝑎𝑙 (𝛼, 𝑤* ), (2) and save the trained parameters of all OPs (Table 1) in a s.t. 𝑤* = arg min𝑤 ℒ𝑡𝑟𝑎𝑖𝑛 (𝛼, 𝑤), sampled architecture during the search process. where ℒ𝑡𝑟𝑎𝑖𝑛 and ℒ𝑣𝑎𝑙 represent the training and validation loss, respectively, and 𝒜 represents the search 4. Experiments space introduced in Section 3.1. 𝛼 and 𝑤 represent the architecture and model parameters. Eq. (2) denotes a trial- 4.1. Experimental Settings and-error process for the NAS problem, which selects an 4.1.1. Datasets and Tasks. architecture 𝛼 from the search space, and then trains it from scratch to obtain the best performance. This process Here, we introduce two tasks and the corresponding is repeated during the given time budget and the optimal datasets (Table 2), which are standard ones in the 𝛼* is kept track of and returned after the search process literature [8, 9, 13]. finished. Transductive Task. Only a subset of nodes in one In this work, motivated by the pioneering NAS graph are used as training data, and other nodes are works [14, 15], we design a RL method to execute the used as validation and test data. For this setting, we use search process. To be specific, during the search phase, three benchmark dataset: Cora, CiteSeer, PubMed. They we use a recurrent neural network (RNN) controller, are all citation networks, provided by [32]. Each node parameterized by 𝜃𝑐 , to sample an candidate architecture represents a paper, and each edge represents the citation from the search space. The architecture is represented relation between two papers. The datasets contain bag- by a list of actions (OPs), including the node aggregators, of-words features for each paper (node), and the task layer aggregators and IDENTITY/ZERO as shown in is to classify papers into different subjects based on the Table 1. Then the candidate architecture will be citation networks. trained till convergence, and the accuracy on a held-out For all datasets, We split the nodes in all graphs into validation set 𝒟𝑣𝑎𝑙 is returned. The parameters of the 60%, 20%, 20% for training, validation, and test. For the RNN controller are then optimized in order to maximize transductive task, we use the classification accuracy as the expected validation accuracy E𝑃 (𝛼;𝜃𝑐 ) [ℛ] on 𝒟𝑣𝑎𝑙 , the evaluation metric. where 𝑃 (𝛼; 𝜃𝑐 ) is the distribution of architectures Inductive Task. In this task, we use a number of graphs parameterized by 𝜃𝑐 , and ℛ is the validation accuracy. as training data, and other completely unseen graphs In this way, the RNN controller will generate better as validation/test data. For this setting, we use the architectures over time, and can obtain optimal one in PPI dataset, provided by [9], on which the task is to the end of the search phase. After finishing the search classify protein functions. PPI consists of 24 graphs, with process, we need to derive the searched architectures. We each corresponding to a human tissue. Each node has first sample 𝑛 architectures under the trained distribution positional gene sets, motif gene sets and immunological Table 3 Performance comparisons in transductive tasks. We show the mean classification accuracy (with standard deviation). We categorize baselines into human-designed GNNs and NAS methods. The best results in different groups of baselines are underlined, and the best result on each dataset is in boldface. Transductive Methods Cora CiteSeer PubMed GCN 0.8761 (0.0101) 0.7666 (0.0202) 0.8708 (0.0030) GCN-JK 0.8770 (0.0118) 0.7713 (0.0136) 0.8777 (0.0037) GraphSAGE 0.8741 (0.0159) 0.7599 (0.0094) 0.8784 (0.0044) GraphSAGE-JK 0.8841 (0.0015) 0.7654 (0.0054) 0.8822 (0.0066) Human- GAT 0.8719 (0.0163) 0.7518 (0.0145) 0.8573 (0.0066) designed GAT-JK 0.8726 (0.0086) 0.7527 (0.0128) 0.8674 (0.0055) GNN GIN 0.8600 (0.0083) 0.7340 (0.0139) 0.8799 (0.0046) GIN-JK 0.8699 (0.0103) 0.7651 (0.0133) 0.8828 (0.0054) GeniePath 0.8670 (0.0123) 0.7594 (0.0137) 0.8796 (0.0039) GeniePath-JK 0.8776 (0.0117) 0.7591 (0.0116) 0.8818 (0.0037) LGCN 0.8687 (0.0075) 0.7543 (0.0221) 0.8753 (0.0012) Random 0.8694 (0.0032) 0.7820 (0.0020) 0.8888(0.0009) Bayesian 0.8580 (0.0027) 0.7650 (0.0021) 0.8842(0.0005) NAS GraphNAS 0.8840 (0.0071) 0.7762 (0.0061) 0.8896 (0.0024) methods GraphNAS-WS 0.8808 (0.0101) 0.7613 (0.0156) 0.8842 (0.0103) SNAG 0.8826 (0.0023) 0.7707 (0.0064) 0.8877 (0.0012) ours SNAG-WS 0.8895 (0.0051) 0.7695 (0.0069) 0.8942 (0.0010) Table 4 Performance comparisons in inductive tasks. We show the Micro-F1 (with standard deviation). We categorize baselines into human-designed GNNs and NAS methods. The best results in different groups of baselines are underlined, and the best result is in boldface. Methods PPI GCN 0.9333 (0.0019) GCN-JK 0.9344 (0.0007) GraphSAGE 0.9721 (0.0010) GraphSAGE-JK 0.9718 (0.0014) Human-designed GAT 0.9782 (0.0005) GNN GAT-JK 0.9781 (0.0003) GIN 0.9593 (0.0052) GIN-JK 0.9641 (0.0029) GeniePath 0.9528 (0.0000) GeniePath-JK 0.9644 (0.0000) Random 0.9882 (0.0011) Bayesian 0.9897 (0.0008) NAS methods GraphNAS 0.9698 (0.0128) GraphNAS-WS 0.9584 (0.0415) SNAG 0.9887 (0.0010) ours SNAG-WS 0.9875 (0.0006) signatures as features and gene ontology sets as labels. 4.1.2. Compared Methods 20 graphs are used for training, 2 graphs are used for We compare SNAG with two groups of state-of-the-art validation and the rest for testing, respectively. For the methods: human-designed GNN architectures and NAS inductive task, we use Micro-F1 as the evaluation metric. methods for GNN. Human-designed GNNs. We use the following popular GNN architectures: GCN [8], GraphSAGE [9], GAT [10], 4.2. Performance comparison In this part, we give the analysis of the performance comparisons on different datasets. From Table 3, we can see that SNAG models, including SNAG-WS, win over all baselines on most datasets except CiteSeer. Considering the fact that the performance of SNAG on CiteSeer is very close to the best one (a) Cora. (b) CiteSeer. (Random), it demonstrates the effectiveness of the NAS methods on GNN. In other words, with SNAG, we can obtain well-behaved GNN architectures given a new task. When comparing SNAG with GraphNAS methods, the performance gain is evident. We attribute this to the superiority of the expressive yet simple search space. From Table 4, we can see that the performance trending is very similar to that in transductive task, which is that the NAS methods can obtain better or competitive performance than human-designed GNNs. When looking (c) PubMed. (d) PPI. at the NAS methods, we can see that our proposed SNAG, Random and Bayesian outperforms GraphNAS. Figure 2: The validation accuracy w.r.t. search time (in This also demonstrates the superiority of the designed seconds) in log base 10. search space. Taking into consideration the results of these two tasks, we demonstrate the effectiveness of SNAG models, GIN [12], LGCN [11], GeniePath [6]. For models with especially the superiority of the search space. variants, like different aggregators in GraphSAGE or different attention functions in GAT, we report the best 4.3. Understanding the search space of performance across the variants. Besides, we extend the idea of JK-Network [13] in all models except for LGCN, SNAG and obtain 5 more baselines: GCN-JK, GraphSAGE-JK, In this section, we show the simplicity and expressiveness GAT-JK, GIN-JK, GeniePath-JK, which add an extra layer. of the designed search space of SNAG from two aspects: For LGCN, we use the code released by the authors 1 . For speedup in searching and the performance gain from the other baselines, we use the popular open source library layer aggregators. Pytorch Geometric (PyG) [33] 2 , which implements various GNN models. For all baselines, we train it from 4.3.1. Speedup in searching scratch with the obtained best hyper-parameters on validation datasets, and get the test performance. We In this part, to show the simplicity of the designed repeat this process for 5 times, and report the final mean search space, we compare the efficiency of SNAG and accuracy with standard deviation. GraphNAS by showing the validation accuracy w.r.t to NAS methods for GNN. We consider the following the running time, and the results are shown in Figure 2. methods: Random search (denoted as “Random”) and The accuracy is obtained by evaluating the sampled Bayesian optimization [34] (denoted as “Bayesian”), architecture on validation set after training it from which directly search on the search with random sam- scratch till convergency, which can reflect the capability pling and bayesian optimization methods, respectively. of NAS methods in discovering better architectures with Besides, GraphNAS3 [18] is chosen as NAS baseline. time elapsing. From Figure 2, we can see that SNAG Note that for human-designed GNNs and NAS meth- speeds up the search process significantly comparing ods, for fair comparison and good balance between to GraphNAS, i.e., the model can obtain better GNN efficiency and performance, we choose set the number of architectures during the search space. Considering the GNN layers to be 3, which is an empirically good choice fact that both GraphNAS and SNAG adopt the same RL in the literature [10, 6]. framework, then this advantage is attributed to simpler and smaller search space. 4.3.2. Influence of layer aggregators 1 https://github.com/HongyangGao/LGCN 2 https://github.com/rusty1s/pytorch_geometric In this part, to show the stronger expressive capability 3 https://github.com/GraphNAS/GraphNAS of the designed search space, we conduct experiments Table 5 Performance comparisons of SNAG and SNAG-WS using different search spaces. SNAG SNAG-WS layer aggregators (w) layer aggregators (w/o) layer aggregators (w) layer aggregators (w/o) Cora 0.8826 (0.0023) 0.8822 (0.0071) 0.8895 (0.0051) 0.8892 (0.0062) CiteSeer 0.7707 (0.0064) 0.7335 (0.0025) 0.7695 (0.0069) 0.7530 (00034) PubMed 0.8877 (0.0012) 0.8756 (0.0016) 0.8942 (0.0010) 0.8800 (0.0013) PPI 0.9887 (0.0010) 0.9849 (0.0040) 0.9875 (0.0006) 0.9861 (0.0009) on all datasets using a search space only with the node tional inductive biases, deep learning, and graph aggregators, i.e., removing the layer aggregators, as networks, arXiv preprint arXiv:1806.01261 (2018). comparisons. The results are shown in Table 5, and we [3] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. report the test accuracies of both the SNAG and SNAG- Hamilton, J. Leskovec, Graph convolutional neural WS. From Table 5, we can see that the performance networks for web-scale recommender systems, in: consistently drops on all datasets when removing the KDD, 2018, pp. 974–983. layer aggregators, which demonstrates the importance of [4] J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, the layer aggregators for the final performance and aligns D. L. Lee, Billion-scale commodity embedding for with the observation in Section 4.2 that the performance e-commerce recommendation in alibaba, in: KDD, of human-designed GNNs can be improved by adopting 2018, pp. 839–848. the JK-Network architecture. [5] W. Xiao, H. Zhao, H. Pan, Y. Song, V. W. Zheng, Q. Yang, Beyond personalization: Social content recommendation for creator equality and consumer 5. Conclusion and Future work satisfaction, in: KDD, 2019, pp. 235–245. [6] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, L. Song, Y. Qi, In this work, to overcome the drawbacks in expres- Geniepath: Graph neural networks with adaptive sive capability and search efficiency of two existing receptive paths, in: AAAI, volume 33, 2019, pp. NAS approaches for GNN, i.e., GraphNAS [18] and 4424–4431. Auto-GNN [19], we propose the SNAG framework, [7] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, i.e., Simplified Neural Architecture search for GNN. G. E. Dahl, Neural message passing for quantum By revisiting existing works, we unify state-of-the- chemistry, in: ICML, 2017, pp. 1263–1272. art GNN models in a message passing framework [7], [8] T. N. Kipf, M. Welling, Semi-supervised classifi- and design a simpler yet more expressive search space cation with graph convolutional networks, ICLR than that of GraphNAS and Auto-GNN. A RL-based (2016). search algorithm is designed and a variant (SNAG-WS) [9] W. Hamilton, Z. Ying, J. Leskovec, Inductive is also proposed by incorporating the weight sharing representation learning on large graphs, in: mechanism. Through extensive experiments on real- NeurIPS, 2017, pp. 1024–1034. world datasets, we not only demonstrate the effectiveness [10] P. Veličković, G. Cucurull, A. Casanova, A. Romero, of the proposed SNAG framework comparing to various P. Lio, Y. Bengio, Graph attention networks, ICLR baselines including GraphNAS and Auto-GNN, but also (2018). give better understanding of different components of [11] H. Gao, Z. Wang, S. Ji, Large-scale learnable graph the proposed SNAG. For future work, we will explore convolutional networks, in: KDD, 2018, pp. 1416– the SNAG framework in more graph-based tasks besides 1424. node classification. [12] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: ICLR, 2019. References [13] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, S. Jegelka, Representation learning on graphs with [1] M. Gori, G. Monfardini, F. Scarselli, A new model jumping knowledge networks, in: ICML, 2018, pp. for learning in graph domains, in: IJCNN, volume 2, 5449–5458. 2005, pp. 729–734. [14] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing [2] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez- neural network architectures using reinforcement Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, learning, ICLR (2017). D. Raposo, A. Santoro, R. Faulkner, et al., Rela- [15] B. Zoph, Q. V. Le, Neural architecture search with reinforcement learning, ICLR (2017). [16] T. Elsken, J. H. Metzen, F. Hutter, Neural architec- Algorithms for hyper-parameter optimization, in: ture search: A survey, JMLR (2018). NeurIPS, 2011, pp. 2546–2554. [17] Q. Yao, M. Wang, Taking human out of learning applications: A survey on automated machine learning, arXiv preprint arXiv:1810.13306 (2018). [18] Y. Gao, H. Yang, P. Zhang, C. Zhou, Y. Hu, Graph neural architecture search, in: IJCAI, 2020, pp. 1403– 1409. [19] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-GNN: Neu- ral Architecture Search of Graph Neural Networks, Technical Report, arXiv preprint arXiv:1909.03184, 2019. [20] C. Sciuto, K. Yu, M. Jaggi, C. Musat, M. Salzmann, Evaluating the search phase of neural architecture search, ICLR (2020). [21] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016, pp. 770–778. [22] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable architecture search, ICLR (2019). [23] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learn- ing transferable architectures for scalable image recognition, in: CVPR, 2018, pp. 8697–8710. [24] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: ICML, 2019, pp. 6105–6114. [25] Q. Yao, J. Xu, W.-W. Tu, Z. Zhu, Efficient neural architecture search via proximal iterations., in: AAAI, 2020, pp. 6664–6671. [26] Y. Zhang, Q. Yao, L. Chen, Neural Recurrent Structure Search for Knowledge Graph Embed- ding, Technical Report, International Workshop on Knowledge Graph@KDD, 2019. [27] H. Pham, M. Guan, B. Zoph, Q. Le, J. Dean, Efficient neural architecture search via parameter sharing, in: ICML, 2018, pp. 4092–4101. [28] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, Q. V. Le, Understanding and simplifying one-shot architecture search, in: ICML, 2018, pp. 549–558. [29] L. Li, A. Talwalkar, Random search and reproducibil- ity for neural architecture search, arXiv preprint arXiv:1902.07638 (2019). [30] B. Colson, P. Marcotte, G. Savard, An overview of bilevel optimization, Annals of operations research 153 (2007) 235–256. [31] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, M. Pontil, Bilevel programming for hyperparameter optimization and meta-learning, in: ICML, 2018, pp. 1568–1577. [32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI magazine 29 (2008) 93–93. [33] M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLRW, 2019. [34] J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl,