=Paper= {{Paper |id=Vol-2699/paper08 |storemode=property |title=Simplifying Architecture Search for Graph Neural Network |pdfUrl=https://ceur-ws.org/Vol-2699/paper08.pdf |volume=Vol-2699 |authors=Huan Zhao,Lanning Wei,Quanming Yao |dblpUrl=https://dblp.org/rec/conf/cikm/ZhaoWY20 }} ==Simplifying Architecture Search for Graph Neural Network== https://ceur-ws.org/Vol-2699/paper08.pdf
Simplifying Architecture Search for Graph Neural Network
Huan Zhaoa , Lanning Weib and Quanming Yaoc
a
  4Paradigm Inc. Shenzhen
b
  Institute of Computing Technology Chinese Academy of Sciences
c
  Hong Kong


                                             Abstract
                                             Recent years have witnessed the popularity of Graph Neural Networks (GNN) in various scenarios. To obtain optimal data-
                                             specific GNN architectures, researchers turn to neural architecture search (NAS) methods, which have made impressive
                                             progress in discovering effective architectures in convolutional neural networks. Two preliminary works, GraphNAS
                                             and Auto-GNN, have made first attempt to apply NAS methods to GNN. Despite the promising results, there are several
                                             drawbacks in expressive capability and search efficiency of GraphNAS and Auto-GNN due to the designed search space.
                                             To overcome these drawbacks, we propose the SNAG framework (Simplified Neural Architecture search for Graph neural
                                             networks), consisting of a novel search space and a reinforcement learning based search algorithm. Extensive experiments
                                             on real-world datasets demonstrate the effectiveness of the SNAG framework compared to human-designed GNNs and NAS
                                             methods, including GraphNAS and Auto-GNN.1


1. Introduction                                                                                                       neural networks (CNN) and recurrent neural networks
                                                                                                                      (RNN). In very recent time, there are two preliminary
In recent years, Graph Neural Networks (GNN) [1, 2]                                                                   works, GraphNAS [18] and Auto-GNN [19], making the
have been a hot topic due to their promising results on                                                               first attempt to apply NAS to GNN architecture design.
various graph-based tasks, e.g., recommendation [3, 4, 5],                                                            Though GraphNAS and Auto-GNN show some promising
fraud detection [6], chemistry [7]. In the literature,                                                                results, there are some drawbacks in expressive capability
various GNN models [8, 9, 10, 11, 12, 13, 6, 5] have been                                                             and search efficiency of GraphNAS and Auto-GNN due to
designed for graph-based tasks. Despite the success of                                                                the designed search space. In the NAS literature [14, 15,
these GNN models, there are two challenges facing them.                                                               16, 17], a good search space should be both expressive and
The first one is that there is no optimal architecture                                                                compact. That is the search space should be large enough
which can always behave well in different scenarios. For                                                              to subsume existing human-design architectures, thus
example, in our experiments (Table 3 and 4), we can                                                                   the performance of a search method can be guaranteed.
see that the best GNN architectures vary on different                                                                 However, it will be extremely costly if the search space
datasets and tasks. It means that we have to spend                                                                    is too general, which is impractical for any searching
huge computational and expertise resources designing                                                                  method. The search spaces of GraphNAS and Auto-
and tuning a well-behaved GNN architecture given a                                                                    GNN are the same, both of which do not well satisfy
specific task, which limits the application of GNN models.                                                            the requirement of a good search space. On one hand,
Secondly, existing GNN models do not make full use of                                                                 they fail to include several latest GNN models, e.g.,
the best architecture design practices in other established                                                           the GeniePath [6], for which we give a more detailed
areas, e.g., computer vision (CV). For example, existing                                                              analysis in Section 3.1 (Table 1). On the other hand, the
multi-layer GNN models tend to stack multiple layers                                                                  search space includes too many choices, making it too
with the same aggregator (see bottom left of Figure 1),                                                               complicated to search efficiently.
which aggregates hidden features of multi-hop neighbors.                                                                 In this work, to overcome the drawbacks of GraphNAS
However it remains to be seen whether combinations of                                                                 and Auto-GNN and push forward the research of NAS
different aggregators in a multi-layer GNN model can                                                                  approaches for GNN, we propose the SNAG framework
further improve the performance. These challenges lead                                                                (Simplified Neural Architecture search for Graph neural
to a straightforward question: can we obtain well-behaved                                                             networks), consisting of a simpler yet more expressive
data-specific GNN architectures?                                                                                      search space and a RL-based search algorithm. By
   To address the above question, researchers turn                                                                    revisiting extensive existing works, we unify state-of-
to neural architecture search (NAS) [14, 15, 16, 17]                                                                  the-art GNN models in a message passing framework [7],
approaches, which have shown promising results in                                                                     based on which a much more expressive yet simpler
automatically designing architectures for convolutional                                                               search space is designed. The simplified search space can
                                                                                                                      not only emulate a series of existing GNN models, but
Proceedings of the CIKM 2020 Workshops, October 19-20, Galway,                                                        also be very flexible to use the weight sharing mechanism,
Ireland.
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                                                                                                      which is a widely used technique to accelerate the
    CEUR
    Workshop
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                      search algorithm in the NAS literature. We conduct
    Proceedings
Figure 1: The whole framework of the propsed SNAG (Best view in color). (a) Upper Left: an example graph with five nodes.
The gray rectangle represent the input features attached to each node; (b) Bottom Left: a typical 2-layer GNN architecture
following the message passing neighborhood aggregation schema, which computes the embeddings of node “2”; (c) Upper
Right: the reinforcement learning pipeline for NAS; d) Bottom Right: an illustration of a search space of the proposed SNAG
using 2-layer GNN as backbone, which includes two key components of existing GNN models: node and layer aggregators.




extensive experiments to demonstrate the effectiveness        2. Related Works
of the SNAG framework comparing to various baselines
including GraphNAS and Auto-GNN. To summarize, the            2.1. Graph Neural Network (GNN)
contributions of this work are in the following:
                                                         GNN is first proposed in [1] and in the past five years
 • In this work, to automatically obtain well-behaved    many different variants [8, 9, 10, 11, 12, 13, 6] have been
   data-specific GNN architectures, we propose the       designed, all of which are relying on a neighborhood
   SNAG framework, which can overcome the drawbacks      aggregation (or message passing) schema [7]. As shown
   of existing NAS approaches, i.e., GraphNAS and Auto-  in the left part of Figure 1, it tries to learn the representa-
   GNN. To better utilize the NAS techniques, we design  tion of a given node in a graph by iteratively aggregating
   a novel and effective search space, which can emulate the hidden features (“message”) of its neighbors, and the
   more existing GNN architectures than previous works.  message can propagate to farther neighborhood in the
• We design a RL-based search algorithm and its variant graph, e.g., the hidden features of two-hop neighbors
  by adopting the weight sharing mechanism (SNAG- can be aggregated in a two-step iteration process. Let
  WS). By comparing the performance of these two 𝒢 = 𝑁(𝒱,              ℰ) be a simple graph with node features
  variants, we show that the weight sharing mechanism    X  ∈  R    ×𝑑
                                                                       , where 𝒱 and ℰ represent the node and edge
  is not empirically useful as we imagined, which aligns sets, respectively. 𝑁 represents the number of nodes and
  with the latest research in NAS literature [20].       𝑑 is the dimension of node features. We use 𝑁 (𝑣) to
                                                         represent the first-order neighbors of a node 𝑣 in 𝒢, i.e.,
• Extensive experiments on real-world datasets are 𝑁 (𝑣) = {𝑢 ∈ 𝒱|(𝑣, 𝑢) ∈ ℰ}. In the literature, we also
  conducted to evaluate the proposed SNAG frame- create a new set 𝑁         ̃︀ (𝑣) is the neighbor set including itself,
  work, comparing to human-designed GNNs and NAS i.e., 𝑁       ̃︀ (𝑣) = {𝑣} ∪ {𝑢 ∈ 𝒱|(𝑣, 𝑢) ∈ ℰ}.
  methods. The experimental results demonstrate the         Then a 𝐾-layer GNN can be written as follows: the
  superiority of SNAG in terms of effectiveness and 𝑙-th layer (𝑙 = 1, · · · , 𝐾) updates h𝑣 for each node 𝑣 by
  efficiency compared extensive baseline models.         aggregating its neighborhood as
                                                                      (︂           (︂                          )︂)︂
                                                            h𝑙𝑣 = 𝜎 W(𝑙) · Φ𝑛 {h(𝑙−1)   𝑢     , ∀𝑢  ∈  𝑁
                                                                                                       ̃︀ (𝑣)}      , (1)

                                                                        (𝑙)
                                                              where h𝑣 ∈ R𝑑𝑙 represents the hidden features
                                                              of a node 𝑣 learned by the 𝑙-th layer, and 𝑑𝑙 is the
Table 1
Comparisons of the search space between existing NAS methods and SNAG. For more details of the “Others” columns of
GraphNAS/Auto-GNN, we refer readers to the corresponding papers.

                                 Node aggregators                   Layer aggregators                 Others
                                                                                            Hidden Embedding Size,
   GraphNAS/          GCN,SAGE-SUM/-MEAN/-MAX, MLP, GAT ,
                                                                            -                  Attention Head,
   Auto-GNN          GAT-SYM/-COS/ -LINEAR/-GEN-LINEAR ,
                                                                                             Activation Function
      Ours          All above plus SAGE-LSTM and GeniePath          CONCAT,MAX,LSTM             IDENTITY, ZERO



corresponding dimension. W(𝑙) is a trainable weight            more promising architectures. GraphNAS [18] and Auto-
matrix shared by all nodes in the graph, and 𝜎 is a non-       GNN [19] are the first two RL-based NAS methods for
linear activation function, e.g., a sigmoid or ReLU. Φ𝑛        GNN.
is the key component, i.e., a pre-defined aggregation             Search space is a key component of NAS approaches,
function, which varies across on different GNN models.         the quality of which directly affects the final performance
For example, in [8], a weighted summation function is          and search efficiency. As mentioned in [14, 15, 23, 28, 27,
designed as the node aggregators, and in [9], different        22, 29, 26, 25], a good search space should include existing
functions, e.g., mean and max pooling, are proposed as         human-designed models, thus the performance of an
the aggregators. Further, to weigh the importance of           designed search algorithm can be guaranteed. In this
different neighbors, attention mechanism is incorporated       work, by unifying existing GNN models in the message
to design the aggregators [10].                                passing framework [7] with the proposed node and layer
   Usually, the output of the last layer is used as the        aggregators, we design a more expressive yet simple
final representation for each node, which is denoted           search space in this work, which is also flexible enough
              (𝐾)
as z𝑣 = h𝑣 . In [12], skip-connections [21] are                to incorporate the weight sharing mechanism into our
incorporated to propagate message from intermediate            RL-based method.
layers to an extra layer, and the final representation
of the node(︁𝑣 is computed )︁by a layer aggregation as
                (1)         (𝐾)                                3. The Proposed Framework
z𝑣 = Φ𝑙 h𝑣 , · · · , h𝑣           , and Φ𝑙 can also have
different options, e.g., max-pooling, concatenation. Based     3.1. The design of search space
on the node and layer aggregators, we can define the
two key components of exiting GNN models, i.e., the            As introduced in Section 2.1, most existing GNN ar-
neighborhood aggregation function and the range of the         chitectures are relying on a message passing frame-
neighborhood, which tends to be tuned depending on the         work [7], which constitutes the backbone of the designed
tasks. In Table 1, we list all node and layer aggregators in   search space in this work. Besides, motivated by
this work, which lays the basis for the proposed SNAG          JK-Network [13], to further improve the expressive
framework.                                                     capability, we modify the message framework by adding
                                                               an extra layer which can adaptively combine the outputs
                                                               of all node aggregation layers. In this work, we
2.2. Neural Architecture Search (NAS)                          argue and demonstrate in the experiments that these
Neural architecture search (NAS) [14, 15, 16, 17] aims         two components are the key parts for a well-behaved
to automatically find better and smaller architectures         GNN model, denoted as Node arggregators and Layer
comparing to expert-designed ones, which have shown            aggregators. The former one focus on how to aggregate
promising results in architecture design for CNN and           the neighborhood features, while the latter one focus on
Recurrent Neural Network (RNN) [22, 23, 24, 25, 26]. In        the range of neighborhood to use. Here we introduce the
the literature, one of the representative NAS approaches       backbone of the proposed search space, as shown in the
are reinforcement learning (RL) [14, 15, 27], which trains     bottom right part of Figure 1, which consists of two key
an RNN controller in the loop: the controller firstly          components:
generates an candidate architecture by sampling a list of
                                                               • Node aggregators: We choose 12 node aggregators
actions (operations) from a pre-defined search space, and
                                                                 based on popular GNN models, and they are presented
then trains it to convergence to obtain the performance of
                                                                 in Table 1.
the given task. The controller then uses the performance
as the guiding signal to update the RNN parameters, and        • Layer aggregators: We choose 3 layer aggregators
the whole process is repeated for many iterations to find        as shown in Table 1. Besides, we have two more
  operations, IDENTITY and ZERO, related to skip-
                                                         Table 2
  connections. Instead of requiring skip-connections
                                                         Dataset statistics of the datasets in the experiments.
  between all intermediate layers and the final layer in
  JK-Network, in this work, we generalize this option                             Transductive             Inductive
  by proposing to search for the existence of skip-                      Cora CiteSeer PubMed                 PPI
  connection between each intermediate layer and the       #nodes        2,708      3,327        19,717      56,944
  last layer. To connect, we choose IDENTITY, and ZERO     #edges        5,278      4,552        44,324     818,716
  otherwise.                                               #features     1,433      3,703         500         121
                                                                #classes       7         6          3           50
  To further inject the domain knowledge from exist-
ing GNN architectures, when searching for the skip-
connections for each GNN layer, we add one more
constraint that the last layer should always be used as     𝑃 (𝛼, 𝜃𝑐 ), and for each architecture, we train them from
the final output, thus for a 𝐾-layer GNN architecture,      scratch with some hyper-parameters tuning, e.g., the
we need to search 𝐾 − 1 IDENTITY or ZERO for the            embedding size and learning rate, etc. We then select
skip-connection options.                                    the best architecture as the searched one, which aligns
                                                            with the process in previous works [15, 27]. In our
                                                            experiments, we empirically set 𝑛 = 10 for simplicity.
3.2. Problem formulation                                    For more technical details, we refer readers to [15, 18].
After designing the search space, denoted as 𝒜, the search     Besides, in this work, we also incorporate the weight
process implies a bi-level optimization problem [30, 31],   sharing  mechanism into our framework, and propose the
as show in the following:                                   SNAG-WS variant. The key difference between SNAG
                                                            and SNAG-WS lies in that we create a dictionary to load
       min𝛼∈𝒜 ℒ𝑣𝑎𝑙 (𝛼, 𝑤* ),                            (2) and save the trained parameters of all OPs (Table 1) in a
             s.t. 𝑤* = arg min𝑤 ℒ𝑡𝑟𝑎𝑖𝑛 (𝛼, 𝑤),              sampled architecture during the search process.

where ℒ𝑡𝑟𝑎𝑖𝑛 and ℒ𝑣𝑎𝑙 represent the training and
validation loss, respectively, and 𝒜 represents the search    4. Experiments
space introduced in Section 3.1. 𝛼 and 𝑤 represent the
architecture and model parameters. Eq. (2) denotes a trial-   4.1. Experimental Settings
and-error process for the NAS problem, which selects an       4.1.1. Datasets and Tasks.
architecture 𝛼 from the search space, and then trains it
from scratch to obtain the best performance. This process     Here, we introduce two tasks and the corresponding
is repeated during the given time budget and the optimal      datasets (Table 2), which are standard ones in the
𝛼* is kept track of and returned after the search process     literature [8, 9, 13].
finished.                                                     Transductive Task. Only a subset of nodes in one
   In this work, motivated by the pioneering NAS              graph are used as training data, and other nodes are
works [14, 15], we design a RL method to execute the          used as validation and test data. For this setting, we use
search process. To be specific, during the search phase,      three benchmark dataset: Cora, CiteSeer, PubMed. They
we use a recurrent neural network (RNN) controller,           are all citation networks, provided by [32]. Each node
parameterized by 𝜃𝑐 , to sample an candidate architecture     represents a paper, and each edge represents the citation
from the search space. The architecture is represented        relation between two papers. The datasets contain bag-
by a list of actions (OPs), including the node aggregators,   of-words features for each paper (node), and the task
layer aggregators and IDENTITY/ZERO as shown in               is to classify papers into different subjects based on the
Table 1. Then the candidate architecture will be              citation networks.
trained till convergence, and the accuracy on a held-out         For all datasets, We split the nodes in all graphs into
validation set 𝒟𝑣𝑎𝑙 is returned. The parameters of the        60%, 20%, 20% for training, validation, and test. For the
RNN controller are then optimized in order to maximize        transductive task, we use the classification accuracy as
the expected validation accuracy E𝑃 (𝛼;𝜃𝑐 ) [ℛ] on 𝒟𝑣𝑎𝑙 ,     the evaluation metric.
where 𝑃 (𝛼; 𝜃𝑐 ) is the distribution of architectures         Inductive Task. In this task, we use a number of graphs
parameterized by 𝜃𝑐 , and ℛ is the validation accuracy.       as training data, and other completely unseen graphs
In this way, the RNN controller will generate better          as validation/test data. For this setting, we use the
architectures over time, and can obtain optimal one in        PPI dataset, provided by [9], on which the task is to
the end of the search phase. After finishing the search       classify protein functions. PPI consists of 24 graphs, with
process, we need to derive the searched architectures. We     each corresponding to a human tissue. Each node has
first sample 𝑛 architectures under the trained distribution   positional gene sets, motif gene sets and immunological
Table 3
Performance comparisons in transductive tasks. We show the mean classification accuracy (with standard deviation). We
categorize baselines into human-designed GNNs and NAS methods. The best results in different groups of baselines are
underlined, and the best result on each dataset is in boldface.
                                                                          Transductive
                              Methods                      Cora             CiteSeer            PubMed
                              GCN                      0.8761 (0.0101)    0.7666 (0.0202)    0.8708 (0.0030)
                              GCN-JK                   0.8770 (0.0118)    0.7713 (0.0136)    0.8777 (0.0037)
                              GraphSAGE                0.8741 (0.0159)    0.7599 (0.0094)    0.8784 (0.0044)
                              GraphSAGE-JK             0.8841 (0.0015)    0.7654 (0.0054)    0.8822 (0.0066)
                Human-        GAT                      0.8719 (0.0163)    0.7518 (0.0145)    0.8573 (0.0066)
                designed      GAT-JK                   0.8726 (0.0086)    0.7527 (0.0128)    0.8674 (0.0055)
                GNN           GIN                      0.8600 (0.0083)    0.7340 (0.0139)    0.8799 (0.0046)
                              GIN-JK                   0.8699 (0.0103)    0.7651 (0.0133)    0.8828 (0.0054)
                              GeniePath                0.8670 (0.0123)    0.7594 (0.0137)    0.8796 (0.0039)
                              GeniePath-JK             0.8776 (0.0117)    0.7591 (0.0116)    0.8818 (0.0037)
                              LGCN                     0.8687 (0.0075)    0.7543 (0.0221)    0.8753 (0.0012)
                              Random                   0.8694 (0.0032)   0.7820 (0.0020)     0.8888(0.0009)
                              Bayesian                 0.8580 (0.0027)    0.7650 (0.0021)    0.8842(0.0005)
                NAS
                              GraphNAS                 0.8840 (0.0071)    0.7762 (0.0061)    0.8896 (0.0024)
                methods
                              GraphNAS-WS              0.8808 (0.0101)    0.7613 (0.0156)    0.8842 (0.0103)
                              SNAG                     0.8826 (0.0023)    0.7707 (0.0064)    0.8877 (0.0012)
                 ours
                              SNAG-WS                 0.8895 (0.0051)     0.7695 (0.0069)   0.8942 (0.0010)


Table 4
Performance comparisons in inductive tasks. We show the Micro-F1 (with standard deviation). We categorize baselines into
human-designed GNNs and NAS methods. The best results in different groups of baselines are underlined, and the best result
is in boldface.
                                                      Methods                  PPI
                                                      GCN                 0.9333 (0.0019)
                                                      GCN-JK              0.9344 (0.0007)
                                                      GraphSAGE           0.9721 (0.0010)
                                                      GraphSAGE-JK        0.9718 (0.0014)
                                Human-designed        GAT                 0.9782 (0.0005)
                                GNN                   GAT-JK              0.9781 (0.0003)
                                                      GIN                 0.9593 (0.0052)
                                                      GIN-JK              0.9641 (0.0029)
                                                      GeniePath           0.9528 (0.0000)
                                                      GeniePath-JK        0.9644 (0.0000)
                                                      Random              0.9882 (0.0011)
                                                      Bayesian           0.9897 (0.0008)
                                  NAS methods         GraphNAS            0.9698 (0.0128)
                                                      GraphNAS-WS         0.9584 (0.0415)
                                                      SNAG               0.9887 (0.0010)
                                        ours
                                                      SNAG-WS             0.9875 (0.0006)



signatures as features and gene ontology sets as labels. 4.1.2. Compared Methods
20 graphs are used for training, 2 graphs are used for
                                                           We compare SNAG with two groups of state-of-the-art
validation and the rest for testing, respectively. For the
                                                           methods: human-designed GNN architectures and NAS
inductive task, we use Micro-F1 as the evaluation metric.
                                                           methods for GNN.
                                                           Human-designed GNNs. We use the following popular
                                                           GNN architectures: GCN [8], GraphSAGE [9], GAT [10],
                                                             4.2. Performance comparison
                                                             In this part, we give the analysis of the performance
                                                             comparisons on different datasets.
                                                                From Table 3, we can see that SNAG models, including
                                                             SNAG-WS, win over all baselines on most datasets except
                                                             CiteSeer. Considering the fact that the performance
                                                             of SNAG on CiteSeer is very close to the best one
          (a) Cora.                      (b) CiteSeer.       (Random), it demonstrates the effectiveness of the NAS
                                                             methods on GNN. In other words, with SNAG, we can
                                                             obtain well-behaved GNN architectures given a new task.
                                                             When comparing SNAG with GraphNAS methods, the
                                                             performance gain is evident. We attribute this to the
                                                             superiority of the expressive yet simple search space.
                                                                From Table 4, we can see that the performance trending
                                                             is very similar to that in transductive task, which is
                                                             that the NAS methods can obtain better or competitive
                                                             performance than human-designed GNNs. When looking
         (c) PubMed.                       (d) PPI.          at the NAS methods, we can see that our proposed
                                                             SNAG, Random and Bayesian outperforms GraphNAS.
Figure 2: The validation accuracy w.r.t. search time (in
                                                             This also demonstrates the superiority of the designed
seconds) in log base 10.
                                                             search space.
                                                                Taking into consideration the results of these two
                                                             tasks, we demonstrate the effectiveness of SNAG models,
GIN [12], LGCN [11], GeniePath [6]. For models with          especially the superiority of the search space.
variants, like different aggregators in GraphSAGE or
different attention functions in GAT, we report the best
                                                             4.3. Understanding the search space of
performance across the variants. Besides, we extend the
idea of JK-Network [13] in all models except for LGCN,            SNAG
and obtain 5 more baselines: GCN-JK, GraphSAGE-JK,           In this section, we show the simplicity and expressiveness
GAT-JK, GIN-JK, GeniePath-JK, which add an extra layer.      of the designed search space of SNAG from two aspects:
For LGCN, we use the code released by the authors 1 . For    speedup in searching and the performance gain from the
other baselines, we use the popular open source library      layer aggregators.
Pytorch Geometric (PyG) [33] 2 , which implements
various GNN models. For all baselines, we train it from      4.3.1. Speedup in searching
scratch with the obtained best hyper-parameters on
validation datasets, and get the test performance. We        In this part, to show the simplicity of the designed
repeat this process for 5 times, and report the final mean   search space, we compare the efficiency of SNAG and
accuracy with standard deviation.                            GraphNAS by showing the validation accuracy w.r.t to
NAS methods for GNN. We consider the following               the running time, and the results are shown in Figure 2.
methods: Random search (denoted as “Random”) and             The accuracy is obtained by evaluating the sampled
Bayesian optimization [34] (denoted as “Bayesian”),          architecture on validation set after training it from
which directly search on the search with random sam-         scratch till convergency, which can reflect the capability
pling and bayesian optimization methods, respectively.       of NAS methods in discovering better architectures with
Besides, GraphNAS3 [18] is chosen as NAS baseline.           time elapsing. From Figure 2, we can see that SNAG
   Note that for human-designed GNNs and NAS meth-           speeds up the search process significantly comparing
ods, for fair comparison and good balance between            to GraphNAS, i.e., the model can obtain better GNN
efficiency and performance, we choose set the number of      architectures during the search space. Considering the
GNN layers to be 3, which is an empirically good choice      fact that both GraphNAS and SNAG adopt the same RL
in the literature [10, 6].                                   framework, then this advantage is attributed to simpler
                                                             and smaller search space.

                                                             4.3.2. Influence of layer aggregators
   1
     https://github.com/HongyangGao/LGCN
   2
     https://github.com/rusty1s/pytorch_geometric            In this part, to show the stronger expressive capability
   3
     https://github.com/GraphNAS/GraphNAS                    of the designed search space, we conduct experiments
Table 5
Performance comparisons of SNAG and SNAG-WS using different search spaces.

                                      SNAG                                              SNAG-WS
                 layer aggregators (w)   layer aggregators (w/o)     layer aggregators (w)  layer aggregators (w/o)
      Cora          0.8826 (0.0023)          0.8822 (0.0071)            0.8895 (0.0051)          0.8892 (0.0062)
     CiteSeer       0.7707 (0.0064)          0.7335 (0.0025)            0.7695 (0.0069)          0.7530 (00034)
     PubMed         0.8877 (0.0012)          0.8756 (0.0016)            0.8942 (0.0010)          0.8800 (0.0013)
       PPI          0.9887 (0.0010)          0.9849 (0.0040)            0.9875 (0.0006)          0.9861 (0.0009)



on all datasets using a search space only with the node             tional inductive biases, deep learning, and graph
aggregators, i.e., removing the layer aggregators, as               networks, arXiv preprint arXiv:1806.01261 (2018).
comparisons. The results are shown in Table 5, and we           [3] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.
report the test accuracies of both the SNAG and SNAG-               Hamilton, J. Leskovec, Graph convolutional neural
WS. From Table 5, we can see that the performance                   networks for web-scale recommender systems, in:
consistently drops on all datasets when removing the                KDD, 2018, pp. 974–983.
layer aggregators, which demonstrates the importance of         [4] J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao,
the layer aggregators for the final performance and aligns          D. L. Lee, Billion-scale commodity embedding for
with the observation in Section 4.2 that the performance            e-commerce recommendation in alibaba, in: KDD,
of human-designed GNNs can be improved by adopting                  2018, pp. 839–848.
the JK-Network architecture.                                    [5] W. Xiao, H. Zhao, H. Pan, Y. Song, V. W. Zheng,
                                                                    Q. Yang, Beyond personalization: Social content
                                                                    recommendation for creator equality and consumer
5. Conclusion and Future work                                       satisfaction, in: KDD, 2019, pp. 235–245.
                                                                [6] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, L. Song, Y. Qi,
In this work, to overcome the drawbacks in expres-
                                                                    Geniepath: Graph neural networks with adaptive
sive capability and search efficiency of two existing
                                                                    receptive paths, in: AAAI, volume 33, 2019, pp.
NAS approaches for GNN, i.e., GraphNAS [18] and
                                                                    4424–4431.
Auto-GNN [19], we propose the SNAG framework,
                                                                [7] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals,
i.e., Simplified Neural Architecture search for GNN.
                                                                    G. E. Dahl, Neural message passing for quantum
By revisiting existing works, we unify state-of-the-
                                                                    chemistry, in: ICML, 2017, pp. 1263–1272.
art GNN models in a message passing framework [7],
                                                                [8] T. N. Kipf, M. Welling, Semi-supervised classifi-
and design a simpler yet more expressive search space
                                                                    cation with graph convolutional networks, ICLR
than that of GraphNAS and Auto-GNN. A RL-based
                                                                    (2016).
search algorithm is designed and a variant (SNAG-WS)
                                                                [9] W. Hamilton, Z. Ying, J. Leskovec, Inductive
is also proposed by incorporating the weight sharing
                                                                    representation learning on large graphs,            in:
mechanism. Through extensive experiments on real-
                                                                    NeurIPS, 2017, pp. 1024–1034.
world datasets, we not only demonstrate the effectiveness
                                                               [10] P. Veličković, G. Cucurull, A. Casanova, A. Romero,
of the proposed SNAG framework comparing to various
                                                                    P. Lio, Y. Bengio, Graph attention networks, ICLR
baselines including GraphNAS and Auto-GNN, but also
                                                                    (2018).
give better understanding of different components of
                                                               [11] H. Gao, Z. Wang, S. Ji, Large-scale learnable graph
the proposed SNAG. For future work, we will explore
                                                                    convolutional networks, in: KDD, 2018, pp. 1416–
the SNAG framework in more graph-based tasks besides
                                                                    1424.
node classification.
                                                               [12] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful
                                                                    are graph neural networks?, in: ICLR, 2019.
References                                                     [13] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi,
                                                                    S. Jegelka, Representation learning on graphs with
 [1] M. Gori, G. Monfardini, F. Scarselli, A new model              jumping knowledge networks, in: ICML, 2018, pp.
     for learning in graph domains, in: IJCNN, volume 2,            5449–5458.
     2005, pp. 729–734.                                        [14] B. Baker, O. Gupta, N. Naik, R. Raskar, Designing
 [2] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-          neural network architectures using reinforcement
     Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti,            learning, ICLR (2017).
     D. Raposo, A. Santoro, R. Faulkner, et al., Rela-         [15] B. Zoph, Q. V. Le, Neural architecture search with
                                                                    reinforcement learning, ICLR (2017).
[16] T. Elsken, J. H. Metzen, F. Hutter, Neural architec-     Algorithms for hyper-parameter optimization, in:
     ture search: A survey, JMLR (2018).                      NeurIPS, 2011, pp. 2546–2554.
[17] Q. Yao, M. Wang, Taking human out of learning
     applications: A survey on automated machine
     learning, arXiv preprint arXiv:1810.13306 (2018).
[18] Y. Gao, H. Yang, P. Zhang, C. Zhou, Y. Hu, Graph
     neural architecture search, in: IJCAI, 2020, pp. 1403–
     1409.
[19] K. Zhou, Q. Song, X. Huang, X. Hu, Auto-GNN: Neu-
     ral Architecture Search of Graph Neural Networks,
     Technical Report, arXiv preprint arXiv:1909.03184,
     2019.
[20] C. Sciuto, K. Yu, M. Jaggi, C. Musat, M. Salzmann,
     Evaluating the search phase of neural architecture
     search, ICLR (2020).
[21] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
     learning for image recognition, in: CVPR, 2016,
     pp. 770–778.
[22] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable
     architecture search, ICLR (2019).
[23] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learn-
     ing transferable architectures for scalable image
     recognition, in: CVPR, 2018, pp. 8697–8710.
[24] M. Tan, Q. Le, Efficientnet: Rethinking model
     scaling for convolutional neural networks, in:
     ICML, 2019, pp. 6105–6114.
[25] Q. Yao, J. Xu, W.-W. Tu, Z. Zhu, Efficient neural
     architecture search via proximal iterations., in:
     AAAI, 2020, pp. 6664–6671.
[26] Y. Zhang, Q. Yao, L. Chen, Neural Recurrent
     Structure Search for Knowledge Graph Embed-
     ding, Technical Report, International Workshop on
     Knowledge Graph@KDD, 2019.
[27] H. Pham, M. Guan, B. Zoph, Q. Le, J. Dean, Efficient
     neural architecture search via parameter sharing,
     in: ICML, 2018, pp. 4092–4101.
[28] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan,
     Q. V. Le, Understanding and simplifying one-shot
     architecture search, in: ICML, 2018, pp. 549–558.
[29] L. Li, A. Talwalkar, Random search and reproducibil-
     ity for neural architecture search, arXiv preprint
     arXiv:1902.07638 (2019).
[30] B. Colson, P. Marcotte, G. Savard, An overview of
     bilevel optimization, Annals of operations research
     153 (2007) 235–256.
[31] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi,
     M. Pontil, Bilevel programming for hyperparameter
     optimization and meta-learning, in: ICML, 2018,
     pp. 1568–1577.
[32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher,
     T. Eliassi-Rad, Collective classification in network
     data, AI magazine 29 (2008) 93–93.
[33] M. Fey, J. E. Lenssen, Fast graph representation
     learning with PyTorch Geometric, in: ICLRW, 2019.
[34] J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl,