=Paper= {{Paper |id=Vol-1743/paper7 |storemode=property |title=The Impact of Network Sampling on Relational Classification |pdfUrl=https://ceur-ws.org/Vol-1743/paper7.pdf |volume=Vol-1743 |authors=Lilian Berton,Didier A. Vega-Oliveros,Jorge Valverde-Rebaza,Andre Tavares da Silva,Alneu de Andrade Lopes |dblpUrl=https://dblp.org/rec/conf/simbig/BertonVVSL16 }} ==The Impact of Network Sampling on Relational Classification== https://ceur-ws.org/Vol-1743/paper7.pdf
           The impact of network sampling on relational classification

            Lilian Berton2 Didier A. Vega-Oliveros1 Jorge Valverde-Rebaza1
                    Andre Tavares da Silva2 Alneu de Andrade Lopes1

            Department of Computer Science
             1
                                                               Technological Sciences Center
                                                               2

          ICMC, University of São Paulo                      University of Santa Catarina State
      CEP 13560-970, São Carlos - SP - Brazil                CEP 89219-710, Joinville - SC - Brazil
             {davo,jvalverr,alneu}@icmc.usp.br               lilian.2as@gmail.com, andre.silva@udesc.br



                     Abstract                               G = (V, E), where V is the set of vertices repre-
                                                            senting objects in a specific context, and E is the
    Many real-world networks, such as the                   set of edges representing the interactions among
    Internet, social networks, biological net-              these objects. For instance, in a social network,
    works are massive in size, which difficult              vertices are individuals and edges are the friend-
    different processing and analysis tasks.                ships existing among them (Newman, 2010).
    For this reason, it is necessary to apply                  Since analyzing and modeling data in rela-
    a sampling process to reduce the network                tional representation is relevant for different do-
    size without losing relevant network in-                mains, several applications have been studied to
    formation. In this paper, we propose a                  obtain more benefits from the network struc-
    new and intuitive sampling method based                 ture, such as community detection (Valejo et
    on exploiting the following centrality mea-             al., 2014), link prediction (Valverde-Rebaza and
    sures: degree, k-core, clustering, eccen-               Lopes, 2013; Valverde-Rebaza and Lopes, 2014;
    tricity and structural holes. For our experi-           Valverde-Rebaza et al., 2015), topic extraction
    ments, we delete 30% and 50% of the ver-                (Faleiros and Lopes, 2015), information diffusion
    tices from the original network and eval-               (Vega-Oliveros and Berton, 2015; Vega-Oliveros
    uate our proposal on six real-world net-                et al., 2015), and others. Recently, there has
    works on relational classification task us-             been a lot of interest in relational learning, espe-
    ing six different classifiers. Classification           cially related to relational classification techniques
    results achieved on sampled graphs gener-               (Lu and Getoor, 2003; Macskassy and Provost,
    ated from our proposal are similar to those             2003; Macskassy and Provost, 2007; Lopes et al.,
    obtained on the entire graphs. In most                  2009). Relational classifiers have shown best per-
    cases, our proposal reduced the original                formance than conventional classifiers (Valverde-
    graphs by up to 50% of its original number              Rebaza et al., 2014).
    of edges. Moreover, the execution time for                 However, most of these networks are massive in
    learning step of the classifier is shorter on           size, being difficult to be studied in their entirety.
    the sampled graph.                                      In some cases, the network is not totally available,
                                                            or it is hard to be collected, or even if we have the
keywords: network sampling, relational classifi-            complete graph, it can be very expensive to run the
cation, centrality measures, complex networks               algorithms on it. Hence, it is necessary to perform
                                                            and study on network sampling, i.e., selecting a
1   Introduction
                                                            subset of vertices and edges from the full graph,
Networks are relational structures with a high level        in such way we obtain G0 = (V 0 , E 0 ) 2 G =
of order and organization, despite they display big         (V, E) (Leskovec and Faloutsos, 2006; Ahmed et
inhomogeneities (Fortunato, 2010). Furthermore,             al., 2012; Ahmed et al., 2013).
networks are extremely useful as a representation              Considering the assumption that network data
of a wide variety of complex systems in a lot of            fits in memory is not realistic for many real-world
real-world contexts, such as social, information,           domains (e.g., online social networks), different
biological and technological domains (Newman,               strategies for sampling have been proposed aim-
2010). Formally, a network is denoted by a graph            ing to reduce the number of vertices or edges



                                                       62
of a network. The state-of-the-art technique is              tional classification accuracy obtained by differ-
called as random subsampling method, which se-               ent sampled graphs generated from our proposal is
lect nodes uniformly at random. However, while               as good as the classification accuracy obtained on
this technique is intuitive and relatively straight-         entire graphs and taking less time in the learning
forward, it does not accurately capture properties           phase; iii) we also analyze the network topology to
of networks with power-law degree distributions              exploit which cases the sampled graphs are similar
(Stumpf et al., 2005). To cope with this problem,            to full graphs.
researchers have also considered other sampling                 The remaining of this paper is organized as fol-
methods based on breadth-first search or random              lows. Section 2 presents some concepts used in the
walks. Snowball sampling method, for instance,               paper encompassing centrality measures and rela-
adds nodes and edges using breadth-first search              tional classification. Section 3 presents the pro-
from a randomly selected seed node for accurately            posed approach for network sampling. Section 4
maintaining the network connectivity within the              presents the experimental evaluation which ana-
snowball (Lee et al., 2006). On the other hand,              lyzes the impact of sampling on relational classifi-
the Forest Fire Sampling (FFS) method uses par-              cation and on the network topology. Finally, Sec-
tial breadth-first search where only a fraction of           tion 5 presents the conclusions and future works.
neighbors are followed for each node (Leskovec
and Faloutsos, 2006), and the degree-based sam-              2     Background
pling method selects nodes considering their prob-           In this section, we describe the main centrality
abilities to be visited, which is proportional to the        measures used as conventional parameters in dif-
node degree (Adamic et al., 2001).                           ferent sampling methods existent in the literature.
   Other techniques have been proposed in the lit-           Also, we introduce briefly the main concepts on
erature, which consider the different sources (e.g.,         relational classification and six of the most popu-
disk-resident/static or streaming) and scale (e.g.,          lar relational classifiers.
small or large). Although there is previous work
                                                             2.1    Centrality measures
focusing on evaluating the performance of sam-
pling methods by comparing network statistics,               In complex network, some researchers have pro-
i.e. measure the representativeness of the sampled           posed different measures to analyze the impor-
subgraph structure comparing it with the full input          tance of central vertices (Newman, 2010; Doro-
network structure (Ahmed et al., 2012; Ahmed et              govtsev and Mendes, 2002). The centrality mea-
al., 2013), to the best of our knowledge there is            sures indicate how much a vertex is important
no extensive research focused on study the impact            in some scope. Considering that, n = |V | and
of using sampled networks in a specific machine              m = |E|, the centrality measures applied in this
learning task, such as, classification, exploiting a         work are described as follow.
lot of classifiers and datasets. Thus, in this paper,            • Degree (DG): The degree or connectivity of
we propose an intuitive sampling method and use                    vertex i, referred to ki , is related with the
different configurations of it to perform an empir-                number of edges or connections that go (kiout )
ical evaluation in six real-world networks and six                 or arrive (kiin ) to vertex i. The average degree
classifiers. We evaluate the quality of our proposal               hki for directed networks is the average of the
analyzing: i) how much the full network structure                  input or output edges. When the network is
is preserved in the sampled graphs generated, ii)                  undirected, the average degree is the factor
the accuracy obtained by six relational classifiers                hki = 2 ⇤ m/n, i.e., the sum of all the edges
on entire and sampled graphs; and iii) the execu-                  per vertex of the network over the number of
tion time in the learning step of the classifiers.                 vertices. The ki values can be calculated as
   The main contributions of this paper are: i)                    follow:                   X
we propose a new and intuitive method for sam-                                         ki =       aij .          (1)
pling based on centrality properties of networks,                                         i2N
such as, node degree, k-core, and others. These                    Vertices with very high ki values are called
measures can be calculated in only part of the                     hubs, which represent instances strongly
graph and have low computational cost; ii) we per-                 connected that impact on the dynamics of
form an empirical evaluation that shows the rela-                  the network (Barabasi and Bonabeau, 2003;



                                                        63
  Vega-Oliveros and Berton, 2015). For in-                • Eccentricity (EC): The shortest path be-
  stance, in social networks, hubs are the                  tween two vertices is the shortest sequence
  most popular individuals, like famous actors,             of edges that connect them, and the distance
  politicians, etc. The time complexity for cal-            is the number of edges contained in the path.
  culating to all the vertices is O(n ⇤ hki).               This problem can be resolved by employing
                                                            different algorithms, like Dijkstra, Bellman-
• K-core (KC): The network can be decom-                    Ford, Floyd-Warshall, or breadth-first search
  posed in terms of sub-networks or cores (Sei-             methods (Cormen et al., 2009). In the case
  dman, 1983), where each core of order (Hk )               that i and j belong to different components,
  represents the set of vertices that has ki k.             it is assumed that `ij = n. In this way, the
  Therefore, a vertex i belongs to Kc(x) = k                eccentricity value of a vertex i is the largest
  if Hk is the largest core it can be part (Sei-            distance over all the shortest path to the other
  dman, 1983). The principal core is the set                vertices, as follow:
  of vertices with the largest k-core value, and
  they are the most central (Kitsak et al., 2010).
                                                                         ECi = max{|`ij |} ,             (3)
  In general, vertices with lower KC values are                                   i6=j
  located at the periphery of the network. The
  KC centrality is obtained by an iterative and             where |`ij | is the distance of the shortest path
  incremental process (Batagelj and Zaversnik,              between vertices i and j. This measure eval-
  2003) that begins with k = 1: (i) All the ver-            uates how close is a vertex to its most distant
  tices with degree lower or equal than k are re-           vertex. Lower values of EC indicates that the
  moved. Then, (ii) the remaining vertices are              vertex is more central and closer to the oth-
  evaluated several times, in order to remove               ers. Therefore, vertices located at the net-
  those with ki lower or equal than k. After                work center have the lowest eccentricity val-
  that, (iii) the removed vertices are part of the          ues. For unweighted graphs, the running time
  set Kc(i) = k, k is incremented, and the                  complexity of this measure is O(n ⇤ m).
  process continues with step (i). The final set
  of vertices is the main core of the network,
  which has the largest KC centrality. Notice             • Structural Holes (HO): Some vertices in the
  that not necessarily the hubs have the high-              network work such as the bridge of clusters
  est k-core values. For instance, hubs located             or other vertices, and if they are removed
  in the periphery have small k-core central-               a structural hole will occur. The structural
  ity (Kitsak et al., 2010). The algorithm has              hole vertices act as spanners among com-
  low computational complexity O(n + m) for                 munities or groups of vertices without direct
  calculating the centrality to all vertices.               connections. These individuals are important
                                                            to the connectivity of local regions. We cal-
• Clustering coefficient (CT): In topology                  culate Burt’s constraint scores (Burt, 1992) as
  terms, it is the presence of triangles (cycles            the structural holes centrality. The algorithm
  of order three) in the network. The cluster-              considers all vertices as ego networks, where
  ing coefficient (Watts and Strogatz, 1998) of             connections no related to it have not a direct
  a vertex i is defined as the number of trian-             effect. For each vertex, the score is the frac-
  gles centered on i over its maximum number                tion of isolated holes will exists associated
  of possible connections, i.e.,                            with it and according to its ego network. The
                                                            higher the fraction of structural holes associ-
                             2ei                            ated with the vertex, the more central it is.
               CTi =               .          (2)
                         ki (ki 1)                          Therefore, vertices with higher degree cen-
                                                            trality tend to have low HO values, given that
  In the case of ki 2 {0, 1}, it is assumed a               its ego networks are larger and more densely
  centrality value of zero, and CTi = 1 only                interconnected, and this diminishes the frac-
  if all the neighbors of i are interconnected.             tion of isolated holes. The time complexity
  The running time complexity of the measure                for calculating the measure to all the vertices
  is O(n ⇤ hki2 ).                                          is O(n + n ⇤ hki2 ).



                                                     64
2.2 Relational classification                                      method. The no-lb classifier creates a feature vec-
Conventional classification algorithms learn from                  tor for a vertex by aggregating the labels of its
a training set formed by independent and identi-                   neighborhood and then uses logistic regression to
cally distributed (i.i.d) data (Mitchell, 1997). Nev-              build a discriminative model based on those fea-
ertheless, as previously mentioned, a lot of real-                 ture vectors (Lu and Getoor, 2003). This learned
world data are relational in nature and can be rep-                model is then applied to estimate P (vi = c|Ni ).
resented by graphs. Conventional classifiers do                    For no-lb classifier, three aggregation methods
not work properly on graphs because they ignore                    have been considered: binary-link (no-lb-binary),
pairwise dependency relations between vertices,                    mode-link (no-lb-mode), and count-link (no-lb-
i.e. relational information. To cope with that,                    count). Another aggregation method considered is
different relational classifiers have been proposed                class-distribution link (no-lb-distrib) (Macskassy
(Lu and Getoor, 2003; Macskassy and Provost,                       and Provost, 2007). All the no-lb aggregations use
2003; Macskassy and Provost, 2007; Lopes et al.,                   the iterative classification as a collective inference
2009). Relational classifiers require a fully de-                  method.
scribed graph (vertices and edges) with known la-
bels for some of the vertices to predict the labels                3   Proposal
of the remaining vertices.
                                                                   As previously mentioned, our proposal consists in
   For the domain of relational classification, we                 an intuitive approach based on exploring the cen-
redefine the network as the graph G = (V, E, W ),                  trality measures of a network to remove some ver-
where V = {v1 , v2 , ... vn } is the set of n vertices             tices and edges trying to conserve the equivalence
that describes an object, E = {e1 , e2 , ... em } is               between the sampled and the entire network. We
the set of m edges representing some similarity                    aim to obtain a sample from G in such way it does
between a pair of vertices and W is a matrix of                    not affect the performance of any learning task.
weights, which associates to each edge a weight                    Thus, our proposal generates a sample G0 from G,
wij that determines the strength of the connection.                i.e. G0 = (G), where is the function represent-
For this work, we consider three relational clas-                  ing our proposal. It is important to note that G0 is
sifiers: weighted vote relational neighbor (wvrn),                 a sub-graph from G, so V 0 ⇢ V and E 0 ⇢ E. The
network-only Bayes (no-Bayes) and network-only                     size of the sample is relative to the graph size.
link-based (no-lb).
                                                                      The proposed approach is illustrated in Figure
   The wvrn classifier estimates class membership
                                                                   1 and follows these steps: 1) calculate a specific
probabilities by assuming that linked nodes tend
                                                                   centrality measure for all vertices of the network,
to belong to the same class and considering the
                                                                   in this paper we use DG, KC, CT, EC, HO mea-
weighted mean of the class-membership proba-
                                                                   sures; 2) select some percentage of vertices with
bilities for the neighborhood of each node ana-
                                                                   the highest (H) or lowest (L) centrality values, in
lyzed (Macskassy and Provost, 2007) according to
                                                                   this paper we experiment selecting 30% and 50%
Equation 4.
                                                                   of vertices; 3) remove all selected vertices and
                    1 X                                            all their corresponding edges from G, obtaining
P (vi = c|Ni ) =        w(vi , vj )P (vj = c|Nj )                  G0 . The sampled graph generated, G0 , should be
                    N
                        vj 2Ni
                                                                   equivalent to the entire graph, so learning algo-
                                             (4)
                                                                   rithms should have a similar performance in both
   The no-Bayes classifier employs multinomial
                                                                   the sampled and the entire graph.
naı̈ve Bayes classifier based on the classes of
                                                                      All measures used for sampling the graph can
the neighborhood of each vertex (Macskassy and
                                                                   be calculated considering only a fraction of the
Provost, 2007). The no-Bayes is defined as Equa-
                                                                   graph, in a direct way or by employing statistical
tion 5,
                                                                   methods. The measures DG, HO and CT, for ex-
                                 P (Ni |c)P (c)                    ample, can be calculated for each vertex directly.
           P (vi = c|Ni ) =                            (5)         In the case of EC and KC, there are very precise
                                    P (Ni )
                                                                   approaches that consider only the vertex commu-
                      Q
where P (Ni |c) = N1 vj 2Ni P (vj = cj |vi = c)w(vi ,vj ) .        nity (part of the network). These measures have
  Furthermore, these two relational classifiers                    low computational cost to be calculated and can
use the relaxation label as a collective inference                 be applied on very large networks, moreover, by



                                                              65
                    (a)                                     (b)                                     (c)


Figure 1: Proposed approach: (a) Select % of vertices with lowest (red) or highest (green) centrality measure values from
original graph. (b) Remove all the vertices and all their edges to obtain the sampled network. In this case, we remove ver-
tices with lowest centrality measure values. (c) Use some learning task on the sampled network, for instance, the relational
classification.


the experimental results achieved good accuracy.
                                                                            Table 1: Data sets description.
                                                                     Datasets       |V |    |E|      # Classes     hki
4   Experimental results                                               Cora        4240    35912         7        17.84
                                                                     Cornell        351    1393          6        3.98
In this section, we present extensive empirical ex-                    Imdb        1441    51481         2        66.99
periments focused on evaluating the quality of                       Industry      2189    6531         12        10.65
sampled graphs generated by different configura-                      Texas         338    1002          6        3.44
tions of our proposal, when compared with the                       Washington      434    1941          6        4.41
original graph. We use six real-world networks
and apply six relational classifiers (see Section
2.2) on full and sampled graphs. We perform two                   was used as evaluation measure to compare the ac-
types of evaluations, Section 4.2 shows the clas-                 curacy of graph G and sampled graph G0 .
sification accuracy results, and Section 4.3 shows
                                                                  4.2   Impact of sampling on classification
the topological analysis of sampled and original
                                                                        accuracy
graphs.
                                                                  The classification results for the entire graph, 30%
4.1 Data sets and experimental setup                              and 50% of the sampled networks are shown in
                                                                  Figures 2 and 3 respectively, with the accuracy
We consider six benchmark data sets1 , which rep-
                                                                  for the six datasets (Figures (a), (b), (c), (d), (e)
resent real networks and are described in Table 1.
                                                                  and (f)), the six classifiers (bars) and the ten sam-
We consider that all networks are undirected.
                                                                  pling proposed strategies moreover the classifi-
   We sample a subgraph G0 from a graph G using
                                                                  cation with the entire graph (FULL). For all the
the centrality measures presented and considering
                                                                  sampling strategies the datasets Cora and Imdb
30% and 50% of vertices with smallest and high-
                                                                  achieved the highest accuracy. And the better clas-
est centralities values. For each sample size, we
                                                                  sifiers, in general, was nolb-lb-count and nolb-lb-
perform 10-fold cross validation and applied the
                                                                  distrib.
following relational classifiers: weighted vote re-
                                                                     The Nemenyi post-hoc test (Demšar, 2006) was
lational neighbor (wvrn), network-only Bayes (no-
                                                                  executed to verify the possibility of detecting sta-
Bayes), and network-only link-based (no-lb) clas-
                                                                  tistical differences among the sampling strategies.
sifiers, in their Netkit-SRL implementations with
                                                                  The results for 30% and 50% of sampled networks
standard configuration. For the network-only link-
                                                                  are shown in Figures 4 and 5 respectively. On the
based classifier we employed models modelink
                                                                  top of the diagrams is the critical difference (CD)
(no-lb-mode), count-link (no-lb-count), binary-
                                                                  and in the axis are plotted the average ranks of
link (nolb-binary) and class-distribution-link (no-
                                                                  the evaluated techniques, where the lowest (best)
lb-distrib). The area under the ROC curve (AUC)
                                                                  ranks are on the left side. When the methods ana-
  1
    http://netkit-srl.sourceforge.net/                            lyzed have no significant difference, they are con-
data.html                                                         nected by a black line in the diagram.



                                                            66
                                                              (a)




                                                              (b)




                                                              (c)




                                                              (d)




                                                              (e)




                                                              (f)


Figure 2: Classification results for the entire graph (FULL) and sampling strategies that remove 30% of vertices for the
following datasets: (a) Cora, (b) Cornell, (c) Imdb, (d) Industry, (e) Texas and (f) Washington.




                                                              67
                                                              (a)




                                                              (b)




                                                              (c)




                                                              (d)




                                                              (e)




                                                              (f)


Figure 3: Classification results for the entire graph (FULL) and sampling strategies that remove 50% of vertices for the
following datasets: (a) Cora, (b) Cornell, (c) Imdb, (d) Industry, (e) Texas and (f) Washington.




                                                              68
                             (a)                                                                   (a)




                             (b)                                                                   (b)




                             (c)                                                                   (c)




                             (d)
                                                                                                   (d)




                             (e)                                                                   (e)




                             (f)
                                                                                                   (f)

Figure 4: Nemenyi post-hoc test for the entire graph and              Figure 5: Nemenyi post-hoc test for the entire graph and
sampling strategies that removes 30% of vertices for the fol-
                                                                      sampling strategies that removes 50% of vertices for the fol-
lowing relational classifiers: (a) no-Bayes, (b) no-lb-binary,
                                                                      lowing relational classifiers: (a) no-Bayes, (b) no-lb-binary,
(c) no-lb-count, (d) no-lb-distrib, (e) no-lb-mode and (f)
                                                                      (c) no-lb-count, (d) no-lb-distrib, (e) no-lb-mode and (f)
wvrn.
                                                                      wvrn.


   According to the Nemenyi statistics, the crit-                     the entire graph. It is the case of CT-30H, CT-30L,
ical value for comparing the average-ranking of                       EC-30H and HO-30H for 30% of vertices sam-
two different algorithms considering the sampling                     pled, and CT-50L, EC-50H, HO-50H, KC-50L,
strategy that removes 30% of vertices (Figure 4)                      and DG-50L for 50% of vertices sampled. In par-
or 50% of vertices (Figure 5) at 95 percentile in                     ticular, the CT-50H only had significance differ-
all classifiers (no-Bayes, nolb-binary, no-lb-count,                  ence with the no-lb-distrib classifier. In terms of
no-lb-distrib, no-lb-mode, wvrn) is 6.16.                             accuracy, this result indicates that the CT central-
   In all the classifiers there are some sampling                     ity, for all the analyzed parameters, was more ro-
strategies that have no statistical difference with                   bust and suitable as a sampling strategy.



                                                                 69
                     Table 2: Execution time (ms) for classification training models.
                 Datasets     Classifiers    CT-30H   EC-30H    HO-30H   KC-30H    DG-30H   FULL
                              no-Bayes         756      824       960      552       511     1395
                             nolb-binary      13498    13398     13857    13380     13313   14547
                  CoRa       no-lb-count      12213    12585     13897    12386     12765   13759
                             no-lb-distrib    14218    14310     14569    14452     14401   15396
                             no-lb-mode       13991    13837     14097    13304     12924   17516
                                 wvrn          552      622       713      403       378     1032
                              no-Bayes          74      52        52        50        39      228
                             nolb-binary       724      775       789      694       694      936
                 Cornell     no-lb-count       718      809       754      697       667     1250
                             no-lb-distrib     753      759       772      690       719      918
                             no-lb-mode        729      703       726      690       712     2018
                                 wvrn           38      49        45        31        36      145
                              no-Bayes         381      327       558      214       175     1042
                             nolb-binary       959      810      1260      659       587     1538
                  Imdb       no-lb-count       998      843      1258      693       605     1786
                             no-lb-distrib     947      868      1232      662       595     1484
                             no-lb-mode        970      830      1086      621       600     2720
                                 wvrn          482      387       695      253       214     1050
                              no-Bayes         529      635       744      314       290     2841
                             nolb-binary      23140    21688     22061    26874     26101   23196
                 Industry    no-lb-count      23751    23934     27333    25806     26370   27111
                             no-lb-distrib    27827    26984     26721    29931     28834   28148
                             no-lb-mode       27854    25516     24045    28688     27881   31027
                                 wvrn          281      323       346      187       180      541
                              no-Bayes          54      47        52        44        40      206
                             nolb-binary       809      730       703      680       601      914
                  Texas      no-lb-count       826      773       765      703       605     1125
                             no-lb-distrib     766      681       701      680       633      895
                             no-lb-mode        657      642       666      674       664     2283
                                 wvrn           42      38        42        37        40      153
                              no-Bayes          68      72        66        45        42      277
                             nolb-binary       967      963       950      888       804     1016
                Washington   no-lb-count      1016      977       957      899       868     1124
                             no-lb-distrib    1043      955       945      880       794     1190
                             no-lb-mode        870      948       877      897       834     2300
                                 wvrn           51      54        80        44        46      218



   Table 2 shows the time comparison for the                   maining edges. This occurs since for the EC, the
learning step for all classifiers and all datasets.            most central or closest vertices have the lowest
We notice that all sampling strategies proposed,               values and for the HO measure, hubs tend to have
considering 30% of sampling, achieve small time                larger ego-networks; ergo, the centrality values are
compared with the original graph, especially the               lower.
strategy DG and KC. The lowest times are in bold.                 We notice that there exists diverse values of re-
                                                               moved edges from the original network, without
4.3 Impact of sampling on network topology                     strongly affecting the accuracy of the classifiers
                                                               (in bold). This variation of removed edges, some
We have analyzed the impact of the sampling
                                                               larger than 50%, suggest that depending on the
methods in the structure of the original network.
                                                               expected requirements, it can be privileged in the
In Table 3, we have the fraction of remaining
                                                               sampling process:
edges after applying the sampling methods, ac-
cording to the target vertices (with highest (H) or              1. The maximal removal of edges by removing
lowest (L) centrality value) and removal percent-                   a low proportion of vertices.
age (30 or 50%). The bold values highlight the
techniques and parameters that achieve similar ac-               2. Equivalent removal proportion of edges and
curacy results to the full network, i.e., with no                   vertices.
significance difference for all the classifiers. We
have observed that the EC and HO measures are                    3. The minimal removal of edges by removing
inversely proportional to the final fraction of re-                 a high proportion of vertices.



                                                         70
In the first case, by removing 30% of vertices we                                    with the entire network, i.e. the impact on clas-
have the sampling method CT-30H. For the third                                       sification results obtained by entire networks is
case, we have the methods DG-50L, KC-50L, and                                        minimal when compared with those obtained by
HO-50H. The left bold sampling strategies are in                                     sampled networks. We have applied the proposed
the second case.                                                                     strategy in six real networks considering six differ-
   Notwithstanding reducing the number of ver-                                       ent relational classifiers. The CT measure was the
tices and edges from the original network do not                                     most robust in accuracy for all classifiers and on all
statistically impact the classification results, the                                 networks, without statistical significance. More-
topological properties are sensibly affected by the                                  over, the execution time for the learning step of the
removal. For instance, removing 30% of vertices                                      classifiers are smaller in the sampling strategies
with the highest degree centrality (ki ) it produces                                 proposed when compared with the entire graph.
a more homogeneous distributed network (tending
to a Poisson or regular graph) and the average de-                                   Acknowledgments
gree decays. On the other hand with the same pro-                                    This work was partially supported by the São
portion, removing the least connected vertices pro-                                  Paulo Research Foundation (FAPESP) grants:
duce networks with more heterogeneous degree                                         2013/12191      5 and 2015/14228       9, Na-
distribution than the original graph.                                                tional Council for Scientific and Technological
                                                                                     Development (CNPq) grants: 302645/2015
Table 3: Descriptions of edges on sampled graphs.                                    2 and 140688/2013      7, and Coordination for
    Datasets     Measures
                   DG
                            #edges 30H
                               0.216
                                         #edges 30L
                                            0.916
                                                      #edges 50H
                                                         0.077
                                                                   #edges 50L
                                                                      0.780
                                                                                     the Improvement of Higher Education Personnel
                   KC          0.266        0.918        0.093        0.785          (CAPES).
      CoRa         HO          0.912        0.231        0.767        0.090
                   EC          0.718        0.389        0.477        0.191
                   CT          0.555        0.600        0.293        0.368
                   DG          0.067        0.853        0.017        0.666
                   KC          0.140        0.849        0.040        0.670          References
     Cornell       HO          0.853        0.080        0.659        0.020
                   EC          0.662        0.326        0.479        0.114
                   CT          0.500        0.702        0.051        0.553
                                                                                     L.A. Adamic, R.M. Lukose, A.R. Puniyani, and B.A.
                   DG          0.226        0.881        0.069        0.648            Huberman. 2001. Search in power-law networks.
      Imdb
                   KC
                   HO
                               0.284
                               0.872
                                            0.881
                                            0.252
                                                         0.086
                                                         0.637
                                                                      0.653
                                                                      0.096
                                                                                       Physical Review E, 64(046135).
                   EC          0.499        0.347        0.264        0.146
                   CT          0.601        0.553        0.314        0.290          Nesreen K. Ahmed, Jennifer Neville, and Ramana
                   DG          0.079        0.952        0.029        0.883
                   KC          0.090        0.951        0.039        0.887
                                                                                       Kompella. 2012. Network sampling designs for re-
    Industry       HO          0.952        0.094        0.882        0.035            lational classification. In In Proceedings of the 6th
                   EC          0.758        0.354        0.354        0.152
                   CT          0.575        0.934        0.203        0.438
                                                                                       International AAAI Conference on Weblogs and So-
                   DG          0.083        0.835        0.026        0.634            cial.
                   KC          0.182        0.829        0.070        0.643
      Texas        HO          0.835        0.083        0.626        0.034
                   EC          0.685        0.529        0.492        0.196          N.K. Ahmed, J. Neville, and T. Kompella. 2013. Net-
                   CT          0.474        0.720        0.085        0.559            work sampling: From static to streaming graphs.
                   DG          0.082        0.857        0.017        0.680
                   KC          0.178        0.863        0.074        0.692
                                                                                       ACM Trans. Knowl. Discov. Data, 8(2):1–56.
    Washington     HO          0.855        0.084        0.685        0.028
                   EC
                   CT
                               0.724
                               0.467
                                            0.268
                                            0.745
                                                         0.524
                                                         0.081
                                                                      0.114
                                                                      0.616
                                                                                     A. Barabasi and E. Bonabeau. 2003. Scale-free net-
                                                                                       works. Scientific American, pages 50–59.

                                                                                     V. Batagelj and M. Zaversnik. 2003. An O(m) algo-
5      Conclusion                                                                       rithm for cores decomposition of networks. Arxiv
                                                                                        preprint cs/0310049.
In this paper, we proposed a strategy for network
                                                                                     R.S. Burt. 1992. Structural holes: The social struc-
sampling by exploring five centrality measures:                                        ture of competition. Harvard University Press, Cam-
DG, KC, CT, EC, HO and eliminating vertices                                            bridge, MA.
with 30% or 50% of lowest or highest centrality
values. All centrality measures considered have a                                    T.H. Cormen, C.E. Leiserson, R.L. Rivest, and
                                                                                       C. Stein. 2009. Introduction to Algorithms. The
low order of complexity and are computationally                                        MIT Press, 3 edition.
applicable in real networks scenarios. Moreover,
they can be calculated in part of the graph.                                         J. Demšar. 2006. Statistical comparisons of classifiers
                                                                                        over multiple data sets. JMLR, 7:1–30.
   The proposed approach reduces the original
graph in 50% or even more and the accuracy re-                                       S. N. Dorogovtsev and J. F. F. Mendes. 2002. Evolu-
sults remain statistically similar to the obtained                                      tion of networks. In Adv. Phys, pages 1079–1187.




                                                                                71
T. Faleiros and A. Lopes. 2015. Bipartite graph for             M. Stumpf, C. Wiuf, and R. May. 2005. Subnets
   topic extraction. In IJCAI 2015, pages 4363–4364.              of scale-free networks are not scale-free: Sampling
                                                                  properties of networks. Proceedings of the National
S. Fortunato. 2010. Community detection in graphs.                Academy of Sciences, 102:4221–4224.
   CoRR, abs/0906.0612v2.

M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Much-        A. Valejo, J. Valverde-Rebaza, and A. Lopes. 2014.
  nik, H.E. Stanley, and A. Makse. 2010. Identifi-                A multilevel approach for overlapping community
  cation of influential spreaders in complex networks.            detection. BRACIS 2014, pages 390–395. IEEE.
  Nature Physics, 6(11):888–893.
                                                                J. Valverde-Rebaza and A. Lopes. 2013. Exploiting
S. Lee, P. Kim, and H. Jeong. 2006. Statistical prop-              behaviors of communities of Twitter users for link
   erties of sampled networks. Physical Review E,                  prediction. Social Network Analysis and Mining,
   73(016102).                                                     pages 1–12.
J. Leskovec and C. Faloutsos. 2006. Sampling from
   large graphs. SIGKDD 2006, pages 631–636.                    J. Valverde-Rebaza and A. Lopes. 2014. Link predic-
                                                                   tion in online social networks using group informa-
A. Lopes, J.R. Bertini, R. Motta, and L. Zhao. 2009.               tion. In ICCSA 2014, volume 8584, pages 31–45.
  Classification based on the optimal k-associated net-
  work. In Complex Sciences, volume 4 of Lecture                J. Valverde-Rebaza, A. Soriano, L. Berton, M.C.F.
  Notes of the Institute for Computer Sciences, So-                de Oliveira, and A. Lopes. 2014. Music genre
  cial Informatics and Telecommunivations Engineer-                classification using traditional and relational ap-
  ing, pages 1167–1177. Springer Berlin Heidelberg.                proaches. BRACIS 2014, pages 259–264. IEEE.
                                                                J. Valverde-Rebaza, A. Valejo, L. Berton, T. Faleiros,
Q. Lu and L. Getoor. 2003. Link-based classification.              and A. Lopes. 2015. A naı̈ve bayes model based on
  In ICML, pages 496–503.                                          overlapping groups for link prediction in online so-
                                                                   cial networks. In ACM SAC’ 15, pages 1136–1141.
S.A. Macskassy and F.J. Provost. 2003. A simple
  relational classifier. In 2nd Workshop on Multi-
  Relational Data Mining.                                       D. Vega-Oliveros and L. Berton. 2015. Spreader se-
                                                                  lection by community to maximize information dif-
S.A. Macskassy and F.J. Provost. 2007. Classification             fusion in social networks. In SIMBig 2015, pages
  in networked data: A toolkit and a univariate case              73–82.
  study. JMLR, 8:935–983.
                                                                D. Vega-Oliveros, L. Berton, A. Lopes, and F. Ro-
T.M. Mitchell. 1997. Machine Learning. McGraw-
                                                                  drigues. 2015. Influence maximization based on
  Hill, New York.
                                                                  the least influential spreaders. In SocInf 2015, co-
M. E. J. Newman. 2010. Networks: an introduction.                 located with IJCAI 2015, volume 1398, pages 3–8.
  Oxford University Press.
                                                                D.J. Watts and S. Strogatz. 1998. Collective dynamics
S. Seidman. 1983. Network structure and minimum                   of ’small-world’ networks. Nature, 393(6684):440–
   degree. Social networks, 5(3):269–287.                       442.




                                                           72