=Paper=
{{Paper
|id=Vol-1743/paper7
|storemode=property
|title=The Impact of Network Sampling on Relational Classification
|pdfUrl=https://ceur-ws.org/Vol-1743/paper7.pdf
|volume=Vol-1743
|authors=Lilian Berton,Didier A. Vega-Oliveros,Jorge Valverde-Rebaza,Andre Tavares da Silva,Alneu de Andrade Lopes
|dblpUrl=https://dblp.org/rec/conf/simbig/BertonVVSL16
}}
==The Impact of Network Sampling on Relational Classification==
The impact of network sampling on relational classification Lilian Berton2 Didier A. Vega-Oliveros1 Jorge Valverde-Rebaza1 Andre Tavares da Silva2 Alneu de Andrade Lopes1 Department of Computer Science 1 Technological Sciences Center 2 ICMC, University of São Paulo University of Santa Catarina State CEP 13560-970, São Carlos - SP - Brazil CEP 89219-710, Joinville - SC - Brazil {davo,jvalverr,alneu}@icmc.usp.br lilian.2as@gmail.com, andre.silva@udesc.br Abstract G = (V, E), where V is the set of vertices repre- senting objects in a specific context, and E is the Many real-world networks, such as the set of edges representing the interactions among Internet, social networks, biological net- these objects. For instance, in a social network, works are massive in size, which difficult vertices are individuals and edges are the friend- different processing and analysis tasks. ships existing among them (Newman, 2010). For this reason, it is necessary to apply Since analyzing and modeling data in rela- a sampling process to reduce the network tional representation is relevant for different do- size without losing relevant network in- mains, several applications have been studied to formation. In this paper, we propose a obtain more benefits from the network struc- new and intuitive sampling method based ture, such as community detection (Valejo et on exploiting the following centrality mea- al., 2014), link prediction (Valverde-Rebaza and sures: degree, k-core, clustering, eccen- Lopes, 2013; Valverde-Rebaza and Lopes, 2014; tricity and structural holes. For our experi- Valverde-Rebaza et al., 2015), topic extraction ments, we delete 30% and 50% of the ver- (Faleiros and Lopes, 2015), information diffusion tices from the original network and eval- (Vega-Oliveros and Berton, 2015; Vega-Oliveros uate our proposal on six real-world net- et al., 2015), and others. Recently, there has works on relational classification task us- been a lot of interest in relational learning, espe- ing six different classifiers. Classification cially related to relational classification techniques results achieved on sampled graphs gener- (Lu and Getoor, 2003; Macskassy and Provost, ated from our proposal are similar to those 2003; Macskassy and Provost, 2007; Lopes et al., obtained on the entire graphs. In most 2009). Relational classifiers have shown best per- cases, our proposal reduced the original formance than conventional classifiers (Valverde- graphs by up to 50% of its original number Rebaza et al., 2014). of edges. Moreover, the execution time for However, most of these networks are massive in learning step of the classifier is shorter on size, being difficult to be studied in their entirety. the sampled graph. In some cases, the network is not totally available, or it is hard to be collected, or even if we have the keywords: network sampling, relational classifi- complete graph, it can be very expensive to run the cation, centrality measures, complex networks algorithms on it. Hence, it is necessary to perform and study on network sampling, i.e., selecting a 1 Introduction subset of vertices and edges from the full graph, Networks are relational structures with a high level in such way we obtain G0 = (V 0 , E 0 ) 2 G = of order and organization, despite they display big (V, E) (Leskovec and Faloutsos, 2006; Ahmed et inhomogeneities (Fortunato, 2010). Furthermore, al., 2012; Ahmed et al., 2013). networks are extremely useful as a representation Considering the assumption that network data of a wide variety of complex systems in a lot of fits in memory is not realistic for many real-world real-world contexts, such as social, information, domains (e.g., online social networks), different biological and technological domains (Newman, strategies for sampling have been proposed aim- 2010). Formally, a network is denoted by a graph ing to reduce the number of vertices or edges 62 of a network. The state-of-the-art technique is tional classification accuracy obtained by differ- called as random subsampling method, which se- ent sampled graphs generated from our proposal is lect nodes uniformly at random. However, while as good as the classification accuracy obtained on this technique is intuitive and relatively straight- entire graphs and taking less time in the learning forward, it does not accurately capture properties phase; iii) we also analyze the network topology to of networks with power-law degree distributions exploit which cases the sampled graphs are similar (Stumpf et al., 2005). To cope with this problem, to full graphs. researchers have also considered other sampling The remaining of this paper is organized as fol- methods based on breadth-first search or random lows. Section 2 presents some concepts used in the walks. Snowball sampling method, for instance, paper encompassing centrality measures and rela- adds nodes and edges using breadth-first search tional classification. Section 3 presents the pro- from a randomly selected seed node for accurately posed approach for network sampling. Section 4 maintaining the network connectivity within the presents the experimental evaluation which ana- snowball (Lee et al., 2006). On the other hand, lyzes the impact of sampling on relational classifi- the Forest Fire Sampling (FFS) method uses par- cation and on the network topology. Finally, Sec- tial breadth-first search where only a fraction of tion 5 presents the conclusions and future works. neighbors are followed for each node (Leskovec and Faloutsos, 2006), and the degree-based sam- 2 Background pling method selects nodes considering their prob- In this section, we describe the main centrality abilities to be visited, which is proportional to the measures used as conventional parameters in dif- node degree (Adamic et al., 2001). ferent sampling methods existent in the literature. Other techniques have been proposed in the lit- Also, we introduce briefly the main concepts on erature, which consider the different sources (e.g., relational classification and six of the most popu- disk-resident/static or streaming) and scale (e.g., lar relational classifiers. small or large). Although there is previous work 2.1 Centrality measures focusing on evaluating the performance of sam- pling methods by comparing network statistics, In complex network, some researchers have pro- i.e. measure the representativeness of the sampled posed different measures to analyze the impor- subgraph structure comparing it with the full input tance of central vertices (Newman, 2010; Doro- network structure (Ahmed et al., 2012; Ahmed et govtsev and Mendes, 2002). The centrality mea- al., 2013), to the best of our knowledge there is sures indicate how much a vertex is important no extensive research focused on study the impact in some scope. Considering that, n = |V | and of using sampled networks in a specific machine m = |E|, the centrality measures applied in this learning task, such as, classification, exploiting a work are described as follow. lot of classifiers and datasets. Thus, in this paper, • Degree (DG): The degree or connectivity of we propose an intuitive sampling method and use vertex i, referred to ki , is related with the different configurations of it to perform an empir- number of edges or connections that go (kiout ) ical evaluation in six real-world networks and six or arrive (kiin ) to vertex i. The average degree classifiers. We evaluate the quality of our proposal hki for directed networks is the average of the analyzing: i) how much the full network structure input or output edges. When the network is is preserved in the sampled graphs generated, ii) undirected, the average degree is the factor the accuracy obtained by six relational classifiers hki = 2 ⇤ m/n, i.e., the sum of all the edges on entire and sampled graphs; and iii) the execu- per vertex of the network over the number of tion time in the learning step of the classifiers. vertices. The ki values can be calculated as The main contributions of this paper are: i) follow: X we propose a new and intuitive method for sam- ki = aij . (1) pling based on centrality properties of networks, i2N such as, node degree, k-core, and others. These Vertices with very high ki values are called measures can be calculated in only part of the hubs, which represent instances strongly graph and have low computational cost; ii) we per- connected that impact on the dynamics of form an empirical evaluation that shows the rela- the network (Barabasi and Bonabeau, 2003; 63 Vega-Oliveros and Berton, 2015). For in- • Eccentricity (EC): The shortest path be- stance, in social networks, hubs are the tween two vertices is the shortest sequence most popular individuals, like famous actors, of edges that connect them, and the distance politicians, etc. The time complexity for cal- is the number of edges contained in the path. culating to all the vertices is O(n ⇤ hki). This problem can be resolved by employing different algorithms, like Dijkstra, Bellman- • K-core (KC): The network can be decom- Ford, Floyd-Warshall, or breadth-first search posed in terms of sub-networks or cores (Sei- methods (Cormen et al., 2009). In the case dman, 1983), where each core of order (Hk ) that i and j belong to different components, represents the set of vertices that has ki k. it is assumed that `ij = n. In this way, the Therefore, a vertex i belongs to Kc(x) = k eccentricity value of a vertex i is the largest if Hk is the largest core it can be part (Sei- distance over all the shortest path to the other dman, 1983). The principal core is the set vertices, as follow: of vertices with the largest k-core value, and they are the most central (Kitsak et al., 2010). ECi = max{|`ij |} , (3) In general, vertices with lower KC values are i6=j located at the periphery of the network. The KC centrality is obtained by an iterative and where |`ij | is the distance of the shortest path incremental process (Batagelj and Zaversnik, between vertices i and j. This measure eval- 2003) that begins with k = 1: (i) All the ver- uates how close is a vertex to its most distant tices with degree lower or equal than k are re- vertex. Lower values of EC indicates that the moved. Then, (ii) the remaining vertices are vertex is more central and closer to the oth- evaluated several times, in order to remove ers. Therefore, vertices located at the net- those with ki lower or equal than k. After work center have the lowest eccentricity val- that, (iii) the removed vertices are part of the ues. For unweighted graphs, the running time set Kc(i) = k, k is incremented, and the complexity of this measure is O(n ⇤ m). process continues with step (i). The final set of vertices is the main core of the network, which has the largest KC centrality. Notice • Structural Holes (HO): Some vertices in the that not necessarily the hubs have the high- network work such as the bridge of clusters est k-core values. For instance, hubs located or other vertices, and if they are removed in the periphery have small k-core central- a structural hole will occur. The structural ity (Kitsak et al., 2010). The algorithm has hole vertices act as spanners among com- low computational complexity O(n + m) for munities or groups of vertices without direct calculating the centrality to all vertices. connections. These individuals are important to the connectivity of local regions. We cal- • Clustering coefficient (CT): In topology culate Burt’s constraint scores (Burt, 1992) as terms, it is the presence of triangles (cycles the structural holes centrality. The algorithm of order three) in the network. The cluster- considers all vertices as ego networks, where ing coefficient (Watts and Strogatz, 1998) of connections no related to it have not a direct a vertex i is defined as the number of trian- effect. For each vertex, the score is the frac- gles centered on i over its maximum number tion of isolated holes will exists associated of possible connections, i.e., with it and according to its ego network. The higher the fraction of structural holes associ- 2ei ated with the vertex, the more central it is. CTi = . (2) ki (ki 1) Therefore, vertices with higher degree cen- trality tend to have low HO values, given that In the case of ki 2 {0, 1}, it is assumed a its ego networks are larger and more densely centrality value of zero, and CTi = 1 only interconnected, and this diminishes the frac- if all the neighbors of i are interconnected. tion of isolated holes. The time complexity The running time complexity of the measure for calculating the measure to all the vertices is O(n ⇤ hki2 ). is O(n + n ⇤ hki2 ). 64 2.2 Relational classification method. The no-lb classifier creates a feature vec- Conventional classification algorithms learn from tor for a vertex by aggregating the labels of its a training set formed by independent and identi- neighborhood and then uses logistic regression to cally distributed (i.i.d) data (Mitchell, 1997). Nev- build a discriminative model based on those fea- ertheless, as previously mentioned, a lot of real- ture vectors (Lu and Getoor, 2003). This learned world data are relational in nature and can be rep- model is then applied to estimate P (vi = c|Ni ). resented by graphs. Conventional classifiers do For no-lb classifier, three aggregation methods not work properly on graphs because they ignore have been considered: binary-link (no-lb-binary), pairwise dependency relations between vertices, mode-link (no-lb-mode), and count-link (no-lb- i.e. relational information. To cope with that, count). Another aggregation method considered is different relational classifiers have been proposed class-distribution link (no-lb-distrib) (Macskassy (Lu and Getoor, 2003; Macskassy and Provost, and Provost, 2007). All the no-lb aggregations use 2003; Macskassy and Provost, 2007; Lopes et al., the iterative classification as a collective inference 2009). Relational classifiers require a fully de- method. scribed graph (vertices and edges) with known la- bels for some of the vertices to predict the labels 3 Proposal of the remaining vertices. As previously mentioned, our proposal consists in For the domain of relational classification, we an intuitive approach based on exploring the cen- redefine the network as the graph G = (V, E, W ), trality measures of a network to remove some ver- where V = {v1 , v2 , ... vn } is the set of n vertices tices and edges trying to conserve the equivalence that describes an object, E = {e1 , e2 , ... em } is between the sampled and the entire network. We the set of m edges representing some similarity aim to obtain a sample from G in such way it does between a pair of vertices and W is a matrix of not affect the performance of any learning task. weights, which associates to each edge a weight Thus, our proposal generates a sample G0 from G, wij that determines the strength of the connection. i.e. G0 = (G), where is the function represent- For this work, we consider three relational clas- ing our proposal. It is important to note that G0 is sifiers: weighted vote relational neighbor (wvrn), a sub-graph from G, so V 0 ⇢ V and E 0 ⇢ E. The network-only Bayes (no-Bayes) and network-only size of the sample is relative to the graph size. link-based (no-lb). The proposed approach is illustrated in Figure The wvrn classifier estimates class membership 1 and follows these steps: 1) calculate a specific probabilities by assuming that linked nodes tend centrality measure for all vertices of the network, to belong to the same class and considering the in this paper we use DG, KC, CT, EC, HO mea- weighted mean of the class-membership proba- sures; 2) select some percentage of vertices with bilities for the neighborhood of each node ana- the highest (H) or lowest (L) centrality values, in lyzed (Macskassy and Provost, 2007) according to this paper we experiment selecting 30% and 50% Equation 4. of vertices; 3) remove all selected vertices and 1 X all their corresponding edges from G, obtaining P (vi = c|Ni ) = w(vi , vj )P (vj = c|Nj ) G0 . The sampled graph generated, G0 , should be N vj 2Ni equivalent to the entire graph, so learning algo- (4) rithms should have a similar performance in both The no-Bayes classifier employs multinomial the sampled and the entire graph. naı̈ve Bayes classifier based on the classes of All measures used for sampling the graph can the neighborhood of each vertex (Macskassy and be calculated considering only a fraction of the Provost, 2007). The no-Bayes is defined as Equa- graph, in a direct way or by employing statistical tion 5, methods. The measures DG, HO and CT, for ex- P (Ni |c)P (c) ample, can be calculated for each vertex directly. P (vi = c|Ni ) = (5) In the case of EC and KC, there are very precise P (Ni ) approaches that consider only the vertex commu- Q where P (Ni |c) = N1 vj 2Ni P (vj = cj |vi = c)w(vi ,vj ) . nity (part of the network). These measures have Furthermore, these two relational classifiers low computational cost to be calculated and can use the relaxation label as a collective inference be applied on very large networks, moreover, by 65 (a) (b) (c) Figure 1: Proposed approach: (a) Select % of vertices with lowest (red) or highest (green) centrality measure values from original graph. (b) Remove all the vertices and all their edges to obtain the sampled network. In this case, we remove ver- tices with lowest centrality measure values. (c) Use some learning task on the sampled network, for instance, the relational classification. the experimental results achieved good accuracy. Table 1: Data sets description. Datasets |V | |E| # Classes hki 4 Experimental results Cora 4240 35912 7 17.84 Cornell 351 1393 6 3.98 In this section, we present extensive empirical ex- Imdb 1441 51481 2 66.99 periments focused on evaluating the quality of Industry 2189 6531 12 10.65 sampled graphs generated by different configura- Texas 338 1002 6 3.44 tions of our proposal, when compared with the Washington 434 1941 6 4.41 original graph. We use six real-world networks and apply six relational classifiers (see Section 2.2) on full and sampled graphs. We perform two was used as evaluation measure to compare the ac- types of evaluations, Section 4.2 shows the clas- curacy of graph G and sampled graph G0 . sification accuracy results, and Section 4.3 shows 4.2 Impact of sampling on classification the topological analysis of sampled and original accuracy graphs. The classification results for the entire graph, 30% 4.1 Data sets and experimental setup and 50% of the sampled networks are shown in Figures 2 and 3 respectively, with the accuracy We consider six benchmark data sets1 , which rep- for the six datasets (Figures (a), (b), (c), (d), (e) resent real networks and are described in Table 1. and (f)), the six classifiers (bars) and the ten sam- We consider that all networks are undirected. pling proposed strategies moreover the classifi- We sample a subgraph G0 from a graph G using cation with the entire graph (FULL). For all the the centrality measures presented and considering sampling strategies the datasets Cora and Imdb 30% and 50% of vertices with smallest and high- achieved the highest accuracy. And the better clas- est centralities values. For each sample size, we sifiers, in general, was nolb-lb-count and nolb-lb- perform 10-fold cross validation and applied the distrib. following relational classifiers: weighted vote re- The Nemenyi post-hoc test (Demšar, 2006) was lational neighbor (wvrn), network-only Bayes (no- executed to verify the possibility of detecting sta- Bayes), and network-only link-based (no-lb) clas- tistical differences among the sampling strategies. sifiers, in their Netkit-SRL implementations with The results for 30% and 50% of sampled networks standard configuration. For the network-only link- are shown in Figures 4 and 5 respectively. On the based classifier we employed models modelink top of the diagrams is the critical difference (CD) (no-lb-mode), count-link (no-lb-count), binary- and in the axis are plotted the average ranks of link (nolb-binary) and class-distribution-link (no- the evaluated techniques, where the lowest (best) lb-distrib). The area under the ROC curve (AUC) ranks are on the left side. When the methods ana- 1 http://netkit-srl.sourceforge.net/ lyzed have no significant difference, they are con- data.html nected by a black line in the diagram. 66 (a) (b) (c) (d) (e) (f) Figure 2: Classification results for the entire graph (FULL) and sampling strategies that remove 30% of vertices for the following datasets: (a) Cora, (b) Cornell, (c) Imdb, (d) Industry, (e) Texas and (f) Washington. 67 (a) (b) (c) (d) (e) (f) Figure 3: Classification results for the entire graph (FULL) and sampling strategies that remove 50% of vertices for the following datasets: (a) Cora, (b) Cornell, (c) Imdb, (d) Industry, (e) Texas and (f) Washington. 68 (a) (a) (b) (b) (c) (c) (d) (d) (e) (e) (f) (f) Figure 4: Nemenyi post-hoc test for the entire graph and Figure 5: Nemenyi post-hoc test for the entire graph and sampling strategies that removes 30% of vertices for the fol- sampling strategies that removes 50% of vertices for the fol- lowing relational classifiers: (a) no-Bayes, (b) no-lb-binary, lowing relational classifiers: (a) no-Bayes, (b) no-lb-binary, (c) no-lb-count, (d) no-lb-distrib, (e) no-lb-mode and (f) (c) no-lb-count, (d) no-lb-distrib, (e) no-lb-mode and (f) wvrn. wvrn. According to the Nemenyi statistics, the crit- the entire graph. It is the case of CT-30H, CT-30L, ical value for comparing the average-ranking of EC-30H and HO-30H for 30% of vertices sam- two different algorithms considering the sampling pled, and CT-50L, EC-50H, HO-50H, KC-50L, strategy that removes 30% of vertices (Figure 4) and DG-50L for 50% of vertices sampled. In par- or 50% of vertices (Figure 5) at 95 percentile in ticular, the CT-50H only had significance differ- all classifiers (no-Bayes, nolb-binary, no-lb-count, ence with the no-lb-distrib classifier. In terms of no-lb-distrib, no-lb-mode, wvrn) is 6.16. accuracy, this result indicates that the CT central- In all the classifiers there are some sampling ity, for all the analyzed parameters, was more ro- strategies that have no statistical difference with bust and suitable as a sampling strategy. 69 Table 2: Execution time (ms) for classification training models. Datasets Classifiers CT-30H EC-30H HO-30H KC-30H DG-30H FULL no-Bayes 756 824 960 552 511 1395 nolb-binary 13498 13398 13857 13380 13313 14547 CoRa no-lb-count 12213 12585 13897 12386 12765 13759 no-lb-distrib 14218 14310 14569 14452 14401 15396 no-lb-mode 13991 13837 14097 13304 12924 17516 wvrn 552 622 713 403 378 1032 no-Bayes 74 52 52 50 39 228 nolb-binary 724 775 789 694 694 936 Cornell no-lb-count 718 809 754 697 667 1250 no-lb-distrib 753 759 772 690 719 918 no-lb-mode 729 703 726 690 712 2018 wvrn 38 49 45 31 36 145 no-Bayes 381 327 558 214 175 1042 nolb-binary 959 810 1260 659 587 1538 Imdb no-lb-count 998 843 1258 693 605 1786 no-lb-distrib 947 868 1232 662 595 1484 no-lb-mode 970 830 1086 621 600 2720 wvrn 482 387 695 253 214 1050 no-Bayes 529 635 744 314 290 2841 nolb-binary 23140 21688 22061 26874 26101 23196 Industry no-lb-count 23751 23934 27333 25806 26370 27111 no-lb-distrib 27827 26984 26721 29931 28834 28148 no-lb-mode 27854 25516 24045 28688 27881 31027 wvrn 281 323 346 187 180 541 no-Bayes 54 47 52 44 40 206 nolb-binary 809 730 703 680 601 914 Texas no-lb-count 826 773 765 703 605 1125 no-lb-distrib 766 681 701 680 633 895 no-lb-mode 657 642 666 674 664 2283 wvrn 42 38 42 37 40 153 no-Bayes 68 72 66 45 42 277 nolb-binary 967 963 950 888 804 1016 Washington no-lb-count 1016 977 957 899 868 1124 no-lb-distrib 1043 955 945 880 794 1190 no-lb-mode 870 948 877 897 834 2300 wvrn 51 54 80 44 46 218 Table 2 shows the time comparison for the maining edges. This occurs since for the EC, the learning step for all classifiers and all datasets. most central or closest vertices have the lowest We notice that all sampling strategies proposed, values and for the HO measure, hubs tend to have considering 30% of sampling, achieve small time larger ego-networks; ergo, the centrality values are compared with the original graph, especially the lower. strategy DG and KC. The lowest times are in bold. We notice that there exists diverse values of re- moved edges from the original network, without 4.3 Impact of sampling on network topology strongly affecting the accuracy of the classifiers (in bold). This variation of removed edges, some We have analyzed the impact of the sampling larger than 50%, suggest that depending on the methods in the structure of the original network. expected requirements, it can be privileged in the In Table 3, we have the fraction of remaining sampling process: edges after applying the sampling methods, ac- cording to the target vertices (with highest (H) or 1. The maximal removal of edges by removing lowest (L) centrality value) and removal percent- a low proportion of vertices. age (30 or 50%). The bold values highlight the techniques and parameters that achieve similar ac- 2. Equivalent removal proportion of edges and curacy results to the full network, i.e., with no vertices. significance difference for all the classifiers. We have observed that the EC and HO measures are 3. The minimal removal of edges by removing inversely proportional to the final fraction of re- a high proportion of vertices. 70 In the first case, by removing 30% of vertices we with the entire network, i.e. the impact on clas- have the sampling method CT-30H. For the third sification results obtained by entire networks is case, we have the methods DG-50L, KC-50L, and minimal when compared with those obtained by HO-50H. The left bold sampling strategies are in sampled networks. We have applied the proposed the second case. strategy in six real networks considering six differ- Notwithstanding reducing the number of ver- ent relational classifiers. The CT measure was the tices and edges from the original network do not most robust in accuracy for all classifiers and on all statistically impact the classification results, the networks, without statistical significance. More- topological properties are sensibly affected by the over, the execution time for the learning step of the removal. For instance, removing 30% of vertices classifiers are smaller in the sampling strategies with the highest degree centrality (ki ) it produces proposed when compared with the entire graph. a more homogeneous distributed network (tending to a Poisson or regular graph) and the average de- Acknowledgments gree decays. On the other hand with the same pro- This work was partially supported by the São portion, removing the least connected vertices pro- Paulo Research Foundation (FAPESP) grants: duce networks with more heterogeneous degree 2013/12191 5 and 2015/14228 9, Na- distribution than the original graph. tional Council for Scientific and Technological Development (CNPq) grants: 302645/2015 Table 3: Descriptions of edges on sampled graphs. 2 and 140688/2013 7, and Coordination for Datasets Measures DG #edges 30H 0.216 #edges 30L 0.916 #edges 50H 0.077 #edges 50L 0.780 the Improvement of Higher Education Personnel KC 0.266 0.918 0.093 0.785 (CAPES). CoRa HO 0.912 0.231 0.767 0.090 EC 0.718 0.389 0.477 0.191 CT 0.555 0.600 0.293 0.368 DG 0.067 0.853 0.017 0.666 KC 0.140 0.849 0.040 0.670 References Cornell HO 0.853 0.080 0.659 0.020 EC 0.662 0.326 0.479 0.114 CT 0.500 0.702 0.051 0.553 L.A. Adamic, R.M. Lukose, A.R. Puniyani, and B.A. DG 0.226 0.881 0.069 0.648 Huberman. 2001. Search in power-law networks. Imdb KC HO 0.284 0.872 0.881 0.252 0.086 0.637 0.653 0.096 Physical Review E, 64(046135). EC 0.499 0.347 0.264 0.146 CT 0.601 0.553 0.314 0.290 Nesreen K. Ahmed, Jennifer Neville, and Ramana DG 0.079 0.952 0.029 0.883 KC 0.090 0.951 0.039 0.887 Kompella. 2012. Network sampling designs for re- Industry HO 0.952 0.094 0.882 0.035 lational classification. In In Proceedings of the 6th EC 0.758 0.354 0.354 0.152 CT 0.575 0.934 0.203 0.438 International AAAI Conference on Weblogs and So- DG 0.083 0.835 0.026 0.634 cial. KC 0.182 0.829 0.070 0.643 Texas HO 0.835 0.083 0.626 0.034 EC 0.685 0.529 0.492 0.196 N.K. Ahmed, J. Neville, and T. Kompella. 2013. Net- CT 0.474 0.720 0.085 0.559 work sampling: From static to streaming graphs. DG 0.082 0.857 0.017 0.680 KC 0.178 0.863 0.074 0.692 ACM Trans. Knowl. Discov. Data, 8(2):1–56. Washington HO 0.855 0.084 0.685 0.028 EC CT 0.724 0.467 0.268 0.745 0.524 0.081 0.114 0.616 A. Barabasi and E. Bonabeau. 2003. Scale-free net- works. Scientific American, pages 50–59. V. Batagelj and M. Zaversnik. 2003. An O(m) algo- 5 Conclusion rithm for cores decomposition of networks. Arxiv preprint cs/0310049. In this paper, we proposed a strategy for network R.S. Burt. 1992. Structural holes: The social struc- sampling by exploring five centrality measures: ture of competition. Harvard University Press, Cam- DG, KC, CT, EC, HO and eliminating vertices bridge, MA. with 30% or 50% of lowest or highest centrality values. All centrality measures considered have a T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. 2009. Introduction to Algorithms. The low order of complexity and are computationally MIT Press, 3 edition. applicable in real networks scenarios. Moreover, they can be calculated in part of the graph. J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. JMLR, 7:1–30. The proposed approach reduces the original graph in 50% or even more and the accuracy re- S. N. Dorogovtsev and J. F. F. Mendes. 2002. Evolu- sults remain statistically similar to the obtained tion of networks. In Adv. Phys, pages 1079–1187. 71 T. Faleiros and A. Lopes. 2015. Bipartite graph for M. Stumpf, C. Wiuf, and R. May. 2005. Subnets topic extraction. In IJCAI 2015, pages 4363–4364. of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National S. Fortunato. 2010. Community detection in graphs. Academy of Sciences, 102:4221–4224. CoRR, abs/0906.0612v2. M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Much- A. Valejo, J. Valverde-Rebaza, and A. Lopes. 2014. nik, H.E. Stanley, and A. Makse. 2010. Identifi- A multilevel approach for overlapping community cation of influential spreaders in complex networks. detection. BRACIS 2014, pages 390–395. IEEE. Nature Physics, 6(11):888–893. J. Valverde-Rebaza and A. Lopes. 2013. Exploiting S. Lee, P. Kim, and H. Jeong. 2006. Statistical prop- behaviors of communities of Twitter users for link erties of sampled networks. Physical Review E, prediction. Social Network Analysis and Mining, 73(016102). pages 1–12. J. Leskovec and C. Faloutsos. 2006. Sampling from large graphs. SIGKDD 2006, pages 631–636. J. Valverde-Rebaza and A. Lopes. 2014. Link predic- tion in online social networks using group informa- A. Lopes, J.R. Bertini, R. Motta, and L. Zhao. 2009. tion. In ICCSA 2014, volume 8584, pages 31–45. Classification based on the optimal k-associated net- work. In Complex Sciences, volume 4 of Lecture J. Valverde-Rebaza, A. Soriano, L. Berton, M.C.F. Notes of the Institute for Computer Sciences, So- de Oliveira, and A. Lopes. 2014. Music genre cial Informatics and Telecommunivations Engineer- classification using traditional and relational ap- ing, pages 1167–1177. Springer Berlin Heidelberg. proaches. BRACIS 2014, pages 259–264. IEEE. J. Valverde-Rebaza, A. Valejo, L. Berton, T. Faleiros, Q. Lu and L. Getoor. 2003. Link-based classification. and A. Lopes. 2015. A naı̈ve bayes model based on In ICML, pages 496–503. overlapping groups for link prediction in online so- cial networks. In ACM SAC’ 15, pages 1136–1141. S.A. Macskassy and F.J. Provost. 2003. A simple relational classifier. In 2nd Workshop on Multi- Relational Data Mining. D. Vega-Oliveros and L. Berton. 2015. Spreader se- lection by community to maximize information dif- S.A. Macskassy and F.J. Provost. 2007. Classification fusion in social networks. In SIMBig 2015, pages in networked data: A toolkit and a univariate case 73–82. study. JMLR, 8:935–983. D. Vega-Oliveros, L. Berton, A. Lopes, and F. Ro- T.M. Mitchell. 1997. Machine Learning. McGraw- drigues. 2015. Influence maximization based on Hill, New York. the least influential spreaders. In SocInf 2015, co- M. E. J. Newman. 2010. Networks: an introduction. located with IJCAI 2015, volume 1398, pages 3–8. Oxford University Press. D.J. Watts and S. Strogatz. 1998. Collective dynamics S. Seidman. 1983. Network structure and minimum of ’small-world’ networks. Nature, 393(6684):440– degree. Social networks, 5(3):269–287. 442. 72