Strength of Co-authorship Ties in Clusters: a Comparative Analysis Michele A. Brandão and Mirella M. Moro Universidade Federal de Minas Gerais, Belo Horizonte, Brazil micheleabrandao@dcc.ufmg.br, mirella@dcc.ufmg.br Abstract. We analyze the strength of ties through three different clus- tering algorithms applied to co-authorship social networks from three different research areas. This study reveals if tie strength metrics can be used to evaluate clusters quality. We obtain different results for each algorithm and observe that Markov cluster algorithm provides the best results for co-authorship social networks. Also, researchers in overlapped communities detected by clique percolation method work as bridges. Keywords: Social Networks, Tie Strength, Clustering Algorithms 1 Introduction Clustering algorithms represent a classical problem of data mining and has many applications over a plethora of domains. Then, identifying which algorithm is proper to one such domain is a challenge per se. Likewise, evaluating the qual- ity of the created clusters is hard due to its problem-driven nature, as a good clustering algorithm for a problem may not be as good for another [1]. In the context of social networks (SN), clustering algorithms are useful for detecting (finding) communities. Examples of studies include to explore regional innovation systems, clustering effect in scientific communities and concentration of developers in a country [5]. Specially in academic SN, detecting clusters helps to discovery patterns that may increase the researchers’ productivity, reveal the impact in research policy and understand group formation [11]. However, once again, the problem is how to verify the quality of the created clusters. Here, we apply clustering techniques in co-authorship SN, a type of academic SN in which the nodes are researchers and there are edges between those who have published together. By definition, a cluster in SN is a collection of individu- als with dense interactions patterns internally and sparse interactions externally [13]. Therefore, to evaluate the quality of the created clusters, we use existing metrics to assess the strength of co-authorship ties intra and inter clusters. In summary, when the strength of ties is measured by metrics that consider the neighborhood of nodes, the strength of ties intra cluster should be higher than inter clusters. Hence, we measure tie strength using two metrics that provide such information [3,7,15]: Neighborhood Overlap (N O), the collaboration between two nodes regarding their neighbors; and co-authorship frequency (W ), the absolute number of publications between two persons. Brief Related Work. There are many clustering techniques and they are applied to different types of networks, for example, similarity graphs [9], directed networks [12], social professional networks [5] and mobile SN [10]. From these techniques, we have chosen three that are commonly applied to undirected graphs and represent good strategies commonly used to detect communities in SN [13]. Also, there are different ways to measure clustering quality, such as BetaCV, C-index and modularity [5]. However, identifying whether such metrics give the expected answer for a graph is very difficult [1]. Moreover, most of these metrics are biased and unreliable in larger real graphs. Indeed, in this work, we inves- tigate whether tie strength metrics can be used to evaluate clustering quality. This study represents a new direction in the evaluation of clustering algorithms and may help to fill this gap in the state-of-the-art. Contributions. Overall, our contributions are the analyses of: the distribution of strong and weak ties intra and inter clusters and the dynamism of the strength of ties through different clustering algorithms. Such analyses reveal whether tie strength metrics can be used to evaluate clusters quality based on the definition that ties intra a community should be strong and inter should be weak. Next, we present the analysis setup that includes creating reals SNs (Section 2). Then, we analyze three clustering methods: Louvain method (Section 3.1), clique percolation method (Section 3.2) and Markov cluster algorithm (Section 3.3), and compare their results (Section 4). We have chosen such algorithms because they are important to detect core groups (a.k.a. clusters or communi- ties) on SN [10]. Thus, we use both terms interchangeably and maintain the nomenclature of the clustering methods’ authors. 2 Analyses Setup A co-authorship social network can be modeled as a weighted graph G w = (V, E w ), with V the set of nodes and E w the set of non-directed weighted links. Nodes are researchers (or authors), a tie between any two researchers exists if they have published together, and the tie weight represents the absolute number of publications between them, called as co-authorship frequency or simply W , which has been applied to measure the strength of ties [15]. Another topological property to measure the strength of ties is Neighborhood Overlap – N O [3,7]. The N O of an edge connecting researchers vi and vj is given |N (v )∩N (vj )| by the equation: |N (vi )∪Ni (vj )|−{v i ,vj } , where N (vi ) represents the co-authors of researcher vi , and N (vj ) the co-authors of vj . Here, we build three co-authorship SNs using the CiênciaBrasil datasets1 . The publications available in CiênciaBrasil are from Brazilian researchers and have been collected from Lattes, an online platform for archiving researchers’ curriculum vitae, in November 2013. Each network represents the co-authorships among researchers from three areas: computer science, medicine and sociology. Table 1 has the datasets statistics: number of authors (researchers), number of 1 Datasets available at http://www.dcc.ufmg.br/∼mirella/projs/apoena Table 1: Dataset statistics per research area. Area # authors # publications AvgPubA # pairs (# dist) Computer Science 543 48,706 89.69 16,312 (1,563) Medicine 368 75,553 205.30 16,089 (778) Sociology 96 7,195 74.95 322 (39) (a) Neighborhood Overlap (N O) (b) Co-authorship Frequency (W ) Fig. 1: The strength of ties intra-communities in a perfect clustering. In each box plot, the central rectangle spans the first to the third quartiles, the segment inside is the median, traits above and below the rectangle represent the minimum and maximum values. The clusters’ identifiers order the box plots. publications, average number of publications per author and number of pairs of co-authors (and number of distinct pairs of co-authors). Considering these datasets, we apply three clustering algorithms: Louvain method (LM), clique percolation method (CPM) and Markov cluster algorithm (MCL). Then, we measure the strength of ties for each pair of researchers in each cluster detected by the algorithms. Such strength is measured by using N O and W . Following Brandão and Moro [3], we consider that a tie is weak when N O is in the range [0; 0.2] and strong otherwise. Likewise, a tie is weak when W is in the range [1; 5] and strong otherwise. Overall, analyses provide insights whether the strength of ties metrics can be used to evaluate clustering quality. By clusters definition [2,14,16], ties intra- clusters should be strong and ties inter-clusters should be weak. Therefore, a cluster should have most pairs of researchers (ties) classified as strong and most ties that connect different clusters as weak. One of the problems in evaluating clustering quality is the absence of a ground truth for comparison [1]. Thus, we now verify the strength of ties in a synthetic data that represents a situation with perfect clustering. According to Harman et al. [8], a perfect clustering has a perfect modularization, i.e., all modules in a cluster are connected to all other modules and there are no inter-cluster connections. Thus, we build a graph with 17 nodes and 23 edges (two randomly chosen prime numbers). We link the nodes in a way to form four clusters and there are no connections among nodes from different clusters. Figures 1a and 1b present N O and W of a perfect clustering, respectively. Cluster #1 is the largest one (7 nodes and 12 edges), cluster #2 is the second largest (4 nodes and 5 edges), clusters #3 and #4 have the same size (3 nodes and 3 edges). Note the minimum value of N O is 0.2, i.e., most communities are composed by strong ties. The smallest clusters have N O equal to 1 (i.e., all ties are strongly connected), because all nodes are connected to each other, but in a real social network this hardly happens. We emphasize that a high N O indicates that pairs of researchers are more connected to each other intra a cluster. Also, W of all clusters has the median higher than 20. This is a property strictly related to the frequency of nodes interactions – not always found in real networks. However, co-authorship SN with a high degree of collaboration tend to have a high W [3]. Hence, most detected clusters should have more strong ties than weak ones. 3 Evaluated Clustering Techniques In this section, considering three real co-authorship networks, we analyze three clustering techniques: Louvain method (Section 3.1), Clique Percolation method (Section 3.2) and Markov Cluster algorithm (Section 3.3). For space constraint, all graphs are presented in [4] and we discuss only the main findings next. 3.1 Community Detection Using Louvain Method The Louvain method (LM) [2] is a simple, efficient and a very common method for detecting communities in large networks. It makes greedy seeks to optimize the modularity of a partition of the network, where modularity is a topological property and designed to measure the density of links intra communities [2]. As it works over unweighted networks, it allows to study the links between researchers in clusters that are formed by the modularity and the network topology. Considering only computer science (CS), Figure 2 presents the results for measuring intra and inter-communities created by LM using both N O and W . Overall, medicine has more communities with smaller mean and median N O values than computer science, but W of such communities are higher than com- puter science. Also, in both areas, the communities with highest N O do not indicate communities with highest W . In sociology, N O and W of researchers in each community is small, but communities with the highest N O do not have the highest W . Such aspects suggest that the strength of the intensity of co- authorships among researchers measured by W does not always correspond to the strength of the interactions among researchers’ neighbors measured by N O. Moreover, there are communities with N O equal to zero, few edges compose all such communities in the three SN, and all edges have one node in common. These smaller communities are detected by LM because the researchers are not (a) CS: LM Intra-community with N O (b) CS: LM Intra-community with W (c) CS: LM Inter-community with N O (d) CS: LM Inter-community with W Fig. 2: LM intra/inter-communities with N O and W for CS. densely connected to other communities. Also, considering the outliers, the com- munities have less outliers to N O than to W . For future work, the study of such outliers might reveal interesting properties about co-authorships. 3.2 Uncovering Communities with Clique Percolation Method The clique percolation method (CPM) locates the k-clique communities of net- works and considers that a typical node in a community is linked to many others, but not necessarily to all other nodes in the community [14]. Overall, a commu- nity is a union of smaller fully connected subgraphs that share nodes. Such complete subgraphs are called k-cliques, where k refers to the number of nodes in the subgraph. Then, k-clique-community is defined as the union of all k-cliques that can be reached from each other through adjacent k-cliques [14]. We apply this method using the algorithm implemented in CFinder2 and k=3. By definition, a community is actually a connected graph when k=2 and 2 CFinder: http://www.cfinder.org (a) Med: CPM Intra-community with N O (b) Med: CPM Intra-community with W Fig. 3: CPM intra/inter-communities with N O and W for medicine. a set of disconnected nodes without any edge when k=1. The parameter k de- termines the nature of the communities. Using different values for k reveals the nature of the communities [6]. We have chosen k=3 in order to discover trian- gles and because such a value is also used in most general cases [14]. Finally, the CPM allows overlap, i.e., a node can be a member of different communities at the same time, and communities overlap with each other by sharing nodes. Figure 3a show the communities uncovered by the CPM (k=3) applied to the medicine SN. Note that it considers the social network as unweighted. The N O values reveal that although the communities are formed by cliques, some of them have only weak ties (i.e., pairs of researchers weakly connected regarding N O): seven in medicine, ten in computer science, and two in sociology. In other words, cliques formed by co-authorship of researchers do not have only strong in- teractions. Additionally, each community may have ties linking different cliques, and such ties are also weak in communities with only weak ties. Other com- munities have only strong ties: eight in medicine, four in computer science, and two in sociology. It is also interesting to investigate these communities in order to identify patterns in the high cooperativeness. In the remaining communities, there is a mix of strong and weak ties. Furthermore, most communities have the median and mean different from the others, meaning that researchers have distinct behavior of co-authorship in each community. Regarding W as a measure of the strength of tie, Figure 3b shows that most communities are composed by ties between researchers with small W . In medicine and computer science, only one community has tie with W greater than 10. In sociology, W does not reach five. Although the cliques compose such communities, the high connectivity among researchers groups does not indicate a strong intensity of co-authorship. According to CPM definition, one researcher may be in more than one com- munity. The number of overlaps is small in the three networks: in computer science, only one researcher is in four communities; in sociology, there is no overlap; and in medicine, one researcher is in three communities. An analysis (a) CS: MCL Intra-communitity with NO (b) Soc: MCL Intra-communitity with NO (c) CS: MCL Intra-communitity with W (d) Soc: MCL Intra-communitity with W Fig. 4: MCL intra-communities with N O and W for computer science (CS), soci- ology (soc) – clusters’ identifiers in x axis are ordered by the size of communities. suggests that researchers in overlapped communities have weak ties with other researchers and work as a bridge (details available in [4]). 3.3 Clustering with MCL Algorithm The Markov Cluster Algorithm (MCL) is an unsupervised clustering algorithm for graphs based on simulation of stochastic flow in graphs (known as network) [16]. MCL deterministically finds cluster structures by computing the probability of random walks though the network. We use the algorithm available in Micans3 and keep the default values of the parameters. One input to MCL is a file describing the graph edges: the source and target nodes, and W as edges weight. The MCL interprets W of the edges as similar- ity to cluster the nodes. In order to understand how N O and W influence on clustering formation, we run the algorithm twice changing the value of the edge weight (one time weights equal to N O and another to W ). Using N O, the MCL 3 Micans: http://micans.org/mcl (a) Computer science (b) Sociology Fig. 5: Results of clustering methods and measuring tie strength with N O (each cluster as a color, and each edge as a point). has found 140 clusters in computer science, 35 in sociology and 139 in medicine. Then, using W , the MCL has detected 82 clusters in computer science, 16 in sociology and 68 in medicine. Some clusters are composed of only one node, and they are more present in clusters formed with N O as edge weight. This result indicates that the similarity among researchers is lower considering N O than W . Figure 4 shows the results ordered by the size of communities when N O and W between researchers are considered as edge weight. The graphs do not include clusters with only one node for clarity. There are communities formed only by weak ties and only by strong ties, for example, the communities #25 and #34 in computer science, respectively. However, most communities include both types of tie. Considering the clusters size, the biggest communities are in the beginning of each graphic. For example, clusters #0 and #1 are the largest in sociology. The number of nodes in the largest clusters for N O and W as edges weight is respectively 30 and 27 for computer science, four nodes (two communities of the same size) and four nodes (four communities of the same size) for sociology, 22 and 17 for medicine. Figure 4 also presents that the largest clusters are more formed by strong ties than weak ties, because the first quartile of these clusters is higher or equal to 0.2. Then, the largest communities have high W , because the third quartile pass 30 in the three areas. Lastly, MCL does not find ties connecting researchers from different commu- nities for the three co-authorships SN. This reveals that MCL provides a good clustering result since clustering algoritms minimizes inter-cluster edges [12]. 4 Comparative Analyses We now compare the results of the three clustering methods. Figures 5 and 6 contrast the clusters of each method regarding N O and W , respectively. We ob- (a) Computer science (b) Sociology Fig. 6: Results of clustering methods and measuring tie strength with W (each cluster as a color, and each edge as a triangle). serve that LM tends to find less and larger clusters than the other two methods. Also, MCL detects a huge number of clusters, and some of them are singleton. In the co-authorship context, although CPM allows community overlaps, as a researcher may publish with researchers from others communities, MCL provides the best clusters because most of the detected ones are composed by strong ties. Moreover, Figure 5 shows a high concentration of edges until N O reaches 0.6 in computer science and medicine. In sociology, the maximum value of N O is 0.5, and there is more concentration of edges between 0.2 to 0.3. Also, note that CPM and MCL exclude edges with N O equal to 0. Some strong ties are also removed in CPM, probably because these edges are in a 2-clique (we choose k=3 for CPM). Here, MCL better differentiates the relationships putting them in distinct clusters. On the other hand, the concentration of points in CPM and LM does not show the same for these methods. Additionally, Figure 6 shows a high concentration of edges for W less than 100 in computer science and medicine, and less than 10 in sociology. We also note that co-authorship frequencies equal to zero are not removed in any clustering method. Overall, the three algorithms form clusters with weak and strong ties. However, MCL mostly detects clusters with more strong ties than weak ones. 5 Concluding Remarks We applied three clustering algorithms in three co-authorship SN. For the un- weighted LM, its evaluation results showed it identifies less clusters than the others. When applying CPM, there was a small number of overlaps between communities and researchers in the overlaps are weak ties (they work as bridges). For MCL, we have applied it twice in each algorithm: one with N O as weight and another with W . MCL identified a larger number of clusters than the other methods. Furthermore, the tie strength inter-communities tends to be weak for LM and CPM; whereas MCL algorithm does not find edges inter-communities. A main conclusion of using N O and W in clustering evaluation is: MCL is the best clustering algorithm to be applied in co-authorship SN when compared to LM and CPM. Nevertheless, we also conclude that considering only the strength of ties metrics is not enough to define clustering qualities. Therefore, in the next steps, we plan to apply internal measures (like BetaCV, C-index, and so on) to compare with the results generated by the tie strength metrics. Also, we plan to investigate how the network structure affects the clustering results. Acknowledgments. Work funded by CAPES, CNPq and FAPEMIG, Brazil. References 1. Almeida et al., H.: Is there a best quality metric for graph clusters? In: ECML- PKDD. (2011) 44–59 2. Blondel et al., V.D.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10) (2008) P10008 3. Brandão, M.A., Moro, M.M.: Analyzing the strength of co-authorship ties with neighborhood overlap. In: DEXA. (2015) 527–542 4. Brandão, M.A., Moro, M.M.: A comparative analysis of the strength of co-authorship ties in clusters. Technical Report 4, UFMG (March 2017) http://www.dcc.ufmg.br/˜mirella/projs/apoena. 5. Brandão, M.A., Moro, M.M.: Social professional networks: A survey and taxonomy. Computer Communications 100 (2017) 20 – 31 6. Deb, D., Vishveshwara, S., Vishveshwara, S.: Understanding protein structure from a percolation perspective. Biophysical journal 97(6) (2009) 1787–1794 7. Easley, D., Kleinberg, J.: Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press (2010) 8. Harman, M., Swift, S., Mahdavi, K.: An empirical study of the robustness of two module clustering fitness functions. In: GECCO. (2005) 1029–1036 9. Hassanzadeh et al., O.: Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2(1) (2009) 1282–1293 10. Kim, P., Kim, S.: A detection of overlapping community in mobile social network. In: Procs. of ACM SAC. (2014) 11. Kshitij, A., Ghosh, J., Gupta, B.M.: Embedded information structures and func- tions of co-authorship networks: Evidence from cancer research collaboration in india. Scientometrics 102(1) (2015) 285–306 12. Malliaros, F.D., Vazirgiannis, M.: Clustering and community detection in directed networks: A survey. Physics Reports 533(4) (2013) 95–142 13. Mishra et al., N.: Clustering social networks. In Bonato, A., Chung, F.R.K., eds.: Algorithms and Models for the Web-Graph. Springer (2007) 56–67 14. Palla et al., G.: Uncovering the overlapping community structure of complex net- works in nature and society. Nature 435(7043) (2005) 814–818 15. Silva et al., T.H.P.: Community-based endogamy as an influence indicator. In: JCDL. (2014) 16. Van Dongen, S.M.: Graph clustering by flow simulation. PhD thesis, Utrecht University (2000)