Measuring Homophily Matteo Cristani, Diana Fogoroasi, and Claudio Tomazzoli University of Verona {matteo.cristani, diana.fogoroasi.studenti, claudio.tomazzoli}@univr.it Abstract. Social Network Analysis is employed widely as a means to compute the probability that a given message flows through a social net- work. This approach is mainly grounded upon the correct usage of three basic graph- theoretic measures: degree centrality, closeness centrality and betweeness centrality. We show that, in general, those indices are not adapt to foresee the flow of a given message, that depends upon indices based on the sharing of interests and the trust about depth in knowledge of a topic. We provide new definitions for measures that over- come the drawbacks of general indices discussed above, using Semantic Social Network Analysis, and show experimental results that show that with these measures we have a different understanding of a social network compared to standard measures. 1 Introduction Social Networks are considered, on the current panorama of web applications, as the principal virtual space for online communication. Therefore, it is of strong relevance for practical applications to understand how strong a member of the network is with respect to the others. Traditionally, sociological investigations have dealt with problems of defining properties of the users that can value their relevance (sometimes their impor- tance, that can be considered different, the first denoting the ability to emerge, and the second the relevance perceived by the others). Scholars have developed several measures and studied how to compute them in different types of graphs, used as models for social networks. This field of research has been named Social Network Analysis. Sometimes the same name is attributed to a wider context, where we also mean to include analysis of the ways in which such values arise (for instance, processes able to change importance of members), or to provide methods for employing these measures in applications. Majorly, scholars dealt with the Social Network Analysis from the viewpoint of information flow, namely they provide models of importance (and other as- pects as well) to understand how probable would be that a piece of information passed through a given node. Mainly, the information flow has been studied for propagation of viruses (both in medical and in computer security contexts), news spread-out (and hence, studies about viral marketing as well), and message passing in certain application contexts. Three basic measures have been developed that belong to the family of cen- trality measures: degree centrality, closeness centrality and betweeness central- ity. In this paper we criticize the models of social network analysis developed for these measures, showing that there are cases in which these measures are not adapt. The criticism arises mainly as related to the absence of semantic aspects in measures. To show what we mean with these limits, let us introduce a general example. Example 1. Consider two users of facebook, Alice and Bob, and assume that the measures of importance is settled to coincide with the number of friends, the distance to non-friends, and the probability of being in common between two non-friends. Alice results to be much more important in the network than Bob, under all the three measures. However, to a closer observation we notice that this result is definitely true for certain topics whilst it results false for other ones. In particular, Alice is much more expert than Bob about Geography and History, equivalent with respect to Sport and weaker for Cuisine. When someone passes a message to Alice and the message regards Geography, she is much more likely to pass the message than Bob. Conversely, when a message regards Sport the opposite case holds. Cuisine information flow is better when passes through Bob. The above described example shows that it can be the case that two members of a social network can exhibit different orders of prevalence in terms of centrality depending on the topic we refer the prevalence to. This may produce effects that cannot be reproduced by a single index, as shown in the example below. Example 2. Consider five individuals: Alice, Bob, John, Annie and Charlie. Alice is connected to Bob and John; John connects also to Bob and Charlie and also Annie is connected to Bob and Charlie, while Bob is connected directly to everyone and is person who loathes gossips when the others like or accept it. If we don’t consider topics we would say that dropping a gossip in the net- work, the right person to deliver it to have it spread is of course Bob. Unfortunately, the message has contents of a topic which probably will see Bob cancel it, instead of forwarding it, while both John and Charlie are good choices because they are directly connected to three people each and they have a different attitude toward gossip than Bob. The purpose of this paper is to give account to the aspects showed in the example above. We provide a model of Social Network Analysis that takes into account topics, and show that it can foresee information flow for message treating those topics in a more accurate way than classical topic-free social network analysis. We also name Semantic Social Network Analysis the techniques we studied in this investigation to cover a part of research that some previous studies did not cover satisfactorily. The rest of the paper is organised as follows: in Section 2 we discuss related work on the subject. Further we employ Section 2 to provide the actual technical part of the paper and in Section 4 . Finally Section 5 takes some conclusions and sketches further work. 2 Related Work The reference literature can be considered as articulated in three themes: – Studies about implicit social links that exist among users of the internet (or of an internet application), or about enrichment of social web; – Investigations of the semantics of social networks; – Research about Social Network Analysis and relationships to semantic issues. Regarding the first topic, we can look at methods for social link extraction, as discussed below, as one of the best structured investigations on the theme. This specific method for extracting social networks from the web using similar- ity between collective contexts is proposed in [2]. The authors construct three social networks on the same set of named entities. They use Jaccard, overlap and Normalized Google Distance (NGD) [4] coefficients to retrieve degree of close- ness between entities. They show how actors may be assigned different relevance degrees and that actors having higher ranking results may be assigned lower ranks and inversely by choosing another measure to perform the ranking. In our perspective their work is solid, but lacks in one important aspect, the authors build homophily on the based of the contents. This is a technique to build a net- work, and not an analysis of the network itself, as we do in this work. Suffering the same issue is the work of [9], where the authors present a new framework for applying Social Network Analysis to RDF representations of social data. In particular, the use of graph models underlying RDF and SPARQL extensions enables us to extract efficiently and to parametrize the classic Social Network Analysis features directly from these representations. The main criticisms to the proposed approach lie on the fact that, as already shown in many practical cases, it makes a lot of difference, in terms of understanding of the structure of similar- ity between nodes, to know the relevance of the two nodes. In fact, similarity can be used, as done, for instance in [6], for community detection, where members are related to each other based on their similarity in semantic terms. This is different in terms of relationship, with respect to measuring the relevance and study attractivity. Clearly, being interested in Football lies on liking it, but the community is formed around authoritative persons, for instance journalists. A more practical research has been documented in [16] where an application of semantic social networks and attraction theory to web based services is carried out. The relation between trust and Social Network Analysis has been investi- gate in [17] and specified as a means for understanding deeply the meaning of centrality and other measures as related to authority. The same concept is em- ployed to provide a framework for the general interpretation of the logic bases of recommendation systems in [7]. The studies cited above all aim at discovering network links by means of mining techniques. On the other hand, the introduc- tion of notions derived from semantic web into social networks is the core quest of many recent studies, including [18]. As a complete reference to the current literature about meaning of social links, and relationships between social web and semantics, readers can look at [12]. More deeply, in [14] a direct and explicit comparison between social networks and the semantic web is carried out. This paper proposes a parallel between networked knowledge of members in a network and the basic notions of semantic web. The same issue is dealt with, with the specificity of a known technique, the semantic networks, in [8]. More generally, the semantic web methods are employed for understanding the meaning of so- cial networks as sharing platforms for common knowledge, in [15]. The idea of using Social Network Analysis as a means for forecasting the probability of a message to pass through a given member of the network itself is not novel at all. Base of our analysis is the criticisms to the roughness of the employed measures, criticisms that are not novel anyhow. This has been dealt in two distinct ways: by using semantic methods for habilitating the forecast processes: in particular in [19], authors use semantic networks for foreseeing the behaviour in facebook. On the other hand, many criticisms are applied to centrality measures ([11], [10]). The main criticisms, that are met by the above mentioned investigations as well as by researches tending to correct the flaws of the general methods for centrality measures, and the measures themselves, lie on the weakness of the notion of similarity derived from the notion of centrality. The above mentioned notion of similarity as derived from centrality measures, and its applications to the notion of reciprocity, a concept that has a crucial importance, for instance, in asymmetric social networks (Instagram, Twitter) are dealt with in [1]. Authors show that centrality measures as used so far are unsuccessful in forecasting the information flows. 3 Measures in Semantic Social Network Analysis In this section, we focus in computing two of the basic centrality measures: Degree centrality and Closeness centrality including topic and subtopic. 3.1 Topics and Subtopics In order to define a different structure of a Social Network that includes the interests of an actor, we here introduce concepts like ”topic” and ”subtopic” and the relationships between them as they are defined in [3, pp. 1–77]. Let us consider a simple example that represent knowledge concerning Cuisine (Figure 1). The structure is meant to represent the generality/specificity of the concepts involved, therefore is called a terminology. For example, a link between Cuisine and Italian Cuisine says that ”Italian cuisine is a cuisine”. When a concept is more specific than other concepts, like seen in the previous example, it inherits the property of the more general one. Messages in a social network can be classified as belonging to a set of topics and subtopics. 3.2 Semantic Degree centrality In classical Social Network Analisys actors with the highest Degree centrality are considered key members on the network. In other words, actors with many CUISINE Italian French Indian Japanese Cuisine Cuisine Cuisine Cuisine Fig. 1. Knowledge concerning Cuisine direct connections are considered more important. As discussed earlier, this may not be true if we take into account semantics. Consequently, we try to compute the Degree centrality of an actor, in the network, based on his depth (interest) in specific subtopics. We here consider an actor more important or more central in a network if he has a high value of depth in a specific topic. In order to distinguish the measures we name this measure Semantic User Profiling. We compute the measure for one actor for all five topics described earlier. The depth value can vary from 0.0 to 1.0. Intuitively, a depth of 1.0 means that the person is maximum interested in a specific subtopic and a depth of 0.0 means that the person is not interested at all in that subtopic. The same stands also for topics. It is possible to compute Semantic User Profiling for a node vp for each topic th as follow: l X SU P (vp ) th = αi · sphi i=1 where h = {1, ..., r} with r the number of topics, l represents the number of subtopics for each topic th and αi is the weight of each subtopic shi . 3.3 Semantic Closeness centrality Applying the existing measure Closeness centrality, actors with highest Closeness centrality are considered more central as they require only few intermediaries to reach other actors. The closeness is defined as the reciprocal value of the sum of all shortest paths from one actor to all others actors in the network. This could not be completely true if we take into account semantics. We here consider the closeness of an actor with respect to subtopics, as the minimum value between the depth of the actor and all others actors depth, in the network, that are on the shortest path from the actor to them. If there is more than one shortest path from the actor to another actor, then we consider the minimum value of depth between all shortest path. Thus, we compute the Closeness centrality of an actor, in the network based on the minimum between his depth and the depths to all other actors. To distinguish the measure from the standard Closeness centrality we call this measure Semantic Closeness centrality. The distance between two nodes, taking into account subtopics, that we named reversetopicdistance, is defined as: l X Xl l X l X dt(vp , vq ) = min(min(( αi ·shi )p , ( αi ·shi )k ), min(( αi ·shi )k , ( αi ·shi )q )) i=1 i=1 i=1 i=1 Pl where ( i=1 αi · shi )k represents the depth of the actor, denoted by the node k, in the topic th and: – h = {1, ..., r} with r the number of topics – l is the number of subtopics that defines the topic th – shi is the subtopic included in the topic th (shi v th ) – αi is the weight of the subtopic shi The distance denotes the minimum depth on the shortest path form the node p to the node q, for which all intermediary nodes are in the set {1, 2, ..., k}. At this point it is possible to compute Semantic Closeness centrality for a node vp on a topic th as follows: Pn th p=1,q6=p dt(vp , vq ) CSC (vp ) = n−1 where, once more h = {1, ..., r} with r the number of topics and n is the number of actors in the network. It is reasonable to think that the Semantic Closeness centrality of an actor in a certain topic may be the sum of all reversetopicdistances weighted by the number of nodes, excluding the node for which we compute the measure. Intuitively, this measure represents the minimum depth in each topic, between one actor and all other actors, following the shortest path in the network. 3.4 Semantic Actuation From the Semantic Closeness centrality of a node on various topics is possible to derive a new measure, that we named Semantic Actuation, in order to quantify the actuation of an actor considering all topics. We can define it as a cross-topic measure. To explain, the actuation can be defined as the good chance that an actor becomes engaged when a message regarding different topics arrives to him. For a node v, we can express this as: r X SA(v) = CSC (v)th h=1 where r is the number of topics. The actuation of an actor is defined as the sum of the Semantic Closeness cen- tralities in all topics. Intuitively, the formula finds the actors that are closer to all other actors in the network for all topics. 3.5 Computational analysis of Semantic Social Network Analysis Computation of degree centrality is straightforwardly obtained from the basic algorithm for Social Networks, that simply counts the number of incident edges of each vertex, and then derives the consequent computations for relative and graded variants. In Semantic Social Network Analysis, we sum the weights, in- stead of counting the incident edges. The base for computing both closeness and betweeness centrality is the la- belling of edges by the graph distance, meant as shortest path. We extend here the method known as Floyd Warshall Algorithm. In indirected unlabelled graphs, the algorithm computes just the incident edges to obtain the correct value of dis- tances. We use the vectorial min() function both to initialize the distance matrix and to give a the value of the distance between two vertices in the core of the algorithm and we extend the algorithm in [5] to take into account topics and subtopics. – A n × n Topic Matrix T representing the minimum depth of all nodes in a graph. T = (tij ), where tij is the minimum depth between the depth of the node i and the depth of the node j on topic t in a graph G = (V, E): 0 if i = j ( tij = min(ti , tj ) if i 6= j and (i, j) ∈ E ∞ if i 6= j and (i, j) ∈/E – A n × n Inverse Topic Distance T D representing the reversetopicdistance as (k) (k) described in the previous section. T D = (tdij ), where tdij is the minimum depth on topic t on the shortest path from node i to node j for which (k) all intermediary nodes are in the set {1, 2, . . . , k}R Note that tdij can be recursively defined as: ( (k) tij if k = 0 tdij = (k−1) (k−1) (k−1) min(tdij , min(tdik , tdkj )) if k ≥ 1 – A n × n Weight Matrix W representing the edge weights of all nodes in a graph. W = (wij ), where wij is the weight of an edge between node i and node j in a graph G = (V, E).In our case the weight of each edge is 1 0 if i = j ( wij = 1 if i 6= j and (i, j) ∈ E ∞ if i 6= j and (i, j) ∈/E – A n × n Path Matrix D representing the path distance between nodes, where D = (dkij ). dkij is the length of the shortest path from node i to node j for which all intermediate nodes are in the set {1, 2, . . . , k}. (k) Note that dij can be recursively defined as: ( (k) wij if k = 0 dij = (k−1) (k−1) (k−1) min(dij , dik + dkj ) if k ≥ 1 Algorithm 1 ALGORITHM FWSTST. given a graph D returns matrix of distances 1: Input: a graph D of vertices V = v1 , . . . , vn and a n × n matrix of Inverse Topic Distance T D; 2: Output: a matrix of distances ∆; 3: i ← 1, j ← 1, k ← 1; 4: ∆ ← T D; 5: for k < |V | do 6: for j < |V | do 7: for i < |V | do 8: if dij ≥ dij + dkj then dij = dik + dkj ; 9: end if (k−1) (k−1) (k−1) 10: δ[i, j] ← min(δij , min(δik , δkj )); 11: end for 12: end for 13: end for In the non extent algorithm we used ”>” because we were no interested in a path with the same length as the previous path (only in shortest), but here we want to find the minimum depth on all the shortest path from a node to another, thus we use ”≥”. The Semantic Floyd-Warshall Algorithm solves the problem of the shortest path between all pair of nodes and the minimum depth between all pair of nodes in a graph in time O(n3 ), where n is the order of the graph. At each step incrementally improves an estimate on the shortest path between two nodes and an estimate on the minimum depth between two nodes. 4 Experiments To test the methods introduced above in section 2, in order to find the Semantic centrality of an actor, we conduct an experiment on a small, but real, part of the giant network Facebook. With the experiment we want to examine the way in which the centrality of an actor changes when semantic information is included. (a) Network pointing highest Degree (b) Network pointing highest Closeness centrality centrality Fig. 2. Representation of social network with node size proportional to considered measure 4.1 Experimental setup To perform the experiment, we take advantage on a relative small, but realistic dataset from Stanford Dataset Collection 1 [13]. We choose the social network Facebook because we want to lead the experi- ment on a representation whit an undirected graph. Other Social Networks, like Twitter, are usually represented with direct graphs. This, because are based on concepts like ”followers” or ”following”, contrarily on Facebook, where we can found the desired notion of ”friend”. The dataset consists of a list of friends from Facebook, collected from sur- vey participants using the Facebook app 2 . As one can expect, the data has been anonymized by replacing the id-s with new values. From the 10 networks avail- able we choose one with 347 nodes and 5038 edges. At this point we have a connected and undirect graph that represent 347 actors and 5038 friend rela- tionships between them. The two figures (Figures 2(a) and 2(b) ) display the earlier described network, where in the first one we point out the nodes with the highest Degree centrality and in the second one the nodes with the highest Closeness centrality. Lacking data about topic interest, a gaussian distribution has been assumed in order to simulate the depths of interests for each person in various subtopics. 1 http://snap.stanford.edu/ 2 https://www.facebook.com/apps/application.php?id=201704403232744 We used five topics with four subtopic. Each person has been assigned a value of depth between 0.0 and 1.0, where 0.0 can be interpreted as ”the person have no interest in the subtopic” and 1.0 can be interpreted as ”the person is maximum interested in that specific subtopic”. 4.2 Results The two tables below (Tables 1 and 2) evidence the first 10 results when the stan- dard formulas of centrality measure, Degree centrality and Closeness centrality, were applied on the network described earlier. Node Degree Closeness Node Degree Closeness Centrality Centrality Centrality Centrality 56 77 0.001175 277 64 0.001316 67 75 0.001185 25 68 0.001212 271 72 0.001172 322 71 0.001198 322 71 0.001198 67 75 0.001185 25 68 0.001212 119 61 0.001181 26 67 0.00116 56 77 0.001175 21 64 0.001163 271 72 0.001172 252 64 0.001149 315 55 0.001172 277 64 0.001316 21 64 0.001163 122 62 0.001155 26 67 0.00116 Table 1. Standard Centrality Table 2. Standard Centrality Measures (Ordered by Degree cen- Measures (Ordered by Closeness trality) centrality) The first table points out the first 10 actors with the highest Degree centrality. The maximum value of Degree centrality can be 346, but note that the actor with id56 has degree centrality 77 which is the maximum value on this network. Let us now focus on the results obtained on the semantic network. The next two tables (Tables 3 and 4) reveal the first 10 actors with higher Semantic Degree centrality, where in the first one are ordered by their Semantic Degree centrality in Movie and in the second one are ordered by the Semantic Degree centrality in Music. It is clear that actors, considered important applying the existing measure, in the second case, applying the new measure, may not results important (key- persons) in the network showing that semantic measures can capture different information -and we say also important ones- compared to standard ones 5 Conclusions In this paper we investigated an extension to Social Network Analysis based upon the usage of a network model that includes the notion of topic. This leads Node Movie Cousine Clothing Music Sport Node Movie Cousine Clothing Music Sport 193 0.89 0.46 0.49 0.28 0.7 79 0.33 0.48 0.31 0.89 0.49 254 0.87 0.5 0.34 0.42 0.43 300 0.29 0.38 0.35 0.88 0.43 271 0.8 0.59 0.59 0.27 0.42 74 0.57 0.37 0.56 0.86 0.2 199 0.79 0.33 0.71 0.44 0.35 273 0.38 0.24 0.67 0.85 0.41 89 0.77 0.6 0.41 0.48 0.3 21 0.4 0.34 0.71 0.82 0.1 179 0.77 0.47 0.39 0.35 0.61 191 0.7 0.18 0.56 0.82 0.53 237 0.76 0.61 0.34 0.57 0.44 145 0.41 0.7 0.42 0.81 0.49 8 0.75 0.62 0.63 0.39 0.84 180 0.63 0.4 0.42 0.81 0.67 31 0.75 0.63 0.38 0.48 0.62 43 0.75 0.57 0.38 0.79 0.43 43 0.75 0.57 0.38 0.79 0.43 62 0.42 0.37 0.7 0.79 0.59 Table 3. Semantic Degree central- Table 4. Semantic Degree central- ity ( Ordered by depth in Movie) ity (Ordered by depth in Music) to a further model that incorporates the notion of sensitivity, by means of a value, called activation that is meant to denote the probability of a member of the network to be active in an information flow. Algorithms for computing ex- tended notions of centrality are provided, and proved to be correct, complete and computationally efficient. We provide experiments that show that our approach can fruitfully solve few evident drawbacks of the general model, as applied to information flow forecast. There are at least three different ways in which this investigation can be extended. First of all we aim at formalising a problem of dissemination of infor- mation pieces throughout a network. The problem can be formulated as follows: given a social network, a number k and a probability value p, select k members in such a way that the set of members reached by an information piece sent to the members in the selection and disseminated by them and the chains of members generated therefore, has a probability of being total (namely to cover the entire network) of at least p. A second study investigates ways of providing reacher models of topics. In particular, we aim at investigating topics with sub-topics. References 1. Suraj Bandyopadhyay A. Ramachandra Rao. Measures of reciprocity in a so- cial network. Sankhy: The Indian Journal of Statistics, Series A (1961-2002), 49(2):141–188, 1987. 2. Rasim Alguliev, Ramiz Aliguliyev, and Fadai Ganjaliyev. Investigation of the role of similarity measure and ranking algorithm in mining social networks. Journal of Information Science, 37(3):229–234, 2011. 3. Franz Baader. The description logic handbook: Theory, implementation and appli- cations. Cambridge university press, 2003. 4. Rudi Cilibrasi and Paul M. B. Vitányi. The google similarity distance. IEEE Trans. Knowl. Data Eng., 19(3):370–383, 2007. 5. Matteo Cristani, Claudio Tomazzoli, and Francesco Olivieri. Semantic social net- work analysis foresees message flows. In Proceedings of the 8th International Con- ference on Agents and Artificial Intelligence, pages 296–303, 2016. 6. Juan David Cruz, Cécile Bothorel, and François Poulet. Community detection and visualization in social networks: Integrating structural and semantic information. ACM Trans. Intell. Syst. Technol., 5(1):11:1–11:26, January 2014. 7. Elnaz Davoodi, Keivan Kianmehr, and Mohsen Afsharchi. A semantic social network-based expert recommender system. Applied Intelligence, 39(1):1–13, 2013. 8. Stephen Downes. Semantic networks and social networks. The Learning Organi- zation, 12(5):411–417, 2005. 9. Guillaume Erétéo, Fabien L. Gandon, Olivier Corby, and Michel Buffa. Semantic social network analysis. CoRR, abs/0904.3701, 2009. 10. Soong Moon Kang. A note on measures of similarity based on centrality. Social Networks, 29(1):137 – 142, 2007. 11. Andrea Landherr, Bettina Friedl, and Julia Heidemann. A critical review of central- ity measures in social networks. Business and Information Systems Engineering, 2(6):371–385, 2010. 12. Loet Leydesdorff. Advances in science visualization: Social networks, semantic maps, and discursive knowledge. CoRR, abs/1206.3746, 2012. 13. Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In NIPS, volume 2012, pages 548–56, 2012. 14. P. Mika. Social networks and the semantic web. In Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on, pages 285–291, Sept 2004. 15. Peter Mika. Flink: Semantic web technology for the extraction and analysis of social networks. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2):211 – 223, 2005. Selcted Papers from the International Semantic Web Conference, 2004 ISWC, 20043rd. International Semantic Web Conference, 2004. 16. Soe-Tysr Yuan and Yan-Lin Fei. A synthesis of semantic social network and at- traction theory for innovating community-based e-service. Expert Systems with Applications, 37(5):3588 – 3597, 2010. 17. Yu Zhang, Huajun Chen, and Zhaohui Wu. A social network-based trust model for the semantic web. In LaurenceT. Yang, Hai Jin, Jianhua Ma, and Theo Un- gerer, editors, Autonomic and Trusted Computing, volume 4158 of Lecture Notes in Computer Science, pages 183–192. Springer Berlin Heidelberg, 2006. 18. Lina Zhou, Li Ding, and Tim Finin. How is the semantic web evolving? a dynamic social network perspective. Computers in Human Behavior, 27(4):1294 – 1302, 2011. Social and Humanistic Computing for the Knowledge Society. 19. Jolene Zywica and James Danowski. The faces of facebookers: Investigating social enhancement and social compensation hypotheses; predicting facebook and offline popularity from sociability and self-esteem, and mapping the meanings of pop- ularity with semantic networks. Journal of Computer-Mediated Communication, 14(1):1–34, 2008.