Reconstructing Social Reconstructing SocialNetworks fromEmails Networks from Emails Marcel Kvassay1, Michal Laclavík1, Štefan Dlugolinský1 Marcel Kvassay, Michal Laclavı́k, and Štefan Dlugolinský 1 Institute of Informatics, Slovak Academy of Sciences, Dúbravská cesta 9, Institute of Informatics, Slovak Academy of Sciences Dúbravská 845 cesta079,Bratislava 45, Slovak 45, 845 07 Bratislava Republic Slovak Republic marcel.kvassay@savba.sk marcel.kvassay@savba.sk Abstract. The article provides a brief overview of Social Network Analysis (SNA) and its potential for exploiting the wealth of information buried in the email archives of business and private entities. Within the scope of the COMMIUS1 project, we built a proof-of-concept prototype in Java, which used the spreading activation algorithm to reconstruct various aspects of multidimen- sional social network from emails. Two different variants of the spreading acti- vation algorithm are discussed and compared. 1 Introduction 1.1 Social Network Analysis History. The development of Social Network Analysis (SNA) is instructively mapped out by Freeman in [1] and, more briefly, by Fararo in [2]. Fig. 1 depicts the area at the intersection of sociology and psychology where SNA emerged as an alternative to traditional sociology. This area can be broadly classified as social psychology (SP), though it overlaps with other fields, such as anthropology or ethnology. Psychology typically focuses on the individual. It studies mental functions – per- ception, cognition, attention, emotion, motivation, etc. – and their role in individual and social behavior. Sociology, in contrast, focuses on collective formations and ana- lyzes human social activity starting from the micro level (agency and interaction) to the macro level (systems and social structures). In the area where sociology overlaps with psychology, the focus is on small groups. According to Wikipedia,2 there are differences between social psychology as practised by psychologists (SPp), and by sociologists (SPs). Psychologists retain their individual focus and study how the thoughts, feelings, and behaviors of individuals are influenced by other members of the group. Sociologists focus more on the group itself and study group dynamics, crowd-phenomena, etc., often in the context of larger social structures (race, class, gender). Based on this distinction, the origins of SNA can be traced specifically to social psychology as practiced by sociologists and anthropologists (SPs). 1 http://www.commius.eu/ 2 http://en.wikipedia.org/wiki/Social_psychology J. Pokorný, V. Snášel, K. Richta (Eds.): Dateso 2010, pp. 50–59, ISBN 978-80-7378-116-3. Reconstructing Social Networks from Emails 51 Early forms of SNA, such as sociometry, appeared in the 1930s but for various rea- sons did not catch on. SNA was eventually accepted as a separate discipline in the 1970s, and has been on the rise ever since. With the spread of internet and cheap computing power, it penetrated mainstream sociology to such an extent that Wikipe- dia3 now considers it “a key technique in modern sociology.” Fig. 1. SNA originated in the 1930s at the intersection of sociology and psychology, in the area shared by social psychology (SP), anthropology and ethnology. With the spread of internet and cheap computing power, it entered mainstream sociology (SNA+) According to Freeman [1], SNA has four defining features: 1. Structural intuition (that patterning of social ties influences actors); 2. Systematic collection of empirical data; 3. Use of rigorous mathematical and computational models; 4. Graphic imagery. Methods. To the use of statistics, which had a long tradition in sociology, SNA added new methods grounded in algebra and graph theory. SNA also enriched these discip- lines with new concepts and techniques, such as block-modeling or the notions of bridge, centrality, structural balance, etc. More recently the focus has shifted to com- putational sociology and multi-agent simulations of social phenomena. 1.2 Social Networks in Emails Email has become the most widespread Internet application. It is a tool supporting not only communication but also cooperation, task management, archiving, or information and knowledge management. Furthermore Email is a source of information on person- al, enterprise or community network of an individual or an organization. Email com- munication analysis allows extraction of social networks with further connection to people, organizations, locations, topics or time. Social Networks included in email archives are becoming increasingly valuable as- sets in organizations, enterprises and communities, though to date they have been little explored. While social networks in social network site such as Facebook are owned by 3 http://en.wikipedia.org/wiki/Social_network_analysis 52 Marcel Kvassay, Michal Laclavı́k, Štefan Dlugolinský third parties, email social network data are owned by individual or organization in- cluding many useful connections hidden in emails. On personal archives, Xobni4 ex- ploits social networks to help the user manage contacts and attachments, but at the enterprise or community level, social networks can be exploited to improve email search, manage customers and suppliers, prioritise emails or improve inference me- chanisms when connected with other detected semantic information from the email. Social networks within email communication have been studied to some extent. For example, communication on the Apache Web Server mailing lists and its relation to CVS activity was studied in [6]. This work also introduces the problem of identifying email users’ aliases. Extracting social networks and contact information from email and the Web and combining this information is discussed in [7]. Similarly, new email clients (e.g. Postbox) or plug-ins (Xobni) try to connect email social networks with web social networks like LinkedIn or Facebook. We have also performed some expe- riments on extraction of social networks from large email archives and network trans- formations using a semantic model [4]. Another research effort [8] exploits social networks to identify relations and tests proposed approaches on the Enron corpus. To conclude, there is much research work done on social networks in the area of web social network applications, but email social networks are a bit different since in the email you can discover the level of interactions (number of messages exchanged, time, relation to content and possibly discovered semantics), and the influence of these differences on better information and knowledge management still needs to be ex- plored. We would like to use similar approach as IBM Galaxy [10] in Nepomuk5 project, where concept of multidimensional social network was introduced. In this paper we show initial results of exploiting email social network in order to support better understanding of email content as well as allowing applications such as partner or supplier search within organization or community. 1.3 COMMIUS6 project COMMIUS project is part of the 7th Framework Programme of the European Com- mission. Its acronym stands for “Community-based Interoperability Utility for Small and Medium Enterprises.” The consortium comprises partners from Austria, Germany, Greece, Great Britain, Italy, Slovakia and Spain. Their objective is to provide SMEs (Small and Medium Enterprises) with “a zero, or very low-cost, entry into interopera- bility, based on non-proprietary protocols.” To this end, a flexible architecture based on the open-source software was designed and implemented in Java. The COMMIUS system is aimed primarily at companies that already conduct part of their business through email, i.e. send and receive orders, invoices, questionnaires, forms, etc. These documents are often manually retyped so as to enter them in the company’s order-management or accounting system. Such companies would directly benefit from COMMIUS, since it is designed to automate these tasks. COMMIUS 4 http://www.xobni.com/ 5 http://nepomuk.semanticdesktop.org/ 6 http://www.commius.eu/ Reconstructing Social Networks from Emails 53 scans the incoming emails, recognizes certain entities and documents (suppliers, cus- tomers, orders, invoices, telephone numbers, etc.), and proposes appropriate actions to the user. In case of an incoming order, for instance, the appropriate actions could be to check whether the requested items can actually be supplied, to send a confirmation to the customer, or to invoice and ship the order. These recommendations are presented to the user as HTML links that COMMIUS inserts into each incoming email. The user is then free to accept the recommendation (by clicking on the link) or proceed diffe- rently. A more detailed description of COMMIUS can be found in [3] and [4]. Social Networks in COMMIUS. The core of COMMIUS functionality deals with the business interoperability: detection of orders, invoices, payments, etc. in the incoming emails, and their semi-automated inclusion in the company’s order-management and accounting system. Social network functionality is an add-on to this core. Its main purpose is to smoothen the process of COMMIUS adoption by new users, for example by pre-populating their product, customer and supplier databases based on the infor- mation extracted from their emails. On this basis, more advanced functions can be built, e.g. search for potential business partners. Integration with the COMMIUS core. At the moment of installing COMMIUS, the user's email archives will be processed and the results stored in the form of multidi- mensional social network graph. After the installation, each incoming email will be added to this graph. Social network queries will search in the graph so that users get response in reasonable time. It should be noted that social network functions provide probabilistic results and would be offered to the user as recommendations only. 2 Implementation We implemented our “social network extractor” in Java on top of the open-source graphical library JUNG.7 The novelty of our approach is in the application of the spreading activation algorithm to two principal tasks: 1. reconstructing the social network from emails; 2. efficiently searching the social network. The prototype implementation described below is a work in progress. So far only the initial version of the prototype has been tested on the first type of task; details are provided in section 3. Evaluation. 2.1 Initial Version: Simple Cumulative Scorer In the initial version, the information extraction module (IE) passes on a collection of strings (objects) matched by regular expressions. The strings acquire a “type” accord- ing to the regular expression that matched them. In this way, they are categorized as email addresses, personal names, names of organizations, telephone numbers, etc. The 7 http://jung.sourceforge.net/ 54 Marcel Kvassay, Michal Laclavı́k, Štefan Dlugolinský email message itself is represented by a “message ID” object to which the other ob- jects are connected in a star-like fashion. If the same string is found in several messag- es, it is connected to all of them. If the same string is found multiple times in the same message, it has multiple links to that particular message ID, which is our way of re- cording the strength of the bond. The resulting network graph can be represented as a three-tier structure or tripartite graph (Fig. 2). Fig. 2. Spread of activation as implemented in the initial version of the prototype. The activa- tion starts from the left by assigning the initial activation value (val=1) to one attribute instance in Tier 1. In the next step, this initial value flows towards Tier 2 via the four red-colored links. There it accumulates in the three messages in which this attribute instance was found. One of them is connected via the double parallel link (which means it contained two occurrences of this attribute instance), so it accumulates a higher value (val=2) than the other two messages. In the final step, the activation values flow separately from each activated message (with val>0) towards Tier 3, where they further accumulate in the primary entities found in these messages. The primary entity with the highest accumulated value (val=3) is declared the “owner” of this attribute instance. The whole process is repeated for all the attribute instances in Tier 1 Our Simple Cumulative Scorer is inspired by the spreading activation algorithm in the sense that it separately “activates” each attribute instance with a uniform value of 1 and cumulatively spreads this activation (in the breadth-first manner) via the message IDs to the primary entities that will “own” the attribute. Each node can only fire once. It is an extremely simple implementation – we omitted such features of the standard spreading activation algorithm as attenuation, activation threshold or limit on the max- imum activated value –nevertheless it allowed us to establish the utility of spreading activation in social networks. Reconstructing Social Networks from Emails 55 In the three-tier structure depicted on Fig. 2, the middle tier (message IDs) links the attributes in Tier 1 to their “primary entities” in Tier 3. In general, the “primary enti- ty” can be any of the objects identified by the Information Extractor (IE) if it can “own” (or be composed of) some other objects (attributes) likewise identified by the IE. In this sense, the “date” as a complex data type can be the primary entity with respect to year, month and day of which it is composed (provided the IE identifies these as separate objects), but it can itself be the attribute of a more complex data type, such as “event,” “conference,” etc. The number of such scenarios is limited only by the capabilities of the Information Extractor. In our case – since we are trying to reconstruct the social network – the primary entities are persons and organizations represented either by their email addresses or by their proper names. The Information Extractor collects both the email addresses and proper names from all the parts of the email message (the headers as well as the body). For our test task, we have chosen the telephone numbers as the sample attribute that we wish to assign correctly to persons and organizations. Fig. 3. Hierarchical graph of objects extracted by the enhanced information extractor (IE) from one email. Nodes representing sentences, paragraphs and blocks of the message are omitted. Graph was built from our personal email. Due to privacy reasons we cannot show the graph built from the enterprise emails of our Commius partners 56 Marcel Kvassay, Michal Laclavı́k, Štefan Dlugolinský 2.2 Cumulative Edge Scorer with Attenuation The improved version of the prototype relies on the enhanced output from the infor- mation extractor (IE). There are two major improvements: 1. IE produces a hierarchical tree of complex objects as shown on Fig. 3; 2. Objects are not linked directly to message IDs, but to the nodes representing the sentences, paragraphs and blocks of the message in which they were found. In this way the information about the physical proximity of the objects in the original email is preserved for further analysis as shown on Fig. 4. In this richer and deeper tree-like structure, it makes sense to use a more sophisticated variant of spreading activation with attenuation and activation threshold. The structure can still be visualized as a multipartite graph, but the objects of each data type now require a separate partition (Fig. 4). This applies to candidate attributes as well as to primary entities. When partitioned in this way, the objects in each partition (i.e. the objects of the same data type) still have no connections among themselves, only to objects in other partitions, which is advantageous from the point of view of computa- tional complexity. Fig. 4. A more complex variant of spreading activation in a multipartite graph. Activation again starts at one attribute instance (phone number) indicated by the red arrow (center left) and flows towards candidate primary entities (email IDs) by a variety of ways. Since the activation gets attenuated each time it passes through an edge, the shortest path (indicated in red) will carry over the greatest increment. But the longer paths (such as those indicated in purple and blue) can be so numerous that – depending on the value of attenuation and other parameters – their accumulated contribution ultimately prevails Reconstructing Social Networks from Emails 57 Even in the enhanced version of our prototype, the spreading activation always starts from a single node. This allowed us to keep the simple and elegant breadth-first va- riant of the algorithm, in which each node can only fire once. Applications that start the initial activation from more than one node may need more sophisticated imple- mentation of spreading activation. 3 Evaluation We created a test email archive consisting of 28 representative sample messages sup- plied by our partners in COMMIUS. We then run the prototype with the task to assign telephone numbers to people and organizations (represented by their proper names) that were found in the emails. Though our primary goal was to evaluate the spreading activation algorithm, we could not completely insulate it from the effects caused by the Information Extractor. These are seen primarily in the figures for the recall. Out of 17 unique relevant tele- phone numbers in the test emails, the IE identified 13. These 13 numbers were then passed on to the spreading activation algorithm. As can be seen from Table 1, the initial version of the spreading activation algorithm correctly assigned 8 of them, which resulted in the precision of 61.5% and the recall of 47%. Table 1. Quantitative evaluation of the initial version of the spreading activation algorithm Total Total Correctly Wrongly Recall Precision relevant found assigned assigned [%] [%] 17 13 8 5 47 61.5 At present, we are still in the early stages of experimenting with the attenuated variant of the spreading activation. The first tentative results that we obtained are presented in Table 2. Table 2. Quantitative evaluation of the attenuated variant of spreading activation Total Total Correctly Wrongly Recall Precision relevant found assigned assigned [%] [%] 17 13 10 3 58.8 76.9 As expected, the enhanced version of the prototype gave significantly better results (the precision of 76.9% and the recall of 58.8%), though we still need to investigate the remaining problems. Similarity in the tabular way of presenting the data actually masks deep differences between the two implementations. In the initial version, the Information Extractor identified 32 candidate names of persons and organizations against which the phone numbers had to be matched. The candidate names were found purely by matching against regular expressions. 58 Marcel Kvassay, Michal Laclavı́k, Štefan Dlugolinský In the next version, it was decided to enhance the capabilities of the Information Extractor by adding gazetteers. Regular expressions were adapted so as to increase the recall and return a much larger number of candidate names, which would be subse- quently filtered by the gazetteers. However, at the time of our experiment the gazet- teers were not yet ready, so the enhanced version of the prototype had to match the phone numbers against a much larger set of 152 candidate names. That it still outper- formed the initial version is therefore doubly significant and promising. Moreover, the 3 phone numbers that were wrongly assigned had very low frequen- cies (one or two occurrences in the test corpus), so the basic assumptions of the algo- rithm were not met. Nevertheless, our initial experimentation with the prototypes was very useful and provided us with important hints for future work, which we discuss in the next section. 4 Conclusions and Future Work This article is a report of a work in progress, and our experimentation with the proto- types still continues. The most obvious need is to further test the enhanced prototype on a larger set of representative emails. Here the main challenge is to get access to relevant and representative email sets. Though the email correspondence grows at an accelerating rate, we have learned that not all emails were created equal. They fall into distinct groups which differ significantly with respect to the ways and kinds of infor- mation that can be mined from them. Each application needs to be tested on the emails that are representative of the area in which it will be actually deployed. The second lesson was that there were several alternative ways in which the Infor- mation Extractor could present the data extracted from the emails, and each had its pros and cons. There is a scope for deeper theoretical analysis here – either to find a general “canonical” structure suitable for most purposes or, alternatively, an easy way of transforming the data from one form to another depending on the task. In general, graph and data transformations may be necessary in order to filter out the irrelevant information. Certain algorithms may require it for correct functioning; in others it will help to reduce the complexity of computation. The “Social Network Extractor” component that we developed is able to process either mailboxes in mbox format or directories with email (.eml) messages, and thus extract multidimensional social network information contained in the email archive. In such a graph or network it is possible to see and exploit the links among objects such as people, time, email addresses, subjects, URLs, contact details or recipients. The preliminary results of the extraction of social networks from email archives show that it is possible to deliver Xobni-like functionality in the enterprise or organi- zational context. Our approach is based on the concept of spreading activation similar to IBM Galaxy [10]. We have shown inferring relations between people and phone numbers on a small set of emails using a simple algorithm. The success rate (precision) of the experiment is 76.9%. In future, we would like to infer the relations such as those between custom- Reconstructing Social Networks from Emails 59 ers and services, suppliers, products and transactions, organizations and people, people and address details, and others. The extracted graph data from the email archives together with a well defined and tuned spreading activation algorithm can deliver the data needed for the adaptation of Commius or other enterprise systems. Such data can also be used to fill in the enter- prise system database upon installation and thus help it to offer full functionality from the beginning. For example, we can populate a system database with a list of potential suppliers, organizations, contacts and their expertise. Acknowledgements This work is partially supported by projects Commius FP7-213876, APVV DO7RP- 0005-08, AIIA APVV-0216-07, VEGA 2/0184/10 and VEGA 2/0211/09. We would also like to thank the anonymous reviewers, based on whose comments we have sig- nificantly reworked and enhanced the paper. References 1. Freeman, L.: The Development of Social Network Analysis. Empirical Press, Vancouver (2006) (URL: http://aris.ss.uci.edu/~lin/book.pdf) 2. Fararo, T.J.: “Theoretical Sociology in the 20th Century.” Journal of Social Structure 2. (2001) (URL: http://www.cmu.edu/joss/content/articles/volume2/Fararo.html) 3. Laclavík, M., Šeleng, M., Gatial, E., Hluchý, L.: Future email services and applications. In: CEUR-WS: Proceedings of the poster and demonstration paper track of the 1st Future Inter- net Symposium (FIS´08), Vol. 399. Telecon Res. Center, Vienna (2008) 33-35. ISSN 1613- 0073. (URL: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-399). 4. Laclavík, M., Šeleng, M., Ciglan, M., Hluchý, L.: Supporting Collaboration by Large Scale Email Analysis. In: Cracow´08 Grid Workshop: Proceedings. Academic Computer Centre CYFRONET AGH, Kraków (2009) 382-387. ISBN 978-83-61433-00-2 5. Balzert, S., Burkhart, T., Kalaboukas, K., Carpenter, M., Laclavik, M., Marin, C., Mehand- jiev, N., Sonnhalter, K., Ziemann, J.: Appendix to D2.1.2: State of the Art in Interoperabili- ty Technology, Commius project deliverable (2009) 6. Bird, C., Gourley, A., Devanbu, P., Gertz, M., Swaminathan, A.: Mining Email Social Networks. In: MSR ’06: Proceedings of the 2006 International Workshop on Mining Soft- ware Repositories. ACM, New York (2006) 137–143. 7. Culotta, A., Bekkerman, R., McCallum, A.: Extracting Social Networks and Contact In- formation from Email and the Web. In: CEAS ’04: Proceedings of the First Conference on Email and Anti-Spam, 2004. http://www.ceas.cc/papers-2004/176.pdf 8. Diehl, C. P., Namata, G., Getoor, L.: Relationship Identification for Social Network Dis- covery. In: The AAAI 2008 Workshop on Enhanced Messaging (2008) 10. Judge, J., Sogrin, M., Troussov, A.: Galaxy: IBM Ontological Network Miner. In: Proceed- ings of the 1st Conference on Social Semantic Web, Volume P-113 of Lecture Notes in In- formatics (LNI) series (ISSN 16175468, ISBN 9783-88579207-9). (2007)