Identifying Interesting Postings on Social Media Sites Swathi Seethakkagari Anca Ralescu School for Electronics & Computer Systems Machine Learning & Computational Intelligence Lab University of Cincinnati School of Computing Sciences & Informatics Cincinnati, OH 45221-0030 University of Cincinnati seethasi@mail.uc.edu Cincinnati, Oh 45221-0030 anca.ralescu@uc.edu Abstract tions, and reveals global phenomena at scales that may be hard to identify when looking at a finer-grained resolution” (Leskovec, Huttenlocher, and Kleinberg 2010). This paper considers the classification of messages posted on social networking sites as a step towards identifying interesting/non-interesting messages. As a first approxima- Predicting the network evolution in time is central to such tion, a message is represented by two attributes – the message studies. In particular, detecting communities in a network, length (number of words), posting frequency (time difference predicting the links between nodes in the network, have be- between consecutive messages) for the same sender. A clas- come much studied subjects in social computing, and other sifier, trained according to a user’s perception of whether a domains based on network representations. ”Prediction can message is interesting or not, is used to label each message. be used to recommend new relationships such as friends in a Facebook is considered for illustration purposes. social network or to uncover previously unknown links such as regulatory interactions among genes” (Tan, Chen, and Esfahanian 2008). Keywords By contrast with studies to reveal ”global phenomena” one can consider the local, self-centric social network to which Social networks, classification, k-nearest neighbors. an individual belongs. The ability of anytime anywhere communication that online social networks provides to users has lead to an explosion of user generated data. Therefore, Introduction extracting patterns, global or local, from social networks, is necessary if we are to make sense of what a social network Social networking has long been an activity within social conveys about its users, and society at large. In this paper communities. Whether through relatives, friends, or acquin- we consider the setting of a social network (such as Face- tances people are routinely using their social connections to book, for example) where each user is free to post various further their careers, and improve and enjoy their lives. With messages (in Facebook this is done via the user status which the advent of online social networks and sites this type of ac- the user updates. Some users are inclined to post frequent tivity has increased, making possible networking on a large and often relatively uninteresting updates, others post them scale between people at great physical distances. Friend- more rarely and their contents are more interesting. Then ship and contacts can now be maintained over longer period again, the evaluation ”interesting/uninteresting” is subjec- of time, idea can be exchanged between massive groups of tive and varies from user to user. Therefore, to support a people. Social networks have become an excellent commu- user whose community (friends) is large, filtering or classi- nication source. fication of postings which can take into account this user’s preference is necessary. Analysis of the networking sites has led to many interesting research issues, in a field that is rapidly growing of social In the remainder of this paper we explore the idea of classi- computing and cultural modeling. A natural, and often used fying postings/updates in a user’s centered community based way to represent a network is through graphs, in which a on the user’s perception of their contents as interesting or vertex corresponds to an entity in the network, usually an not. individual, and an edge connecting two vertices represents some form of relationship between the corresponding in- dividuals (Al Hasan et al. 2006). ”Social network analy- sis provides a significant perspective on a range of social computing applications. The structure of networks arising in such applications offers insights into patterns of interac- Analysis of messages on the social network Algorithm 1 Pseudo code for labeling a message using the k-NN Classifier Require: 2-class training data set of of size n In the setting of a Facebook-like network, let I denote a Require: Test data point and k (odd values to avoid ties) generic individual, and F(I) the collection of friends (direct Require: classification technique: k Nearest Neighbors or indirect) of I. A snapshot of such Facebook community Classifier is illustrated in Figure 1. Ensure: The test data point is labeled with its class based on classifier output. Labels are set to -1 or +1. Compute the classifier based on the distance of a test da- tum to the k nearest neighbors. for i = 1, . . . , n do Calculate the distance, dist(i) with the ith data point in the training set Sort the distances Extract the k nearest neighbors if the sum of the top k labels is positive then label of test data point is set to 1; else label of test data point is set to -1; end if end for Figure 1: Snapshot from a Facebook friends network show- A small real example ing a group of clusters. Table 1 shows a small set of updates posted on one of au- thors (S. Seethakkagari) Facebook page. The labels, ±, are For each friend f ∈ F(I), uI (f ) is an update of posted by assigned according to her subjective evaluation of each post- f . The extent to which an update is interesting is, of course, ing. a matter of its contents. This means that in principle, a text analysis of its contents should be done. However, while this would not doubt provide a deeper understanding of the ac- Table 1: A small set of postings extracted from S. Seethakk- tual content, other characteristics, such as message length, agari’s wall on facebook. N is the message length, T , the and frequency of messages from a particular f might give a time from the last message of the same sender. The Label is close enough idea of how interesting a message is. For the assigned according her subjective evaluation of the posting purpose of this paper then, uI (f ; n, t) denotes the update content. of length n (words), posted by the friend f at time inter- val t. The attributes, n and t are used to classify the update ID 1 2 3 4 5 6 7 8 9 uI (f ; n, t) as interesting or not. N 23 16 26 20 30 22 32 16 12 T 272 81 149 287 10 26 4 1 36 Label 1 -1 1 1 1 1 -1 1 1 k-Nearest Neighbors Classification of Postings ID 10 11 12 13 14 15 16 17 18 N 6 4 7 1 32 4 17 15 3 T 0 64 39 558 199 72 52 32 216 Label -1 -1 -1 1 -1 1 1 1 1 The well known k-nearest neighbor classifier (Hart 1967) is used to classify a newly posted message. The classification rule used by the k-nearest algorithm is very simple: using ID 19 20 21 22 23 24 25 a set of labeled examples, a new example is classified ac- N 61 2 38 13 2 6 23 cording to its k-nearest neighbors, where k is a parameter T 27 594 63 0 80 39 57 of the algorithm. The k nearest neighbors ”vote” each for Label -1 1 1 -1 -1 1 1 the class with it has been labeled. Variations of the algo- rithm make possible to weigh a vote by the actual distance from each of these neighbors to the new example(Al Hasan et al. 2006): the vote of a closer neighbor counts more than Experimental Results that of a neighbor farther away. In the small experiment de- scribed below the simpler version of the algorithm is used. The data of labeled postings, shown in Table 1 are plotted in Algorithm 1 describes the steps for this classification. the length × time space as shown in Figure 2. 600 as the probability of classification accuracy greater than or T: time interval from last posting by the same person + : interesting postings equal to 50% is over 83%. As a future study, a larger at- 500 o : uninteresting postings tribute set may be used. For example, the comments (their length and/or contents) received for previous message, the 400 ID of a message sender, can be considered. However, the 300 tradeoff between classification accuracy and computational efficiency. For example, as the number of attributes, or the 200 number of neighbors k increase, the complexity in calcula- tion increases. Feedback from the user may be used to adapt 100 5000 5000 0 4500 4500 4000 4000 −100 frequency of each accuracy 0 10 20 30 40 50 60 70 3500 3500 frequency of accuracy N: length of posting 3000 3000 2500 2500 Figure 2: Plot of the 25 postings from S. Seethakkagari’s 2000 2000 site. 1500 1500 1000 1000 The following experiment was carried out: all the possible 500 500 test data sets of four postings were generated for a total of 0 0 20 40 60 80 100 0 5 15 25 35 45 55 65 75 85 95 12650 sets. The corresponding training sets were obtained accuracy of predicted labels (%) accuracy of predicted labels as percentage by eliminating the test data sets from the set of postings. The number of neighbors was selected to be k = 3. Table 2 and Figure 3: Accuracy of prediction when 21 messages are used Figure 3 show the classification results of all test postings. to predict labels of a subset of four messages. Table 2: Results of classification for all postings of four mes- the classifier so as to achieve a better tradeoff between speed sages with respect to 21 postings used as training data. and accuracy. accuracy(%) 0 25 50 75 100 frequency 252 1865 4544 4455 1534 average 60.1858 mode 50 median 50 References Al Hasan, M.; Chaoji, V.; Salem, S.; and Zaki, M. 2006. Link prediction using supervised learning. In SDM06: Workshop on Link Analysis, Counter-terrorism and Secu- rity. Conclusion and future work Hart, P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1):21–27. We explored the use of classification of postings on a so- Leskovec, J.; Huttenlocher, D.; and Kleinberg, J. 2010. cial media site into two classes: interesting versus non- Signed networks in social media. In Proceedings of the interesting. Each message was encoded using two at- 28th international conference on Human factors in com- tributes,length, expressed as the N the number of words in puting systems, 1361–1370. ACM. the message and the frequency with which a its sender posts Tan, P.; Chen, F.; and Esfahanian, A. 2008. A Matrix messages. Experiments were run for the set shown in Ta- Alignment Approach for Link Prediction. In Proceedings ble 1, a small, but real data set, using a k-nearest neighbor of ICPR 2008. classifier, with k = 3. We consider the results encouraging,