Identifying Interesting Postings on Social Media Sites
                   Swathi Seethakkagari                                             Anca Ralescu
          School for Electronics & Computer Systems                Machine Learning & Computational Intelligence Lab
                    University of Cincinnati                          School of Computing Sciences & Informatics
                  Cincinnati, OH 45221-0030                                     University of Cincinnati
                seethasi@mail.uc.edu                                          Cincinnati, Oh 45221-0030
                                                                              anca.ralescu@uc.edu


                           Abstract                                  tions, and reveals global phenomena at scales that may be
                                                                     hard to identify when looking at a finer-grained resolution”
                                                                     (Leskovec, Huttenlocher, and Kleinberg 2010).
  This paper considers the classification of messages posted
  on social networking sites as a step towards identifying
  interesting/non-interesting messages. As a first approxima-        Predicting the network evolution in time is central to such
  tion, a message is represented by two attributes – the message     studies. In particular, detecting communities in a network,
  length (number of words), posting frequency (time difference       predicting the links between nodes in the network, have be-
  between consecutive messages) for the same sender. A clas-         come much studied subjects in social computing, and other
  sifier, trained according to a user’s perception of whether a      domains based on network representations. ”Prediction can
  message is interesting or not, is used to label each message.      be used to recommend new relationships such as friends in a
  Facebook is considered for illustration purposes.                  social network or to uncover previously unknown links such
                                                                     as regulatory interactions among genes” (Tan, Chen, and
                                                                     Esfahanian 2008).
Keywords
                                                                     By contrast with studies to reveal ”global phenomena” one
                                                                     can consider the local, self-centric social network to which
Social networks, classification, k-nearest neighbors.                an individual belongs. The ability of anytime anywhere
                                                                     communication that online social networks provides to users
                                                                     has lead to an explosion of user generated data. Therefore,
                       Introduction                                  extracting patterns, global or local, from social networks, is
                                                                     necessary if we are to make sense of what a social network
Social networking has long been an activity within social            conveys about its users, and society at large. In this paper
communities. Whether through relatives, friends, or acquin-          we consider the setting of a social network (such as Face-
tances people are routinely using their social connections to        book, for example) where each user is free to post various
further their careers, and improve and enjoy their lives. With       messages (in Facebook this is done via the user status which
the advent of online social networks and sites this type of ac-      the user updates. Some users are inclined to post frequent
tivity has increased, making possible networking on a large          and often relatively uninteresting updates, others post them
scale between people at great physical distances. Friend-            more rarely and their contents are more interesting. Then
ship and contacts can now be maintained over longer period           again, the evaluation ”interesting/uninteresting” is subjec-
of time, idea can be exchanged between massive groups of             tive and varies from user to user. Therefore, to support a
people. Social networks have become an excellent commu-              user whose community (friends) is large, filtering or classi-
nication source.                                                     fication of postings which can take into account this user’s
                                                                     preference is necessary.
Analysis of the networking sites has led to many interesting
research issues, in a field that is rapidly growing of social        In the remainder of this paper we explore the idea of classi-
computing and cultural modeling. A natural, and often used           fying postings/updates in a user’s centered community based
way to represent a network is through graphs, in which a             on the user’s perception of their contents as interesting or
vertex corresponds to an entity in the network, usually an           not.
individual, and an edge connecting two vertices represents
some form of relationship between the corresponding in-
dividuals (Al Hasan et al. 2006). ”Social network analy-
sis provides a significant perspective on a range of social
computing applications. The structure of networks arising
in such applications offers insights into patterns of interac-
Analysis of messages on the social network                        Algorithm 1 Pseudo code for labeling a message using the
                                                                  k-NN Classifier
                                                                  Require: 2-class training data set of of size n
In the setting of a Facebook-like network, let I denote a         Require: Test data point and k (odd values to avoid ties)
generic individual, and F(I) the collection of friends (direct    Require: classification technique: k Nearest Neighbors
or indirect) of I. A snapshot of such Facebook community            Classifier
is illustrated in Figure 1.                                       Ensure: The test data point is labeled with its class based
                                                                    on classifier output. Labels are set to -1 or +1.
                                                                    Compute the classifier based on the distance of a test da-
                                                                    tum to the k nearest neighbors.
                                                                    for i = 1, . . . , n do
                                                                      Calculate the distance, dist(i) with the ith data point in
                                                                      the training set
                                                                      Sort the distances
                                                                      Extract the k nearest neighbors
                                                                      if the sum of the top k labels is positive then
                                                                          label of test data point is set to 1;
                                                                      else
                                                                          label of test data point is set to -1;
                                                                      end if
                                                                    end for


Figure 1: Snapshot from a Facebook friends network show-                           A small real example
ing a group of clusters.
                                                                  Table 1 shows a small set of updates posted on one of au-
                                                                  thors (S. Seethakkagari) Facebook page. The labels, ±, are
For each friend f ∈ F(I), uI (f ) is an update of posted by       assigned according to her subjective evaluation of each post-
f . The extent to which an update is interesting is, of course,   ing.
a matter of its contents. This means that in principle, a text
analysis of its contents should be done. However, while this
would not doubt provide a deeper understanding of the ac-         Table 1: A small set of postings extracted from S. Seethakk-
tual content, other characteristics, such as message length,      agari’s wall on facebook. N is the message length, T , the
and frequency of messages from a particular f might give a        time from the last message of the same sender. The Label is
close enough idea of how interesting a message is. For the        assigned according her subjective evaluation of the posting
purpose of this paper then, uI (f ; n, t) denotes the update      content.
of length n (words), posted by the friend f at time inter-
val t. The attributes, n and t are used to classify the update      ID       1       2     3      4       5    6     7     8        9
uI (f ; n, t) as interesting or not.                                 N      23      16    26     20      30   22    32    16       12
                                                                     T     272      81   149    287      10   26     4     1       36
                                                                   Label     1      -1     1      1       1    1    -1     1        1

k-Nearest Neighbors Classification of Postings                      ID       10     11     12    13     14    15    16    17        18
                                                                     N        6      4      7     1     32     4    17    15         3
                                                                     T        0     64     39   558    199    72    52    32       216
                                                                   Label     -1     -1     -1     1     -1     1     1     1         1
The well known k-nearest neighbor classifier (Hart 1967) is
used to classify a newly posted message. The classification
rule used by the k-nearest algorithm is very simple: using          ID       19    20      21     22     23   24    25
a set of labeled examples, a new example is classified ac-           N       61     2      38     13      2    6    23
cording to its k-nearest neighbors, where k is a parameter           T       27   594      63      0     80   39    57
of the algorithm. The k nearest neighbors ”vote” each for          Label     -1     1       1     -1     -1    1     1
the class with it has been labeled. Variations of the algo-
rithm make possible to weigh a vote by the actual distance
from each of these neighbors to the new example(Al Hasan
et al. 2006): the vote of a closer neighbor counts more than                      Experimental Results
that of a neighbor farther away. In the small experiment de-
scribed below the simpler version of the algorithm is used.       The data of labeled postings, shown in Table 1 are plotted in
Algorithm 1 describes the steps for this classification.          the length × time space as shown in Figure 2.
                                                              600
                                                                                                                                                    as the probability of classification accuracy greater than or
     T: time interval from last posting by the same person
                                                                                                           + : interesting postings                 equal to 50% is over 83%. As a future study, a larger at-
                                                              500                                          o : uninteresting postings               tribute set may be used. For example, the comments (their
                                                                                                                                                    length and/or contents) received for previous message, the
                                                              400
                                                                                                                                                    ID of a message sender, can be considered. However, the
                                                              300
                                                                                                                                                    tradeoff between classification accuracy and computational
                                                                                                                                                    efficiency. For example, as the number of attributes, or the
                                                              200                                                                                   number of neighbors k increase, the complexity in calcula-
                                                                                                                                                    tion increases. Feedback from the user may be used to adapt
                                                              100
                                                                                                                                                                                 5000                                                                                    5000

                                                                0                                                                                                                4500                                                                                    4500

                                                                                                                                                                                 4000                                                                                    4000
                                                             −100


                                                                                                                                                    frequency of each accuracy
                                                                    0     10      20         30        40         50        60          70                                       3500                                                                                    3500


                                                                                                                                                                                                                                                 frequency of accuracy
                                                                                         N: length of posting
                                                                                                                                                                                 3000                                                                                    3000

                                                                                                                                                                                 2500                                                                                    2500
Figure 2: Plot of the 25 postings from S. Seethakkagari’s                                                                                                                        2000                                                                                    2000
site.
                                                                                                                                                                                 1500                                                                                    1500

                                                                                                                                                                                 1000                                                                                    1000

The following experiment was carried out: all the possible                                                                                                                       500                                                                                     500

test data sets of four postings were generated for a total of                                                                                                                      0
                                                                                                                                                                                        0   20          40             60           80     100
                                                                                                                                                                                                                                                                           0
                                                                                                                                                                                                                                                                                5   15     25 35 45 55 65 75 85                       95
12650 sets. The corresponding training sets were obtained                                                                                                                                        accuracy of predicted labels (%)                                                        accuracy of predicted labels as percentage

by eliminating the test data sets from the set of postings. The
number of neighbors was selected to be k = 3. Table 2 and                                                                                           Figure 3: Accuracy of prediction when 21 messages are used
Figure 3 show the classification results of all test postings.                                                                                      to predict labels of a subset of four messages.

Table 2: Results of classification for all postings of four mes-
                                                                                                                                                    the classifier so as to achieve a better tradeoff between speed
sages with respect to 21 postings used as training data.
                                                                                                                                                    and accuracy.
  accuracy(%)                                                                        0          25                50         75               100
  frequency                                                                        252        1865              4544       4455              1534
  average                                                                      60.1858
  mode                                                                              50
  median                                                                            50                                                                                                                                                   References
                                                                                                                                                                                 Al Hasan, M.; Chaoji, V.; Salem, S.; and Zaki, M. 2006.
                                                                                                                                                                                 Link prediction using supervised learning. In SDM06:
                                                                                                                                                                                 Workshop on Link Analysis, Counter-terrorism and Secu-
                                                                                                                                                                                 rity.
                                                                        Conclusion and future work                                                                               Hart, P. 1967. Nearest neighbor pattern classification.
                                                                                                                                                                                 IEEE Transactions on Information Theory 13(1):21–27.
We explored the use of classification of postings on a so-                                                                                                                       Leskovec, J.; Huttenlocher, D.; and Kleinberg, J. 2010.
cial media site into two classes: interesting versus non-                                                                                                                        Signed networks in social media. In Proceedings of the
interesting. Each message was encoded using two at-                                                                                                                              28th international conference on Human factors in com-
tributes,length, expressed as the N the number of words in                                                                                                                       puting systems, 1361–1370. ACM.
the message and the frequency with which a its sender posts                                                                                                                      Tan, P.; Chen, F.; and Esfahanian, A. 2008. A Matrix
messages. Experiments were run for the set shown in Ta-                                                                                                                          Alignment Approach for Link Prediction. In Proceedings
ble 1, a small, but real data set, using a k-nearest neighbor                                                                                                                    of ICPR 2008.
classifier, with k = 3. We consider the results encouraging,