=Paper= {{Paper |id=None |storemode=property |title=Towards Personalized Offers by Means of Life Event Detection on Social Media and Entity Matching |pdfUrl=https://ceur-ws.org/Vol-1210/SP2014_06.pdf |volume=Vol-1210 |dblpUrl=https://dblp.org/rec/conf/ht/CavalinGP14 }} ==Towards Personalized Offers by Means of Life Event Detection on Social Media and Entity Matching== https://ceur-ws.org/Vol-1210/SP2014_06.pdf
       Towards Personalized Offers by Means of Life Event
         Detection on Social Media and Entity Matching

               Paulo Cavalin                           Maíra Gatti                     Claudio Pinhanez
            IBM Research - Brazil                 IBM Research - Brazil              IBM Research - Brazil
            pcavalin@br.ibm.com                   mairacg@br.ibm.com                 csantosp@br.ibm.com



ABSTRACT                                                          One way to find potential customers for services or prod-
In this paper we present a system for personalized offers         ucts is by detecting life events from public user activities
based on two main components: a) a hybrid method, com-            on SMNs, in special microbloggings. Generally, a life event
bining rules and machine learning, to find users that post life   can be defined as something important that happened, is
events on social media networks; and b) an entity matching        happening, or will be happening, in a particular individual’s
algorithm to find out possible relation between the detected      life, such as getting married, get graduated, having a baby,
social media users and current clients. The main assump-          buying a house, and thus forth. That is, if a life event is
tion is that, if one can detect the life events of these users,   properly detected, a product or service can be offered to
a personalized offer can be made to them even before they         someone even before she looks for it, anticipating her needs.
look for a product or service. This proposed solution was         For instance, if a person posts on the SMN that her marriage
implemented on the IBM InfoSphere BigInsights platform            will be happening in a few days (or weeks or months), a loan
to take advantage of the MapReduce programming frame-             or an insurance (for the honey moon trip for example) can
work for large scale capability, and was tested on a dataset      be offered to her in advance. Furthermore, as state in [6],
containing 9 million posts from Twitter. In this set, 42K         marketers know that people mostly shop based on habits,
life event posts sent by 19K different users were detected,       but that among the most likely times to break those habits
with an overall accuracy of 89% e precision of about 65%          is when a major life event happens.
to detect life events. The entity matching of these 19K so-
cial media users against an internal database of 1.6M users       For this reason, this work focuses on presenting a system
returned 983 users, with accuracy of about 90%.                   that can detect life events from textual posts on SMNs,
                                                                  and can match the corresponding users with an existing
                                                                  database, i.e. entity matching with current clients, using ba-
Keywords                                                          sic information such as the name and the location available
Social Media Networks, Life Event Detection, Natural Lan-         on the SMN. Entity matching is important to understand
guage Processing, Machine Learning, Entity Matching               whether a given user of a SMN is already a customer or not,
                                                                  and adapt the way the person can be approached.

1.   INTRODUCTION                                                 Both life event detection and entity matching are complex
Social Media Networks (SMN), such as Twitter and Face-            tasks which are subject of various research in fields such
book, engage thousands of people that post, on a daily ba-        as artificial intelligence, machine learning [6], natural lan-
sis, a huge amount of content represented by texts, images,       guage processing and large scale analysis of unstructured
videos, etc [5, 10]. Often the content can be intimately re-      data (popularly known as Big Data) [12]. Performing nat-
lated to the person the publishes it, in such a way that is       ural language processing on microbloggings’ posts presents
can expose behavioral traits or events that are happening         several challenges, such as dealing with the short and asyn-
in the individual’s life. As a consequence, the proper ex-        chronous nature of the messages, making it difficult to ex-
ploration of this type of content not only can be a way to        tract contextual information, and dealing with a very un-
better understand the users on SMNs, but also can lever-          normalized vocabulary due to the frequent use of slangs,
age many applications that require adequate user profiling,       acronyms, abbreviations, and informal language often with
for instance credit risk analysis, marketing campaigns, and       misspelling errors [1, 7, 13]. Nonetheless, one study that
personalized product and/or service offers.                       supports the possibility of detecting life events from textual
                                                                  posts has been presented in [4]. In that work, the author con-
                                                                  ducted a study on the behavior of mothers during pregnancy,
                                                                  and they observed that these mothers can be distinguished
                                                                  by linguistic changes captured by shifts in a relatively small
                                                                  number of words in their social media posts.

                                                                  In the light of this, in this work we describe and evaluate our
                                                                  proposed solution to tackle the life event detection problem
                                                                  and the entity matching. For the first task, we propose a
hybrid system combining rules and machine learning (ML).         used to rank the probability that two user profiles from two
In contrast to the system specifically focused on life event     different OSNs belong to the same individual.
detection presented in [6] (the only one for this problem to
the best of our knowledge), which uses only ML, our system       The former problem can be considered a subset of the latter
allows for dealing with the life event classes independently.    if we ignore the fact that the second set contains real people
The rule-based phase acts as a mechanism to filter most          information rather than SMN’s profiles. And generally, as
posts that do not contain life events, since all those posts     summarized by [15], there are two approaches for handling
not matching the desirable rules are eliminated. Then, bi-       this: (i) syntactic-based similarity approaches: providing ex-
nary classifiers (one for each type of life event) are applied   act or approximate lexicographical matching of two values;
to validate the possible life events. Greater detail is pro-     and (ii) semantic-based similarity approaches: used to mea-
vided in Section 3.1. For entity matching, a combination of      sure how two values, lexicographically different, are seman-
string distance functions is used to compare the names and       tically similar. For instance, Foaf-o-matic1 and OKKAM2
locations of the users. This method is better described in       projects aim at social profiles integration by means of formal
Section 3.2.                                                     FOAF (Friend-of-a-friend) semantics.

The entire system has been implemented on the IBM In-            Regarding, syntactic-based similarity approach, we summa-
foSphere BigInsights platform [9], to take advantage of the      rize here the ones most used for URI, numeric-based at-
MapReduce programming paradigm for large scale data pro-         tributes and, in the context of SNMs, two users’ full names.
cessing. A dataset containing 9 million posts in portuguese,     Levenshtein or Edit Distance [11] is defined to be the small-
extracted from Twitter, has been used to evaluated the sys-      est number of edit operations, inserts, deletes, and substitu-
tem. To evaluate the entity matching, a database with 1.6        tions required to change one string into another. In addition,
million users has been constructed. More details about the       Jaro is an algorithm commonly used for name matching in
experiments are present in Section 4.                            data linkage systems. A similarity measure is calculated us-
                                                                 ing the number of common characters (i.e., same characters
2.   BACKGROUND AND RELATED WORK                                 that are within half the length of the longer string) and the
                                                                 number of transpositions. Winkler (or Jaro-Winkler) im-
Since the work proposed in this paper is a hybrid solution
                                                                 proves upon Jaro’s algorithm by applying ideas based on
on which we integrate a ML-based classifier with an Entity
                                                                 empirical studies which found that fewer errors typically oc-
Matching solution, the background and related work is pre-
                                                                 cur at the beginning of names [3][2].
sented separated for both as follows:
Life Event Detection: as already mentioned, a life event
                                                                 Another approach is the N-Gram name similarity, on which
can be defined as something important regarding the users’
                                                                 N-grams are sub-strings of length n and an n-gram similar-
lives in SMNs. It is important to differentiate it from some
                                                                 ity between two strings is calculated by counting the num-
related work which uses the event detection expression to
                                                                 ber of n-grams in common (i.e., n-grams contained in both
refer to the problem of detecting unexpected event exposed
                                                                 strings) and dividing by either the number of n-grams in the
by several users in SMNs like a rumor, a trend, or emergent
                                                                 shorter string (called Overlap coefficient), or the number of
topic. In the case of the work proposed in this paper, detec-
                                                                 n-grams in the longer string (called Jaccard similarity), or
tion means to classify a short post, like Twitter’s or Face-
                                                                 the average number of n-grams in both strings. 2-grams and
book’s status messages in one of the life event categories,
                                                                 3-grams have been used to calculate the similarity between
which could be considered, for instance, topics. Therefore,
                                                                 the two users’ full names. Finally, the VMN name similarity
as related work, any approach of topic classification of short
                                                                 approach proposed by [18] was designed for full and partial
messages could be considered like [6], which is the most re-
                                                                 matches of names consisting of one or more words. VMN
lated to our work. Regarding ML-based solutions, other su-
                                                                 supports the case of swapped names and the cases of partial
pervised or unsupervised methods for topic classification are
                                                                 matches.
also related, although not yet used for short messages but
long documents. And regarding semantic-rule-based solu-
                                                                 In this paper, we use two versions of ED preceded by Jaro’s
tions, AQL rules combined with dictionaries are known ap-
                                                                 similarity as described in the next section.
proaches for topic classification with the usage of templates.
Ontologies have also been applied for long documents.
                                                                 3.     METHODOLOGY
Entity Matching: in SMNs there are two problems one              In this section we describe in detail both systems for life
can find Entity Matching solutions for. One is, given a set      event detection system and entity matching.
containing user features on SMNs, like user information and
activities, and another set containing real people informa-      3.1      Hybrid Life Event Detection System
tion, the goal is to try to match the users within both sets.    Given a social media network, the life event detection system
The second problem is, given two sets containing user fea-       has as main goal to return a list of users that posted life
tures on two different SMNs, the goal is to try finding cor-     events within a given time window. This task involves a
responding users, i.e., the biggest possible number of social    crawler to gather data, and a system to search for life events
profiles that refer to the same person between both social       on the data. Note that not only accuracy is important in this
networks. The latter can also be called Entity Resolution        case, to find the largest list of users with a high precision,
(ER) problem, and in the past few years some work has            but also performance is important since the system is likely
been proposed to solve this problem. For instance, [14] pro-
                                                                 1
posed supervised learning techniques and extracted features          http://www.foaf-o-matic.org/
                                                                 2
to build different classifiers, which were then trained and          http://www.okkam.org/
to face a large amount of data. In addition, on a production         3.2      Entity Matching System
environment, the system must allow for easy fine-tuning,             Given the output of the life event detection system, i.e. users
addition and removal of life events classes.                         (aka entities) that posted life events on social media, the
                                                                     main goal of the entity matching system is to find corre-
                                                                     sponding people in a database of real names. For achieving
                                                                     this task accurately, the system must use as much informa-
                                                                     tion as possible to decrease the level of uncertainty.

                                                                     Dealing with users found on SMNs, though, is very challeng-
                                                                     ing. First of all, on most SMNs the basic information about
                                                                     the user (e.g. name, location, age) is very limited (on Twit-
                                                                     ter only the name and location of the user are available). In
                                                                     addition, such personal information may be lacking or not
                                                                     relevant since filling them may be not mandatory, and the
                                                                     content filed is not verified. Besides that, when the informa-
                                                                     tion is seriously provided by the user, other difficulty factors
                                                                     can appear, such as the use of simplified names (Claudio
  Figure 1: Hybrid Life Event Detection System.                      Pinhanez instead of Claudio Santos Pinhanez), the use of
                                                                     social media pen-names (@cinhanez instead of Claudio San-
                                                                     tos Pinhanez), or the use of nickname (Darth Vader instead
To cope with the aforementioned issues, we propose a hybrid          of Claudio Santos Pinhanez.
life event detection approach, combining both rules and ma-
chine learning (ML). Such a system, depicted in Figure 1, is         To deal with some of the aforementioned difficulty factors,
basically composed of three subsequent phases or modules,            for this work we have developed a system to match names
namely Ingest, Filter, and Detect. The first phase, i.e. In-         and locations of users using three different string distance
gest, captures a database of posts to be used for the search         functions:
for life events. This is done by considering a set of words
that can possibly relate to all life events of the system. We             1. Exact matching (EM): a match is found if all the names
assume that the larger this dataset, the larger the set of                   of an SMN user are identical to those of a client
users that will be returned. Once the set of posts has been
totally crawled, the Filter module selects the set of posts               2. Entity Distance 1 (ED1): designed to consider mis-
that is more likely to contain life events. That is, by consid-              spellings and transpositions between adjacent charac-
ering a set of simple rules such as words and combinations of                ters as a match. For instance, the user “Jooa Paulo”
words (but more elaborated rules than those of Ingest), but                  matches the client “Joao Paulo”, and the user “Car-
in this case a set of rules for each type of life event, the posts           olina” matches “Carolina”. In this case, the threshold
that match these rules are marked with the corresponding                     σ1 is used to define a match only if the similarity value
possible life events.                                                        is above this threshold.
                                                                          3. Entity Distance 2 (ED2): designed to match abbrevi-
Despite these rules can indicate a possible life event, a large              ations and some nicknames. For example, the user
portion of these message can be false candidates. For this                   “Joseph S.” matches the client “Joseph Salem”; the
reason, the Detected phase is then carried out to validate                   user “Fabinho” matches the client “Fábio”, and “Mari”
the possible life events with their corresponding probabil-                  matches “Mariana”. Similarly to ED1, the threshold
ity. For each post found in the Filter phase, we apply the                   σ2 is used to define a match.
ML classifier of the corresponding possible life events and
compute the probability of that the post contains the given
life events. With this information, all posts with life event        The execution of three aforementioned matching algorithms
probability above the threshold θ are selected and users of          results in three distinct sets of users, denoted ΩEM , ΩED1
the corresponding posts are generate as the output of the            and ΩED2 . The resulting set of users ΩAll corresponds to
system.                                                              the union of those individual sets. That is, ΩAll = ΩEM ∪
                                                                     ΩED1 ∪ ΩED2 , where ΩEM ∩ ΩED1 ∩ ΩED2 6= ∅ or ΩEM ∩
It is worth noting that currently ML is well-known to pro-           ΩED1 ∩ ΩED2 = ∅, depending on the data.
duce the best solutions to deal with ambiguous and noisy
texts such as microbloggings’ posts. However, the proposed           It is worth mentioning that the Jaro Winkler similarity fil-
hybrid solution takes advantage of the rule-based filtering          tering [20] is used prior to calling ED1 and ED2, to elimi-
to reduces the search space for the ML classifier, which can         nate weak matches such as ’Maria’ and ’Maria das Graças
reduce both the number of errors and processing time. More-          Silva’. Furthermore, ED1 and ED2 may return more than
over, by treating types of life events independently it makes        one match for the same user, whenever the result is above
it easy for fine-tuning, addition and removal of life event          the given threshold. In this work, only the matching with
classes. For instance, to add a new type of life events, one         the highest value is considered.
need to append the corresponding keywords for the Inges-
tion phase, the rules for Filter, and a binary classifier in         4.     EXPERIMENTS
the Detect phase. This can be done with no impact on the             In this section we present the results of applying the pro-
accuracy of existing life events.                                    posed system on a dataset containing 9 millions of posts
from Twitter, which have been produced by about 1.4 mil-
lion users. This data has been gathered by means of the
GNIP social media data provider [8].

Mainly, these experiments have two different purposes. First
we aim at evaluating the numbers related to applying the
system on this 9 million dataset, i.e. how many posts and
users are returned by using the system. And second, we
focus on a quality analysis to validate those numbers by
means of a manual inspection of samplings of this dataset.

The life event detection system has been implemented for
six types of life events: Marriage, Graduation, Travel, Birth-
day, Birth,and Death. For each one, a training dataset of
about 2 thousand samples has been manually labeled as ei-
ther life event or non life event, and a distinct classifier has
been trained. The training data has been obtained with the
Twitter Search API [17]. For this work we make use of Naive
Bayes classifiers using bag-of-words features [19]. The main
parameters, i.e. θ, σ1 and σ2 , have been set to 0.5, 0.95 and
0.95, respectively.                                                        Figure 2: Results on the 9M dataset.

4.1    Quantitative Results
As we mentioned, the first experiment has as main purpose          Table 1: Detailed results on the 9 million dataset.
to evaluate how many posts and users are returned after                Life event     Filter (% of Detect (%
carrying out each phase of the proposed system. The results                           350k)          of 42k)
of applying the implemented life event detection system on             Marriage       182,096 (52.4) 19,457 (46.5 )
9-million-tweet dataset is summarized in Figure 2. In this             Graduation     25,676 (7.4)   11,097 (26.5)
case, the Filter phase has returned 347 thousand posts from            Travel         22,596 (6.5)   1,868 (4.5)
about 220 thousand users. Then, after going through the                Birthday       33,305 (9.6)   3,604 (8.6)
Detect module, 42 thousand posts, from about 19 thousand               Birth          48,687 (14.0)  3,881 (9.3)
users, have been detected as life events. It is worth noting           Death          35,242 (10.1)  1,929 (4.6)
the large difference in terms of proportion from one phase             Total (% of 347,602 (3.7)     41,836 (0.45)
to another. The Ingest phase captures a very large dataset,            9M)
i.e. 9 million posts. Then, Filter finds out that only 3.7%
of these posts can be of interest. However, the Detect phase
shows that from these 347K% of posts, only 42 thousand             containing 1.6 million users using publicly-available data.
(0.45% of 9M or 12% of 347K) are really those that the             The users on this dataset have been matched against the
application is looking for. Considering that many of the           19 thousand users that have been detected as the ones that
current search system are rule-based, these results indicate       posted life events in the 9M dataset. The results and this
that our proposed system can avoid a useless search on about       process are illustrated in Figure 3. Note that we have con-
88% of the posts returned, 307 thousand posts in this case.        ducted two different experiments. The first one matches
                                                                   these users by taking into account only their names, since
In Table 1 we present the results of the experiment above          we consider this as the minimum information we will be able
for each type of life event. We can observe that about 12%         to obtain from the SMN. In this case, 983 users have been
of the posts filtered have been generally confirmed as life        found as probable matches. In the second experiment, where
events, but this proportion can vary according to the type         both names and locations are considered, only 5 users have
of life event. For instance, for the Marriage class, from the      been found. This shows that the precision of entity match-
182,096 posts that the filter considered as possible life event,   ing can be increased considering more for this process. On
the machine learning algorithm detected 19,475 (10.6%) as          the other hand, this will also reduce the size of the resulting
being actually life events, which is close to the average. The     matching set.
Graduation type, on the other hand, presented a much larger
proportion (43.21%), while Death and Travel smaller ones           In order to validate the above results, we performed a ran-
(5.47% and 8.26% respectively). We believe that this dif-          dom sampling of 23 thousand posts (from the 9 million set)
ference can happen either due to the period of the year in         focusing on quality analysis. The number of posts filtered
which the data is gathered (Graduation supposedly has more         and detected are shown in Figure 4. The total of posts
posts in certain periods of the year), or even due to the type     filtered is 1,008, from which 105 have been detected as life
of life event that may contain more non life events (Travel        events. Similar to the results on the 9 million set, only about
for example, which may present many posts from marketing           10% of the filtered posts have been detected as life events.
agencies) or even less life event posts (for instance Death,       Detailed numbers, for each type of life event, are presented
whereas people might to be more introspective).                    in the columns Filtered and Detected in Table 2.

To evaluate the entity matching, we have a built a dataset         Those 1,008 posts resulting from the Filter module have
                                                                   contains the total number of true positives, true negative,
                                                                   false positives and false negatives. This has allowed us to
                                                                   compute the values for accuracy, precision and recall [16],
                                                                   which were at about 89%, 65% and 48%, respectively. In
                                                                   this case, a true positive consists of a posts that contains a
                                                                   life event (according to the manual inspection) and is cor-
                                                                   rectly detected by the system, a true negative is not a life
                                                                   event and is correctly ignored by the system, a false posi-
                                                                   tive is not a life events but is detected by the system, and
                                                                   a false negative is a life event but is not detected by the
                                                                   system. As a consequence, the precision represents the pro-
                                                                   portion of detected posts that contain life events, and the
                                                                   recall the proportion of life events that have been found by
                                                                   the system. It is worth noting that there is a trade-off be-
          Figure 3: Entity matching results.                       tween precision and recall that is set according to the value
                                                                   of θ, where lower values can increase recall and large values
                                                                   increase the precision (see Figure 5).


                                                                   Table 3: Confusion matrix of the 1008 filtered posts
                                                                   found on the 23K sampling.
                                                                                               Manual labeling
                                                                                            Life Events Non Life
                                                                                                           Events
                                                                      Life        Positive   68 (6.7%)     37 (3.7%)
                                                                      Event
                                                                      Detection Negative     74 (7.3%)     829
                                                                      System                               (82.4%)




       Figure 4: Results on the 23K sampling.


Table 2: Number of posts returned per life event
type on the 23K sampling.
   Life event Filtered     Detected     Ground-
               (%       of (%        of Truth (%
               1008)       105)         of 142)
   Marriage    162 (16.2)  8 (7.6 )     7 (4.9)
   Graduation 70 (7.0)     26 (24.7)    15 (10.6)
   Travel      474 (47.4)  55 (52.4)    99 (69.7)                  Figure 5: Precision/Recall trade-off by varying θ
   Birthday    102 (10.2)  11 (10.5)    12 (8.5)                   from 0.1 to 0.9.
   Birth       107 (10.7)  4 (3.8)      7 (4.9)                    Similarly, to validate the quality of the entity matching al-
   Death       93 (9.3)    1 (9.5)      2 (1.4)                    gorithm we have done a random sampling of 500 users and
   Total (% of 1008 (4.4)  105 (0.45)   142 (0.6)                  manually inspected the correctness of the matchings found.
   23K)                                                            In this case, the entity matching algorithm returned 72 users,
                                                                   being 43 found by EM, 13 by ED1 and 16 by ED2. But, as
                                                                   we mentioned, both ED1 and ED2 can return more than one
been then manually inspected in order to verify whether the        matching per user if the matching algorithm returns a value
Detect phase has assigned the correct probability or not.          above the threshold σ1 and σ2 . For a better analysis of the
The total of posts for each type of life event are listed in the   algorithm, in Table 4 and Table 5 we present the confusion
Ground-Truth column in Table 2. It can be observed that            matrices of both ED1 and ED2 considering all matches. The
our system presents numbers that are close to what was             former has found a total of 476 matches, with an accuracy of
found by the manual inspection. By comparing the manual            about 91%, precision of 10.4% and recall of 71.4%, while the
inspection with the results of the system, we have been able       latter has returned a total of 452 matches, 94% of accuracy,
to compute the confusion matrix presented in Table 3, which        precision of 50% and recall of 94%.
                                                                      Major life changes and behavioral markers in social
 Table 4: Confusion matrix for ED1 on 500 users.                      media: Case of childbirth. In Proceedings of the 2013
                            Manual labeling
                                                                      Conference on Computer Supported Cooperative Work
                           Match    Non
                                                                      (New York, NY, USA, 2013), CSCW ’13, ACM,
                                    Match
                                                                      pp. 1431–1442.
    Entity     Positive 5 (1.10%) 43 (9.0%)
                                                                  [5] Ehrlich, K., and Shami, N. S. Microblogging inside
    Match-
                                                                      and outside the workplace. In ICWSM (2010).
    ing
    System     Negative 2 (0.4%) 426                              [6] Eugenio, B. D., Green, N., and Subba, R.
                                    (89.5%)                           Detecting life events in feeds from twitter. 2012 IEEE
                                                                      Sixth International Conference on Semantic
                                                                      Computing 0 (2013), 274–277.
 Table 5: Confusion matrix for ED2 on 500 users.                  [7] Felt, A. P., and Wagner, D. Phishing on mobile
                            Manual labeling                           devices. In In W2SP (2011).
                           Match    Non                           [8] GNIP. GNIP, 2014. [Online; accessed 28-May-2014].
                                    Match                         [9] IBM. IBM InfoSphere BigInsights, 2014. [Online;
    Entity     Positive 17 (3.7%) 17 (3.7%)                           accessed 28-May-2014].
    Match-                                                       [10] Kwak, H., Lee, C., Park, H., and Moon, S. What
    ing                                                               is twitter, a social network or a news media? In
    System     Negative 1 (3.9%) 427                                  Proceedings of the 19th international conference on
                                    (96.1%)                           World wide web (New York, NY, USA, 2010), WWW
                                                                      ’10, ACM, pp. 591–600.
                                                                 [11] Levenshtein, V. Binary Codes Capable of Correcting
5.   CONCLUSIONS                                                      Deletions, Insertions and Reversals. Soviet Physics
In this work we presented a system for personalized offer             Doklady 10 (1966), 707.
based on life event detection. Once the system detects users     [12] Lin, J., and Dyer, C. Data-Intensive Text Processing
posting life events on a social media network, these users            with MapReduce. Claypool Publishers, 2010.
are matched against an internal database of clients to de-       [13] Liu, F., Weng, F., and Jiang, X. A broad-coverage
cide what is the best approach to offer them a service or             normalization system for social media language. In
product. We described a way to implement the entire sys-              Proceedings of the 50th Annual Meeting of the
tem, and presented the results of applying the system on              Association for Computational Linguistics: Long
a dataset of 9 million posts. From this set, a total of 42            Papers - Volume 1 (Stroudsburg, PA, USA, 2012),
thousands life events have been found, with a projected ac-           ACL ’12, Association for Computational Linguistics,
curacy of 88.90% and precision of 65%. This indicates that,           pp. 1035–1044.
in a normal day of 20 million posts published by Brazilian       [14] Peled, O., Fire, M., Rokach, L., and Elovici, Y.
users, for instance, the system presents the ability to detect        Entity matching in online social networks. In Social
around 91 thousand posts a day, being about 60 thousand               Computing (SocialCom), 2013 International
of them correct. Besides that, it is worth mentioning that            Conference on (Sept 2013), pp. 339–344.
the system is scalable since it has been implement with the      [15] Raad, E., Chbeir, R., and Dipanda, A. User profile
MapReduce programming paradigm.                                       matching in social networks. In Network-Based
                                                                      Information Systems (NBiS), 2010 13th International
Future work can follow many different and complementary               Conference on (Sept 2010), pp. 297–304.
paths. Accuracy is important and could be improved by
                                                                 [16] Sokolova, M., and Lapalme, G. A systematic
evaluating other types of classifiers and features, as well as
                                                                      analysis of performance measures for classification
increasing training data. The addition and evaluation of
                                                                      tasks. Information Processing and management, 45
other types of life events could be important to better un-
                                                                      (2009), 427–437.
derstand the way people behave on the SMNs. Furthermore,
                                                                 [17] Twitter. Using the Twitter Search API, 2014.
the adaptation to a real-time streaming platform such as the
                                                                      [Online; accessed 28-May-2014].
IBM InfoSphere Streams would allow the system react very
quickly (near to real-time) once the users post life events.     [18] Vosecky, J., Hong, D., and Shen, V. User
                                                                      identification across multiple social networks. In
                                                                      Networked Digital Technologies, 2009. NDT ’09. First
6.   REFERENCES                                                       International Conference on (July 2009), pp. 360–365.
 [1] Atefeh, F., and Khreich, W. A survey of                     [19] Weiss, S. M., Indurkhya, N., and Zhang, T.
     techniques for event detection in twitter.                       Fundamentals of Predictive Text Mining. Springer
     Computational Intelligence (2013), n/a–n/a.                      London, 2010.
 [2] Bilenko, M., Mooney, R., Cohen, W.,                         [20] Winkler, W. E. String comparator metrics and
     Ravikumar, P., and Fienberg, S. Adaptive name                    enhanced decision rules in the fellegi-sunter model of
     matching in information integration. IEEE Intelligent            record linkage. In Proceedings of the Section on Survey
     Systems 18, 5 (Sept. 2003), 16–23.                               Research Methods (American Statistical Association
 [3] Cohen, W. W., Ravikumar, P., and Fienberg,                       (1990), pp. 354–359.
     S. E. A comparison of string distance metrics for
     name-matching tasks. pp. 73–78.
 [4] De Choudhury, M., Counts, S., and Horvitz, E.