<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Personalized Offers by Means of Life Event Detection on Social Media and Entity Matching</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Claudio Pinhanez IBM Research -</institution>
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paulo Cavalin IBM Research -</institution>
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present a system for personalized o ers based on two main components: a) a hybrid method, combining rules and machine learning, to nd users that post life events on social media networks; and b) an entity matching algorithm to nd out possible relation between the detected social media users and current clients. The main assumption is that, if one can detect the life events of these users, a personalized o er can be made to them even before they look for a product or service. This proposed solution was implemented on the IBM InfoSphere BigInsights platform to take advantage of the MapReduce programming framework for large scale capability, and was tested on a dataset containing 9 million posts from Twitter. In this set, 42K life event posts sent by 19K di erent users were detected, with an overall accuracy of 89% e precision of about 65% to detect life events. The entity matching of these 19K social media users against an internal database of 1.6M users returned 983 users, with accuracy of about 90%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Social Media Networks</kwd>
        <kwd>Life Event Detection</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Entity Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Social Media Networks (SMN), such as Twitter and
Facebook, engage thousands of people that post, on a daily
basis, a huge amount of content represented by texts, images,
videos, etc [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ]. Often the content can be intimately
related to the person the publishes it, in such a way that is
can expose behavioral traits or events that are happening
in the individual's life. As a consequence, the proper
exploration of this type of content not only can be a way to
better understand the users on SMNs, but also can
leverage many applications that require adequate user pro ling,
for instance credit risk analysis, marketing campaigns, and
personalized product and/or service o ers.
      </p>
      <p>
        One way to nd potential customers for services or
products is by detecting life events from public user activities
on SMNs, in special microbloggings. Generally, a life event
can be de ned as something important that happened, is
happening, or will be happening, in a particular individual's
life, such as getting married, get graduated, having a baby,
buying a house, and thus forth. That is, if a life event is
properly detected, a product or service can be o ered to
someone even before she looks for it, anticipating her needs.
For instance, if a person posts on the SMN that her marriage
will be happening in a few days (or weeks or months), a loan
or an insurance (for the honey moon trip for example) can
be o ered to her in advance. Furthermore, as state in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
marketers know that people mostly shop based on habits,
but that among the most likely times to break those habits
is when a major life event happens.
      </p>
      <p>
        For this reason, this work focuses on presenting a system
that can detect life events from textual posts on SMNs,
and can match the corresponding users with an existing
database, i.e. entity matching with current clients, using
basic information such as the name and the location available
on the SMN. Entity matching is important to understand
whether a given user of a SMN is already a customer or not,
and adapt the way the person can be approached.
Both life event detection and entity matching are complex
tasks which are subject of various research in elds such
as arti cial intelligence, machine learning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], natural
language processing and large scale analysis of unstructured
data (popularly known as Big Data) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Performing
natural language processing on microbloggings' posts presents
several challenges, such as dealing with the short and
asynchronous nature of the messages, making it di cult to
extract contextual information, and dealing with a very
unnormalized vocabulary due to the frequent use of slangs,
acronyms, abbreviations, and informal language often with
misspelling errors [
        <xref ref-type="bibr" rid="ref1 ref13 ref7">1, 7, 13</xref>
        ]. Nonetheless, one study that
supports the possibility of detecting life events from textual
posts has been presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In that work, the author
conducted a study on the behavior of mothers during pregnancy,
and they observed that these mothers can be distinguished
by linguistic changes captured by shifts in a relatively small
number of words in their social media posts.
      </p>
      <p>
        In the light of this, in this work we describe and evaluate our
proposed solution to tackle the life event detection problem
and the entity matching. For the rst task, we propose a
hybrid system combining rules and machine learning (ML).
In contrast to the system speci cally focused on life event
detection presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (the only one for this problem to
the best of our knowledge), which uses only ML, our system
allows for dealing with the life event classes independently.
The rule-based phase acts as a mechanism to lter most
posts that do not contain life events, since all those posts
not matching the desirable rules are eliminated. Then,
binary classi ers (one for each type of life event) are applied
to validate the possible life events. Greater detail is
provided in Section 3.1. For entity matching, a combination of
string distance functions is used to compare the names and
locations of the users. This method is better described in
Section 3.2.
      </p>
      <p>
        The entire system has been implemented on the IBM
InfoSphere BigInsights platform [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], to take advantage of the
MapReduce programming paradigm for large scale data
processing. A dataset containing 9 million posts in portuguese,
extracted from Twitter, has been used to evaluated the
system. To evaluate the entity matching, a database with 1.6
million users has been constructed. More details about the
experiments are present in Section 4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. BACKGROUND AND RELATED WORK</title>
      <p>
        Since the work proposed in this paper is a hybrid solution
on which we integrate a ML-based classi er with an Entity
Matching solution, the background and related work is
presented separated for both as follows:
Life Event Detection: as already mentioned, a life event
can be de ned as something important regarding the users'
lives in SMNs. It is important to di erentiate it from some
related work which uses the event detection expression to
refer to the problem of detecting unexpected event exposed
by several users in SMNs like a rumor, a trend, or emergent
topic. In the case of the work proposed in this paper,
detection means to classify a short post, like Twitter's or
Facebook's status messages in one of the life event categories,
which could be considered, for instance, topics. Therefore,
as related work, any approach of topic classi cation of short
messages could be considered like [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is the most
related to our work. Regarding ML-based solutions, other
supervised or unsupervised methods for topic classi cation are
also related, although not yet used for short messages but
long documents. And regarding semantic-rule-based
solutions, AQL rules combined with dictionaries are known
approaches for topic classi cation with the usage of templates.
Ontologies have also been applied for long documents.
Entity Matching: in SMNs there are two problems one
can nd Entity Matching solutions for. One is, given a set
containing user features on SMNs, like user information and
activities, and another set containing real people
information, the goal is to try to match the users within both sets.
The second problem is, given two sets containing user
features on two di erent SMNs, the goal is to try nding
corresponding users, i.e., the biggest possible number of social
pro les that refer to the same person between both social
networks. The latter can also be called Entity Resolution
(ER) problem, and in the past few years some work has
been proposed to solve this problem. For instance, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
proposed supervised learning techniques and extracted features
to build di erent classi ers, which were then trained and
used to rank the probability that two user pro les from two
di erent OSNs belong to the same individual.
      </p>
      <p>
        The former problem can be considered a subset of the latter
if we ignore the fact that the second set contains real people
information rather than SMN's pro les. And generally, as
summarized by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], there are two approaches for handling
this: (i) syntactic-based similarity approaches: providing
exact or approximate lexicographical matching of two values;
and (ii) semantic-based similarity approaches: used to
measure how two values, lexicographically di erent, are
semantically similar. For instance, Foaf-o-matic1 and OKKAM2
projects aim at social pro les integration by means of formal
FOAF (Friend-of-a-friend) semantics.
      </p>
      <p>
        Regarding, syntactic-based similarity approach, we
summarize here the ones most used for URI, numeric-based
attributes and, in the context of SNMs, two users' full names.
Levenshtein or Edit Distance [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is de ned to be the
smallest number of edit operations, inserts, deletes, and
substitutions required to change one string into another. In addition,
Jaro is an algorithm commonly used for name matching in
data linkage systems. A similarity measure is calculated
using the number of common characters (i.e., same characters
that are within half the length of the longer string) and the
number of transpositions. Winkler (or Jaro-Winkler)
improves upon Jaro's algorithm by applying ideas based on
empirical studies which found that fewer errors typically
occur at the beginning of names [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Another approach is the N-Gram name similarity, on which
N-grams are sub-strings of length n and an n-gram
similarity between two strings is calculated by counting the
number of n-grams in common (i.e., n-grams contained in both
strings) and dividing by either the number of n-grams in the
shorter string (called Overlap coe cient), or the number of
n-grams in the longer string (called Jaccard similarity), or
the average number of n-grams in both strings. 2-grams and
3-grams have been used to calculate the similarity between
the two users' full names. Finally, the VMN name similarity
approach proposed by [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] was designed for full and partial
matches of names consisting of one or more words. VMN
supports the case of swapped names and the cases of partial
matches.
      </p>
      <p>In this paper, we use two versions of ED preceded by Jaro's
similarity as described in the next section.</p>
    </sec>
    <sec id="sec-3">
      <title>3. METHODOLOGY</title>
      <p>In this section we describe in detail both systems for life
event detection system and entity matching.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Hybrid Life Event Detection System</title>
      <p>Given a social media network, the life event detection system
has as main goal to return a list of users that posted life
events within a given time window. This task involves a
crawler to gather data, and a system to search for life events
on the data. Note that not only accuracy is important in this
case, to nd the largest list of users with a high precision,
but also performance is important since the system is likely
1http://www.foaf-o-matic.org/
2http://www.okkam.org/
to face a large amount of data. In addition, on a production
environment, the system must allow for easy ne-tuning,
addition and removal of life events classes.
To cope with the aforementioned issues, we propose a hybrid
life event detection approach, combining both rules and
machine learning (ML). Such a system, depicted in Figure 1, is
basically composed of three subsequent phases or modules,
namely Ingest, Filter, and Detect. The rst phase, i.e.
Ingest, captures a database of posts to be used for the search
for life events. This is done by considering a set of words
that can possibly relate to all life events of the system. We
assume that the larger this dataset, the larger the set of
users that will be returned. Once the set of posts has been
totally crawled, the Filter module selects the set of posts
that is more likely to contain life events. That is, by
considering a set of simple rules such as words and combinations of
words (but more elaborated rules than those of Ingest), but
in this case a set of rules for each type of life event, the posts
that match these rules are marked with the corresponding
possible life events.</p>
      <p>Despite these rules can indicate a possible life event, a large
portion of these message can be false candidates. For this
reason, the Detected phase is then carried out to validate
the possible life events with their corresponding
probability. For each post found in the Filter phase, we apply the
ML classi er of the corresponding possible life events and
compute the probability of that the post contains the given
life events. With this information, all posts with life event
probability above the threshold are selected and users of
the corresponding posts are generate as the output of the
system.</p>
      <p>It is worth noting that currently ML is well-known to
produce the best solutions to deal with ambiguous and noisy
texts such as microbloggings' posts. However, the proposed
hybrid solution takes advantage of the rule-based ltering
to reduces the search space for the ML classi er, which can
reduce both the number of errors and processing time.
Moreover, by treating types of life events independently it makes
it easy for ne-tuning, addition and removal of life event
classes. For instance, to add a new type of life events, one
need to append the corresponding keywords for the
Ingestion phase, the rules for Filter, and a binary classi er in
the Detect phase. This can be done with no impact on the
accuracy of existing life events.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Entity Matching System</title>
      <p>Given the output of the life event detection system, i.e. users
(aka entities) that posted life events on social media, the
main goal of the entity matching system is to nd
corresponding people in a database of real names. For achieving
this task accurately, the system must use as much
information as possible to decrease the level of uncertainty.
Dealing with users found on SMNs, though, is very
challenging. First of all, on most SMNs the basic information about
the user (e.g. name, location, age) is very limited (on
Twitter only the name and location of the user are available). In
addition, such personal information may be lacking or not
relevant since lling them may be not mandatory, and the
content led is not veri ed. Besides that, when the
information is seriously provided by the user, other di culty factors
can appear, such as the use of simpli ed names (Claudio
Pinhanez instead of Claudio Santos Pinhanez), the use of
social media pen-names (@cinhanez instead of Claudio
Santos Pinhanez), or the use of nickname (Darth Vader instead
of Claudio Santos Pinhanez.</p>
      <p>To deal with some of the aforementioned di culty factors,
for this work we have developed a system to match names
and locations of users using three di erent string distance
functions:
1. Exact matching (EM): a match is found if all the names
of an SMN user are identical to those of a client
2. Entity Distance 1 (ED1): designed to consider
misspellings and transpositions between adjacent
characters as a match. For instance, the user \Jooa Paulo"
matches the client \Joao Paulo", and the user
\Carolina" matches \Carolina". In this case, the threshold
1 is used to de ne a match only if the similarity value
is above this threshold.
3. Entity Distance 2 (ED2): designed to match
abbreviations and some nicknames. For example, the user
\Joseph S." matches the client \Joseph Salem"; the
user \Fabinho" matches the client \Fabio", and \Mari"
matches \Mariana". Similarly to ED1, the threshold
2 is used to de ne a match.</p>
      <p>The execution of three aforementioned matching algorithms
results in three distinct sets of users, denoted EM , ED1
and ED2. The resulting set of users All corresponds to
the union of those individual sets. That is, All = EM [
ED1 [ ED2, where EM \ ED1 \ ED2 6= ; or EM \
ED1 \ ED2 = ;, depending on the data.</p>
      <p>
        It is worth mentioning that the Jaro Winkler similarity
ltering [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is used prior to calling ED1 and ED2, to
eliminate weak matches such as 'Maria' and 'Maria das Gracas
Silva'. Furthermore, ED1 and ED2 may return more than
one match for the same user, whenever the result is above
the given threshold. In this work, only the matching with
the highest value is considered.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. EXPERIMENTS</title>
      <p>
        In this section we present the results of applying the
proposed system on a dataset containing 9 millions of posts
from Twitter, which have been produced by about 1.4
million users. This data has been gathered by means of the
GNIP social media data provider [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Mainly, these experiments have two di erent purposes. First
we aim at evaluating the numbers related to applying the
system on this 9 million dataset, i.e. how many posts and
users are returned by using the system. And second, we
focus on a quality analysis to validate those numbers by
means of a manual inspection of samplings of this dataset.
The life event detection system has been implemented for
six types of life events: Marriage, Graduation, Travel,
Birthday, Birth,and Death. For each one, a training dataset of
about 2 thousand samples has been manually labeled as
either life event or non life event, and a distinct classi er has
been trained. The training data has been obtained with the
Twitter Search API [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For this work we make use of Naive
Bayes classi ers using bag-of-words features [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The main
parameters, i.e. , 1 and 2, have been set to 0.5, 0.95 and
0.95, respectively.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4.1 Quantitative Results</title>
      <p>As we mentioned, the rst experiment has as main purpose
to evaluate how many posts and users are returned after
carrying out each phase of the proposed system. The results
of applying the implemented life event detection system on
9-million-tweet dataset is summarized in Figure 2. In this
case, the Filter phase has returned 347 thousand posts from
about 220 thousand users. Then, after going through the
Detect module, 42 thousand posts, from about 19 thousand
users, have been detected as life events. It is worth noting
the large di erence in terms of proportion from one phase
to another. The Ingest phase captures a very large dataset,
i.e. 9 million posts. Then, Filter nds out that only 3.7%
of these posts can be of interest. However, the Detect phase
shows that from these 347K% of posts, only 42 thousand
(0.45% of 9M or 12% of 347K) are really those that the
application is looking for. Considering that many of the
current search system are rule-based, these results indicate
that our proposed system can avoid a useless search on about
88% of the posts returned, 307 thousand posts in this case.
In Table 1 we present the results of the experiment above
for each type of life event. We can observe that about 12%
of the posts ltered have been generally con rmed as life
events, but this proportion can vary according to the type
of life event. For instance, for the Marriage class, from the
182,096 posts that the lter considered as possible life event,
the machine learning algorithm detected 19,475 (10.6%) as
being actually life events, which is close to the average. The
Graduation type, on the other hand, presented a much larger
proportion (43.21%), while Death and Travel smaller ones
(5.47% and 8.26% respectively). We believe that this
difference can happen either due to the period of the year in
which the data is gathered (Graduation supposedly has more
posts in certain periods of the year), or even due to the type
of life event that may contain more non life events (Travel
for example, which may present many posts from marketing
agencies) or even less life event posts (for instance Death,
whereas people might to be more introspective).
containing 1.6 million users using publicly-available data.
The users on this dataset have been matched against the
19 thousand users that have been detected as the ones that
posted life events in the 9M dataset. The results and this
process are illustrated in Figure 3. Note that we have
conducted two di erent experiments. The rst one matches
these users by taking into account only their names, since
we consider this as the minimum information we will be able
to obtain from the SMN. In this case, 983 users have been
found as probable matches. In the second experiment, where
both names and locations are considered, only 5 users have
been found. This shows that the precision of entity
matching can be increased considering more for this process. On
the other hand, this will also reduce the size of the resulting
matching set.</p>
      <p>
        In order to validate the above results, we performed a
random sampling of 23 thousand posts (from the 9 million set)
focusing on quality analysis. The number of posts ltered
and detected are shown in Figure 4. The total of posts
ltered is 1,008, from which 105 have been detected as life
events. Similar to the results on the 9 million set, only about
10% of the ltered posts have been detected as life events.
Detailed numbers, for each type of life event, are presented
in the columns Filtered and Detected in Table 2.
To evaluate the entity matching, we have a built a dataset
Those 1,008 posts resulting from the Filter module have
been then manually inspected in order to verify whether the
Detect phase has assigned the correct probability or not.
The total of posts for each type of life event are listed in the
Ground-Truth column in Table 2. It can be observed that
our system presents numbers that are close to what was
found by the manual inspection. By comparing the manual
inspection with the results of the system, we have been able
to compute the confusion matrix presented in Table 3, which
contains the total number of true positives, true negative,
false positives and false negatives. This has allowed us to
compute the values for accuracy, precision and recall [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
which were at about 89%, 65% and 48%, respectively. In
this case, a true positive consists of a posts that contains a
life event (according to the manual inspection) and is
correctly detected by the system, a true negative is not a life
event and is correctly ignored by the system, a false
positive is not a life events but is detected by the system, and
a false negative is a life event but is not detected by the
system. As a consequence, the precision represents the
proportion of detected posts that contain life events, and the
recall the proportion of life events that have been found by
the system. It is worth noting that there is a trade-o
between precision and recall that is set according to the value
of , where lower values can increase recall and large values
increase the precision (see Figure 5).
Similarly, to validate the quality of the entity matching
algorithm we have done a random sampling of 500 users and
manually inspected the correctness of the matchings found.
In this case, the entity matching algorithm returned 72 users,
being 43 found by EM, 13 by ED1 and 16 by ED2. But, as
we mentioned, both ED1 and ED2 can return more than one
matching per user if the matching algorithm returns a value
above the threshold 1 and 2. For a better analysis of the
algorithm, in Table 4 and Table 5 we present the confusion
matrices of both ED1 and ED2 considering all matches. The
former has found a total of 476 matches, with an accuracy of
about 91%, precision of 10.4% and recall of 71.4%, while the
latter has returned a total of 452 matches, 94% of accuracy,
precision of 50% and recall of 94%.
      </p>
    </sec>
    <sec id="sec-8">
      <title>5. CONCLUSIONS</title>
      <p>In this work we presented a system for personalized o er
based on life event detection. Once the system detects users
posting life events on a social media network, these users
are matched against an internal database of clients to
decide what is the best approach to o er them a service or
product. We described a way to implement the entire
system, and presented the results of applying the system on
a dataset of 9 million posts. From this set, a total of 42
thousands life events have been found, with a projected
accuracy of 88.90% and precision of 65%. This indicates that,
in a normal day of 20 million posts published by Brazilian
users, for instance, the system presents the ability to detect
around 91 thousand posts a day, being about 60 thousand
of them correct. Besides that, it is worth mentioning that
the system is scalable since it has been implement with the
MapReduce programming paradigm.</p>
      <p>Future work can follow many di erent and complementary
paths. Accuracy is important and could be improved by
evaluating other types of classi ers and features, as well as
increasing training data. The addition and evaluation of
other types of life events could be important to better
understand the way people behave on the SMNs. Furthermore,
the adaptation to a real-time streaming platform such as the
IBM InfoSphere Streams would allow the system react very
quickly (near to real-time) once the users post life events.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Atefeh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Khreich</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>A survey of techniques for event detection in twitter</article-title>
          .
          <source>Computational Intelligence</source>
          (
          <year>2013</year>
          ),
          <article-title>n/a{n/a.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bilenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Cohen,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            , and
            <surname>Fienberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Adaptive name matching in information integration</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>18</volume>
          ,
          <issue>5</issue>
          (Sept.
          <year>2003</year>
          ),
          <volume>16</volume>
          {
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravikumar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fienberg</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <article-title>A comparison of string distance metrics for name-matching tasks</article-title>
          . pp.
          <volume>73</volume>
          {
          <fpage>78</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>De</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Counts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Horvitz</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Major life changes and behavioral markers in social media: Case of childbirth</article-title>
          .
          <source>In Proceedings of the 2013 Conference on Computer Supported Cooperative Work</source>
          (New York, NY, USA,
          <year>2013</year>
          ),
          <source>CSCW '13</source>
          , ACM, pp.
          <volume>1431</volume>
          {
          <fpage>1442</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ehrlich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Shami</surname>
            ,
            <given-names>N. S.</given-names>
          </string-name>
          <article-title>Microblogging inside and outside the workplace</article-title>
          .
          <source>In ICWSM</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Eugenio</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Subba</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Detecting life events in feeds from twitter</article-title>
          .
          <source>2012 IEEE Sixth International Conference on Semantic Computing</source>
          <volume>0</volume>
          (
          <year>2013</year>
          ),
          <volume>274</volume>
          {
          <fpage>277</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Felt</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Phishing on mobile devices</article-title>
          . In In W2SP (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>GNIP. GNIP</surname>
          </string-name>
          ,
          <year>2014</year>
          . [Online; accessed 28-May-2014].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>IBM. IBM InfoSphere BigInsights</surname>
          </string-name>
          ,
          <year>2014</year>
          . [Online; accessed 28-May-2014].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
          </string-name>
          , H., and
          <string-name>
            <surname>Moon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>What is twitter, a social network or a news media</article-title>
          ?
          <source>In Proceedings of the 19th international conference on World wide web (</source>
          New York, NY, USA,
          <year>2010</year>
          ),
          <source>WWW '10</source>
          , ACM, pp.
          <volume>591</volume>
          {
          <fpage>600</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Binary Codes Capable of Correcting Deletions, Insertions and Reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          (
          <year>1966</year>
          ),
          <fpage>707</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Data-Intensive Text Processing with MapReduce</article-title>
          . Claypool Publishers,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <article-title>A broad-coverage normalization system for social media language</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume</source>
          <volume>1</volume>
          (
          <issue>Stroudsburg</issue>
          , PA, USA,
          <year>2012</year>
          ), ACL '12,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, pp.
          <volume>1035</volume>
          {
          <fpage>1044</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Peled</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Elovici</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Entity matching in online social networks</article-title>
          .
          <source>In Social Computing (SocialCom)</source>
          ,
          <source>2013 International Conference on (Sept</source>
          <year>2013</year>
          ), pp.
          <volume>339</volume>
          {
          <fpage>344</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Raad</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chbeir</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dipanda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>User pro le matching in social networks</article-title>
          .
          <source>In Network-Based Information Systems (NBiS)</source>
          ,
          <source>2010 13th International Conference on (Sept</source>
          <year>2010</year>
          ), pp.
          <volume>297</volume>
          {
          <fpage>304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sokolova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lapalme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>A systematic analysis of performance measures for classi cation tasks</article-title>
          .
          <source>Information Processing and management</source>
          ,
          <volume>45</volume>
          (
          <year>2009</year>
          ),
          <volume>427</volume>
          {
          <fpage>437</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Twitter</surname>
          </string-name>
          .
          <source>Using the Twitter Search API</source>
          ,
          <year>2014</year>
          . [Online; accessed 28-May-2014].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Vosecky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>User identi cation across multiple social networks</article-title>
          .
          <source>In Networked Digital Technologies</source>
          ,
          <year>2009</year>
          . NDT '
          <volume>09</volume>
          . First International Conference on (
          <year>July 2009</year>
          ), pp.
          <volume>360</volume>
          {
          <fpage>365</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Indurkhya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and Zhang, T. Fundamentals of Predictive Text Mining. Springer London,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <article-title>String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage</article-title>
          .
          <source>In Proceedings of the Section on Survey Research Methods (American Statistical Association</source>
          (
          <year>1990</year>
          ), pp.
          <volume>354</volume>
          {
          <fpage>359</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>