=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards
Personalized Offers by Means of Life Event Detection on Social Media
and Entity Matching
|pdfUrl=https://ceur-ws.org/Vol-1210/SP2014_06.pdf
|volume=Vol-1210
|dblpUrl=https://dblp.org/rec/conf/ht/CavalinGP14
}}
==Towards
Personalized Offers by Means of Life Event Detection on Social Media
and Entity Matching==
Towards Personalized Offers by Means of Life Event Detection on Social Media and Entity Matching Paulo Cavalin Maíra Gatti Claudio Pinhanez IBM Research - Brazil IBM Research - Brazil IBM Research - Brazil pcavalin@br.ibm.com mairacg@br.ibm.com csantosp@br.ibm.com ABSTRACT One way to find potential customers for services or prod- In this paper we present a system for personalized offers ucts is by detecting life events from public user activities based on two main components: a) a hybrid method, com- on SMNs, in special microbloggings. Generally, a life event bining rules and machine learning, to find users that post life can be defined as something important that happened, is events on social media networks; and b) an entity matching happening, or will be happening, in a particular individual’s algorithm to find out possible relation between the detected life, such as getting married, get graduated, having a baby, social media users and current clients. The main assump- buying a house, and thus forth. That is, if a life event is tion is that, if one can detect the life events of these users, properly detected, a product or service can be offered to a personalized offer can be made to them even before they someone even before she looks for it, anticipating her needs. look for a product or service. This proposed solution was For instance, if a person posts on the SMN that her marriage implemented on the IBM InfoSphere BigInsights platform will be happening in a few days (or weeks or months), a loan to take advantage of the MapReduce programming frame- or an insurance (for the honey moon trip for example) can work for large scale capability, and was tested on a dataset be offered to her in advance. Furthermore, as state in [6], containing 9 million posts from Twitter. In this set, 42K marketers know that people mostly shop based on habits, life event posts sent by 19K different users were detected, but that among the most likely times to break those habits with an overall accuracy of 89% e precision of about 65% is when a major life event happens. to detect life events. The entity matching of these 19K so- cial media users against an internal database of 1.6M users For this reason, this work focuses on presenting a system returned 983 users, with accuracy of about 90%. that can detect life events from textual posts on SMNs, and can match the corresponding users with an existing database, i.e. entity matching with current clients, using ba- Keywords sic information such as the name and the location available Social Media Networks, Life Event Detection, Natural Lan- on the SMN. Entity matching is important to understand guage Processing, Machine Learning, Entity Matching whether a given user of a SMN is already a customer or not, and adapt the way the person can be approached. 1. INTRODUCTION Both life event detection and entity matching are complex Social Media Networks (SMN), such as Twitter and Face- tasks which are subject of various research in fields such book, engage thousands of people that post, on a daily ba- as artificial intelligence, machine learning [6], natural lan- sis, a huge amount of content represented by texts, images, guage processing and large scale analysis of unstructured videos, etc [5, 10]. Often the content can be intimately re- data (popularly known as Big Data) [12]. Performing nat- lated to the person the publishes it, in such a way that is ural language processing on microbloggings’ posts presents can expose behavioral traits or events that are happening several challenges, such as dealing with the short and asyn- in the individual’s life. As a consequence, the proper ex- chronous nature of the messages, making it difficult to ex- ploration of this type of content not only can be a way to tract contextual information, and dealing with a very un- better understand the users on SMNs, but also can lever- normalized vocabulary due to the frequent use of slangs, age many applications that require adequate user profiling, acronyms, abbreviations, and informal language often with for instance credit risk analysis, marketing campaigns, and misspelling errors [1, 7, 13]. Nonetheless, one study that personalized product and/or service offers. supports the possibility of detecting life events from textual posts has been presented in [4]. In that work, the author con- ducted a study on the behavior of mothers during pregnancy, and they observed that these mothers can be distinguished by linguistic changes captured by shifts in a relatively small number of words in their social media posts. In the light of this, in this work we describe and evaluate our proposed solution to tackle the life event detection problem and the entity matching. For the first task, we propose a hybrid system combining rules and machine learning (ML). used to rank the probability that two user profiles from two In contrast to the system specifically focused on life event different OSNs belong to the same individual. detection presented in [6] (the only one for this problem to the best of our knowledge), which uses only ML, our system The former problem can be considered a subset of the latter allows for dealing with the life event classes independently. if we ignore the fact that the second set contains real people The rule-based phase acts as a mechanism to filter most information rather than SMN’s profiles. And generally, as posts that do not contain life events, since all those posts summarized by [15], there are two approaches for handling not matching the desirable rules are eliminated. Then, bi- this: (i) syntactic-based similarity approaches: providing ex- nary classifiers (one for each type of life event) are applied act or approximate lexicographical matching of two values; to validate the possible life events. Greater detail is pro- and (ii) semantic-based similarity approaches: used to mea- vided in Section 3.1. For entity matching, a combination of sure how two values, lexicographically different, are seman- string distance functions is used to compare the names and tically similar. For instance, Foaf-o-matic1 and OKKAM2 locations of the users. This method is better described in projects aim at social profiles integration by means of formal Section 3.2. FOAF (Friend-of-a-friend) semantics. The entire system has been implemented on the IBM In- Regarding, syntactic-based similarity approach, we summa- foSphere BigInsights platform [9], to take advantage of the rize here the ones most used for URI, numeric-based at- MapReduce programming paradigm for large scale data pro- tributes and, in the context of SNMs, two users’ full names. cessing. A dataset containing 9 million posts in portuguese, Levenshtein or Edit Distance [11] is defined to be the small- extracted from Twitter, has been used to evaluated the sys- est number of edit operations, inserts, deletes, and substitu- tem. To evaluate the entity matching, a database with 1.6 tions required to change one string into another. In addition, million users has been constructed. More details about the Jaro is an algorithm commonly used for name matching in experiments are present in Section 4. data linkage systems. A similarity measure is calculated us- ing the number of common characters (i.e., same characters 2. BACKGROUND AND RELATED WORK that are within half the length of the longer string) and the number of transpositions. Winkler (or Jaro-Winkler) im- Since the work proposed in this paper is a hybrid solution proves upon Jaro’s algorithm by applying ideas based on on which we integrate a ML-based classifier with an Entity empirical studies which found that fewer errors typically oc- Matching solution, the background and related work is pre- cur at the beginning of names [3][2]. sented separated for both as follows: Life Event Detection: as already mentioned, a life event Another approach is the N-Gram name similarity, on which can be defined as something important regarding the users’ N-grams are sub-strings of length n and an n-gram similar- lives in SMNs. It is important to differentiate it from some ity between two strings is calculated by counting the num- related work which uses the event detection expression to ber of n-grams in common (i.e., n-grams contained in both refer to the problem of detecting unexpected event exposed strings) and dividing by either the number of n-grams in the by several users in SMNs like a rumor, a trend, or emergent shorter string (called Overlap coefficient), or the number of topic. In the case of the work proposed in this paper, detec- n-grams in the longer string (called Jaccard similarity), or tion means to classify a short post, like Twitter’s or Face- the average number of n-grams in both strings. 2-grams and book’s status messages in one of the life event categories, 3-grams have been used to calculate the similarity between which could be considered, for instance, topics. Therefore, the two users’ full names. Finally, the VMN name similarity as related work, any approach of topic classification of short approach proposed by [18] was designed for full and partial messages could be considered like [6], which is the most re- matches of names consisting of one or more words. VMN lated to our work. Regarding ML-based solutions, other su- supports the case of swapped names and the cases of partial pervised or unsupervised methods for topic classification are matches. also related, although not yet used for short messages but long documents. And regarding semantic-rule-based solu- In this paper, we use two versions of ED preceded by Jaro’s tions, AQL rules combined with dictionaries are known ap- similarity as described in the next section. proaches for topic classification with the usage of templates. Ontologies have also been applied for long documents. 3. METHODOLOGY Entity Matching: in SMNs there are two problems one In this section we describe in detail both systems for life can find Entity Matching solutions for. One is, given a set event detection system and entity matching. containing user features on SMNs, like user information and activities, and another set containing real people informa- 3.1 Hybrid Life Event Detection System tion, the goal is to try to match the users within both sets. Given a social media network, the life event detection system The second problem is, given two sets containing user fea- has as main goal to return a list of users that posted life tures on two different SMNs, the goal is to try finding cor- events within a given time window. This task involves a responding users, i.e., the biggest possible number of social crawler to gather data, and a system to search for life events profiles that refer to the same person between both social on the data. Note that not only accuracy is important in this networks. The latter can also be called Entity Resolution case, to find the largest list of users with a high precision, (ER) problem, and in the past few years some work has but also performance is important since the system is likely been proposed to solve this problem. For instance, [14] pro- 1 posed supervised learning techniques and extracted features http://www.foaf-o-matic.org/ 2 to build different classifiers, which were then trained and http://www.okkam.org/ to face a large amount of data. In addition, on a production 3.2 Entity Matching System environment, the system must allow for easy fine-tuning, Given the output of the life event detection system, i.e. users addition and removal of life events classes. (aka entities) that posted life events on social media, the main goal of the entity matching system is to find corre- sponding people in a database of real names. For achieving this task accurately, the system must use as much informa- tion as possible to decrease the level of uncertainty. Dealing with users found on SMNs, though, is very challeng- ing. First of all, on most SMNs the basic information about the user (e.g. name, location, age) is very limited (on Twit- ter only the name and location of the user are available). In addition, such personal information may be lacking or not relevant since filling them may be not mandatory, and the content filed is not verified. Besides that, when the informa- tion is seriously provided by the user, other difficulty factors can appear, such as the use of simplified names (Claudio Figure 1: Hybrid Life Event Detection System. Pinhanez instead of Claudio Santos Pinhanez), the use of social media pen-names (@cinhanez instead of Claudio San- tos Pinhanez), or the use of nickname (Darth Vader instead To cope with the aforementioned issues, we propose a hybrid of Claudio Santos Pinhanez. life event detection approach, combining both rules and ma- chine learning (ML). Such a system, depicted in Figure 1, is To deal with some of the aforementioned difficulty factors, basically composed of three subsequent phases or modules, for this work we have developed a system to match names namely Ingest, Filter, and Detect. The first phase, i.e. In- and locations of users using three different string distance gest, captures a database of posts to be used for the search functions: for life events. This is done by considering a set of words that can possibly relate to all life events of the system. We 1. Exact matching (EM): a match is found if all the names assume that the larger this dataset, the larger the set of of an SMN user are identical to those of a client users that will be returned. Once the set of posts has been totally crawled, the Filter module selects the set of posts 2. Entity Distance 1 (ED1): designed to consider mis- that is more likely to contain life events. That is, by consid- spellings and transpositions between adjacent charac- ering a set of simple rules such as words and combinations of ters as a match. For instance, the user “Jooa Paulo” words (but more elaborated rules than those of Ingest), but matches the client “Joao Paulo”, and the user “Car- in this case a set of rules for each type of life event, the posts olina” matches “Carolina”. In this case, the threshold that match these rules are marked with the corresponding σ1 is used to define a match only if the similarity value possible life events. is above this threshold. 3. Entity Distance 2 (ED2): designed to match abbrevi- Despite these rules can indicate a possible life event, a large ations and some nicknames. For example, the user portion of these message can be false candidates. For this “Joseph S.” matches the client “Joseph Salem”; the reason, the Detected phase is then carried out to validate user “Fabinho” matches the client “Fábio”, and “Mari” the possible life events with their corresponding probabil- matches “Mariana”. Similarly to ED1, the threshold ity. For each post found in the Filter phase, we apply the σ2 is used to define a match. ML classifier of the corresponding possible life events and compute the probability of that the post contains the given life events. With this information, all posts with life event The execution of three aforementioned matching algorithms probability above the threshold θ are selected and users of results in three distinct sets of users, denoted ΩEM , ΩED1 the corresponding posts are generate as the output of the and ΩED2 . The resulting set of users ΩAll corresponds to system. the union of those individual sets. That is, ΩAll = ΩEM ∪ ΩED1 ∪ ΩED2 , where ΩEM ∩ ΩED1 ∩ ΩED2 6= ∅ or ΩEM ∩ It is worth noting that currently ML is well-known to pro- ΩED1 ∩ ΩED2 = ∅, depending on the data. duce the best solutions to deal with ambiguous and noisy texts such as microbloggings’ posts. However, the proposed It is worth mentioning that the Jaro Winkler similarity fil- hybrid solution takes advantage of the rule-based filtering tering [20] is used prior to calling ED1 and ED2, to elimi- to reduces the search space for the ML classifier, which can nate weak matches such as ’Maria’ and ’Maria das Graças reduce both the number of errors and processing time. More- Silva’. Furthermore, ED1 and ED2 may return more than over, by treating types of life events independently it makes one match for the same user, whenever the result is above it easy for fine-tuning, addition and removal of life event the given threshold. In this work, only the matching with classes. For instance, to add a new type of life events, one the highest value is considered. need to append the corresponding keywords for the Inges- tion phase, the rules for Filter, and a binary classifier in 4. EXPERIMENTS the Detect phase. This can be done with no impact on the In this section we present the results of applying the pro- accuracy of existing life events. posed system on a dataset containing 9 millions of posts from Twitter, which have been produced by about 1.4 mil- lion users. This data has been gathered by means of the GNIP social media data provider [8]. Mainly, these experiments have two different purposes. First we aim at evaluating the numbers related to applying the system on this 9 million dataset, i.e. how many posts and users are returned by using the system. And second, we focus on a quality analysis to validate those numbers by means of a manual inspection of samplings of this dataset. The life event detection system has been implemented for six types of life events: Marriage, Graduation, Travel, Birth- day, Birth,and Death. For each one, a training dataset of about 2 thousand samples has been manually labeled as ei- ther life event or non life event, and a distinct classifier has been trained. The training data has been obtained with the Twitter Search API [17]. For this work we make use of Naive Bayes classifiers using bag-of-words features [19]. The main parameters, i.e. θ, σ1 and σ2 , have been set to 0.5, 0.95 and 0.95, respectively. Figure 2: Results on the 9M dataset. 4.1 Quantitative Results As we mentioned, the first experiment has as main purpose Table 1: Detailed results on the 9 million dataset. to evaluate how many posts and users are returned after Life event Filter (% of Detect (% carrying out each phase of the proposed system. The results 350k) of 42k) of applying the implemented life event detection system on Marriage 182,096 (52.4) 19,457 (46.5 ) 9-million-tweet dataset is summarized in Figure 2. In this Graduation 25,676 (7.4) 11,097 (26.5) case, the Filter phase has returned 347 thousand posts from Travel 22,596 (6.5) 1,868 (4.5) about 220 thousand users. Then, after going through the Birthday 33,305 (9.6) 3,604 (8.6) Detect module, 42 thousand posts, from about 19 thousand Birth 48,687 (14.0) 3,881 (9.3) users, have been detected as life events. It is worth noting Death 35,242 (10.1) 1,929 (4.6) the large difference in terms of proportion from one phase Total (% of 347,602 (3.7) 41,836 (0.45) to another. The Ingest phase captures a very large dataset, 9M) i.e. 9 million posts. Then, Filter finds out that only 3.7% of these posts can be of interest. However, the Detect phase shows that from these 347K% of posts, only 42 thousand containing 1.6 million users using publicly-available data. (0.45% of 9M or 12% of 347K) are really those that the The users on this dataset have been matched against the application is looking for. Considering that many of the 19 thousand users that have been detected as the ones that current search system are rule-based, these results indicate posted life events in the 9M dataset. The results and this that our proposed system can avoid a useless search on about process are illustrated in Figure 3. Note that we have con- 88% of the posts returned, 307 thousand posts in this case. ducted two different experiments. The first one matches these users by taking into account only their names, since In Table 1 we present the results of the experiment above we consider this as the minimum information we will be able for each type of life event. We can observe that about 12% to obtain from the SMN. In this case, 983 users have been of the posts filtered have been generally confirmed as life found as probable matches. In the second experiment, where events, but this proportion can vary according to the type both names and locations are considered, only 5 users have of life event. For instance, for the Marriage class, from the been found. This shows that the precision of entity match- 182,096 posts that the filter considered as possible life event, ing can be increased considering more for this process. On the machine learning algorithm detected 19,475 (10.6%) as the other hand, this will also reduce the size of the resulting being actually life events, which is close to the average. The matching set. Graduation type, on the other hand, presented a much larger proportion (43.21%), while Death and Travel smaller ones In order to validate the above results, we performed a ran- (5.47% and 8.26% respectively). We believe that this dif- dom sampling of 23 thousand posts (from the 9 million set) ference can happen either due to the period of the year in focusing on quality analysis. The number of posts filtered which the data is gathered (Graduation supposedly has more and detected are shown in Figure 4. The total of posts posts in certain periods of the year), or even due to the type filtered is 1,008, from which 105 have been detected as life of life event that may contain more non life events (Travel events. Similar to the results on the 9 million set, only about for example, which may present many posts from marketing 10% of the filtered posts have been detected as life events. agencies) or even less life event posts (for instance Death, Detailed numbers, for each type of life event, are presented whereas people might to be more introspective). in the columns Filtered and Detected in Table 2. To evaluate the entity matching, we have a built a dataset Those 1,008 posts resulting from the Filter module have contains the total number of true positives, true negative, false positives and false negatives. This has allowed us to compute the values for accuracy, precision and recall [16], which were at about 89%, 65% and 48%, respectively. In this case, a true positive consists of a posts that contains a life event (according to the manual inspection) and is cor- rectly detected by the system, a true negative is not a life event and is correctly ignored by the system, a false posi- tive is not a life events but is detected by the system, and a false negative is a life event but is not detected by the system. As a consequence, the precision represents the pro- portion of detected posts that contain life events, and the recall the proportion of life events that have been found by the system. It is worth noting that there is a trade-off be- Figure 3: Entity matching results. tween precision and recall that is set according to the value of θ, where lower values can increase recall and large values increase the precision (see Figure 5). Table 3: Confusion matrix of the 1008 filtered posts found on the 23K sampling. Manual labeling Life Events Non Life Events Life Positive 68 (6.7%) 37 (3.7%) Event Detection Negative 74 (7.3%) 829 System (82.4%) Figure 4: Results on the 23K sampling. Table 2: Number of posts returned per life event type on the 23K sampling. Life event Filtered Detected Ground- (% of (% of Truth (% 1008) 105) of 142) Marriage 162 (16.2) 8 (7.6 ) 7 (4.9) Graduation 70 (7.0) 26 (24.7) 15 (10.6) Travel 474 (47.4) 55 (52.4) 99 (69.7) Figure 5: Precision/Recall trade-off by varying θ Birthday 102 (10.2) 11 (10.5) 12 (8.5) from 0.1 to 0.9. Birth 107 (10.7) 4 (3.8) 7 (4.9) Similarly, to validate the quality of the entity matching al- Death 93 (9.3) 1 (9.5) 2 (1.4) gorithm we have done a random sampling of 500 users and Total (% of 1008 (4.4) 105 (0.45) 142 (0.6) manually inspected the correctness of the matchings found. 23K) In this case, the entity matching algorithm returned 72 users, being 43 found by EM, 13 by ED1 and 16 by ED2. But, as we mentioned, both ED1 and ED2 can return more than one been then manually inspected in order to verify whether the matching per user if the matching algorithm returns a value Detect phase has assigned the correct probability or not. above the threshold σ1 and σ2 . For a better analysis of the The total of posts for each type of life event are listed in the algorithm, in Table 4 and Table 5 we present the confusion Ground-Truth column in Table 2. It can be observed that matrices of both ED1 and ED2 considering all matches. The our system presents numbers that are close to what was former has found a total of 476 matches, with an accuracy of found by the manual inspection. By comparing the manual about 91%, precision of 10.4% and recall of 71.4%, while the inspection with the results of the system, we have been able latter has returned a total of 452 matches, 94% of accuracy, to compute the confusion matrix presented in Table 3, which precision of 50% and recall of 94%. Major life changes and behavioral markers in social Table 4: Confusion matrix for ED1 on 500 users. media: Case of childbirth. In Proceedings of the 2013 Manual labeling Conference on Computer Supported Cooperative Work Match Non (New York, NY, USA, 2013), CSCW ’13, ACM, Match pp. 1431–1442. Entity Positive 5 (1.10%) 43 (9.0%) [5] Ehrlich, K., and Shami, N. S. Microblogging inside Match- and outside the workplace. In ICWSM (2010). ing System Negative 2 (0.4%) 426 [6] Eugenio, B. D., Green, N., and Subba, R. (89.5%) Detecting life events in feeds from twitter. 2012 IEEE Sixth International Conference on Semantic Computing 0 (2013), 274–277. Table 5: Confusion matrix for ED2 on 500 users. [7] Felt, A. P., and Wagner, D. Phishing on mobile Manual labeling devices. In In W2SP (2011). Match Non [8] GNIP. GNIP, 2014. [Online; accessed 28-May-2014]. Match [9] IBM. IBM InfoSphere BigInsights, 2014. [Online; Entity Positive 17 (3.7%) 17 (3.7%) accessed 28-May-2014]. Match- [10] Kwak, H., Lee, C., Park, H., and Moon, S. What ing is twitter, a social network or a news media? In System Negative 1 (3.9%) 427 Proceedings of the 19th international conference on (96.1%) World wide web (New York, NY, USA, 2010), WWW ’10, ACM, pp. 591–600. [11] Levenshtein, V. Binary Codes Capable of Correcting 5. CONCLUSIONS Deletions, Insertions and Reversals. Soviet Physics In this work we presented a system for personalized offer Doklady 10 (1966), 707. based on life event detection. Once the system detects users [12] Lin, J., and Dyer, C. Data-Intensive Text Processing posting life events on a social media network, these users with MapReduce. Claypool Publishers, 2010. are matched against an internal database of clients to de- [13] Liu, F., Weng, F., and Jiang, X. A broad-coverage cide what is the best approach to offer them a service or normalization system for social media language. In product. We described a way to implement the entire sys- Proceedings of the 50th Annual Meeting of the tem, and presented the results of applying the system on Association for Computational Linguistics: Long a dataset of 9 million posts. From this set, a total of 42 Papers - Volume 1 (Stroudsburg, PA, USA, 2012), thousands life events have been found, with a projected ac- ACL ’12, Association for Computational Linguistics, curacy of 88.90% and precision of 65%. This indicates that, pp. 1035–1044. in a normal day of 20 million posts published by Brazilian [14] Peled, O., Fire, M., Rokach, L., and Elovici, Y. users, for instance, the system presents the ability to detect Entity matching in online social networks. In Social around 91 thousand posts a day, being about 60 thousand Computing (SocialCom), 2013 International of them correct. Besides that, it is worth mentioning that Conference on (Sept 2013), pp. 339–344. the system is scalable since it has been implement with the [15] Raad, E., Chbeir, R., and Dipanda, A. User profile MapReduce programming paradigm. matching in social networks. In Network-Based Information Systems (NBiS), 2010 13th International Future work can follow many different and complementary Conference on (Sept 2010), pp. 297–304. paths. Accuracy is important and could be improved by [16] Sokolova, M., and Lapalme, G. A systematic evaluating other types of classifiers and features, as well as analysis of performance measures for classification increasing training data. The addition and evaluation of tasks. Information Processing and management, 45 other types of life events could be important to better un- (2009), 427–437. derstand the way people behave on the SMNs. Furthermore, [17] Twitter. Using the Twitter Search API, 2014. the adaptation to a real-time streaming platform such as the [Online; accessed 28-May-2014]. IBM InfoSphere Streams would allow the system react very quickly (near to real-time) once the users post life events. [18] Vosecky, J., Hong, D., and Shen, V. User identification across multiple social networks. In Networked Digital Technologies, 2009. NDT ’09. First 6. REFERENCES International Conference on (July 2009), pp. 360–365. [1] Atefeh, F., and Khreich, W. A survey of [19] Weiss, S. M., Indurkhya, N., and Zhang, T. techniques for event detection in twitter. Fundamentals of Predictive Text Mining. Springer Computational Intelligence (2013), n/a–n/a. London, 2010. [2] Bilenko, M., Mooney, R., Cohen, W., [20] Winkler, W. E. String comparator metrics and Ravikumar, P., and Fienberg, S. Adaptive name enhanced decision rules in the fellegi-sunter model of matching in information integration. IEEE Intelligent record linkage. In Proceedings of the Section on Survey Systems 18, 5 (Sept. 2003), 16–23. Research Methods (American Statistical Association [3] Cohen, W. W., Ravikumar, P., and Fienberg, (1990), pp. 354–359. S. E. A comparison of string distance metrics for name-matching tasks. pp. 73–78. [4] De Choudhury, M., Counts, S., and Horvitz, E.