Novel Location De-identification for Machine and Human Katsuya Taguchi Eiji Aramaki Nara Institute of Science and Technology Nara Institute of Science and Technology Ikoma, Nara Ikoma, Nara taguchi.katsuya.tb3@is.naist.jp aramaki@is.naist.jp ABSTRACT 1 INTRODUCTION In recent years, the protection of personal information has In recent years, de-identification techniques to delete sen- drawn much attention, requiring an advanced technology sitive personal information have been studied actively be- on de-identification to remove personal information from cause of the growing interest in privacy protection. In most data. Among various personal information such as personal automatic de-identification technologies, sensitive personal names, phone numbers, and so forth, this study focuses on information is regarded as identical to proper expressions location information. The conventional approaches to pro- such as personal names, organization names, phone num- tect location information are to remove address expressions. bers, ID numbers, and addresses. Therefore, Named En- However, there are complicated cases in which location in- tity Recognition (NER) techniques have been applied to de- formation can be guessed with unexpected combinations of identification. As described in this paper, this conventional non-address words. For example, we can guess ‘the most approach is designated as NER-based de-identification. traditional city in Japan’ is Kyoto. To our knowledge, such Actually, NER-based de-identification has an important location-inferable expressions have not been dealt with. limitation: an address is identifiable from non-named entity This study handles this phenomenon by using a location expressions. Sometimes, the combination of general terms classifier. In addition, we assume two levels of location infer- can be a strong clue for identifying a specific location. Con- ance; (1) inferable by machine and (2) inferable by human. sider the following sentence: ‘I’m excited to have dinner with To build the first-level inferance, we employed a collection my colleague on the riverbed!’ Because the riverbed is a fa- of tweets with geo-tags. To build the second-level inferance, mous spot in Kyoto and the location of riverbed in Kyoto is we created a new corpus with a flag for whether tweets are well-known, most Japanese people can guess that the per- location-inferable by human or not. By using the two types son behind the tweet is located in Kyoto. This limitation of of corpora, we classified texts into several categories such as NER-based approaches becomes an important issue because a machine-inferable but human-non-inferable tweet, and so many people unintentionally expose their location informa- on. We also could obtain de-identified tweets by iterations tion to others. Sometimes the knowledge might be used il- of removing the highest weighted words for classifiers. We legally. believe our novel concepts of de-identification are essential This study specifically examines automatic de- for various privacy protection. identification of messages in Twitter in terms of their location information. Our de-identification method has two CCS CONCEPTS novel features. • Security and privacy → Privacy protections; • Social • This study handles location-inferable expressions and professional topics → Identity theft; Social engi- (not only proper expressions but also non-proper ex- neering attacks; • Computing methodologies → Learning pressions). linear models; • This study assumes two levels of location inference: (1) inferable by machine and (2) inferable by humans. KEYWORDS De-identification, Location inference, SNS, Twitter, Natural Using the two viewpoints of inference, we were able language processing to design several levels of de-identification: a level of a machine-inferable but human-non-inferable tweets, and so on. It is noteworthy that the proposed method is indepen- dent of any specific language. ©2018. Copyright for the individual papers remains with the authors. Copying permitted for private and academic purposes. The remainder of this paper is organized as follows. First, UISTDA ’18, March 11, 2018, Tokyo, Japan we construct a classifier to infer tweet locations using geo- tagged tweets in Twitter (Section 4). Next, we investigate UISTDA ’18, March 11, 2018, Tokyo, Japan K. Taguchi et al. Table 1: Work related to location inference Home location Tweet location Mentioned location Human network Kong et al. [1] Sadilek et al. [2] Hua et al. [3] Yamaguchi et al. [4] Flatow et al. [6] (word-centric) (word-centric) Tweet content Li et al. [8] Cha et al. [5] Kinsella et al. [7] (location-centric) (location-centric) Tweet context Efstathiades et al. [9] Dredze et al. [10] Fang et al. [11] whether the classifier can infer tweet locations that are de- model and tweet content. However, the method cannot es- identified by humans (Section 5). Then, we tag the tweets timate locations that are identifiable by unexpected word with whether a human can infer the locations (what we call combinations because a classifier is constructed using a feasibility of location inference). We compare the differ- word list that is appropriate to each area. By contrast, we ence between the classifier and human (Section 6). Finally, propose a method to infer locations by considering word we present a de-identification method considering the com- combinations. bination of words (Section 7). De-identification 2 RELATED WORK In the medical field, de-identification of patient data has Location Inference been studied actively. A conventional approach, Named Many methods for location inference have been proposed Entity Recognition (NER) based de-identification, deletes to date. They are classifiable by two aspects: location types proper expressions that are capable of specifying individ- to be estimated and material types to be used for location uals such as proper nouns: phone numbers and addresses. estimation, as presented in Table 1. However, NER-based de-identification is insufficient for de- As for location types, roughly three types of locations identifying location information. Moreover, in the medical have been considered to date as shown in Table 1: user home field, a law exists to protect individuals’ medical records locations, tweet locations, and described locations. Home lo- and other personal health information: Health Insurance cation is a location where a user lives or spends much time, Portability and Accountability Act (HIPAA) 1 , which was including the address of a user’s home or office. Tweet lo- approved in the U.S.A. in 1996. As for de-identification of so- cation is one from which a user has posted a tweet. A men- cial media contents including messages with location infor- tioned location is one that a user has described in a tweet. mation, however, no criteria correspond to HIPAA. This pa- This paper represents an attempt to estimate tweet loca- per therefore sets criteria for the de-identification of tweet tions, which are our target of de-identification. locations by conducting experiments related to manual de- For location inference, three types of materials have been identification. used as shown in Table 1: human network, tweet content, 3 DATASET and tweet context. A human network is a relation between users in social networking services such as follower or fol- This section describes our dataset consisting of tweets with lowee in Twitter. Tweet content represents the content of location information and area division. a tweet message. Tweet context is information associated Tweets with a tweet such as a time stamp, geo-tag, or time zone. When inferring locations using tweet content, there are Tweet data consist of 298,711 Japanese messages with geo- two major approaches distinguished by probabilistic mod- tags (hereinafter called ‘tweets’) posted within the central els. One is called the word-centric model, calculating the region of Kyoto City, Japan (latitude range = [34.93, 35.12] probability p(l|W ) that a location l is labeled to a set of and longitude range = [135.67, 135.83]). This region includes words W . The other is called a location-centric model. It popular landmarks, train stations, castles, shrines, temples, calculates the probability p(d|l) that each location’s label and so on, yielding a diverse mix of tweets. The tweets were l outputs a tweet document d. In this paper, the word-centric collected about for a year between 2011/7/14 and 2012/7/31. model is applied to analyze tweet contents and to construct The tweet data are divided into training data and test data. a classifier to estimate a tweet’s location. Training data consisting of 179,227 tweets (60% of all data) The study by Flatow et al. [6] is similar to ours in that they are used to construct a classifier as described in Section 4. attempted to infer tweet locations with the word-centric 1 https://www.hhs.gov/hipaa/index.html Novel Location De-identification for Machine and Human UISTDA ’18, March 11, 2018, Tokyo, Japan performance. Each text is split into words using a Japan- ese morphological analyzer, MeCab2 . All uni-grams and bi- grams are used as features for a bag-of-words representa- tion. They are converted into vectors and are used for the training data. Each element of a vector was one or zero ac- cording to whether each feature appeared or not. Noises such as URLs (e.g. “https://XXX”), hashtags (e.g. “#hash- tag”), or mentions to other users (e.g. “@username”) were removed from each text3 . Correct answer labels are set to each area (200 classes in total) and are attached to each tweet based on its geo-tag. The classifier is constructed based on a linear model trained by logistic regression. To evaluate the constructed classifier, the test data were classified into 200 classes. Results show that the accuracy for the test data was 47.2%. If the classifier always outputs the area a 15 5 having the highest tweet density in the training data, then the accuracy for the test data is 11.6%. Also, 47.2% is modestly high in spite of its simple structure. 5 PRELIMINARY EXPERIMENT: MANUAL DE-IDENTIFICATION When using the classifier constructed in Section 4, it is nec- essary to define the state: ‘a tweet is de-identified.’ This sec- Figure 1: Geographical distribution of tweets in the central tion describes an experiment by which the state is defined. region of Kyoto City. The region is divided into 200 areas. Materials and Procedure To define the state that a tweet is de-identified, the manually The test data consist of 119,484 tweets (40% of all data) used annotated corpus was created. 500 tweets from the test data to evaluate the classifier’s performance in Section 4. Some were de-identified manually. First, participants observe each test data are used for experiments in Sections 5 and 6. tweet and infer its location as precisely as possible. The par- ticipants are allowed to use search engines, etc. Then, they Area Division delete the minimum number of morphemes in a tweet until The region described in Section 3.1, the central region of they ascertain that the tweet’s location becomes ambiguous. Kyoto City, Japan, was divided into 200 (= 20×10) areas In this preliminary experiment, two annotators with (a 1 1 , ..., a 20 10 ∈ Akyot o ), as presented in Figure 1. Each knowledge about Kyoto City independently annotated 500 area is 501 m × 547 m. This division was useful to sepa- tweets of the test data. The tweet below is an example of rate two consecutive stations (Hankyu Kawaramachi Sta- the annotated tweets. Words to be deleted are crossed off. tion and Hankyu Karasuma Station) into two areas. Both In this example, the annotators considered that Tweet (1) areas are located near the Hankyu Kawaramachi Station, was de-identified by deleting ‘御池 (Oike)’ and ‘マザーズハ which is well known as the busiest downtown area in Kyoto. ローワーク (Mother’s Hello Work)’.   Therefore, this manner of division is reasonable. Figure 1 presents the geographic distribution of 298,711 (1) 烏 丸御池 プ ラ ザ が 本 チャン や な い ん か? tweets for the selected areas. 39,078 tweets (13.1% of all @マザーズハローワーク鳥丸御池 tweets) were posted around the area a 15 5 , where Kyoto Sta- (Is not the Karasuma Oike Plaza main? @Mother’s Hello tion (Kyoto’s largest train station) is located. By contrast, Work Karasuma 御池Oike) only two tweets were posted in area a 17 10 , which is located   southeast of Miterasennyuji Temple. Then, a threshold determining whether a tweet is de- identified or not is defined using the annotated tweets. 4 CONSTRUCTION OF LOCATION CLASSIFIER Given a de-identified tweet, the classifier calculates the This section describes a method to construct a classifier that 2 http://taku910.github.io/mecab/ estimates a tweet location and which shows the classifier 3 https://github.com/s/preprocessor UISTDA ’18, March 11, 2018, Tokyo, Japan K. Taguchi et al. probability of location inference when the tweet is assigned Table 2: Inference of feasibility of location inference to the 200 areas, respectively. The maximum of the 200 prob- by the constructed classifier (machine) and human ability values can be regarded as a reference to the tweet’s de-identification. Finally the average of the maximum val- Classifier ues of the probability for all annotated tweets is used as the infearable not inferable threshold. The tweets for which the probability is below the inferable 216 30 Human threshold were regarded as being de-identified. not inferable 258 496 Results and Discussion   The threshold value was set to 0.37 from the preliminary experiment’s result. Therefore, the tweets for which the (2) 風が強いです (>_<) 今日も明るく元気にお昼の営 probability was less than 0.37 were regarded as being de- 業開始です! identified. (The wind is so strong (>_<). I am about to start my However, for some tweets, the classifier outputs show lunch-hour business brightly and cheerfully as usual!) high probability but the annotators were uncertain about their location, or vice versa. Because of such a discrep- (3) おはようございます (^^) 今日の日中は雨予報で ancy, probably one can make two types of inference for de- すね。 気温も 20 ℃まで行かないようです。 今日も identification. One is to prevent inference of the location 明るく元気に!忙しく楽しい一日になるよう頑張り itself. The other is to prevent the assumption that a loca- ます p(^_^)q tion can be inferred. In the next section, we examine another (Good morning (^^). It is supposed to rain during the classifier to infer the feasibility of location inference. day. The temperature will not reach 20◦C. Let’s be bright and cheerful! I try to be busy and enjoy my day p(^_^)q.) 6 FEASIBILITY OF LOCATION INFERENCE   This section describes a preliminary experiment to con- These tweets include fixed phrases for advertising stores, struct a classifier that infers the feasibility of location in- e.g. the latter part of Tweet (2), ‘I am about to start my lunch- ference and the actual construction. hour business brightly and cheerfully as usual!’ It seems that several tweets with typical phrases by a specific store are in- Materials and Procedures cluded in the training data. However, humans cannot read To construct a classifier that infers the feasibility of location and learn so many tweets. Therefore, they believe that such inference, a corpus annotated with the feasibility of location tweets have no feasibility of location inference. The tweets inference is generated. We first used 1,000 tweets selected below are examples for which the classifier cannot infer randomly from the test data in Section 3.1. Then, binary their location, but humans determine that they have feasi- classification tasks were conducted according to whether bility of location inference.   or not the locations can be inferred. To gather a large amount of experimental cooperation, the tasks were con- (4) やっとお昼ご飯。つばめ ducted through crowdsourcing. 100 participants answered (Finally, lunch time. Tsubame) each tweet as to whether or not the location can be inferred. The tweets for which 10% or more participants answered (5) 河合塾の向かいのサブウェイなう! that they can be inferred were defined as tweets with feasi- (I am at the subway station across the street from Kawai- bility of location inference. The others were treated as those juku now!) without feasibility of location inference.   Tweet (4) is a case in which the proper noun ‘つばめ Results and Discussion (Tsubame)’ is also a common noun. Considering such cases, 246 of 1,000 tweets showed the feasibility of location infer- data tagged with feasibility of location inference are appar- ence. Considering the two classification methods, whether ently necessary. Tweet (5) represents a case in which the lo- the classifier in Section 4 can infer locations of tweets and cation is inferable by a combination of ‘河合塾 (Kawaijuku)’ whether tweets have feasibility of location inference, or not, and ‘サブウェイ (Subway)’. the 1,000 tweets were classified into four classes. The results are presented in Table 2. Construction of Classifier for Inferring Location For some tweets, the classifier can infer their location, but Inference Feasibility those without feasibility of location inference are presented A classifier was constructed with 1,000 tweets tagged us- below. ing feasibility of location inference. Of the 1,000 tweets, 900 Novel Location De-identification for Machine and Human UISTDA ’18, March 11, 2018, Tokyo, Japan Return snew if maxprob(snew ) is below the threshold (=0.37). Otherwise, increment m by 1 and back to Step 1 when m is less than n. Results and Discussion We present a part of the result of de-identification by the proposed method. The tweets below are samples of the de- identified tweets. Words to be deleted are crossed off.   (6) まだまだ 新幹線京都駅 (It is still a long way to the Kyoto Shinkansen Station.) Figure 2: Relation between the number of training samples (7) 5年ぶり 京都 タワー and the accuracy. (It has been five years since I came to Kyoto tower.) were training data. The other 100 were test data. As de- (8) 清水の舞台から 1 枚 京都の街が一望だね scribed in Section 4, each tweet was analyzed using MeCab. (I took a picture from the top of Kiyomizu. It has a full All uni-grams and bi-grams were used as features of Bag- view of Kyoto.) of-Words. Training was performed using logistic regression. Accuracy obtained using the test data was 86.0%. (9) ランチ (at なか卯 河原町五条店) 折田先生なう (I am having lunch at the Nakau Gojo branch in Kawara- 7 DISCUSSION machi with Orita-sensei now.) From the result, de-identification of two kinds apparently exists. For that reason, it is necessary to use the system prop- (10) 京都御所一般公開中 erly according to the purpose of de-identification. For exam- (Kyoto Imperial Palace is now open to the public.) ple, the classifier in Section 4 de-identifies tweets for sales purposes. Because it is not necessary to de-identify such (11) 阪急河原町なう tweets, tweets can be de-identified using both the classifiers (I am at Hankyu Kawaramachi now.) in Sections 4 and 6. Here we propose a method to de-identify   tweets related to a combination of words. ‘新幹線京都駅 (Shinkansen Kyoto Station)’ is a proper noun, but there are many stations in Kyoto City. There- Method fore ideal de-identification is achieved by deleting ‘新幹線 Below is the algorithm of de-identification using the classi- 京都 (Shinkansen Kyoto)’. In the case of ‘京都タワー (Kyoto fier constructed in Section 4. tower)’, an ideal de-identification system will delete ‘タワー (tower)’ because it is the only tower in Kyoto City. The re- Step 0: Substitute 1 for m, the number of morphemes to sult (5) is a successful example. Using the proposed method, be deleted. ideal de-identification can be achieved in that this algorithm Step 1: Delete m morpheme(s) from an original tweet does not delete the whole proper noun. sor д . When the number of the morphemes of sor д is With adequate training data, the method would work ide- n, the number of possible patterns is n Cm . Group the ally, but a failure example exists as follows. n Cm tweets into one group, S (= {s 1 , ..., s n Cm }).   Step 2: For each tweet si in S, find the maximum value (12) 撮り飽きもせず 撮り足りもせず 京都御苑 of the probabilities the classifier outputs for its loca- (I never get tired of and never get enough of taking pic- tion (a 1..200 ). tures in Kyoto Gyoen.) prob(si,a j ) = p(a j |si )   maxprob(si ) = max(prob(si,a1 ), ..., prob(si,a200 )) The algorithm should delete ‘御苑 (Gyoen)’, but it actually deletes ‘京都 (Kyoto)’. Step 3: Let the tweet with the least maxprob(si ) be snew , Furthermore, we investigated the relation between the where the following holds. size of the training data and the accuracy. Figure 3 shows snew = arg min maxprob(si ) that more training data are necessary. Some difficulty arises si UISTDA ’18, March 11, 2018, Tokyo, Japan K. Taguchi et al. (a) Before: Original tweet and its location inference (b) After: De-identified tweet and its location inference Figure 3: Images of the system’s user interface with obtaining sufficient amount of data because the way to Communications R&D Promotion Programme (SCOPE), the Min- make these data involves manual labeling. istry of Internal Affairs and Communications of Japan. 8 APPLICATION REFERENCES [1] L. Kong, Z. Liu, and Y. Huang. Spot: Locating social media users based A system for inference and de-identification of tweets can on social network context. In Proc. of the VLDB Endowment, 7(13): be built based on the proposed de-identification method. pp.1681–1684, 2014. Figure 3 presents screenshots for the system. Inputting any [2] A. Sadilek, H. Kautz, and J. P. Bigham. Finding your friends and fol- tweet, this system infers its tweet location and de-identifies lowing them to where you are. In Proc. of the Fifth Intl. Conference on it according to a selected number of morphemes to be Web Search and Web Data Mining, pp. 723–732, 2012. [3] W. Hua, K. Zheng, and X. Zhou. Microblog entity linking with social deleted. This process supports both machine and human in- temporal context. In Proc. of the 2015 ACM SIGMOD Intl. Conference ference. For both (a) raw and (b) de-identified tweets, the on Management of Data, pp. 1761–1775, 2015. location inference results are presented on each map. [4] Y. Yamaguchi, T. Amagasa, H. Kitagawa, and Y. Ikawa. Online user location inference exploiting spatiotemporal correlations in social 9 CONCLUSION streams. In Proc. the 23rd ACM Intl. Conference on Information and Knowledge Management, pp. 1139–1148, 2014. This paper proposed a novel de-identification method to [5] M. Cha, Y. Gwon, and H. T. Kung. Twitter geolocation and regional anonymize tweet locations. Two kinds of tweet location in- classification via sparse coding. In Proc. of the Ninth Intl. Conference ference were presented. One is inference of a location itself. on Web and Social Media, pp. 582–585, 2015. The other is inference of the feasibility of location inference. [6] D. Flatow, M. Naaman, K. E. Xie, Y. Volkvich, and Y. Kanza. On the accuracy of hyper-local geotagging of social media content. In Proc. These location inferences are based on the respective defini- of the Eighth ACM Intl. Conference on Web Search and Data Mining, tions of de-identification. The former tends to regard tweets pp. 127–136, 2015. from stores as identifiable because such tweets are posted [7] S. Kinsella, V. Murdock, and N. O’Hare. I’m eating a sandwich in from only one place many times. The latter tends to regard Glasgow: modeling locations with tweets. In Proc. of the Workshop tweets in which common nouns are used as proper nouns, on Search and Mining User-Generated Contents, pp. 61–68, 2011. [8] G. Li, J. Hu, J. Feng, and K.-l. Tan. Effective location identification from as identifiable. Therefore, in practical use, it would not be microblogs. In Proc. of the 30th Intl. Conference on Data Engineering, sufficient to apply common concepts for de-identification of pp. 880–891, 2014. location. Our algorithm of de-identification based on the hy- [9] H. Efstathiades, D. Antoniades, G. Pallis, and M. D. Dikaiakos. Iden- pothesis that locations of tweets are inferable with combi- tification of key locations based on online social network activity. In nations of words, partially brought expected results. Future Proc. of the 2015 IEEE/ACM Intl. Conference on Advances in Social Net- works Analysis and Mining, pp. 218–225, 2015. tasks involve how to incorporate consideration of contexts. [10] M. Dredze, M. Osborne, and P. Kambadur. Geolocation for twitter: The analyses described in this paper investigated each tweet Timing matters. In Proc. of the 2016 Conference of the North Ameri- as a Bag-of-Words, and did not use information of relations can Chapter of the Association for Computational Linguistics: Human of morphemes. This problem is expected to be resolved by Language Technologies, pp. 1064–1069, 2016. consideration of the syntax structures of tweets. [11] Y. Fang and M. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational ACKNOWLEDGEMENTS Linguistics, 2: pp. 259–272, 2014. This work is supported in part by Japan Agency for Medical Re- search and Development (16768699), Strategic Information and