                       Novel Location De-identification for
                              Machine and Human
                       Katsuya Taguchi                                                      Eiji Aramaki
            Nara Institute of Science and Technology                          Nara Institute of Science and Technology
                           Ikoma, Nara                                                       Ikoma, Nara
                taguchi.katsuya.tb3@is.naist.jp                                          aramaki@is.naist.jp

ABSTRACT                                                               1 INTRODUCTION
In recent years, the protection of personal information has            In recent years, de-identification techniques to delete sen-
drawn much attention, requiring an advanced technology                 sitive personal information have been studied actively be-
on de-identification to remove personal information from               cause of the growing interest in privacy protection. In most
data. Among various personal information such as personal              automatic de-identification technologies, sensitive personal
names, phone numbers, and so forth, this study focuses on              information is regarded as identical to proper expressions
location information. The conventional approaches to pro-              such as personal names, organization names, phone num-
tect location information are to remove address expressions.           bers, ID numbers, and addresses. Therefore, Named En-
However, there are complicated cases in which location in-             tity Recognition (NER) techniques have been applied to de-
formation can be guessed with unexpected combinations of               identification. As described in this paper, this conventional
non-address words. For example, we can guess ‘the most                 approach is designated as NER-based de-identification.
traditional city in Japan’ is Kyoto. To our knowledge, such               Actually, NER-based de-identification has an important
location-inferable expressions have not been dealt with.               limitation: an address is identifiable from non-named entity
This study handles this phenomenon by using a location                 expressions. Sometimes, the combination of general terms
classifier. In addition, we assume two levels of location infer-       can be a strong clue for identifying a specific location. Con-
ance; (1) inferable by machine and (2) inferable by human.             sider the following sentence: ‘I’m excited to have dinner with
To build the first-level inferance, we employed a collection           my colleague on the riverbed!’ Because the riverbed is a fa-
of tweets with geo-tags. To build the second-level inferance,          mous spot in Kyoto and the location of riverbed in Kyoto is
we created a new corpus with a flag for whether tweets are             well-known, most Japanese people can guess that the per-
location-inferable by human or not. By using the two types             son behind the tweet is located in Kyoto. This limitation of
of corpora, we classified texts into several categories such as        NER-based approaches becomes an important issue because
a machine-inferable but human-non-inferable tweet, and so              many people unintentionally expose their location informa-
on. We also could obtain de-identified tweets by iterations            tion to others. Sometimes the knowledge might be used il-
of removing the highest weighted words for classifiers. We             legally.
believe our novel concepts of de-identification are essential             This study specifically examines automatic de-
for various privacy protection.                                        identification of messages in Twitter in terms of their
                                                                       location information. Our de-identification method has two
CCS CONCEPTS                                                           novel features.
• Security and privacy → Privacy protections; • Social                    • This study handles location-inferable expressions
and professional topics → Identity theft; Social engi-                      (not only proper expressions but also non-proper ex-
neering attacks; • Computing methodologies → Learning                       pressions).
linear models;                                                            • This study assumes two levels of location inference:
                                                                            (1) inferable by machine and (2) inferable by humans.
De-identification, Location inference, SNS, Twitter, Natural             Using the two viewpoints of inference, we were able
language processing                                                    to design several levels of de-identification: a level of a
                                                                       machine-inferable but human-non-inferable tweets, and so
                                                                       on. It is noteworthy that the proposed method is indepen-
                                                                       dent of any specific language.
                                                                         The remainder of this paper is organized as follows. First,
UISTDA ’18, March 11, 2018, Tokyo, Japan                               we construct a classifier to infer tweet locations using geo-
                                                                       tagged tweets in Twitter (Section 4). Next, we investigate
UISTDA ’18, March 11, 2018, Tokyo, Japan                                                                            K. Taguchi et al.

                                          Table 1: Work related to location inference

                                            Home location             Tweet location         Mentioned location
                       Human network        Kong et al. [1]           Sadilek et al. [2]     Hua et al. [3]
                                            Yamaguchi et al. [4]      Flatow et al. [6]
                                            (word-centric)            (word-centric)
                       Tweet content                                                         Li et al. [8]
                                            Cha et al. [5]            Kinsella et al. [7]
                                            (location-centric)        (location-centric)
                       Tweet context        Efstathiades et al. [9]   Dredze et al. [10]     Fang et al. [11]

whether the classifier can infer tweet locations that are de-          model and tweet content. However, the method cannot es-
identified by humans (Section 5). Then, we tag the tweets              timate locations that are identifiable by unexpected word
with whether a human can infer the locations (what we call             combinations because a classifier is constructed using a
feasibility of location inference). We compare the differ-             word list that is appropriate to each area. By contrast, we
ence between the classifier and human (Section 6). Finally,            propose a method to infer locations by considering word
we present a de-identification method considering the com-             combinations.
bination of words (Section 7).
2   RELATED WORK                                                       In the medical field, de-identification of patient data has
Location Inference                                                     been studied actively. A conventional approach, Named
Many methods for location inference have been proposed                 Entity Recognition (NER) based de-identification, deletes
to date. They are classifiable by two aspects: location types          proper expressions that are capable of specifying individ-
to be estimated and material types to be used for location             uals such as proper nouns: phone numbers and addresses.
estimation, as presented in Table 1.                                   However, NER-based de-identification is insufficient for de-
   As for location types, roughly three types of locations             identifying location information. Moreover, in the medical
have been considered to date as shown in Table 1: user home            field, a law exists to protect individuals’ medical records
locations, tweet locations, and described locations. Home lo-          and other personal health information: Health Insurance
cation is a location where a user lives or spends much time,           Portability and Accountability Act (HIPAA) 1 , which was
including the address of a user’s home or office. Tweet lo-            approved in the U.S.A. in 1996. As for de-identification of so-
cation is one from which a user has posted a tweet. A men-             cial media contents including messages with location infor-
tioned location is one that a user has described in a tweet.           mation, however, no criteria correspond to HIPAA. This pa-
This paper represents an attempt to estimate tweet loca-               per therefore sets criteria for the de-identification of tweet
tions, which are our target of de-identification.                      locations by conducting experiments related to manual de-
   For location inference, three types of materials have been          identification.
used as shown in Table 1: human network, tweet content,
                                                                       3 DATASET
and tweet context. A human network is a relation between
users in social networking services such as follower or fol-           This section describes our dataset consisting of tweets with
lowee in Twitter. Tweet content represents the content of              location information and area division.
a tweet message. Tweet context is information associated
with a tweet such as a time stamp, geo-tag, or time zone.
When inferring locations using tweet content, there are                Tweet data consist of 298,711 Japanese messages with geo-
two major approaches distinguished by probabilistic mod-               tags (hereinafter called ‘tweets’) posted within the central
els. One is called the word-centric model, calculating the             region of Kyoto City, Japan (latitude range = [34.93, 35.12]
probability p(l|W ) that a location l is labeled to a set of           and longitude range = [135.67, 135.83]). This region includes
words W . The other is called a location-centric model. It             popular landmarks, train stations, castles, shrines, temples,
calculates the probability p(d|l) that each location’s label           and so on, yielding a diverse mix of tweets. The tweets were
l outputs a tweet document d. In this paper, the word-centric          collected about for a year between 2011/7/14 and 2012/7/31.
model is applied to analyze tweet contents and to construct              The tweet data are divided into training data and test data.
a classifier to estimate a tweet’s location.                           Training data consisting of 179,227 tweets (60% of all data)
   The study by Flatow et al. [6] is similar to ours in that they      are used to construct a classifier as described in Section 4.
attempted to infer tweet locations with the word-centric               1 https://www.hhs.gov/hipaa/index.html
Novel Location De-identification for
Machine and Human                                                                        UISTDA ’18, March 11, 2018, Tokyo, Japan
                                                                    performance. Each text is split into words using a Japan-
                                                                    ese morphological analyzer, MeCab2 . All uni-grams and bi-
                                                                    grams are used as features for a bag-of-words representa-
                                                                    tion. They are converted into vectors and are used for the
                                                                    training data. Each element of a vector was one or zero ac-
                                                                    cording to whether each feature appeared or not. Noises
                                                                    such as URLs (e.g. “https://XXX”), hashtags (e.g. “#hash-
                                                                    tag”), or mentions to other users (e.g. “@username”) were
                                                                    removed from each text3 . Correct answer labels are set to
                                                                    each area (200 classes in total) and are attached to each tweet
                                                                    based on its geo-tag. The classifier is constructed based on
                                                                    a linear model trained by logistic regression.
                                                                       To evaluate the constructed classifier, the test data were
                                                                    classified into 200 classes. Results show that the accuracy for
                                                                    the test data was 47.2%. If the classifier always outputs the
                                                                    area a 15 5 having the highest tweet density in the training
                                                                    data, then the accuracy for the test data is 11.6%. Also, 47.2%
                                                                    is modestly high in spite of its simple structure.

                                                                    5 PRELIMINARY EXPERIMENT: MANUAL
                                                                    When using the classifier constructed in Section 4, it is nec-
                                                                    essary to define the state: ‘a tweet is de-identified.’ This sec-
Figure 1: Geographical distribution of tweets in the central        tion describes an experiment by which the state is defined.
region of Kyoto City. The region is divided into 200 areas.
                                                                    Materials and Procedure
                                                                    To define the state that a tweet is de-identified, the manually
The test data consist of 119,484 tweets (40% of all data) used      annotated corpus was created. 500 tweets from the test data
to evaluate the classifier’s performance in Section 4. Some         were de-identified manually. First, participants observe each
test data are used for experiments in Sections 5 and 6.             tweet and infer its location as precisely as possible. The par-
                                                                    ticipants are allowed to use search engines, etc. Then, they
Area Division                                                       delete the minimum number of morphemes in a tweet until
The region described in Section 3.1, the central region of          they ascertain that the tweet’s location becomes ambiguous.
Kyoto City, Japan, was divided into 200 (= 20×10) areas                In this preliminary experiment, two annotators with
(a 1 1 , ..., a 20 10 ∈ Akyot o ), as presented in Figure 1. Each   knowledge about Kyoto City independently annotated 500
area is 501 m × 547 m. This division was useful to sepa-            tweets of the test data. The tweet below is an example of
rate two consecutive stations (Hankyu Kawaramachi Sta-              the annotated tweets. Words to be deleted are crossed off.
tion and Hankyu Karasuma Station) into two areas. Both              In this example, the annotators considered that Tweet (1)
areas are located near the Hankyu Kawaramachi Station,              was de-identified by deleting ‘御池 (Oike)’ and ‘マザーズハ
which is well known as the busiest downtown area in Kyoto.          ローワーク (Mother’s Hello Work)’.
Therefore, this manner of division is reasonable.
    Figure 1 presents the geographic distribution of 298,711           (1) 烏 丸御池 プ ラ ザ が 本 チャン や な い ん か?
tweets for the selected areas. 39,078 tweets (13.1% of all             @マザーズハローワーク鳥丸御池
tweets) were posted around the area a 15 5 , where Kyoto Sta-          (Is not the Karasuma Oike Plaza main? @Mother’s Hello
tion (Kyoto’s largest train station) is located. By contrast,          Work Karasuma 御池Oike)
only two tweets were posted in area a 17 10 , which is located                                                                
southeast of Miterasennyuji Temple.                                   Then, a threshold determining whether a tweet is de-
                                                                    identified or not is defined using the annotated tweets.
4   CONSTRUCTION OF LOCATION CLASSIFIER                             Given a de-identified tweet, the classifier calculates the
This section describes a method to construct a classifier that      2 http://taku910.github.io/mecab/
estimates a tweet location and which shows the classifier           3 https://github.com/s/preprocessor
UISTDA ’18, March 11, 2018, Tokyo, Japan                                                                         K. Taguchi et al.

probability of location inference when the tweet is assigned        Table 2: Inference of feasibility of location inference
to the 200 areas, respectively. The maximum of the 200 prob-        by the constructed classifier (machine) and human
ability values can be regarded as a reference to the tweet’s
de-identification. Finally the average of the maximum val-                                                 Classifier
ues of the probability for all annotated tweets is used as the                                     infearable not inferable
threshold. The tweets for which the probability is below the                         inferable            216            30
threshold were regarded as being de-identified.                                    not inferable          258           496

Results and Discussion                                                                                                               
The threshold value was set to 0.37 from the preliminary
experiment’s result. Therefore, the tweets for which the              (2) 風が強いです (>_<) 今日も明るく元気にお昼の営
probability was less than 0.37 were regarded as being de-             業開始です!
identified.                                                           (The wind is so strong (>_<). I am about to start my
   However, for some tweets, the classifier outputs show              lunch-hour business brightly and cheerfully as usual!)
high probability but the annotators were uncertain about
their location, or vice versa. Because of such a discrep-              (3) おはようございます (^^) 今日の日中は雨予報で
ancy, probably one can make two types of inference for de-             すね。 気温も 20 ℃まで行かないようです。 今日も
identification. One is to prevent inference of the location            明るく元気に!忙しく楽しい一日になるよう頑張り
itself. The other is to prevent the assumption that a loca-            ます p(^_^)q
tion can be inferred. In the next section, we examine another          (Good morning (^^). It is supposed to rain during the
classifier to infer the feasibility of location inference.             day. The temperature will not reach 20◦C. Let’s be bright
                                                                       and cheerful! I try to be busy and enjoy my day p(^_^)q.)
6   FEASIBILITY OF LOCATION INFERENCE                                                                                                
This section describes a preliminary experiment to con-                These tweets include fixed phrases for advertising stores,
struct a classifier that infers the feasibility of location in-     e.g. the latter part of Tweet (2), ‘I am about to start my lunch-
ference and the actual construction.                                hour business brightly and cheerfully as usual!’ It seems that
                                                                    several tweets with typical phrases by a specific store are in-
Materials and Procedures                                            cluded in the training data. However, humans cannot read
To construct a classifier that infers the feasibility of location   and learn so many tweets. Therefore, they believe that such
inference, a corpus annotated with the feasibility of location      tweets have no feasibility of location inference. The tweets
inference is generated. We first used 1,000 tweets selected         below are examples for which the classifier cannot infer
randomly from the test data in Section 3.1. Then, binary            their location, but humans determine that they have feasi-
classification tasks were conducted according to whether            bility of location inference.
or not the locations can be inferred. To gather a large
amount of experimental cooperation, the tasks were con-               (4) やっとお昼ご飯。つばめ
ducted through crowdsourcing. 100 participants answered               (Finally, lunch time. Tsubame)
each tweet as to whether or not the location can be inferred.
The tweets for which 10% or more participants answered                 (5) 河合塾の向かいのサブウェイなう!
that they can be inferred were defined as tweets with feasi-           (I am at the subway station across the street from Kawai-
bility of location inference. The others were treated as those         juku now!)
without feasibility of location inference.                                                                                       
                                                                       Tweet (4) is a case in which the proper noun ‘つばめ
Results and Discussion                                              (Tsubame)’ is also a common noun. Considering such cases,
246 of 1,000 tweets showed the feasibility of location infer-       data tagged with feasibility of location inference are appar-
ence. Considering the two classification methods, whether           ently necessary. Tweet (5) represents a case in which the lo-
the classifier in Section 4 can infer locations of tweets and       cation is inferable by a combination of ‘河合塾 (Kawaijuku)’
whether tweets have feasibility of location inference, or not,      and ‘サブウェイ (Subway)’.
the 1,000 tweets were classified into four classes. The results
are presented in Table 2.                                           Construction of Classifier for Inferring Location
  For some tweets, the classifier can infer their location, but     Inference Feasibility
those without feasibility of location inference are presented       A classifier was constructed with 1,000 tweets tagged us-
below.                                                              ing feasibility of location inference. Of the 1,000 tweets, 900
Novel Location De-identification for
Machine and Human                                                                   UISTDA ’18, March 11, 2018, Tokyo, Japan

                                                                         Return snew if maxprob(snew ) is below the threshold
                                                                         (=0.37). Otherwise, increment m by 1 and back to Step
                                                                         1 when m is less than n.

                                                                   Results and Discussion
                                                                   We present a part of the result of de-identification by the
                                                                   proposed method. The tweets below are samples of the de-
                                                                   identified tweets. Words to be deleted are crossed off.
                                                                     (6) まだまだ 新幹線京都駅
                                                                     (It is still a long way to the Kyoto Shinkansen Station.)

Figure 2: Relation between the number of training samples            (7) 5年ぶり 京都 タワー
and the accuracy.
                                                                     (It has been five years since I came to Kyoto tower.)

were training data. The other 100 were test data. As de-             (8) 清水の舞台から 1 枚 京都の街が一望だね
scribed in Section 4, each tweet was analyzed using MeCab.           (I took a picture from the top of Kiyomizu. It has a full
All uni-grams and bi-grams were used as features of Bag-             view of Kyoto.)
of-Words. Training was performed using logistic regression.
Accuracy obtained using the test data was 86.0%.                     (9) ランチ (at なか卯 河原町五条店) 折田先生なう
                                                                     (I am having lunch at the Nakau Gojo branch in Kawara-
7   DISCUSSION                                                       machi with Orita-sensei now.)
From the result, de-identification of two kinds apparently
exists. For that reason, it is necessary to use the system prop-     (10) 京都御所一般公開中
erly according to the purpose of de-identification. For exam-        (Kyoto Imperial Palace is now open to the public.)
ple, the classifier in Section 4 de-identifies tweets for sales
purposes. Because it is not necessary to de-identify such             (11) 阪急河原町なう
tweets, tweets can be de-identified using both the classifiers        (I am at Hankyu Kawaramachi now.)
in Sections 4 and 6. Here we propose a method to de-identify                                                                     
tweets related to a combination of words.                             ‘新幹線京都駅 (Shinkansen Kyoto Station)’ is a proper
                                                                   noun, but there are many stations in Kyoto City. There-
Method                                                             fore ideal de-identification is achieved by deleting ‘新幹線
Below is the algorithm of de-identification using the classi-      京都 (Shinkansen Kyoto)’. In the case of ‘京都タワー (Kyoto
fier constructed in Section 4.                                     tower)’, an ideal de-identification system will delete ‘タワー
                                                                   (tower)’ because it is the only tower in Kyoto City. The re-
    Step 0: Substitute 1 for m, the number of morphemes to         sult (5) is a successful example. Using the proposed method,
      be deleted.                                                  ideal de-identification can be achieved in that this algorithm
    Step 1: Delete m morpheme(s) from an original tweet            does not delete the whole proper noun.
      sor д . When the number of the morphemes of sor д is            With adequate training data, the method would work ide-
      n, the number of possible patterns is n Cm . Group the       ally, but a failure example exists as follows.
      n Cm tweets into one group, S (= {s 1 , ..., s n Cm }).                                                                    
    Step 2: For each tweet si in S, find the maximum value
                                                                      (12) 撮り飽きもせず 撮り足りもせず 京都御苑
      of the probabilities the classifier outputs for its loca-
                                                                      (I never get tired of and never get enough of taking pic-
      tion (a 1..200 ).
                                                                      tures in Kyoto Gyoen.)
       prob(si,a j ) = p(a j |si )                                                                                              
    maxprob(si ) =       max(prob(si,a1 ), ..., prob(si,a200 ))    The algorithm should delete ‘御苑 (Gyoen)’, but it actually
                                                                   deletes ‘京都 (Kyoto)’.
    Step 3: Let the tweet with the least maxprob(si ) be snew ,       Furthermore, we investigated the relation between the
      where the following holds.                                   size of the training data and the accuracy. Figure 3 shows
                 snew = arg min maxprob(si )                       that more training data are necessary. Some difficulty arises
UISTDA ’18, March 11, 2018, Tokyo, Japan                                                                                       K. Taguchi et al.

             (a) Before: Original tweet and its location inference   (b) After: De-identified tweet and its location inference

                                           Figure 3: Images of the system’s user interface

with obtaining sufficient amount of data because the way to          Communications R&D Promotion Programme (SCOPE), the Min-
make these data involves manual labeling.                            istry of Internal Affairs and Communications of Japan.

This work is supported in part by Japan Agency for Medical Re-
search and Development (16768699), Strategic Information and