Utilization of Information Interpolation using Geotagged
                             Tweets
           Masaki Endo                                 Masaharu Hirota                           Hiroshi Ishikawa
       Polytechnic University                     Okayama University of Science             Tokyo Metropolitan University
            Tokyo, Japan                                Okayama, Japan                              Tokyo, Japan
      e-mail endou@uitec.ac.jp                     e-mail hirota@mis.ous.ac.jp                    e-mail ishikawa-
                                                                                                 hiroshi@tmu.ac.jp
ABSTRACT                                                                provide similar information in the form of guidebooks.
Along with the spread of social media, it has become                    Nevertheless, the information update frequency of that
possible to extract events occurring in the real world in real          medium is low. Because each local government, tourism
time. A benefit of analysis using data with position                    association, and travel company independently provides
information is that it can accurately extract an event from a           information about travel destination locales, it is difficult
target area to be analyzed. However, because social media               for tourists to collect information for “now” tourist spots.
data include few data with location information, the amount             Therefore, providing current, useful, real-world information
to analyze is insufficient for almost all areas: we cannot              for travelers by capturing changes of information in
fully extract most events. Therefore, efficient analytical              accordance with the season and relevant time period of the
methods must be devised for the accurate extraction of                  tourism region is important for the travel industry.
events with position information, even in areas with few
data. For this study, we use geotagged tweets along with                Tourist information for best times requires a peak period,
interpolation to estimate the best time to observe biological           which means that the best time is neither a period after or
seasons when doing sightseeing such as cherry-blossom and               before falling flowers, but a precisely defined period to
autumn leaf viewing in areas and sightseeing spots. Herein,             view blooming flowers. Furthermore, the best times differ
we explain the analysis results obtained using information              among regions and locations. Therefore, for each region
interpolation and analysis of cherry blossoms in Japan                  and location, it is necessary to estimate the best time for
during 2017.                                                            phenological observations. Estimating best-time viewing
                                                                        periods requires the collection of large amounts of
Author Keywords                                                         information having real-time properties. For this study, we
trend estimation; phenological observation; Twitter                     use Twitter data obtained from many users throughout
ACM Classification Keywords                                             Japan. We use Twitter, a typical microblogging service, and
H.5.m. Information interfaces and presentation (e.g., HCI):             also use geotagged tweets that include position information
Miscellaneous                                                           sent in Japan to ascertain the best time (peak period) for
                                                                        biological season observation by region. The geotagged
INTRODUCTION                                                            tweets are useful as social indicators reflecting real-world
Because of the wide dissemination and rapid performance                 circumstances. They are a useful resource supporting a real-
improvement of various devices such as smart phones and                 time regional tourist information system in the tourism field.
tablets, diverse and vast data are generated on the web.                Therefore, our proposed method might be an effective
Particularly, social networking services (SNSs) have                    means of estimating the best time to view events other than
become popular because users can post data and various                  biological seasonal observations.
messages easily. Twitter [1], an SNS that provides a micro-
blogging service, is used as a real-time communication tool.            To analyze information of each region from Twitter data, it
Numerous tweets have been posted daily by vast numbers                  is necessary to specify a location from tweet information.
of users. Twitter is therefore a useful medium to obtain,               Geotagged tweets can identify places. Therefore, they are
from a large amount of information posted by many users,                effective for analysis. However, because geotagged tweets
real-time information corresponding to the real world.                  account for only a very small proportion of the total
                                                                        information content of tweets, it is not possible to analyze
We specifically consider tourist information provision using            all regions. We propose a method to provide tourists with
real-time information from Twitter. According to a survey               information about sightseeing spots after processing a small
reported in the Inbound Landing-type Tourism Guide [2] by               amount of data in real time by interpolation using
the Ministry of Economy, Trade and Industry (METI),                     geotagged tweets.
tourists want real-time information and local unique
seasonal information posted on websites. Current websites               RELATED WORKS
                                                                        Along with rising SNS popularity, real-time information
© 2018. Copyright for the individual papers remains with the authors.   has increased. Analysis using real time data has become
Copying permitted for private and academic purposes.                    possible. Many studies have examined efficient methods for
UISTDA '18, March 11, Tokyo, Japan.
analyzing large amounts of digital data. Some studies have      addition, although the author has a Twitter account, which
been conducted to predict real world phenomena using            is necessary to use the API, the author never tweets.
large amounts of social big data. Phithakkitnukoon et al. [3]   Moreover, even for personal account holders, tweets do not
analyze details of traveler behavior using data from mobile     include location information in some areas. Therefore we
phone GPS location records such as embarkation places,          do not consider excluding the tweets of the author’s own
destinations, and traveling mode on a personal level.           account. Using these data, we calculated the best time for
Mislove et al. [4] develop a system that infers a Twitter       flower viewing, as estimated using the processing described
user’s feelings from tweet text and which visualizes            in the following sections.
changes of emotion in space–time. Based on research to
                                                                Preprocessing
detect events such as earthquakes and typhoons, Sakaki et       This section presents preprocessing. Preprocessing includes
al. [5] propose a method to estimate real-time events from      reverse geocoding and morphological analysis, as well as
Twitter tweets. Cheng et al. [6] estimate Twitter users’        database storage for data collected through the processing
geographical positions at the time of their contributions,      described in the previous subsection.
without the use of geotags, by devoting attention to the
geographical locality of words from text information in         From latitude and longitude information in the individually
articles posted on Twitter. Yamada [7] analyzes Japanese        collected tweets, reverse geocoding is useful to identify
blog data and proposes a method to identify seasonal words      prefectures and municipalities by town name. We use a
using simple autocorrelation analysis. Krumm et al. [8]         simple reverse geocoding service [11] available from the
propose a method to detect events using time series analysis    National Agriculture and Food Research Organization in
of geotagged tweet volumes from localized areas. Various        this process.
studies have analyzed spatiotemporal data, but research to
                                                                Morphological analysis divides the collected geotagged
estimate viewing periods using interlinkage is a new field.
                                                                tweet morphemes. We use the “Mecab” morphological
OUR PROPOSED METHOD                                             analyzer [12].
We describe the best-time estimation method of organisms
by analysis using geotagged tweets that include organism        Preprocessing accomplishes necessary data storage for best-
                                                                time viewing, as estimated based on results of the
names. Best-time estimation, as defined for this paper, is
                                                                processing of the data collection, reverse geocoding, and
estimation of the period during which creatures at tourist
                                                                morphological analysis. Data used for this study were the
spots are useful for sightseeing. Such information can be
                                                                tweet ID, tweet post time, tweet text, morphological
useful reference information when visiting tourist spots. It
supports estimation of the period during which a tourist can    analysis result, latitude, and longitude.
enjoy the four seasons by viewing cherry blossoms and           Interpolation using Kriging
autumn leaves. However, geotagged tweets are far fewer          This section presents the method of interpolation, for which
than tweets without geotags. For that reason, although it is    we used Kriging [13], an estimation method used for
possible to estimate the best time in a prefecture unit or      estimating values for points where information was not
municipality, finely honed analyses have been impossible.       acquired. It is impossible to estimate from the number of
Nevertheless, the best time to visit sightseeing spots can be   geotagged tweets of each sightseeing spot when conducting
estimated with finer granularity using the method with          detailed analysis at each sightseeing spot. Therefore,
interpolation proposed in this paper.                           geotagged tweets that have seven significant digits and the
                                                                same latitude and longitude information are judged to have
In the following subsections, we describe the collection of
                                                                originated from the same spot. As an example, the tweet's
geotagged tweets to be analyzed, character preprocessing
                                                                position information of (latitude, longitude) =
for conducting analysis, and interpolation using Kriging.
                                                                (34.93162536621094, 135.72979736328125) is truncated to
Data collection                                                 (latitude, longitude) = (34.93162, 135.72979). Then, the
This section presents data collection. Geotagged tweets sent    tweets from the same point were counted for each date.
from Twitter are a collection target. The range of geotagged    Furthermore, by dividing the total for each point by the
tweets includes the Japanese archipelago (120.0°E –             total number of tweets on each day, we calculated the
154.0°E, and 20.0°N – 47.0°N) as the collection target. The     weight of each spot.
collection of these data was done using a streaming API [9]
                                                                We attempted estimation by interpolation using data
provided by Twitter Inc.
                                                                aggregated for each spot. The estimated value of the target
Next, we describe the number of collected data. According       data at a point S0 is shown in formula (1) as a weighted
to a report presented by Hashimoto et al. [10], among all       average of the measured values Z(Si) (i = 1, 2..., N) at N
tweets originating in Japan, about 0.18% are geotagged          points Si around point S0. Then we assigned value Z to
tweets: this is a rare characteristic for text data. However,   tweets including the target word and Z. Here, N represents
the geotagged tweets we collected are an average of 500         the 30 nearby targeted tweets. λ denotes a spherical model
thousand tweets per day. We use about 250 million               with decreased influence as distance increases. As
geotagged tweets from 2015/2/17 through 2017/5/13. In           described in this paper, weighting is done only by the
number of tweets existing in the same spot. However,              city from 2017/1/1 to 2017/5/13. The point denoted by A is
consideration of the weight of the tweet itself, such as using    Mt. Takao. Few geotagged tweets are related to cherry
the number of retweets to tweets, is also necessary for           blossoms near Mt. Takao. For that reason, one cannot
future studies.                                                   estimate the best time merely using tweets from A.
                                   𝑁                              Therefore, for this study, we interpolated the amount of
                        𝑧̂(𝑆0 ) = ∑ 𝜆𝑖 𝑍(𝑆𝑖 )            (1)      information by Kriging using information of Hachioji city
                                  𝑖=1                             with Mt. Takao. Interpolation is done on a daily basis. The
                                                                  experiment results are presented in the next subsection.
       𝑍(𝑆𝑖 ) :Measurement value at 𝑖-th position
𝜆𝑖 :Unknown weighting of measured value at 𝑖-th position
                   𝑆0 :Predicted position
               𝑁 :Number of measurements                                                             32 km             C
EXPERIMENTS                                                                                   B
In this section, we explain the experiment for information                                                      D 6 km
interpolation for cherry blossoms in 2017, using the method                            A 16 km
described in the previous section. We are conducting
estimation experiments for cherry blossoms, autumn leaves,
and other phenomena from 2015. As described herein, we
used the period of cherry blossoms in 2017 while studying
the interpolation method to improve estimation accuracy.                      Figure 1. Location of the target area.
The following subsections describe experimental datasets.
Datasets
Datasets used for this experiment were collected using
streaming API, as described for data collection. The data,
which include about 250 million items, are geotagged
tweets from Japan during 2015/2/17 – 2017/5/13. The
estimation experiment conducted to ascertain the best-time
viewing of cherry blossoms uses the target word “cherry
blossom,” which is “桜”, “さくら”, and “サクラ” in
Japanese. We analyzed tweet texts that include the target
word. About 100,000 tweets during the experiment period
included the subject word.
The subject of the experiment was set as tourist spots in
Tokyo. In this report, we describe “Takao Mountain,”
“Showa Memorial Park,” “Shinjuku Gyoen,” and
“Rikugien.” Figure 1 presents the target area locations. A, B,                    Figure 2. Positions of targets.
C, and D in the figure respectively denote “Takao
                                                                  Experimental results
Mountain,” “Showa Memorial Park,” “Rikugien,” and
                                                                  This section presents experimentally obtained results for
“Shinjuku Gyoen.” In this experiment, about 30,000 tweets
                                                                  estimating the best time. Figure 3 presents results for the
including the target word in Tokyo were found. In this
                                                                  estimated best-time viewing in 2017 using the target word
experiment, all tweets made by the same user are also used
                                                                  ‘cherry blossoms’ in the target tourist spots. The black part
as targets for analysis if they are tweets including the target
                                                                  of the figure represents the number of tweets containing a
word.
                                                                  target word and sightseeing spot name. The light gray part
We conducted experiments of the following two kinds               represents best-time viewing as determined using the
using these datasets. The first is an experiment using the        proposed method of interpolation.
number of tweets including the target word and the
                                                                  At tourist spots targeted for the experiment in 2017, as
sightseeing spot name without interpolation. This
                                                                  portrayed in the black part of Figure 3, many data were
experiment was compared as Baseline to confirm the
                                                                  obtained for B, C, and D. The maximum number of tweets
usefulness of interpolation proposed in this paper. The
                                                                  per day was about 20. These results confirmed that some
second is an experiment using interpolation. In this
                                                                  estimation can be accomplished without interpolation.
experiment, we used Kriging in the earlier section.
We present the example of Mt. Takao in Hachioji city.
Figure 2 shows cherry-blossom-related tweets in Hachioji
                                                Figure 3. Experimental results.
The light gray part of Figure 3 portrays an experimentally        ACKNOWLEDGMENTS
obtained result from interpolation results including the          This work was supported by JSPS KAKENHI Grant Nos.
tourist spots. Apparently, A was able to produce an estimate      16K00157 and 16K16158, and by a Tokyo Metropolitan
using the proposed method by increasing the number of             University Grant-in-Aid for Research on Priority Areas
tweets using interpolation with surrounding tweets. For B         “Research on Social Big Data.”
and C, the useful information was included in the tweet not       REFERENCES
co-occurring with the tourist spot name. Therefore, we            1.   Twitter. It's what's happening. 2017. Retrieved
confirmed interpolation related to other kinds of cherry               September 2, 2017 from https://Twitter.com/
blossoms in early March of B and C and late April of C. In
addition, for D, there are days when it can be determined         2.   Ministry of Economy. Trade and Industry. Inbound
more accurately by interpolating the number of tweets.                 Landing-Type Tourism Guide. 2017. Retrieved
                                                                       November 10, 2017 from
Therefore, these results confirmed the possibility of                  http://www.mlit.go.jp/common/001091713.pdf (in
estimating the peak period, even for an area without tweets,           Japanese).
using data interpolation and overall tweet number
                                                                  3.   S. Phithakkitnukoon, T. Teerayut Horanont, A.
interpolation.
                                                                       Witayangkurn, R. Siri, Y. Sekimoto, and R. Shibasaki.
CONCLUSION                                                             2015. Understanding tourist behavior using large-scale
As described in this paper, we proposed an interpolation               mobile sensing approach: A case study of mobile
method to improve the accuracy of tourism information                  phone users in Japan. Pervasive and Mobile Computing
related to phenological observations. The results of the               Volume 18: 18-39.
cherry blossom experiment conducted at the tourist spots in       4.   A. Mislove, S. Lehmann, Y.Y. Ahn, J-P. Onnela, and J.
Tokyo in 2017 confirmed the trend of improved estimation               Niels Rosenquist. Understanding the Demographics of
accuracy using information interpolation. We confirmed the             Twitter Users. 2011. In Proceedings of the Fifth
possibility of applying this proposed method to the                    International AAAI Conference on Weblogs and Social
estimation of viewpoints and sightseeing spots with few                Media (icwsm 2011), 554–557.
tweets. However, in regions with no geo-tagged tweets,
another method must be considered. Research can be                5.   T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake
conducted in the future to verify whether similar results are          shakes Twitter users: real-time event detection by
obtained for other biological seasonal observations.                   social sensors. 2010. WWW 2010, 851–860.
6.   T. Kaneko and K. Yanai. Visual Event Mining from
     the Twitter Stream. 2010. WWW '16 Companion, In
     Proceedings of the 25th International Conference
     Companion on World Wide Web, 51–52.
7.   K. Yamada. Detecting two types of seasonal words
     using simple autocorrelation analysis. 2017. IEEE Big
     Data 2017 Workshops. In Proceedings of the Second
     International Workshop on Application of Big Data for
     Computational Social Science.
8.   J. Krumm and E. Horvitz. Eyewitness: identifying local
     events via space-time signals in Twitter feeds. 2015. In
     Proceedings of the 23rd SIGSPATIAL International
     Conference on Advances in Geographic Information
     Systems (SIGSPATIAL ’15). ACM, New York, NY,
     USA,, Article 20, 10 pages. DOI:
     https://doi.org/10.1145/2820783.2820801
9.   Twitter Developers. Twitter Developer official site.
     2017. Retrieved April 2, 2017 from
     https://dev.twitter.com/
10. Y. Hashimoto and M. Oka. Statistics of Geo-Tagged
    Tweets in Urban Areas (<Special Issue>Synthesis and
    Analysis of Massive Data Flow). 2012. JSAI vol. 27, 4:
    424–431 (in Japanese).
11. National Agriculture and Food Research Organization.
    Simple reverse geocoding service. 2017. Retrieved
    November 18, 2017 from
    https://www.finds.jp/rgeocode/index.html.ja
12. MeCab. Yet Another Part-of-Speech and
    Morphological Analyzer. 2017. Retrieved November
    10, 2017 from http://taku910.github.io/mecab/
13. M. A. Oliver. Kriging: A Method of Interpolation for
    Geographical Information Systems. 1990. International
    Journal of Geographic Information Systems vol. 4:
    313–332.