Utilization of Information Interpolation using Geotagged Tweets Masaki Endo Masaharu Hirota Hiroshi Ishikawa Polytechnic University Okayama University of Science Tokyo Metropolitan University Tokyo, Japan Okayama, Japan Tokyo, Japan e-mail endou@uitec.ac.jp e-mail hirota@mis.ous.ac.jp e-mail ishikawa- hiroshi@tmu.ac.jp ABSTRACT provide similar information in the form of guidebooks. Along with the spread of social media, it has become Nevertheless, the information update frequency of that possible to extract events occurring in the real world in real medium is low. Because each local government, tourism time. A benefit of analysis using data with position association, and travel company independently provides information is that it can accurately extract an event from a information about travel destination locales, it is difficult target area to be analyzed. However, because social media for tourists to collect information for “now” tourist spots. data include few data with location information, the amount Therefore, providing current, useful, real-world information to analyze is insufficient for almost all areas: we cannot for travelers by capturing changes of information in fully extract most events. Therefore, efficient analytical accordance with the season and relevant time period of the methods must be devised for the accurate extraction of tourism region is important for the travel industry. events with position information, even in areas with few data. For this study, we use geotagged tweets along with Tourist information for best times requires a peak period, interpolation to estimate the best time to observe biological which means that the best time is neither a period after or seasons when doing sightseeing such as cherry-blossom and before falling flowers, but a precisely defined period to autumn leaf viewing in areas and sightseeing spots. Herein, view blooming flowers. Furthermore, the best times differ we explain the analysis results obtained using information among regions and locations. Therefore, for each region interpolation and analysis of cherry blossoms in Japan and location, it is necessary to estimate the best time for during 2017. phenological observations. Estimating best-time viewing periods requires the collection of large amounts of Author Keywords information having real-time properties. For this study, we trend estimation; phenological observation; Twitter use Twitter data obtained from many users throughout ACM Classification Keywords Japan. We use Twitter, a typical microblogging service, and H.5.m. Information interfaces and presentation (e.g., HCI): also use geotagged tweets that include position information Miscellaneous sent in Japan to ascertain the best time (peak period) for biological season observation by region. The geotagged INTRODUCTION tweets are useful as social indicators reflecting real-world Because of the wide dissemination and rapid performance circumstances. They are a useful resource supporting a real- improvement of various devices such as smart phones and time regional tourist information system in the tourism field. tablets, diverse and vast data are generated on the web. Therefore, our proposed method might be an effective Particularly, social networking services (SNSs) have means of estimating the best time to view events other than become popular because users can post data and various biological seasonal observations. messages easily. Twitter [1], an SNS that provides a micro- blogging service, is used as a real-time communication tool. To analyze information of each region from Twitter data, it Numerous tweets have been posted daily by vast numbers is necessary to specify a location from tweet information. of users. Twitter is therefore a useful medium to obtain, Geotagged tweets can identify places. Therefore, they are from a large amount of information posted by many users, effective for analysis. However, because geotagged tweets real-time information corresponding to the real world. account for only a very small proportion of the total information content of tweets, it is not possible to analyze We specifically consider tourist information provision using all regions. We propose a method to provide tourists with real-time information from Twitter. According to a survey information about sightseeing spots after processing a small reported in the Inbound Landing-type Tourism Guide [2] by amount of data in real time by interpolation using the Ministry of Economy, Trade and Industry (METI), geotagged tweets. tourists want real-time information and local unique seasonal information posted on websites. Current websites RELATED WORKS Along with rising SNS popularity, real-time information © 2018. Copyright for the individual papers remains with the authors. has increased. Analysis using real time data has become Copying permitted for private and academic purposes. possible. Many studies have examined efficient methods for UISTDA '18, March 11, Tokyo, Japan. analyzing large amounts of digital data. Some studies have addition, although the author has a Twitter account, which been conducted to predict real world phenomena using is necessary to use the API, the author never tweets. large amounts of social big data. Phithakkitnukoon et al. [3] Moreover, even for personal account holders, tweets do not analyze details of traveler behavior using data from mobile include location information in some areas. Therefore we phone GPS location records such as embarkation places, do not consider excluding the tweets of the author’s own destinations, and traveling mode on a personal level. account. Using these data, we calculated the best time for Mislove et al. [4] develop a system that infers a Twitter flower viewing, as estimated using the processing described user’s feelings from tweet text and which visualizes in the following sections. changes of emotion in space–time. Based on research to Preprocessing detect events such as earthquakes and typhoons, Sakaki et This section presents preprocessing. Preprocessing includes al. [5] propose a method to estimate real-time events from reverse geocoding and morphological analysis, as well as Twitter tweets. Cheng et al. [6] estimate Twitter users’ database storage for data collected through the processing geographical positions at the time of their contributions, described in the previous subsection. without the use of geotags, by devoting attention to the geographical locality of words from text information in From latitude and longitude information in the individually articles posted on Twitter. Yamada [7] analyzes Japanese collected tweets, reverse geocoding is useful to identify blog data and proposes a method to identify seasonal words prefectures and municipalities by town name. We use a using simple autocorrelation analysis. Krumm et al. [8] simple reverse geocoding service [11] available from the propose a method to detect events using time series analysis National Agriculture and Food Research Organization in of geotagged tweet volumes from localized areas. Various this process. studies have analyzed spatiotemporal data, but research to Morphological analysis divides the collected geotagged estimate viewing periods using interlinkage is a new field. tweet morphemes. We use the “Mecab” morphological OUR PROPOSED METHOD analyzer [12]. We describe the best-time estimation method of organisms by analysis using geotagged tweets that include organism Preprocessing accomplishes necessary data storage for best- time viewing, as estimated based on results of the names. Best-time estimation, as defined for this paper, is processing of the data collection, reverse geocoding, and estimation of the period during which creatures at tourist morphological analysis. Data used for this study were the spots are useful for sightseeing. Such information can be tweet ID, tweet post time, tweet text, morphological useful reference information when visiting tourist spots. It supports estimation of the period during which a tourist can analysis result, latitude, and longitude. enjoy the four seasons by viewing cherry blossoms and Interpolation using Kriging autumn leaves. However, geotagged tweets are far fewer This section presents the method of interpolation, for which than tweets without geotags. For that reason, although it is we used Kriging [13], an estimation method used for possible to estimate the best time in a prefecture unit or estimating values for points where information was not municipality, finely honed analyses have been impossible. acquired. It is impossible to estimate from the number of Nevertheless, the best time to visit sightseeing spots can be geotagged tweets of each sightseeing spot when conducting estimated with finer granularity using the method with detailed analysis at each sightseeing spot. Therefore, interpolation proposed in this paper. geotagged tweets that have seven significant digits and the same latitude and longitude information are judged to have In the following subsections, we describe the collection of originated from the same spot. As an example, the tweet's geotagged tweets to be analyzed, character preprocessing position information of (latitude, longitude) = for conducting analysis, and interpolation using Kriging. (34.93162536621094, 135.72979736328125) is truncated to Data collection (latitude, longitude) = (34.93162, 135.72979). Then, the This section presents data collection. Geotagged tweets sent tweets from the same point were counted for each date. from Twitter are a collection target. The range of geotagged Furthermore, by dividing the total for each point by the tweets includes the Japanese archipelago (120.0°E – total number of tweets on each day, we calculated the 154.0°E, and 20.0°N – 47.0°N) as the collection target. The weight of each spot. collection of these data was done using a streaming API [9] We attempted estimation by interpolation using data provided by Twitter Inc. aggregated for each spot. The estimated value of the target Next, we describe the number of collected data. According data at a point S0 is shown in formula (1) as a weighted to a report presented by Hashimoto et al. [10], among all average of the measured values Z(Si) (i = 1, 2..., N) at N tweets originating in Japan, about 0.18% are geotagged points Si around point S0. Then we assigned value Z to tweets: this is a rare characteristic for text data. However, tweets including the target word and Z. Here, N represents the geotagged tweets we collected are an average of 500 the 30 nearby targeted tweets. λ denotes a spherical model thousand tweets per day. We use about 250 million with decreased influence as distance increases. As geotagged tweets from 2015/2/17 through 2017/5/13. In described in this paper, weighting is done only by the number of tweets existing in the same spot. However, city from 2017/1/1 to 2017/5/13. The point denoted by A is consideration of the weight of the tweet itself, such as using Mt. Takao. Few geotagged tweets are related to cherry the number of retweets to tweets, is also necessary for blossoms near Mt. Takao. For that reason, one cannot future studies. estimate the best time merely using tweets from A. 𝑁 Therefore, for this study, we interpolated the amount of 𝑧̂(𝑆0 ) = ∑ 𝜆𝑖 𝑍(𝑆𝑖 ) (1) information by Kriging using information of Hachioji city 𝑖=1 with Mt. Takao. Interpolation is done on a daily basis. The experiment results are presented in the next subsection. 𝑍(𝑆𝑖 ) :Measurement value at 𝑖-th position 𝜆𝑖 :Unknown weighting of measured value at 𝑖-th position 𝑆0 :Predicted position 𝑁 :Number of measurements 32 km C EXPERIMENTS B In this section, we explain the experiment for information D 6 km interpolation for cherry blossoms in 2017, using the method A 16 km described in the previous section. We are conducting estimation experiments for cherry blossoms, autumn leaves, and other phenomena from 2015. As described herein, we used the period of cherry blossoms in 2017 while studying the interpolation method to improve estimation accuracy. Figure 1. Location of the target area. The following subsections describe experimental datasets. Datasets Datasets used for this experiment were collected using streaming API, as described for data collection. The data, which include about 250 million items, are geotagged tweets from Japan during 2015/2/17 – 2017/5/13. The estimation experiment conducted to ascertain the best-time viewing of cherry blossoms uses the target word “cherry blossom,” which is “桜”, “さくら”, and “サクラ” in Japanese. We analyzed tweet texts that include the target word. About 100,000 tweets during the experiment period included the subject word. The subject of the experiment was set as tourist spots in Tokyo. In this report, we describe “Takao Mountain,” “Showa Memorial Park,” “Shinjuku Gyoen,” and “Rikugien.” Figure 1 presents the target area locations. A, B, Figure 2. Positions of targets. C, and D in the figure respectively denote “Takao Experimental results Mountain,” “Showa Memorial Park,” “Rikugien,” and This section presents experimentally obtained results for “Shinjuku Gyoen.” In this experiment, about 30,000 tweets estimating the best time. Figure 3 presents results for the including the target word in Tokyo were found. In this estimated best-time viewing in 2017 using the target word experiment, all tweets made by the same user are also used ‘cherry blossoms’ in the target tourist spots. The black part as targets for analysis if they are tweets including the target of the figure represents the number of tweets containing a word. target word and sightseeing spot name. The light gray part We conducted experiments of the following two kinds represents best-time viewing as determined using the using these datasets. The first is an experiment using the proposed method of interpolation. number of tweets including the target word and the At tourist spots targeted for the experiment in 2017, as sightseeing spot name without interpolation. This portrayed in the black part of Figure 3, many data were experiment was compared as Baseline to confirm the obtained for B, C, and D. The maximum number of tweets usefulness of interpolation proposed in this paper. The per day was about 20. These results confirmed that some second is an experiment using interpolation. In this estimation can be accomplished without interpolation. experiment, we used Kriging in the earlier section. We present the example of Mt. Takao in Hachioji city. Figure 2 shows cherry-blossom-related tweets in Hachioji Figure 3. Experimental results. The light gray part of Figure 3 portrays an experimentally ACKNOWLEDGMENTS obtained result from interpolation results including the This work was supported by JSPS KAKENHI Grant Nos. tourist spots. Apparently, A was able to produce an estimate 16K00157 and 16K16158, and by a Tokyo Metropolitan using the proposed method by increasing the number of University Grant-in-Aid for Research on Priority Areas tweets using interpolation with surrounding tweets. For B “Research on Social Big Data.” and C, the useful information was included in the tweet not REFERENCES co-occurring with the tourist spot name. Therefore, we 1. Twitter. It's what's happening. 2017. Retrieved confirmed interpolation related to other kinds of cherry September 2, 2017 from https://Twitter.com/ blossoms in early March of B and C and late April of C. In addition, for D, there are days when it can be determined 2. Ministry of Economy. Trade and Industry. Inbound more accurately by interpolating the number of tweets. Landing-Type Tourism Guide. 2017. Retrieved November 10, 2017 from Therefore, these results confirmed the possibility of http://www.mlit.go.jp/common/001091713.pdf (in estimating the peak period, even for an area without tweets, Japanese). using data interpolation and overall tweet number 3. S. Phithakkitnukoon, T. Teerayut Horanont, A. interpolation. Witayangkurn, R. Siri, Y. Sekimoto, and R. Shibasaki. CONCLUSION 2015. Understanding tourist behavior using large-scale As described in this paper, we proposed an interpolation mobile sensing approach: A case study of mobile method to improve the accuracy of tourism information phone users in Japan. Pervasive and Mobile Computing related to phenological observations. The results of the Volume 18: 18-39. cherry blossom experiment conducted at the tourist spots in 4. A. Mislove, S. Lehmann, Y.Y. Ahn, J-P. Onnela, and J. Tokyo in 2017 confirmed the trend of improved estimation Niels Rosenquist. Understanding the Demographics of accuracy using information interpolation. We confirmed the Twitter Users. 2011. In Proceedings of the Fifth possibility of applying this proposed method to the International AAAI Conference on Weblogs and Social estimation of viewpoints and sightseeing spots with few Media (icwsm 2011), 554–557. tweets. However, in regions with no geo-tagged tweets, another method must be considered. Research can be 5. T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake conducted in the future to verify whether similar results are shakes Twitter users: real-time event detection by obtained for other biological seasonal observations. social sensors. 2010. WWW 2010, 851–860. 6. T. Kaneko and K. Yanai. Visual Event Mining from the Twitter Stream. 2010. WWW '16 Companion, In Proceedings of the 25th International Conference Companion on World Wide Web, 51–52. 7. K. Yamada. Detecting two types of seasonal words using simple autocorrelation analysis. 2017. IEEE Big Data 2017 Workshops. In Proceedings of the Second International Workshop on Application of Big Data for Computational Social Science. 8. J. Krumm and E. Horvitz. Eyewitness: identifying local events via space-time signals in Twitter feeds. 2015. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL ’15). ACM, New York, NY, USA,, Article 20, 10 pages. DOI: https://doi.org/10.1145/2820783.2820801 9. Twitter Developers. Twitter Developer official site. 2017. Retrieved April 2, 2017 from https://dev.twitter.com/ 10. Y. Hashimoto and M. Oka. Statistics of Geo-Tagged Tweets in Urban Areas (Synthesis and Analysis of Massive Data Flow). 2012. JSAI vol. 27, 4: 424–431 (in Japanese). 11. National Agriculture and Food Research Organization. Simple reverse geocoding service. 2017. Retrieved November 18, 2017 from https://www.finds.jp/rgeocode/index.html.ja 12. MeCab. Yet Another Part-of-Speech and Morphological Analyzer. 2017. Retrieved November 10, 2017 from http://taku910.github.io/mecab/ 13. M. A. Oliver. Kriging: A Method of Interpolation for Geographical Information Systems. 1990. International Journal of Geographic Information Systems vol. 4: 313–332.