Comparative Analysis of GDELT Data Using the News Site Contrast System Masaharu Yoshioka Noriko Kando Hokkaido University National Institute of Informatics N14 W9, Kita-ku, Sapporo-shi, 2-1-2, Hitotsubashi, Chiyoda-ku, Hokkaido, 060-0814, Japan Tokyo, 101-8430, Japan yoshioka@ist.hokudai.ac.jp kando@nii.ac.jp only uses small numbers of news sites from East Asian Abstract The News Site Contrast (NSContrast) system countries (Japan, China, Korea) and the USA to char- analyzes news articles retrieved from multiple acterize the differences between them. news sites based on the concept of contrast Recently, a Global Database of Events, Language, set mining. It can extract terms that charac- and Tone (GDELT) [LS13] 1 was released. This terize different topics of interest across news database is based on larger numbers of news sites from sites, countries, and regions. In this study, we all over the world and it contains extracted metadata used NSContrast to analyze Global Database information from news articles. In this paper, we pro- of Events, Language, and Tone (GDELT) data pose a method to utilize GDELT to analyze the char- by comparing news articles from different re- acteristics of news article from different countries and gions (e.g., USA, Asia, and the Middle East). regions by adding country and region information for We also present examples of analyses per- the news sites in the database. By using these data, we formed using this system. can compare news articles from various countries and regions (e.g., USA, Asia, South America, and Africa) 1 Introduction It has become possible to access a wide variety of news worldwide instead of our original small database of sites from across the world via the Internet. Each news news articles. We also present examples of analyses site has its own culture and interpretation of events, so performed using the NSContrast system. we can obtain a greater diversity of information than 2 NSContrast ever before by using multiple news sites. Opinions and 2.1 System description interests expressed in news articles vary across coun- NSContrast employs the following four methods to an- tries, and we can obtain different points of view re- alyze news articles. garding a topic if we access news sites from different • Burst analysis [Kle02] identifies the daily burst countries. For example, Asian, European, and Ameri- terms and the regional distribution of a specific can news sites share some common views on diplomatic bursty term. (Figure 1) issues related to North Korea, as well as having their • A term collocation analysis graph shows re- own characteristic opinions. Therefore, it is important lationships among collocated terms and the given to clarify the characteristics of each specific news site query. NSContrast uses highly collocated terms when analyzing events reported by multiple sites. from all regions based on contrast set mining and The News Site Contrast (NSContrast) system was ordinal collocation analysis. These collocation developed to analyze the characteristics of news sites terms are visualized with a spring model using [YK12]. However, since it is not easy to construct fdp in Graphviz.2 . • A news article retrieval system is used to un- news databases from different countries, NSContrast derstand the meanings of the terms in the collo- Copyright c by the paper’s authors. Copying permitted for cation analysis and the burst analysis. private and academic purposes. This volume is published and • A multifaceted interface for analyzing news copyrighted by its editors. articles. In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- The system uses multiple facets (e.g., keyword, gartner, R. Campos, and D. Albakour (eds.): Proceedings of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, 1 http://www.gdeltproject.org/ published at http://ceur-ws.org 2 http://www.graphviz.org/ named entity, polarity, news site, and country) to By using these metadata, the following information analyze news articles. The interface supports the was constructed for NSContrast. construction of structured queries that use one Date Date of the article. or more facets, where the facet information can Person, Organization, Location Lists of people, be represented using various styles (e.g., time se- organizations, and locations extracted from the quence graph, table, or bar chart). (Figure 2) article using the GDELT GKG. 2.2 Data conversion Polarity We classified articles into three types (posi- To apply NSContrast to the analysis of GDELT data, tive, negative, and neutral) to simplify the analy- it was necessary to convert the GDELT data into news sis of the polarity information. The tone extracted article data. There are two databases in GDELT: by the GDELT GKG was used for classification GDELT Event and GDELT Global Knowledge Graph (tone > 1: positive; tone < −1: negative; other: (GKG). GDELT GKG is a database based on a raw neutral). output format of the original news articles for con- Site Site information extracted from GDELT GKG. structing the GDELT Event database. Because the To count the number of articles from different GDELT Event database does not have detailed orig- news sites, we duplicate one datum for each site. inal news article sources, GDELT GKG was used for However, if there are two or more entries for the NSContrast. same news site information, one of these entries GDELT GKG was constructed by extracting the is used for deduplication. In the above example, following metadata information from the original one datum is duplicated for “punchng.com” and news articles: DATE, THEMES, LOCATIONS, PER- “onlinenigeria.com.” SONS, ORGANIZATIONS, TONE (as a real value; URL URL for the original news article. When 0 means neutral), CAMEOEVENTIDS (references there is one URL for a site, the correspond- to the GDELT Event database), SOURCES, and ing URL is used for each site. However, when SOURCEURLS. When there are two or more arti- there are two or more URLs for a given news cles that share all name sets (THEMES, LOCATIONS, site, the shortest URL is selected for each news PERSONS, and ORGANIZATIONS), those news ar- site (e.g., http://www.punchng.com/25909-2/ for ticles are aggregated as one datum and SOURCES punchng.com). and SOURCEURLS have multiple entries. Example SiteCountry We constructed a database of news of SOURCES and SOURCEURLS information for one sites to identify their countries of origin. We used datum in January 19, 2016 are shown below. http://www.world-newspapers.com/ to extract these relationships. For “BBC monitoring,” we SOURCES punchng.com; punchng.com; onlinenige- used “United Kingdom” as the site country for the ria.com; onlinenigeria.com SOURCEURLS http://www.punchng.com/25909-2/, news site. In addition, if news sites used country http://www.punchng.com/i-am- code top-level domains (e.g., .jp for Japan), we resolved-to-better-lagos-ambode/, used this domain information to estimate the site http://news2.onlinenigeria.com/news/general/ country. Finally we used a geolocation service 3 to 453949-i-am-resolved-to-better- estimate the site country by using the IP address lagos-%E2%80%93w-ambode.html, of the top domain. However, the country was left http://news2.onlinenigeria.com/news/general/ blank if we could not obtain appropriate location 453949-i-am-resolved-to-better-lagos-ambode.html information from the geolocation service. SiteRegion Countries were grouped into the follow- Two types of multiple SOURCEURLS are shown ing eight regions: USA, Asia, Europe, Middle above. In one, almost the same content has a different East, Africa, Oceania, North America (excluding URL for the same news site (the first two URLs and USA), and South America. News articles that the last two URLs above) and the other is a differ- lacked site country information were categorized ent URL with different news sites (the first and third as Unclassified. URLs). We could use all of these information types other Most of the former cases are simply URL variations than the URL to perform multifaceted analyses. of the same content; e.g., the first URL is redirected to the second URL and the third URL is a variation of the 3 NSContrast with GDELT fourth URL (the URL encoding of “%E2%80%93w” We set up our system based on the GDELT GKG is “-” for UTF-8). It is better to select one of them from July 20, 2015 to January 19, 2016. Using the for deduplication. The latter cases are meaningful for data conversion process described above, we extracted representing the importance of the contents, because 31,584,327 articles from 70,781 news sites. different news sites have selected the same content for 3 https://freegeoip.net/, http://ip-api.com/, and their sites. http://ipinfo.io First, we present information related to the coun- T e r m s c o v e r e d w i t h try and region estimation. Because our manually d a r k r e d m e a n s b u r s t i s C o m m o n b u r s t o r g a n i z a t i o n t o p i c o f l l o c a t o s o m e r e g i o n a r e “ V o l k s w a g e n ” a n d “ U n i t e d N a t i o n s ” 2 0 1 5  9  2 7 constructed news site list is small, only 2201 sites (8,555,263 articles) were identified by using this in- formation. Table 1 shows the number of articles (sites) by the top-level domain of URLs (Top 6). Be- cause 81.2% (47,259/70,781) of news sites and (71.9% (22,716,591/31,584,327) of articles have .com as their top-level domain, only 10,139 sites (5,671,259 articles) Figure 1: Burst analysis results on 2015-9-27 were identified by their top-level domain. A d d i t i o n a l q u e r y c o n d i t i o n N o a d d i t i o n a l q u e r y c o n d i t i o n S i t e C o u n t r y = C h i n a A r t i c l e s f r o m a l l o v e r t h e w o r l d ( A r t i c l e s f r o m C h i n a ) Table 1: Number of articles (sites) for top-level do- mains (Top 6) .com 22,716,591 (47,259) .au 2,623,813 (1048) .uk 1,682,960 (2705) .org 1,029,232 (8049) .net 645,996 (3015) .ca 326,184 (1008) Q u e r y C o n d i t i o n A d d i t i o n a l q u e r y o n d i t i o n A d d i t i o n a l q u e r y c o n d i t i o n f o r a l l g r a p h : Finally, by using the geolocation service 57,459 A r t i c l e c o n t a i n ( A S i r t t e i C c l o e u s n f t r r o y m = U U S S A A ) ( A S i r t t e i c R l e e s g i f o r o n m = E E u u r r o o p p e e ) A I I B a s o r g a n i z a t i o n sites (16,816,980 articles) were identified. As a re- & & sult, most of the sites (98.6%: 69,799/70,781) and ar- A r t i c l e p u b l i s h e d r o m 2 0 1 5 | 7 | 2 0 F ticles (98.3%: 31,043,442/31,584,327) were classified into countries and regions. Table 2 shows the number of articles for each region. Figure 2: Multifaceted analysis for the query “AIIB” From this table, news articles from the USA were dom- inant in the database (61.6%: 19,443,005/31,581,063). utilize this information for NSContrast. In this con- In contrast, there were only 903,811 articles from version process, we conducted deduplication of news North America excluding the USA. With such unbal- article URLs and added source country and region in- anced numbers of articles, making a category North formation to analyze the characteristic differences be- America including the USA is almost equivalent to tween them. Because of the large coverage of news USA alone. Therefore, we divided North America into sites, the system can conduct comparative analyses of the USA and North America (excluding USA). various countries and regions by using large numbers Table 2: Number of articles for each region of news articles from different news sites. However, for USA 19,443,005 Europe 3,696,359 future work, it may be better to check the appropriate- Oceania 2,962,792 Asia 2,891,865 ness of the estimated country by using a geolocation North America (excluding USA) 903,811 service. Africa 726,626 Middle East 373,411 Acknowledgement South 45,573 Unclassified 537,621 This work was partially supported by JSPS KAKENHI America Grant Number 25280035. Our multifaceted analysis interface was used to References compare the results with different query conditions. [Kle02] Jon Kleinberg. Bursty and hierarchical struc- Figure 2 shows a time-sequence graph of polarity in ture in streams. In Proceedings of the 8th ACM different countries: all countries (upper left), China SIGKDD Intl. Conf. on Knowledge Discovery (upper right), the USA (lower left), and Europe and Data Mining, pages 91–101, New York, (lower right). These graphs were constructed by NY, USA, 2002. ACM Press. adding new query conditions when selecting the data. [LS13] Kalev Leetaru and Philip A. Schrodt. For example, the graph for China uses news articles Gdelt:global data on events, location, and that included “Asian Infrastructure Investment Bank” tone, 1979-2012. In ISA Annual Convention (AIIB) as the organization, an article date ≥ July 20, 2013, volume 2, page 4, 2013. 2015, and the SiteCountry = “China.” [YK12] Masaharu Yoshioka and Noriko Kando. Mul- This figure shows that there were many positive ar- tifaceted analysis of news articles by using se- ticles about AIIB in China. Europe was slightly posi- mantic annotated information. In Proceedings tive than the USA. This information reflects the atti- of the fifth workshop on Exploiting semantic tudes to AIIB in these countries (or regions). annotations in information retrieval, ESAIR ’12, pages 19–20, New York, NY, USA, 2012. 4 Conclusion ACM. In this study, we have analyzed the characteristics of GDELT data and propose a data conversion process to