-

CC-specific subcorpus of

1613-0073

Change Subcorpus based on New York Times Articles

Francesca Grasso

fr.grasso@unito.it 0 2

Ronny Patz

ronny.patz@uni-potsdam.de 0 1

Manfred Stede

stede@uni-potsdam.de 0 1 0 Climate Change , Corpora, Topic Modeling 1 University of Potsdam , Karl-Liebknecht-Str. 24-25, 14476, Potsdam , Germany 2 University of Turin , Corso Svizzera 185, 10149, Turin , Italy

2043

3 630 0000 0001

Over the past decade, the analysis of discourses on climate change (CC) has gained increased interest within the social sciences and the NLP community. Textual resources are crucial for understanding how narratives about this phenomenon are crafted and delivered. However, there still is a scarcity of datasets that cover CC in marking the first CC analysis on this data. The subcorpus was created by combining diferent methods for text selection to ensure representativeness and reliability, which is validated using ClimateBERT. To provide initial insights into the CC subcorpus, we discuss the results of a topic modeling experiment (LDA). These show the diversity of contexts in which CC is discussed in news media over time.

CEUR ceur-ws.org

1. Introduction We present NYTAC-CC, a topic-specific subcorpus with

3,630 articles addressing climate change (CC), derived from the New York Times Annotated Corpus. This subcorpus covers a 20-year period, drawing from NYTAC’s collection of 1.8 million articles published between 1987 and 2007, which is available through the Linguistic Data Consortium.

The original corpus, and thus also the subcorpus, includes a variety of metadata such as the ‘desk’ (the newspaper branch) and both manually- and automatically-labeled content categories, with many articles also featuring hand-written summaries. The extensive use of NYTAC in NLP research over the last 15 years (e.g., [ 1, 2 ]) benefits CC researchers, allowing for detailed historical analysis of CC discussions in news media. This includes exploring how CC debates were interwoven with topics like domestic and foreign policy, science reporting, and arts and culture coverage. Unlike other CC-focused resources that often contain shorter documents, the NYTAC-CC subcorpus ofers a diverse tent, making it a unique resource for investigating the evolution of CC narratives over time.

The contribution of this paper is threefold: (i) We present the NYTAC-CC subcorpus and its con†These authors contributed equally. mateBERT. As ClimateBERT falsely classifies a number of true positives from our subcorpus as (false) negatives, we demonstrate that our approach achieves better results in ensuring recall of relevant CC articles from the NYTAC corpus.

(iii) To gain initial insights into the CC subcorpus coverage, we use keyword analysis and topic modeling the 1987-2007 time span. The results show important trends over time, including key periods of reporting and a large variety of contexts in which CC is discussed.

Thus, our goal is to provide a substantively new and relevant subcorpus, developed and validated in multiple NYT’s coverage of climate change during the time period

covered in our corpus. Although several studies have explored U.S. print media’s reporting on anthropogenic

CC, we cover an important 20-year period in which much

of today’s climate change discourse evolved. array of articles with varying lengths and complex con- (specifically LDA) to track specifics of CC reporting over

2. Related Work: CC in News 3. Building the NYTAC-CC

Despite the growing interest in addressing climate 3.1. Challenges in CC Text Selection change among various academic communities, as pointed out by Luo et al. [ 5 ], the topic has so far received lim- The New York Times Annotated Corpus (LDC release)1 ited attention within the ’core’ NLP community. This contains 1,855,658 articles (1987-2007), each formatted is largely due to the NLP field’s focus on standardized as a single XML file. Metadata include date, author, and datasets and shared tasks, where the topic of CC has been newsroom desk. Articles are manually annotated with loscarcely addressed. cations, people, organizations, and key topics. However,

Eforts can be observed within the context of social topic labels are generally not suficient for our purpose, media, with datasets made available for CC-related tasks that is, finding all CC-related articles, because (i) not all [ 6, 7 ]. However, there remains a scarcity of work ad- articles are labeled; (ii) some labels of potentially CCdressing CC at the news article level, which is essential relevant text are overly broad, e.g., ’weather,’ which also for the NLP community investigating CC narratives in encompasses many non-CC topics; and (iii) some articles media or performing downstream tasks involving longer we consider CC-relevant are tagged with labels that do texts. In contrast, the analysis of CC discourse on both not relate to CC. social media and traditional media has been extensively Our goal is to design a retrieval method that not only studied in various social science disciplines [ 8, 9 ]. In the ensures validity and reliability but also emphasizes reprefollowing, we will focus on prominent work targeting sentativeness, ensuring that the corpus adequately covers traditional news media. content related to the specific subject it aims to represent.

A widely-cited early study by Trumbo [ 10 ] examined Traditional approaches, such as the use of keywords or the framing techniques used by various ”claim makers” n-grams, can be inadequate if used alone and can lead in the online editions of five U.S. newspapers. After to misclassifications due to both false positives and false querying with diferent terms and manually filtering the negatives. Crucially, this holds even with advanced modresults, the remaining articles were thoroughly investi- els, particularly when tasked with processing large lingated. Boykof [ 11 ] later studied the ”claims and frames” guistic units such as entire articles [ 18 ]. The changing issue in a similar manner. Legagneux et al. [ 12 ] con- use of language in time-spanning corpora can further ducted a comparative study of scientific literature and challenge single-method approaches, since they must press articles to investigate coverage diferences between handle texts that, although consistent in topic, may cover CC and biodiversity. They analyzed materials from the the phenomenon in varied ways over time. USA, Canada, and the United Kingdom spanning 1991 Moreover, we aim for an approach that is reproducible, to 2016, using representative keywords to query and re- i.e., that can also be applied to other corpora that do not trieve relevant content. Similarly, [ 13 ] examined how come with this type of metadata. We have therefore opted journalistic norms afected CC reporting in U.S. TV and for a hybrid approach that combines the advantages of newspapers. Other studies examined the frequency of both keyword-based methods and automatic classificaCC mentions, or the ’attention cycle’. Brossard et al. tion, while also aiming to overcome the weaknesses of [ 14 ] compared CC reporting between the NYT and the both.

French Le Monde. Grundmann and Krishnamurthy [ 15 ] analyzed newspapers from four countries, enhancing arti- 3.2. Our Hybrid Approach cle counts with word frequency and collocation analyses using corpus-linguistic tools, where the outcomes are Our subcorpus construction is built on text retrieval methmanually interpreted. The work of [ 16 ] highlights one ods previously used in studies on CC discourse (see, e.g., of the few instances where NLP technology is used to Section 2), but merges them into a hybrid approach to analyze CC in newspapers, where authors applied su- address their strengths and weaknesses. In the literature, pervised classification to construct a corpus and identify we identified the following approaches: frame categories within four U.S. papers. Continuing in the NLP domain, [ 4 ] utilized a specialized corpus that 1. Search with bigrams: typically, this involves includes CC-related news articles, though details on data terms like “climate change,” sometimes accomparetrieval are not available. [ 17 ] compiled a dataset of 11k nied by one or two others, notably “global warmnews articles from Science Daily through web scraping. ing” and ”greenhouse efect”; e.g., [ 10, 12 ];

In conclusion, there remains a scarcity of available 2. Search with a longer list of keywords, followed corpora containing larger text units like entire articles, by manual filtering; e.g., [ 19, 18 ]; which are essential for the NLP community investigating CC narratives in traditional media or performing various downstream tasks involving news articles. 3. Complex Boolean queries with keywords and op

erators (AND, OR, NOT); e.g., [ 20 ];

4. Manual annotation of training data followed by supervised classification; e.g., [ 16]. As a first exploratory step, we experimented with

method (1), obtaining the expected unsatisfactory results.

We subsequently refined our retrieval process from the NYTAC by extending methods (2) and (4). Texts that we consider relevant for the CC topic must not only merely mention CC in passing, but should discuss aspects of anthropogenic CC, relate substantial information, or convey a stance on its existence or urgency.

Bigram search. Initially, we experimented with a list of bigrams (see Appendix A) sourced from the BBC Climate Change Glossary2. This was done to cover terminologies used over the two decades spanned by the Figure 1: Key features in classifying ”climate change” articles corpus. This method led to the retrieval of 10,707 articles. Upon manual inspection, we found that many were false positives, addressing general environmental issues but not specifically related to CC. Conversely, many arti- the labels ’1’ (CC-related) or ’0’ (not CC-related). cles we regarded as relevant did not contain the bigram We used the manually-annotated data to train and test ”climate change” (searching for this bigram yielded only an XGBoost classifier, configured to diferentiate between 2,080 texts). Consequently, this led us to seek a more CC-related and non-CC articles. The features used inelaborate approach. cluded keyword counts, (those from [ 21 ], plus ’Kyoto’),

Keyword search. In response to the limited perfor- the 50 most frequent ’topic’ labels from the article metamance of the bigram search, we proceeded to extract data, and several binary features: whether an article was CC-related articles using keywords that were employed published by (i) the ’Dining’ or ’Style’ desks or by (ii) by [ 19 ] to identify topic-relevant articles in Nature and other desks; whether it was published on the weekend; Science (see Appendix B). To these, we added the key- whether a keyword appeared in the title or the first paraword ”Kyoto”, given the specific time period of our corpus graph; and whether the article was (i) an opinion piece or where the Kyoto conference had a similar importance a letter versus (ii) another type of article. The classifier as later the ”Paris agreement”. However, the resulting achieved a precision score of 1.0 and a recall score of 0.94 subcorpus still contained many false positives, primarily on our held-out evaluation set of 100 texts. Subsequently, from long list-like articles combining various news items. we used the classifier to label the entire intermediate corTo ensure homogeneity, we excluded these articles, re- pus, labeling 9,253 articles as not CC-related and 3,630 sulting in an intermediate corpus of 12,883 articles. CC-related, thus forming what we now refer to as our

Text ranking and supervised classification. To ifnal ’NYTAC climate change subcorpus’ and make availovercome the presence of false positives, we implemented able as the list of document IDs.3 Figure 1 illustrates the an additional, more elaborate filtering step on the inter- features that had the greatest impact on the classification mediate corpus. Initially, we heuristically ranked the decisions. articles for topic relevance, using a score based on accumulated keyword weights. This score reflects both 3.3. Evaluation with ClimateBERT the frequency of the keywords and their position within the article, as content in the beginning is generally con- We aim to demonstrate (i) the relevance of our 3,630sidered most important. Specifically, we multiply the article subcorpus in genuinely consisting of climate number of keyword occurrences per sentence by a score change (CC)-related articles and, thereby, (ii) the validity representing sentence prominence (1 for the first sen- of our combined method for retrieving topic-consistent tence, 0.9 for the second, 0.8 for the third, and so on). texts from a larger, heterogeneous collection while minAfter automatically ranking the articles, we selected 450 imizing false positives. To perform that validation, we articles for manual tagging: the top 150, the last 150, and employed ClimateBERT, specifically [ 4 ], a 150 from the middle. We manually assessed them to de- BERT-based model trained on CC-related texts. In partictermine if they were at least partially about CC, using ular, we used distilroberta-base-climate-detector from the

2https://www.bbc.com/news/science-environment-11833685 3https://github.com/discourse-lab/NYTAC-CC

Hugging Face platform[ 22 ], a fine-tuned version with a classification head for detecting climate-related paragraphs. Given its specialization in CC-related texts, we deemed ClimateBERT a very suitable tool to confirm the accuracy of our dataset. In doing so, we are also indirectly assessing the model’s capability in detecting CC-related content within larger portions of texts. As the model’s context length is limited to 512 tokens, we addressed this limitation by adopting two diferent approaches described below.

In the first approach, longer texts were truncated due to the model’s limited context length. Of the 3,630 instances, the model recognized 3,468 articles as +climate.

We manually inspected the remaining 162 texts classified as -climate, i.e., as false negatives. We found that the model clearly misclassified 75 texts, which included relevant CC content appearing beyond the initial 512 tokens.

More qualitative insights on these 162 texts are provided in the subsection below.

In addition, we attempted a second approach to overcome the context length constraint by using a sliding window technique. This involved creating chunks of longer texts (> 512 tokens), classifying each chunk, and labeling the entire text as +climate if any of the chunks Figure 2: Monthly article count in CC subcorpus were labeled as such. This second approach led to significantly diferent results, as only 3 out of 3,630 instances were labeled -climate.

These results demonstrate both the representativeness Kyoto Protocol or metaphorical uses of global warming. of our corpus and the validity of our hybrid subcorpus selection method. In addition, we show how automatic 4. Overview of NYTAC-CC classification models can be limiting when dealing with long text units, therefore reinforcing the need for a com- In this section, we provide an initial overview of the bined approach to build topic-relevant (sub)corpora. NYTAC-CC coverage, including the article distribution over time and a preliminary subtopics exploration.

3.4. Analysis of the ClimateBERT misclassifications As discussed in Section 3.3, we manually inspected 162

articles that ClimateBERT initially classified as false negatives within our subcorpus. Of these, 75 were clearly related to CC. Specifically, 48 articles featured significant discussions on CC-related issues beyond the model’s 512token limit. Additionally, 27 articles contained detailed CC narratives within the first 512 tokens, often intersecting with other topics like politics (e.g., conferences on CC) and population (e.g., CC impacts on specific regions). This misclassification highlights the models’ limitation extending beyond the mere input token limitation, underscoring the challenges in handling topic intersections.

Although not the primary focus, CC was still mentioned in the remaining articles. In particular, 51 articles included CC in contexts marginally related to their main narratives, integrating CC with other discussions. In another 36 articles, CC was a secondary topic, occasionally mentioned only in passing, such as references to the

4.1. Temporal and Keyword highlights We examine the temporal distribution of articles and key

lexical features in our corpus to illuminate trends and shifts in CC coverage over time (see Figure 2).

The analysis reveals a peak in articles during 1990, with up to 50 mentions per month, followed by a decline to 20 articles per month in the mid-90s. After the Kyoto Protocol in December 1997, the curve shows a steady rise with intermittent bursts in coverage. In the figure, we have marked important ’climate events’ corresponding to the years they occurred.

The frequency ratios of the top eight lexical features determined by the classifier (cf. Figure 1) over time in Figure 3 illustrate the dominance of ’greenhouse’ in the late 1980s. ’Warming’ remains the most frequent term throughout, but in the final years, ’climate’ gains prominence, suggesting a shift of term preference from ’global warming’ to ’climate change’—a transition noted in various other studies as well. Also, the two ’Kyoto’ events are clearly visible: the international accord was reached in 1997, and the Bush administration’s decision not to 9. plant: coal, company, emission, power, utility ratify it occurred in 2001.

At the same time, we also find that many articles fo- 10. water: area, land, river, population, fish cused on weather or pollution primarily addressed these 11. state: pollution, air, ozone, epa, smog issues directly, mentioning climate change only tangen- 12. china: government, people, war, security, country tially. This reduces the co-occurence of other prominent CC terms in these articles. 13. car: vehicle, fuel, gasoline, hydrogen, auto

4.2. Document Structuring with LDA

14. ice: sea, arctic, ocean, glacier, bear 15. forest: tree, plant, species, fire, crop Building on the basic statistics discussed in the previous 16. weather: winter, temperature, snow, degree, heat subsection, we delved deeper into the range of subtopics 17. storm: el_nino, drought, hurricane, wind, flood within the CC corpus using topic modeling, specifically Latent Dirichlet Allocation (LDA). This approach helps 18. island: bird, beach, garden, long_island, sand to uncover underlying thematic structures in the data, As is common with topic models, some overlap bewhich are not immediately apparent from simple key- tween topics can occasionally be observed when examinword analysis. ing the complete top-30 term lists, for example, between

Preprocessing Steps To prepare the texts for LDA, topics company and plant. Additionally, we find some we performed several preprocessing steps on article titles apparent ’outlier’ terms in all the topics. and bodies, including removing punctuation, lemmatiz- As a preliminary approximation, we tagged each text ing words, and converting all text to lowercase to ensure in the subcorpus with the predominant topic identified by consistency. We also joined frequently co-occurring bi- the model, allowing us to track the evolution of topic covgrams into single terms to preserve important phrases. erage over time (see Figure 4). This LDA-based analysis For our topic modeling, we focused on nouns and proper highlights how the context of CC-related coverage in the nouns that ranked among the top 10,000 by frequency NYTAC corpus shifts over time, for example from a framand had more than two letters. This refinement allowed ing within science and pollution debates to a discourse us to emphasize key entities and their relationships, cen- context in which greenhouse gas emissions were central. tral to the content of the articles, and avoid the dilution of Further, our findings complement the manual inspection thematic significance by less informative parts of speech, discussed in Section 3.3, illustrating how climate change enhancing consistency through the use of pseudowords. discussions, while sometimes secondary in broader arti

Model Selection The best LDA model was chosen cles on government policy (topic ’administration’), are based on the coherence score, calculated using the Python integral to discussions on foreign policy (’China’) and Gensim library. This ensures an objective selection pro- cultural topics (’people’). cess, minimizing subjective interpretation. We prioritized coherence to ensure that the topics generated by the model are interpretable and meaningful. The optimal 5. Conclusion and Future Work model identified 18 topics, with a coherence score of .56, indicating a reasonable level of interpretability. We chose In this paper, we introduced the NYTAC-CC, a specialized the highest-ranked term as the ’name’ of each topic and subcorpus of 3,630 climate change articles from the New listed five additional representative terms as follows: York Times Annotated Corpus spanning 1987 to 2007, marking the first CC analysis with this dataset. Addressing the lack of available news-based textual resources for NLP tasks, we employed a hybrid method combining keyword-based prefiltering and automatic classification to optimize the corpus construction. The representativeness of the subcorpus was confirmed using ClimateBERT, but additional manual inspection of ClimateBERT’s classification of a relevant amount of true positives as (false) negatives also showed the model’s limitations and the benefits of the hybrid approach chosen.

Initial analyses of the subcorpus, including statistics, keyword searches, and topic modeling, highlight the corpus’s potential for detailed diachronic and subtopic exploration.

Thus, the NYTAC-CC subcorpus can be a useful resource for examining the historical narrative of climate change in news media. As it builds on the NYTAC corpus, it adds to previous work on this data, providing valuable insights for social science research. It also serves as a beneficial dataset for developing NLP applications that require a deep understanding of climate-related discourse.

While the size of the subcorpus may restrict certain quantitative analyses, its rich, concentrated content is ideal for qualitative studies. Furthermore, it ofers the potential for expansion and further integration with additional sources to enhance its utility and relevance for ongoing climate change research. Future work will expand on these findings with advanced topic modeling techniques and integrate more recent articles to enrich the diachronic analysis.

A. List of Bigrams climate change, global warming, greenhouse efect, acid rain, ozone layer, greenhouse gases, fossil fuels, greenhouse emissions, ice shelves, ice sheets, rising sea, sea levels, Kyoto Protocol, Montreal Protocol, carbon footprint, carbon dioxide, carbon neutral, emission trading, feedback loop, global dimming, renewable energy, Stern Review.

B. List of Keywords

[1]

Zhang ,

Jatowt ,

S. S.

Bhowmick ,

Tanaka , Omnia mutantur, nihil interit: Connecting past with present by finding corresponding terms across time , in: Annual Meeting of the Association for Computational Linguistics , 2015 . URL: https: //api.semanticscholar.org/CorpusID:1121386.

[2]

Alonso ,

Berberich ,

S. J.

Bedathur , G. Weikum, Time-based exploration of news archives , 2010 . URL: https://api.semanticscholar.org/CorpusID: 2353972.

[3]

Kantner ,

Overbeck , Exploring soft concepts with hard corpus-analytic methods , in: N. Reiter , A. Pichler , J. Kuhn (Eds.), Reflektierte algorithmische Textanalyse, De Gruyter, Berlin, 2020 .

[4]

Webersinke ,

Kraus ,

Bingler , M. Leippold, ClimateBERT: A Pretrained Language Model for Climate-Related Text , in: Proceedings of AAAI 2022 Fall Symposium: The Role of AI in Responding to Climate Challenges, 2022 . doi:https://doi.org/ 10.48550/arXiv.2212.13631.

[5]

Luo ,

Card ,

Jurafsky , Detecting stance in media on global warming, in: Findings of the Association for Computational Linguistics: EMNLP 2020 , Online, 2020 , pp. 3296 - 3315 .

[6]

Efrosynidis ,

Karasakalidis ,

Sylaios ,

Arampatzis , The climate change twitter dataset , Expert Syst. Appl . 204 ( 2022 ) 117541 . URL: https: //api.semanticscholar.org/CorpusID:248807383.

[7]

Samantray ,

Pin , Data and code for: Credibility of climate change denial in social media ( 2019 ). URL: https://doi.org/10.7910/DVN/LNNPVD. doi: 10 .7910/DVN/LNNPVD.

[8]

Diehl ,

Huber , H. G. de Zúñiga,

J. H.

Liu , Social media and beliefs about climate change: A cross-national analysis of news use, political ideology, and trust in science, International Journal of Public Opinion Research ( 2019 ). URL: https: //api.semanticscholar.org/CorpusID:214067785.

[9]

Shehata ,

Johansson ,

Andersen , Climate change frame acceptance and resistance: Extreme weather, consonant news, and personal media orientations , Mass Communication and Society 25 ( 2021 ) 51 - 76 . URL: https: //api.semanticscholar.org/CorpusID:238720934.

[10]

Trumbo , Constructing climate change: claims and frames in US news coverage of an environmental issue , Publ. Underst. Science 5 ( 1996 ) 269 - 283 .

[11]

Boykof , The cultural politics of climate change discourse in UK tabloids , Political Geography 27 ( 2008 ) 549 - 569 .

[12]

Legagneux ,

Casajus ,

Cazelles ,

Chevallier ,

Chevrinais ,

Guéry ,

Jacquet ,

Jafré , M.-J. Naud , F.

Noisette , P.

Ropars , S.

Vissault , P.

Archambault , J.

Bêty , D.

Berteaux , D.

Gravel , Our house is burning: Discrepancy in climate change vs. biodiversity coverage in the media as compared to scientific literature , Frontiers in Ecology and Evolution 5 ( 2018 ). URL: https://api.semanticscholar.org/ CorpusID:39805874.

[13]

Boykof ,

Boykof , Climate Change and Journalistic Norms: A Case-Study of US Mass-Media Coverage, Geoforum 38 ( 2007 ) 1190 - 2004 .

[14]

Brossard ,

Shanahan ,

McComas , Are issuecycles culturally constructed? A comparison of French and American coverage of global climate change , Mass Communication and Society 7 ( 2004 ) 359 - 377 .

[15]

Grundmann ,

Krishnamurthy , The Discourse of Climate Change: A Corpus-based Approach, Critical Approaches to Discourse Analysis across Disciplines 4 ( 2010 ) 113 - 133 .

[16]

D. A.

Stecula , E. Merkley, Framing Climate Change: Economics, Ideology, and Uncertainty in American News Media Content From 1988 to 2014, Frontiers in Communication 4 ( 2019 ).

[17]

Mishra ,

Mittal , Neuralnere: Neural named entity relationship extraction for end-to-end climate change knowledge graph construction , in: ICML 2021 Workshop on Tackling Climate Change with Machine Learning , 2021 . URL: https://www. climatechange.ai/papers/icml2021/76.

[18]

Leippold ,

F. S.

Varini , Climatext: A dataset for climate change topic detection , in: NeurIPS 2020 Workshop on Tackling Climate Change with Machine Learning , 2020 . URL: https://www. climatechange.ai/papers/neurips2020/69.

[19]

Hulme ,

Obermeister ,

Randalls , M. Borie, Framing the challenge of climate change in Nature and Science editorials , nature climate change 8 ( 2018 ) 515 - 521 .

[20]

Schmidt ,

Ivanova ,

M. S.

Schäfer , Media Attention for Climate Change around the World: A Comparative Analysis of Newspaper Coverage in 27 Countries, Global Environmental Change 23 ( 2013 ) 1233 - 1248 .

[21]

Hulme , Why we disagree about climate change: Understanding controversy, inaction and opportunity , Cambridge UP, Cambridge, 2009 .

[22]

Bingler ,

Kraus ,

Leippold ,

Webersinke , How Cheap Talk in Climate Disclosures Relates to Climate Initiatives, Corporate Emissions, and Reputation Risk, Working paper, Available at SSRN 3998435 , 2023 .