NYTAC-CC: A Climate Change Subcorpus based on New
                                York Times Articles
                                Francesca Grasso1,∗,† , Ronny Patz2,† and Manfred Stede2,†
                                1
                                    University of Turin, Corso Svizzera 185, 10149, Turin, Italy
                                2
                                    University of Potsdam, Karl-Liebknecht-Str. 24-25, 14476, Potsdam, Germany


                                                  Abstract
                                                  Over the past decade, the analysis of discourses on climate change (CC) has gained increased interest within the social
                                                  sciences and the NLP community. Textual resources are crucial for understanding how narratives about this phenomenon are
                                                  crafted and delivered. However, there still is a scarcity of datasets that cover CC in news media in a representative way. This
                                                  paper presents a CC-specific subcorpus of 3,630 articles extracted from the 1.8 million New York Times Annotated Corpus,
                                                  marking the first CC analysis on this data. The subcorpus was created by combining different methods for text selection
                                                  to ensure representativeness and reliability, which is validated using ClimateBERT. To provide initial insights into the CC
                                                  subcorpus, we discuss the results of a topic modeling experiment (LDA). These show the diversity of contexts in which CC is
                                                  discussed in news media over time.

                                                  Keywords
                                                  Climate Change, Corpora, Topic Modeling


                                1. Introduction                                                                                            struction using blending of dictionary-based and super-
                                                                                                                                           vised methods in order to ensure representativeness as
                                We present NYTAC-CC, a topic-specific subcorpus with                                                       well as validity and reliability, which are key in social
                                3,630 articles addressing climate change (CC), derived                                                     science research [3]. This hybrid approach addresses the
                                from the New York Times Annotated Corpus. This sub-                                                        challenges of refining a topic-specific subcorpus from a
                                corpus covers a 20-year period, drawing from NYTAC’s                                                       larger corpus, aiming to mitigate the limitations of tradi-
                                collection of 1.8 million articles published between 1987                                                  tional keyword-based sampling that often results in false
                                and 2007, which is available through the Linguistic Data                                                   positives.
                                Consortium. The original corpus, and thus also the                                                            (ii) To demonstrate the validity of the subcorpus, and
                                subcorpus, includes a variety of metadata such as the                                                      thus its reliability for further downstream tasks, we il-
                                ‘desk’ (the newspaper branch) and both manually- and                                                       lustrate the results of a classification experiment using
                                automatically-labeled content categories, with many ar-                                                    ClimateBERT [4]. While this experiment further vali-
                                ticles also featuring hand-written summaries. The ex-                                                      dates that the articles in our NYTAC-CC subcorpus are,
                                tensive use of NYTAC in NLP research over the last 15                                                      indeed, true positives, it also shows limitations of Cli-
                                years (e.g., [1, 2]) benefits CC researchers, allowing for                                                 mateBERT. As ClimateBERT falsely classifies a number of
                                detailed historical analysis of CC discussions in news                                                     true positives from our subcorpus as (false) negatives, we
                                media. This includes exploring how CC debates were                                                         demonstrate that our approach achieves better results in
                                interwoven with topics like domestic and foreign policy,                                                   ensuring recall of relevant CC articles from the NYTAC
                                science reporting, and arts and culture coverage. Unlike                                                   corpus.
                                other CC-focused resources that often contain shorter                                                         (iii) To gain initial insights into the CC subcorpus
                                documents, the NYTAC-CC subcorpus offers a diverse                                                         coverage, we use keyword analysis and topic modeling
                                array of articles with varying lengths and complex con-                                                    (specifically LDA) to track specifics of CC reporting over
                                tent, making it a unique resource for investigating the                                                    the 1987-2007 time span. The results show important
                                evolution of CC narratives over time.                                                                      trends over time, including key periods of reporting and
                                   The contribution of this paper is threefold:                                                            a large variety of contexts in which CC is discussed.
                                   (i) We present the NYTAC-CC subcorpus and its con-                                                         Thus, our goal is to provide a substantively new and
                                                                                                                                           relevant subcorpus, developed and validated in multiple
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                       iterations, and to then provide a first overview of the
                                Dec 04 — 06, 2024, Pisa, Italy                                                                             NYT’s coverage of climate change during the time period
                                †
                                    These authors contributed equally.                                                                     covered in our corpus. Although several studies have
                                Envelope-Open fr.grasso@unito.it (F. Grasso); ronny.patz@uni-potsdam.de                                    explored U.S. print media’s reporting on anthropogenic
                                (R. Patz); stede@uni-potsdam.de (M. Stede)
                                Orcid 0000-0001-8473-9491 (F. Grasso); 0000-0002-0761-086X (R. Patz);
                                                                                                                                           CC, we cover an important 20-year period in which much
                                0000-0001-6819-2043 (M. Stede)                                                                             of today’s climate change discourse evolved.
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work: CC in News                                    3. Building the NYTAC-CC
Despite the growing interest in addressing climate             3.1. Challenges in CC Text Selection
change among various academic communities, as pointed
out by Luo et al. [5], the topic has so far received lim-      The New York Times Annotated Corpus (LDC release)1
ited attention within the ’core’ NLP community. This           contains 1,855,658 articles (1987-2007), each formatted
is largely due to the NLP field’s focus on standardized        as a single XML file. Metadata include date, author, and
datasets and shared tasks, where the topic of CC has been      newsroom desk. Articles are manually annotated with lo-
scarcely addressed.                                            cations, people, organizations, and key topics. However,
   Efforts can be observed within the context of social        topic labels are generally not sufficient for our purpose,
media, with datasets made available for CC-related tasks       that is, finding all CC-related articles, because (i) not all
[6, 7]. However, there remains a scarcity of work ad-          articles are labeled; (ii) some labels of potentially CC-
dressing CC at the news article level, which is essential      relevant text are overly broad, e.g., ’weather,’ which also
for the NLP community investigating CC narratives in           encompasses many non-CC topics; and (iii) some articles
media or performing downstream tasks involving longer          we consider CC-relevant are tagged with labels that do
texts. In contrast, the analysis of CC discourse on both       not relate to CC.
social media and traditional media has been extensively           Our goal is to design a retrieval method that not only
studied in various social science disciplines [8, 9]. In the   ensures validity and reliability but also emphasizes repre-
following, we will focus on prominent work targeting           sentativeness, ensuring that the corpus adequately covers
traditional news media.                                        content related to the specific subject it aims to represent.
   A widely-cited early study by Trumbo [10] examined          Traditional approaches, such as the use of keywords or
the framing techniques used by various ”claim makers”          n-grams, can be inadequate if used alone and can lead
in the online editions of five U.S. newspapers. After          to misclassifications due to both false positives and false
querying with different terms and manually filtering the       negatives. Crucially, this holds even with advanced mod-
results, the remaining articles were thoroughly investi-       els, particularly when tasked with processing large lin-
gated. Boykoff [11] later studied the ”claims and frames”      guistic units such as entire articles [18]. The changing
issue in a similar manner. Legagneux et al. [12] con-          use of language in time-spanning corpora can further
ducted a comparative study of scientific literature and        challenge single-method approaches, since they must
press articles to investigate coverage differences between     handle texts that, although consistent in topic, may cover
CC and biodiversity. They analyzed materials from the          the phenomenon in varied ways over time.
USA, Canada, and the United Kingdom spanning 1991                 Moreover, we aim for an approach that is reproducible,
to 2016, using representative keywords to query and re-        i.e., that can also be applied to other corpora that do not
trieve relevant content. Similarly, [13] examined how          come with this type of metadata. We have therefore opted
journalistic norms affected CC reporting in U.S. TV and        for a hybrid approach that combines the advantages of
newspapers. Other studies examined the frequency of            both keyword-based methods and automatic classifica-
CC mentions, or the ’attention cycle’. Brossard et al.         tion, while also aiming to overcome the weaknesses of
[14] compared CC reporting between the NYT and the             both.
French Le Monde. Grundmann and Krishnamurthy [15]
analyzed newspapers from four countries, enhancing arti-       3.2. Our Hybrid Approach
cle counts with word frequency and collocation analyses
using corpus-linguistic tools, where the outcomes are          Our subcorpus construction is built on text retrieval meth-
manually interpreted. The work of [16] highlights one          ods previously used in studies on CC discourse (see, e.g.,
of the few instances where NLP technology is used to           Section 2), but merges them into a hybrid approach to
analyze CC in newspapers, where authors applied su-            address their strengths and weaknesses. In the literature,
pervised classification to construct a corpus and identify     we identified the following approaches:
frame categories within four U.S. papers. Continuing in
                                                                      1. Search with bigrams: typically, this involves
the NLP domain, [4] utilized a specialized corpus that
                                                                         terms like “climate change,” sometimes accompa-
includes CC-related news articles, though details on data
                                                                         nied by one or two others, notably “global warm-
retrieval are not available. [17] compiled a dataset of 11k
                                                                         ing” and ”greenhouse effect”; e.g., [10, 12];
news articles from Science Daily through web scraping.
   In conclusion, there remains a scarcity of available               2. Search with a longer list of keywords, followed
corpora containing larger text units like entire articles,               by manual filtering; e.g., [19, 18];
which are essential for the NLP community investigating
CC narratives in traditional media or performing various
downstream tasks involving news articles.                      1
                                                                   https://www.ldc.upenn.edu
       3. Complex Boolean queries with keywords and op-
          erators (AND, OR, NOT); e.g., [20];
       4. Manual annotation of training data followed by
          supervised classification; e.g., [16].

   As a first exploratory step, we experimented with
method (1), obtaining the expected unsatisfactory results.
We subsequently refined our retrieval process from the
NYTAC by extending methods (2) and (4). Texts that we
consider relevant for the CC topic must not only merely
mention CC in passing, but should discuss aspects of an-
thropogenic CC, relate substantial information, or convey
a stance on its existence or urgency.
   Bigram search. Initially, we experimented with a
list of bigrams (see Appendix A) sourced from the BBC
Climate Change Glossary2 . This was done to cover ter-
minologies used over the two decades spanned by the           Figure 1: Key features in classifying ”climate change” articles
corpus. This method led to the retrieval of 10,707 arti-
cles. Upon manual inspection, we found that many were
false positives, addressing general environmental issues
but not specifically related to CC. Conversely, many arti-    the labels ’1’ (CC-related) or ’0’ (not CC-related).
cles we regarded as relevant did not contain the bigram          We used the manually-annotated data to train and test
”climate change” (searching for this bigram yielded only      an XGBoost classifier, configured to differentiate between
2,080 texts). Consequently, this led us to seek a more        CC-related and non-CC articles. The features used in-
elaborate approach.                                           cluded keyword counts, (those from [21], plus ’Kyoto’),
   Keyword search. In response to the limited perfor-         the 50 most frequent ’topic’ labels from the article meta-
mance of the bigram search, we proceeded to extract           data, and several binary features: whether an article was
CC-related articles using keywords that were employed         published by (i) the ’Dining’ or ’Style’ desks or by (ii)
by [19] to identify topic-relevant articles in Nature and     other desks; whether it was published on the weekend;
Science (see Appendix B). To these, we added the key-         whether a keyword appeared in the title or the first para-
word ”Kyoto”, given the specific time period of our corpus    graph; and whether the article was (i) an opinion piece or
where the Kyoto conference had a similar importance           a letter versus (ii) another type of article. The classifier
as later the ”Paris agreement”. However, the resulting        achieved a precision score of 1.0 and a recall score of 0.94
subcorpus still contained many false positives, primarily     on our held-out evaluation set of 100 texts. Subsequently,
from long list-like articles combining various news items.    we used the classifier to label the entire intermediate cor-
To ensure homogeneity, we excluded these articles, re-        pus, labeling 9,253 articles as not CC-related and 3,630
sulting in an intermediate corpus of 12,883 articles.         CC-related, thus forming what we now refer to as our
   Text ranking and supervised classification. To             final ’NYTAC climate change subcorpus’ and make avail-
overcome the presence of false positives, we implemented      able as the list of document IDs.3 Figure 1 illustrates the
an additional, more elaborate filtering step on the inter-    features that had the greatest impact on the classification
mediate corpus. Initially, we heuristically ranked the        decisions.
articles for topic relevance, using a score based on ac-
cumulated keyword weights. This score reflects both           3.3. Evaluation with ClimateBERT
the frequency of the keywords and their position within
the article, as content in the beginning is generally con-    We aim to demonstrate (i) the relevance of our 3,630-
sidered most important. Specifically, we multiply the         article subcorpus in genuinely consisting of climate
number of keyword occurrences per sentence by a score         change (CC)-related articles and, thereby, (ii) the validity
representing sentence prominence (1 for the first sen-        of our combined method for retrieving topic-consistent
tence, 0.9 for the second, 0.8 for the third, and so on).     texts from a larger, heterogeneous collection while min-
After automatically ranking the articles, we selected 450     imizing false positives. To perform that validation, we
articles for manual tagging: the top 150, the last 150, and   employed ClimateBERT, specifically 𝐶𝑙𝑖𝑚𝑎𝑡𝑒𝐵𝑒𝑟𝑡𝐹 [4], a
150 from the middle. We manually assessed them to de-         BERT-based model trained on CC-related texts. In partic-
termine if they were at least partially about CC, using       ular, we used distilroberta-base-climate-detector from the

2                                                             3
    https://www.bbc.com/news/science-environment-11833685         https://github.com/discourse-lab/NYTAC-CC
Hugging Face platform[22], a fine-tuned version with
a classification head for detecting climate-related para-
graphs. Given its specialization in CC-related texts, we
deemed ClimateBERT a very suitable tool to confirm the
accuracy of our dataset. In doing so, we are also indirectly
assessing the model’s capability in detecting CC-related
content within larger portions of texts. As the model’s
context length is limited to 512 tokens, we addressed
this limitation by adopting two different approaches de-
scribed below.
   In the first approach, longer texts were truncated due
to the model’s limited context length. Of the 3,630 in-
stances, the model recognized 3,468 articles as +climate.
We manually inspected the remaining 162 texts classified
as -climate, i.e., as false negatives. We found that the
model clearly misclassified 75 texts, which included rele-
vant CC content appearing beyond the initial 512 tokens.
More qualitative insights on these 162 texts are provided
in the subsection below.
   In addition, we attempted a second approach to over-
come the context length constraint by using a sliding
window technique. This involved creating chunks of
longer texts (> 512 tokens), classifying each chunk, and
labeling the entire text as +climate if any of the chunks       Figure 2: Monthly article count in CC subcorpus
were labeled as such. This second approach led to signif-
icantly different results, as only 3 out of 3,630 instances
were labeled -climate.
   These results demonstrate both the representativeness        Kyoto Protocol or metaphorical uses of global warming.
of our corpus and the validity of our hybrid subcorpus
selection method. In addition, we show how automatic
classification models can be limiting when dealing with
                                                                4. Overview of NYTAC-CC
long text units, therefore reinforcing the need for a com-      In this section, we provide an initial overview of the
bined approach to build topic-relevant (sub)corpora.            NYTAC-CC coverage, including the article distribution
                                                                over time and a preliminary subtopics exploration.
3.4. Analysis of the ClimateBERT
     misclassifications                                         4.1. Temporal and Keyword highlights
As discussed in Section 3.3, we manually inspected 162          We examine the temporal distribution of articles and key
articles that ClimateBERT initially classified as false neg-    lexical features in our corpus to illuminate trends and
atives within our subcorpus. Of these, 75 were clearly          shifts in CC coverage over time (see Figure 2).
related to CC. Specifically, 48 articles featured significant      The analysis reveals a peak in articles during 1990,
discussions on CC-related issues beyond the model’s 512-        with up to 50 mentions per month, followed by a decline
token limit. Additionally, 27 articles contained detailed       to 20 articles per month in the mid-90s. After the Kyoto
CC narratives within the first 512 tokens, often intersect-     Protocol in December 1997, the curve shows a steady rise
ing with other topics like politics (e.g., conferences on       with intermittent bursts in coverage. In the figure, we
CC) and population (e.g., CC impacts on specific regions).      have marked important ’climate events’ corresponding
This misclassification highlights the models’ limitation        to the years they occurred.
extending beyond the mere input token limitation, un-              The frequency ratios of the top eight lexical features
derscoring the challenges in handling topic intersections.      determined by the classifier (cf. Figure 1) over time in
   Although not the primary focus, CC was still men-            Figure 3 illustrate the dominance of ’greenhouse’ in the
tioned in the remaining articles. In particular, 51 articles    late 1980s. ’Warming’ remains the most frequent term
included CC in contexts marginally related to their main        throughout, but in the final years, ’climate’ gains promi-
narratives, integrating CC with other discussions. In           nence, suggesting a shift of term preference from ’global
another 36 articles, CC was a secondary topic, occasion-        warming’ to ’climate change’—a transition noted in var-
ally mentioned only in passing, such as references to the       ious other studies as well. Also, the two ’Kyoto’ events
                                                             1. emission: country, world, greenhouse_gas, car-
                                                                bon_dioxide, global_warming
                                                             2. administration: president, policy, white_house, bill,
                                                                congress
                                                             3. people: time, life, book, world, earth
                                                             4. scientist: temperature, climate, study, research, uni-
                                                                versity
                                                             5. energy: oil, fuel, gas, production, power
                                                             6. city: new_york, people, park, town, mayor, manhat-
Figure 3: Keyword distributions over time                       tan
                                                             7. company: business, project, program, group, director
                                                           8. global_warming: report, climate_change, scientist,
are clearly visible: the international accord was reached     panel, editor
in 1997, and the Bush administration’s decision not to
                                                           9. plant: coal, company, emission, power, utility
ratify it occurred in 2001.
   At the same time, we also find that many articles fo- 10. water: area, land, river, population, fish
cused on weather or pollution primarily addressed these 11. state: pollution, air, ozone, epa, smog
issues directly, mentioning climate change only tangen-
                                                          12. china: government, people, war, security, country
tially. This reduces the co-occurence of other prominent
CC terms in these articles.                               13. car: vehicle, fuel, gasoline, hydrogen, auto
                                                            14. ice: sea, arctic, ocean, glacier, bear
4.2. Document Structuring with LDA                          15. forest: tree, plant, species, fire, crop
Building on the basic statistics discussed in the previous 16. weather: winter, temperature, snow, degree, heat
subsection, we delved deeper into the range of subtopics 17. storm: el_nino, drought, hurricane, wind, flood
within the CC corpus using topic modeling, specifically
                                                               18. island: bird, beach, garden, long_island, sand
Latent Dirichlet Allocation (LDA). This approach helps
to uncover underlying thematic structures in the data,             As is common with topic models, some overlap be-
which are not immediately apparent from simple key- tween topics can occasionally be observed when examin-
word analysis.                                                  ing the complete top-30 term lists, for example, between
   Preprocessing Steps To prepare the texts for LDA, topics company and plant. Additionally, we find some
we performed several preprocessing steps on article titles apparent ’outlier’ terms in all the topics.
and bodies, including removing punctuation, lemmatiz-              As a preliminary approximation, we tagged each text
ing words, and converting all text to lowercase to ensure in the subcorpus with the predominant topic identified by
consistency. We also joined frequently co-occurring bi- the model, allowing us to track the evolution of topic cov-
grams into single terms to preserve important phrases. erage over time (see Figure 4). This LDA-based analysis
For our topic modeling, we focused on nouns and proper highlights how the context of CC-related coverage in the
nouns that ranked among the top 10,000 by frequency NYTAC corpus shifts over time, for example from a fram-
and had more than two letters. This refinement allowed ing within science and pollution debates to a discourse
us to emphasize key entities and their relationships, cen- context in which greenhouse gas emissions were central.
tral to the content of the articles, and avoid the dilution of Further, our findings complement the manual inspection
thematic significance by less informative parts of speech, discussed in Section 3.3, illustrating how climate change
enhancing consistency through the use of pseudowords. discussions, while sometimes secondary in broader arti-
   Model Selection The best LDA model was chosen cles on government policy (topic ’administration’), are
based on the coherence score, calculated using the Python integral to discussions on foreign policy (’China’) and
Gensim library. This ensures an objective selection pro- cultural topics (’people’).
cess, minimizing subjective interpretation. We priori-
tized coherence to ensure that the topics generated by
the model are interpretable and meaningful. The optimal 5. Conclusion and Future Work
model identified 18 topics, with a coherence score of .56,
indicating a reasonable level of interpretability. We chose In this paper, we introduced the NYTAC-CC, a specialized
the highest-ranked term as the ’name’ of each topic and subcorpus of 3,630 climate change articles from the New
listed five additional representative terms as follows:         York Times Annotated Corpus spanning 1987 to 2007,
Figure 4: Topic coverage over the 20-year period


marking the first CC analysis with this dataset. Address-          with present by finding corresponding terms across
ing the lack of available news-based textual resources             time, in: Annual Meeting of the Association
for NLP tasks, we employed a hybrid method combining               for Computational Linguistics, 2015. URL: https:
keyword-based prefiltering and automatic classification            //api.semanticscholar.org/CorpusID:1121386.
to optimize the corpus construction. The representative-       [2] O. Alonso, K. Berberich, S. J. Bedathur, G. Weikum,
ness of the subcorpus was confirmed using ClimateBERT,             Time-based exploration of news archives, 2010.
but additional manual inspection of ClimateBERT’s clas-            URL: https://api.semanticscholar.org/CorpusID:
sification of a relevant amount of true positives as (false)       2353972.
negatives also showed the model’s limitations and the          [3] C. Kantner, M. Overbeck, Exploring soft concepts
benefits of the hybrid approach chosen.                            with hard corpus-analytic methods, in: N. Reiter,
   Initial analyses of the subcorpus, including statistics,        A. Pichler, J. Kuhn (Eds.), Reflektierte algorithmis-
keyword searches, and topic modeling, highlight the cor-           che Textanalyse, De Gruyter, Berlin, 2020.
pus’s potential for detailed diachronic and subtopic ex-       [4] N. Webersinke, M. Kraus, J. Bingler, M. Leippold,
ploration.                                                         ClimateBERT: A Pretrained Language Model for
   Thus, the NYTAC-CC subcorpus can be a useful re-                Climate-Related Text, in: Proceedings of AAAI 2022
source for examining the historical narrative of climate           Fall Symposium: The Role of AI in Responding to
change in news media. As it builds on the NYTAC corpus,            Climate Challenges, 2022. doi:https://doi.org/
it adds to previous work on this data, providing valuable          10.48550/arXiv.2212.13631 .
insights for social science research. It also serves as a      [5] Y. Luo, D. Card, D. Jurafsky, Detecting stance in
beneficial dataset for developing NLP applications that re-        media on global warming, in: Findings of the As-
quire a deep understanding of climate-related discourse.           sociation for Computational Linguistics: EMNLP
While the size of the subcorpus may restrict certain quan-         2020, Online, 2020, pp. 3296–3315.
titative analyses, its rich, concentrated content is ideal     [6] D. Effrosynidis, A. Karasakalidis, G. Sylaios,
for qualitative studies. Furthermore, it offers the poten-         A. Arampatzis, The climate change twitter dataset,
tial for expansion and further integration with additional         Expert Syst. Appl. 204 (2022) 117541. URL: https:
sources to enhance its utility and relevance for ongo-             //api.semanticscholar.org/CorpusID:248807383.
ing climate change research. Future work will expand           [7] A. Samantray, P. Pin, Data and code for: Cred-
on these findings with advanced topic modeling tech-               ibility of climate change denial in social media
niques and integrate more recent articles to enrich the            (2019). URL: https://doi.org/10.7910/DVN/LNNPVD.
diachronic analysis.                                               doi:10.7910/DVN/LNNPVD .
                                                               [8] T. Diehl, B. Huber, H. G. de Zúñiga, J. H. Liu, So-
                                                                   cial media and beliefs about climate change: A
References                                                         cross-national analysis of news use, political ide-
                                                                   ology, and trust in science, International Jour-
 [1] Y. Zhang, A. Jatowt, S. S. Bhowmick, K. Tanaka,
                                                                   nal of Public Opinion Research (2019). URL: https:
     Omnia mutantur, nihil interit: Connecting past
                                                                   //api.semanticscholar.org/CorpusID:214067785.
 [9] A. Shehata, J. Johansson, B. Johansson, K. Ander-             27 Countries, Global Environmental Change 23
     sen, Climate change frame acceptance and re-                  (2013) 1233–1248.
     sistance: Extreme weather, consonant news, and           [21] M. Hulme, Why we disagree about climate change:
     personal media orientations, Mass Communica-                  Understanding controversy, inaction and opportu-
     tion and Society 25 (2021) 51 – 76. URL: https:               nity, Cambridge UP, Cambridge, 2009.
     //api.semanticscholar.org/CorpusID:238720934.            [22] J. Bingler, M. Kraus, M. Leippold, N. Webersinke,
[10] C. Trumbo, Constructing climate change: claims                How Cheap Talk in Climate Disclosures Relates
     and frames in US news coverage of an environmen-              to Climate Initiatives, Corporate Emissions, and
     tal issue, Publ. Underst. Science 5 (1996) 269–283.           Reputation Risk, Working paper, Available at SSRN
[11] M. Boykoff, The cultural politics of climate change           3998435, 2023.
     discourse in UK tabloids, Political Geography 27
     (2008) 549–569.
[12] P. Legagneux, N. Casajus, K. Cazelles, C. Chevallier,
     M. Chevrinais, L. Guéry, C. Jacquet, M. Jaffré, M.-J.
                                                              A. List of Bigrams
     Naud, F. Noisette, P. Ropars, S. Vissault, P. Archam-    climate change, global warming, greenhouse effect, acid
     bault, J. Bêty, D. Berteaux, D. Gravel, Our house        rain, ozone layer, greenhouse gases, fossil fuels, green-
     is burning: Discrepancy in climate change vs. bio-       house emissions, ice shelves, ice sheets, rising sea, sea
     diversity coverage in the media as compared to           levels, Kyoto Protocol, Montreal Protocol, carbon foot-
     scientific literature, Frontiers in Ecology and Evolu-   print, carbon dioxide, carbon neutral, emission trading,
     tion 5 (2018). URL: https://api.semanticscholar.org/     feedback loop, global dimming, renewable energy, Stern
     CorpusID:39805874.                                       Review.
[13] M. Boykoff, J. Boykoff, Climate Change and Jour-
     nalistic Norms: A Case-Study of US Mass-Media
     Coverage, Geoforum 38 (2007) 1190–2004.                  B. List of Keywords
[14] D. Brossard, J. Shanahan, K. McComas, Are issue-
     cycles culturally constructed? A comparison of           climate, atmosphere, weather, warming, carbon, green-
     French and American coverage of global climate           house, pollution.
     change, Mass Communication and Society 7 (2004)
     359–377.
[15] R. Grundmann, R. Krishnamurthy, The Discourse of
     Climate Change: A Corpus-based Approach, Criti-
     cal Approaches to Discourse Analysis across Disci-
     plines 4 (2010) 113–133.
[16] D. A. Stecula, E. Merkley, Framing Climate Change:
     Economics, Ideology, and Uncertainty in American
     News Media Content From 1988 to 2014, Frontiers
     in Communication 4 (2019).
[17] P. Mishra, R. Mittal, Neuralnere: Neural named
     entity relationship extraction for end-to-end cli-
     mate change knowledge graph construction, in:
     ICML 2021 Workshop on Tackling Climate Change
     with Machine Learning, 2021. URL: https://www.
     climatechange.ai/papers/icml2021/76.
[18] M. Leippold, F. S. Varini, Climatext: A dataset
     for climate change topic detection, in: NeurIPS
     2020 Workshop on Tackling Climate Change
     with Machine Learning, 2020. URL: https://www.
     climatechange.ai/papers/neurips2020/69.
[19] M. Hulme, N. Obermeister, S. Randalls, M. Borie,
     Framing the challenge of climate change in Nature
     and Science editorials, nature climate change 8
     (2018) 515–521.
[20] A. Schmidt, A. Ivanova, M. S. Schäfer, Media At-
     tention for Climate Change around the World: A
     Comparative Analysis of Newspaper Coverage in