Aggregating and Visualizing Collocation Data for Humanitarian
Concepts (Short Paper)
Loryn Isaacs1, Pilar León-Araúz1
1
    University of Granada, C/ Puentezuelas, 55, Granada, Spain


                                   Abstract
                                   Analyzing a term’s collocations often offers insight into domain-specific usage, yet manually
                                   comparing large data sets of collocations can be unfeasible. This necessitates programmatic
                                   techniques that aggregate large quantities of collocation data and condense results into
                                   manageable visualizations. This paper presents a method to quickly process hundreds of
                                   thousands of corpus queries with a combination of the Sketch Engine API and related open-
                                   source software. A preliminary web application is offered to explore aggregated collocation
                                   data for the humanitarian concepts that make up the Humanitarian Encyclopedia. Potential
                                   applications are discussed with regards to the study of conceptual variation in the humanitarian
                                   sector.

                                   Keywords1
                                   Humanitarian terminology, collocation, visualization, corpus, Sketch Engine API

1. Introduction

   Collocations are a well-known source of information regarding a term’s meaning and usage in a
specialized context. The analysis of co-occurring lexical items has been formalized in corpus
management systems, including via user interfaces that present summaries of collocational behavior
and allow for further data exploration. Sketch Engine’s word sketch feature is one example, offering a
summary of how strongly and frequently a term is associated with various types of collocates [1]. This
tool has been utilized to conduct concept analyses for the Humanitarian Encyclopedia, an open platform
for linguistic data and expert discussion on humanitarian terminology [2]. To better facilitate the
development of the encyclopedia’s concept entries, a data exploration method was developed to
condense collocation data from many queries into an interactive visualization. This paper summarizes
the workflow used to extract and explore bulk frequency data with the Sketch Engine API.
   The following sections overview the Humanitarian Encyclopedia and its corpus of domain-specific
texts (Section 2), the API-based data collection method using Python (Section 3), the Flask web
application designed to explore the data set (Section 4), and areas for future research (Section 5).

2. The Humanitarian Encyclopedia

   The Humanitarian Encyclopedia is an ongoing collaborative project from the Geneva Centre of
Humanitarian Studies that focuses on studying conceptual variation and elucidating an internationally
shared understanding of key humanitarian concepts. It aims at defining and documenting the dynamics
of concepts that are particularly controversial, fuzzy, or ill-defined within the humanitarian action
domain. It currently focuses on 129 concepts, including HUMANITARIANISM itself, as well as a range of


1
 2nd International Conference on “Multilingual digital terminology today. Design, representation formats and management systems” (MDTT)
2023, June 29–30, 2023, Lisbon, Portugal
EMAIL: lisaacs@ugr.es (L. Isaacs); pleon@ugr.es (P. León-Araúz)
ORCID: 0000-0003-0267-4853 (L. Isaacs); 0000-0002-8520-2749 (P. León-Araúz)
                               © 2023 Copyright for this paper by its authors. Use permitted under Creative
                               Commons License Attribution 4.0 International (CC BY 4.0).
                               CEUR Workshop Proceedings (CEUR-WS.org)
    CEUR
                ht
                 tp:
                   //
                    ceur
                       -ws
                         .or
                           g
    Works
        hop     I
                SSN1613-
                       0073
    Pr
     oceedi
          ngs
events, strategies, entities, and other phenomena related to humanitarian activities (FOOD SECURITY, DO
NO HARM, INDEPENDENCE, etc.).
    Each concept entry is created according to an approach that combines corpus-driven knowledge
provided by terminologists with expert knowledge from humanitarian practitioners or academics. Each
entry offers a blend of quantitative and qualitative data that describe a concept’s primary characteristics,
the degree to which its usage is homogeneous, and debates among humanitarian actors as to its meaning
or institutional value (see the example for DO NO HARM in Figure 1). The technologies and procedures
used to generate content for the encyclopedia’s concept entries are summarized in [3].


 Figure 1: Humanitarian Encyclopedia entry overview and visualization
    Linguistic data for concept entries are extracted from the Humanitarian Encyclopedia corpus, which
consists of texts compiled from humanitarian websites and public databases, such as UN Women2 and
ReliefWeb3. The corpus contains 71 million words and is comprised of annual reports, strategy
documents, and general documents in English published from 2005 to 2019. Authors include a variety
of organizations associated with humanitarian efforts, from international actors to local groups. These
are classified into 26 organization subtypes, with a majority representing nongovernmental
organizations, the United Nations, intergovernmental organizations, and the Red Cross. Its 4,814
documents are tagged by region, originating mostly from Europe, North America, and Asia, year of
publication, document type, and organization type.
    For the Humanitarian Encyclopedia, the phenomenon of collocation plays a key role in identifying
the semantic content of a concept. This can include identifying hypernyms, hyponyms, causes, effects,
term variants and antonyms, as well as controversies, often in reference to a concept’s shared definition
(or lack thereof) and its implementation in humanitarian response. These units tend to be extracted in
Sketch Engine through statistical association with an emphasis on nouns, adjectives, and verbs.
Extraction may be based on collocational strength with the logDice score [4], the identification of
multiword terms, or the analysis of relevant semantic relations (most often hyponomy, meronymy, and
causality) [5]. For instance, collocations of EPIDEMIC allow accessing its multiple conceptual
dimensions by classifying the multiword terms in which epidemic is the head: pathogen (e.g., HIV
epidemic, Zika virus epidemic), cause (e.g., obesity epidemic, tobacco epidemic), morbidity (global
epidemic, localized epidemic), time (e.g., recurrent epidemic, seasonal epidemic), severity (e.g., deadly
epidemic, severe epidemic, acute epidemic), etc. Other (more distant) collocations highlight the most
mentioned countries struck by epidemics. Verbs, in turn, point to the effects of epidemics or the actions
that can be undertaken before, during, and after an epidemic outbreak. Most verbs occurring with

2
    https://www.unwomen.org/en
3
    https://reliefweb.int/
epidemic as a subject indicate its sudden and violent nature, as they are impact-related verbs (hit, strike,
rage, sweep, break out, devastate), whereas most verbs occurring with epidemic as an object are
response-related (contain, reverse, avert, combat, fight, prevent, curb, stop, etc.), indicating a
subsequent phase in the event of an epidemic. More rarely, there are also verbs indicating anticipation
(prevent, avert, detect).
    Collocations also help in identifying knowledge rich contexts, such as “Every year, millions of
people around the world experience the devastating effects of disasters such as floods, droughts and
epidemics” or “Health ministers and other experts from the region exchanged views on strengthening
the resilience of health-care systems against epidemics, armed conflicts and other emergencies”. The
analysis of large collections of KRCs shows that depending on how EPIDEMIC is categorized (e.g.,
disaster, emergency), it has slightly different clusters of sibling concepts. For example, when
categorized as disaster, threat, or calamity, sibling concepts are mostly related to natural hazards,
whereas when categorized as emergency, shock, or factor, epidemic is part of a larger humanitarian
frame, including siblings such as food insecurity, nutritional crises, conflicts, displacement, political
crises, poverty, etc.
    All of these collocation-related analyses point to a dynamic conceptualization that can also be
correlated with corpus metadata (i.e., organization type, publication date, etc.), which makes the
foundations of conceptual variation analysis. For instance, IGOs show more collocates related to
epidemic types (SARS, Marburg, dengue, polio, hepatitis, influenza) or their impact/origin-related
attributes (deadly, waterborne, devastating), whereas NGOs focus on attributes (devastating, deadly,
lethal) but especially impact (rage, break, hit) and response-related verbs (contain, prevent, reverse).
    For the development of the Humanitarian Encyclopedia, data derived from the above methods are
visualized in each concept analysis as a way to interpret data, correlate variables, and transfer
knowledge to humanitarian experts. Plots can include standard visualizations, such as histograms of
text types and maps of document source countries, as well as bespoke visualizations when merited by
a unique linguistic phenomenon. While this modus operandi is necessary and helpful for conducting
individual analyses, a contrastive approach could offer a more global perspective on the relationships
between terms, collocations, and communicative contexts. Such an approach, however, is not practical
without automating queries and developing a convenient means to explore data. In response, the
following method was developed to visualize the entirety of the Humanitarian Encyclopedia’s
collocation data in one resource.

3. API-based data collection
    Collecting collocation data for each of the Humanitarian Encyclopedia’s concepts required utilizing
Sketch Engine’s API, which allows developers to programmatically execute queries. The software
employed to manage API calls was Sketch Grammar Explorer (SGEX), an API wrapper written in
Python [6]. This tool was employed in conjunction with NoSketch Engine, the software’s open-source
variant [7], in the form of a Docker container maintained by the Eötvös Loránd University Department
of Digital Humanities [8]. Together they provided a means to locally query the corpus and store results
as a single data set.
    The primary benefit of using a local instance of NoSketch Engine's API, as opposed to manual file
download or Sketch Engine’s rate-limited API, is to substantially increase data collection rates for large
numbers of queries. The present data collection task, as described below, would have required over 5
months of continual operation to execute API requests to Sketch Engine’s main server at the allowed
rate. In contrast, making requests locally on a consumer desktop with a recent Core i5 Intel processor
reduced this figure to under 10 hours.
    To allow for more granular data manipulation, collocation frequency data were collected by
combinations of text types. While previous visualizations, such as in Figure 1, summarize data for
single concepts and text types (e.g., occurrences of DO NO HARM by year or by region or by organization
type), the current method accepts multiple restrictions. Users could, for example, select multiple
concepts at once by year and by region and by organization type (e.g., occurrences of INDEPENDENCE
and IMPARTIALITY for 2013-2015 in European NGOs. To do so, this required making 2,316 API calls
for each concept, or 298,764 in total, although such granular text type constraints returned no
concordances in almost half (135,929) of these calls. When API calls did retrieve hits, up to 20 of the
top collocations by logDice were included. Where possible, CQL rules incorporated common
abbreviations and variations. Example query syntax and results with combinations of text type
restrictions are shown in (1) and Table 1.

    (([ lemma_lc = "gender" ] [ word = "(B|b)ased" ] | [ lemma_lc = "gender-based" ])[
    lemma = "violence" ]) | [ lemma_lc = "GBV" ] within < class ( DATE = "2004-                     (1)
    2005|2005" ) & ( REGION = "Europe" ) & ( TYPE = "General_Document" )/>

Table 1
Random sample of collocation data with text type restrictions
   Row #    Collocate         Concept     logDice Freq Date             Region    Org type     Doc type
  585041     negative          impact       5.96      10 2018            Asia      NGO
  996561 environmental protection           2.36       3    2008                  Foundb         ARa
  785890  standardized monitoring           7.44       4    2019        Europe      IGO
 1108623     hygiene         sanitation     6.09      40 2017           MENA                     ARa
  502451       crisis         funding       1.12       3    2018        MENA                     ARa
a
    Activity report
b
    Foundations/funds

   To prepare the data for visualization, API responses were merged into a tabular format and cleaned,
including the removal of unwanted collocates, such as non-words, non-Latin characters, and auxiliary
verbs. This was done in two passes, first by identifying unwanted strings automatically with the
unicodedata Python package (Appendix A), and then manually curating an exclusion list of remaining
unwanted items (e.g., xiii, the, be, will). The final data set amounted to 1,236,194 rows with 19,812
unique collocates disaggregated by text type. Among these collocates the most frequent was
humanitarian, at 6,947 cases, followed by disaster, health, community, and development. To ensure the
veracity of automatically retrieved results, a sample was compared with those retrieved manually via
user interface.

4. Web application design and purpose

    A tool was built with Python and the Dash web application framework [9] to visualize the prepared
Sketch Engine API data. It consists of four elements: an interactive scatter plot, a series of dropdown
and slider components for adjusting parameters, a table of summary statistics, and a URL generator that
redirects user-selected data points to Sketch Engine. These elements are generated automatically based
on the shape of the data set, e.g., adding text filters for corpus structures and numeric filters for
frequencies and logDice scores. Visualizations are generated as users apply filters, offering a
standardized means for analysis and evaluation tasks. The data can be further restricted by sample size,
to show the top n collocates, as well as be displayed with faceting, to show data subsets in separate
plots. Users can then identify areas of interest that could previously have been cumbersome to explore
with disparate queries.
    One task facilitated by the visualization is assessing conceptual variation across a specialized
domain. For example, Figure 2 shows a selection of the top collocates for the humanitarian concept of
ACCESS by three organization types: IGO, NGO, and Red Cross. Here an area of interest is the
relationship between the subjects and objects for which access is a challenge. While sanitation appears
in each of the three organization types, internet and energy are exclusive to IGO and education is
exclusive to NGO. Detainee, a type of population that both requires access to resources and which
organizations seek access to, is exclusive to Red Cross. These differences may indicate how parties
represented in the corpus focus their activities on discrete objectives and populations. Discussion of the
meaning of ACCESS, then, could consider commonalities and differences measurable in the corpus
regarding semantic roles: party demanding access, population needing access, authority granting access,
and service being accessed.


Figure 2: Top collocates for ACCESS by organization type
    As seen in Figure 2 and other examples provided as appendices, the discovery of possible
correlations could aid terminological analysis and help transmit results to humanitarian experts. In that
regard, one consideration is the need to provide sufficient contextual information for guiding proper
data interpretation. Supplementary visualizations describing the shape of the data set and its limitations
would be beneficial for users. For instance, among the 129 concepts there is a wide range of frequencies.
While development and community each appear as terms (as opposed to collocates for other concepts)
over 23,000 times, several polylexical terms have very few cases, like humanitarian-development nexus
(99) and humanitarian imperative (132). Evaluating how combinations of text type restrictions
influence the composition of results will be a necessary next step.

5. Future applications
    The data management approach described here addresses a need for the Humanitarian Encyclopedia
to streamline corpus-based analysis of humanitarian concepts and their collocations. It is an example of
an open-source method for integrating the Sketch Engine API into a workflow for terminological
research. The increased rate and scale of data extraction encourages more techniques for the exploration
and analysis of specialized corpora. The immediate interest for the Humanitarian Encyclopedia will be
to research how key terms behave across the humanitarian sector, particularly their degree of
standardization among actors and the prevalence of controversies.
    The prototype interface described in this article is part of a larger effort to create an open-source
dashboard for visualizing the Humanitarian Encyclopedia corpus. A central aim is to track
developments in the current usage of humanitarian concepts across the sector. This will be aided by
developing a means to visualize data from multiple sources and integrating other query systems in
addition to Sketch Engine’s. While the automation of many corpus queries allowed for the creation of
a new data set, with automation comes additional challenges for presentation and contextualization.
6. Acknowledgments

   Funding for this work was provided through the Humanitarian Encyclopedia project at the Geneva
Centre of Humanitarian Studies and the research project PROYEXCEL_00369 (VariTermiHum),
funded by the Regional Government of Andalusia (Spain).

7. References

[1] A. Kilgarriff, V. Baisa, J. Bušta et al. The Sketch Engine: Ten years on. Lexicography ASIALEX
    1, 7–36 (2014). doi:10.1007/s40607-014-0009-9.
[2] Humanitarian Encyclopedia. URL: https://humanitarianencyclopedia.org.
[3] S. Chambó, P. León-Araúz, Visualising lexical data for a corpus-driven encyclopaedia, in: I.
    Kosem, M. Cukr, M. Jakubíček, J. Kallas, S. Krek, C. Tiberius (Eds.), Electronic lexicography in
    the 21st century. Proceedings of the eLex 2021 conference, Lexical Computing, Brno, Czech
    Republic, 2021, pp. 29-55.
[4] P. Rychlý, A lexicographer-friendly association score, in: P. Sojka, A. Horák (Eds.), Proceedings
    of the Second Workshop on Recent Advances in Slavonic Natural Languages Processing,
    RASLAN 2008, Masaryk University, Brno, Czech Republic, 2008, pp. 6–9.
[5] P. León-Araúz, A. San Martín, P. Faber, Pattern-based word sketches for the extraction of semantic
    relations, in: P. Drouin, N. Grabar, T. Hamon, K. Kageura, K. Takeuchi (Eds.), Proceedings of the
    5th International Workshop on Computational Terminology (Computerm2016), The COLING
    2016 Organizing Committee, Osaka, Japan, 2016, pp. 73–82.
[6] L. Isaacs, Sketch Grammar Explorer. doi:10.5281/zenodo.6812335.
[7] P. Rychlỳ, Manatee/Bonito-A modular corpus manager., in: P. Sojka, A. Horák (Eds.), First
    Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2007,
    Masaryk University, Brno, Czech Republic, 2007, pp. 65–70.
[8] Eötvös Loránd University Department of Digital Humanities, NoSketch-Engine-Docker. URL:
    https://github.com/ELTE-DH/NoSketch-Engine-Docker
[9] P. T. Inc., Collaborative data science, 2015. URL: https://plot.ly

8. Appendices

Appendix A: Function for automatic string exclusion

 for collocate in list_of_unique_strings:
   normalized = unicodedata.normalize('NFD', collocate)
   canonical = u"".join([
      char for char in normalized
      if not unicodedata.combining(char)])
   regex_drops.update(
      re.findall(
         re.compile(".*[^a-zA-Z\-\.\']+.*", re.UNICODE), canonical))
Appendix B: Top collocates for ADAPTATION by region (excluding climate change)


Appendix C: Top collocates for CONFLICT by date in European documents