<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>from Ukraine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxim Mironov</string-name>
          <email>maxim.mironov@campus.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Marquard</string-name>
          <email>a.marquard@campus.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Racek</string-name>
          <email>daniel.racek@stat.uni-muenchen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Heumann</string-name>
          <email>chris@stat.uni-muenchen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul W. Thurner</string-name>
          <email>paul.thurner@gsi.uni-muenchen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Aßenmacher</string-name>
          <email>matthias@stat.uni-muenchen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Location Reference Recognition, Geoparsing, Natural Language Processing, Named Entity Recognition</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Statistics, LMU Munich</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Geschwister Scholl Institute of Political Science (GSI), LMU Munich</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Munich Center for Machine Learning (MCML)</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The dynamics of contemporary social media communication, particularly on platforms like X (formerly Twitter), have significantly evolved, and this data is frequently used for scientific research. However, due to X's API changes in 2019, a tweet's precise geolocation is no longer present in the data, thus preventing a geographical assessment of tweets. This project aims to extract location mentions from tweets' texts and to map them to Ukraine's administrative regions. We have developed a specialized pipeline for geoparsing with specific prebuilt components for the Ukrainian, Russian, and English languages. The main advantage of our pipeline's architecture is the interchangeability of all components, allowing for the integration of custom-developed solutions. Initial tests on our hand-labeled Ukrainian dataset show promising results in accurately identifying and mapping location mentions despite various challenges, such as declension and the presence of multiple languages in a single tweet. Additional experiments using publicly available benchmark data further indicate promising performance when transferring our pipeline to other geographical regions. Both our geoparsing pipeline and its online documentation have been made publicly available.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Geotagged social media posts constitute a valuable resource for researchers across numerous fields,
including hazard management [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], public health [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and politics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, in 2019, X (formerly
Twitter), one of the most important platforms for such research [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], removed the option for precise
geotagging [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This has prompted researchers to increasingly develop methods for extracting
geolocation data from textual information of social media posts and mapping these to specific coordinates [ 7].
This process, in the literature commonly referred to as geoparsing [7], comes with various dificulties.
Social media posts are often multilingual, sometimes even written in multiple languages at once, and
frequently contain misspellings and informal language. Moreover, determining whether a post refers to
a specific location depends on its broader context. Geoparsing consists of two primary steps, location
mention recognition (also known as location reference recognition or toponym recognition), which
detects mentioned locations in text. Second, geocoding (also known as toponym resolution), which
identifies and assigns geographic coordinates to these mentioned locations [ 8]. While numerous studies
have focused on location mention recognition [9], full geoparsing pipelines are scarce, often lack
transparency, and are limited to the most commonly spoken languages such as English, making them
unsuitable for many research projects.
      </p>
      <p>Contributions. In this work, we present a fully transparent geoparsing pipeline designed for tweets
Italy
∗Corresponding author.
†These authors contributed equally.</p>
      <p>CEUR</p>
      <p>ceur-ws.org
from Ukraine, written in the three most prevalent languages, Ukrainian, Russian, and English. Our
geoparser TBGAT (text-based geographical assignment of tweets) matches each tweet to the geographic
coordinates of the mentioned locations. We compare our method to a spaCy-based geoparser, showcase
its superior performance, and analyze the locations of tweets before and during the Russian invasion
of Ukraine, used for analyzing language use in [10]. Although designed for tweets from Ukraine, the
pipeline is highly adaptable and can be used to match any social media posts to Ukrainian locations.
Furthermore, extensions to other regions and languages are easily possible using our publically available
implementation1 , as we also showcase on the IDRISI-RE benchmark dataset [11].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>To date, the majority of studies have focused on location mention recognition (LMR). As noted by [9],
these approaches can be broadly categorized into rule-based methods, gazetteer matching, statistical
learning, or hybrids of these techniques. Rule-based approaches rely on predefined rules, such as
regular expressions, to identify recurring patterns that denote location mentions. However, defining
a comprehensive and robust set of rules remains challenging. Gazetteer-based methods match
ngrams from the text with entries in location dictionaries along with additional heuristics for filtering
and disambiguation. The main challenges for this approach include collecting a complete set of
locations and addressing variations in names and context-specific ambiguities. Statistical models are
trained on annotated text corpora, learning to identify and extract location information from unlabeled
previously unseen texts. A large strand of literature employs Named Entity Recognition (NER), which
classifies portions of text as specific types of entities, including location entities. In recent years,
with the emergence of large language models (LLMs) such as BERT [12] and its successors, the focus
has shifted towards improving these deep learning-based NER models for LMR. Nonetheless, these
models face obstacles such as limited availability of annotated training data or handling social media’s
frequent misspellings and informal language. Hence, hybrid approaches, which combine any of the
aforementioned techniques, have been designed to overcome some of these issues. For an extensive
review and comparison of LMR methods, we refer to the survey by [9].</p>
      <p>Geoparsing, which combines location mention recognition with geocoding, plays a crucial role across
many disciplines. Applications include, among others, disaster response [13, 14], disease surveillance
[15], trafic control [ 16], crime management [17], geographic information retrieval [18]. However,
the number of freely available geoparsing tools is limited and most are not actively maintained. The
complexity of geoparsing arises from the unique characteristics of each use case, which vary by the
type of text (e.g., social media vs. news articles), language(s) present, and the geographic area to be
considered. Each language requires a customized approach to LMR, afecting all methods similarly.
Moreover, handling misspellings typically requires use-case-specific solutions. To achieve optimal
performance, geoparsing pipelines must also be set to a certain geographic area and level of granularity
for the matching process (e.g. administrative zone vs. street level). All of these factors contribute
significantly to the complexity of designing and implementing an efective (open-source) geoparser.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>For the development of our geoparsing pipeline, we considered four key aspects: eficiency, accuracy,
transparency, and customizability. The simplicity of our pipeline facilitates easy customization and
extension. Moreover, all modules within our pipeline are interchangeable, allowing for the integration
of custom-developed solutions. In short, the structure of the pipeline can be described as follows:
1. Preprocessing the texts to ensure they comply with the requirements of the subsequent modules.</p>
      <p>This involves text normalization, clearing of whitespace, and other necessary adjustments.
1https://github.com/Maxim-M-D/tbgat/
2. Detecting the language(s) used in the tweet. In cases where multiple languages are present, the
text is split into appropriate parts for further processing.
3. Next, we perform location mention recognition (LMR) using:
• Gazetteers, employing the Aho-Corasick algorithm [19] for eficient pattern matching.
• Transformer models, fine-tuned on the NER task.
4. We then map the locations identified in step three to coordinates obtained using OSM (geocoding).
5. Lastly, we conduct various post-processing tasks such as a further mapping of the obtained
coordinates to first-level administrative regions (Ukrainian Oblasts), checking for special cases in
the found locations, and output formatting.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset &amp; Labelling</title>
        <p>We are using a dataset of tweets from [10], who have studied language use on X (formerly Twitter)
in Ukraine before and during the Russian invasion. We draw on the ∼ 2.3M tweets in the three most
common languages (English, Ukrainian, and Russian) and match these to the geographic coordinates of
the mentioned locations using our pipeline. To evaluate quality and performance, we draw a (stratified)
random sample of 3000 tweets in total, consisting of 1000 tweets in each language (English, Ukrainian,
and Russian). This sample was then manually labeled by a native speaker, who extracted the mentioned
location2 exactly as it appeared in the tweet. Additional geo-information in the form of latitude and
longitude was added, based on the coordinates available in OpenStreetMap (OSM)3, a collaborative
project that provides freely accessible geographic data, which has been previously employed for similar
applications [20]. These labeled location coordinates are then compared to the coordinates assigned by
our geoparsing pipeline.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pipeline Components</title>
        <p>Having discussed the data labeling process we now turn our attention to the internal structure of the
pipeline, namely the modules, which are specifically designed to handle geoparsing in the context of
processing tweets from Ukraine in English, Russian, and Ukrainian respectively. However, we encourage
adaptations for a broader range of geoparsing tasks. On a high level, the pipeline components can be
separated into three distinct modules: processing-, extraction- and mapping module (cf. Fig. 1).
2Locations only in Ukraine were labeled
3https://www.openstreetmap.org/, for more details see Appendix B</p>
        <p>The processing module consists of two individual components: The pre-processing module, which
deals with normalizing the data while detecting the language in which the tweet is written, and the
”Special case matcher module” which in the context of tweets from Ukraine deals with the extraction
and mapping of the occupied territories and common misspellings that were otherwise not possible to
map. After normalizing the data and assigning a language to the tweet in the pre-processing module,
the geographical information present in the text is extracted via the extraction modules. First, pattern
matching is applied via gazetteers after which a NER model extracts further geographical details. This
information is then mapped by the ”mapping module” using the (locally stored) openly available OSM
data. The subsequent ”special case matcher module” helps to map edge cases, for which no OSM data is
available, enforcing a final quality check.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Processing Module</title>
        <p>Normalization. While the use of cursive writing and emojis on social media allows for creative
expression, these features pose challenges for LMR, complicating not only pattern matching with
gazetteers and NER but also earlier steps such as language identification.</p>
        <p>
          In order to combat this, we start by normalizing tweets to use standard ASCII characters. For example
kharkiv might be encoded as ”1D48C 1D489 1D482 1D493 1D48C 1D48A 1D497” in hexadecimal
representation, which highlights the usage of non-standard ASCII characters. For reference, the
standard hex representation of kharkiv using standard ASCII characters is ”6B 68 61 72 6B 69 76”. At
the same time, we would like to keep Cyrillic letters as in Russian or Ukrainian tweets. For this task, we
utilize the Python library unicodedata4 which can be used to normalize strings according to the normal
form KD (NFKD - normal form canonical decomposition), i.e. it replaces all compatibility characters
(like U+00C7, the Latin capital letter C with cedilla) with their equivalents (here U+0043 - the Latin
capital letter C). Further, we apply normalizations in the form of removing URLs, removing mentions
and @-Symbols, removing hashtag symbols, removing emojis and finally removing extra whitespace.
Language Detection. The next step in the pre-processing module is language identification. While
most texts in many applications are monolingual, the scenario shifts significantly for microblogs,
where the phenomenon of code-switching, resulting in code-mixed texts (texts composed of multiple
languages) is prevalent. Research on language identification has traditionally focused on monolingual
texts [see e.g. 21, 22, 23], with comparatively minimal attention given to code-mixed texts. One challenge
specific to language identification in microblogs, such as tweets, is their shortness, which complicates
the efectiveness of many language identification tools including Google’s CLD2 or CLD3, FastText,
and langid [
          <xref ref-type="bibr" rid="ref7">24</xref>
          ]. To address the issue of potential multilingualism in a tweet we utilize the Ligua
Python package5, which provides an innovative approach and not only ofers superior accuracy but
also enhanced (computational) performance over conventional methods. Lingua employs a probabilistic
n-gram model utilizing the character distribution from a training corpus, extending the typical tri-gram
model to include n-grams ranging from 1 to 5 in size. This extension allows for more accurate language
predictions, particularly in shorter texts where fewer n-grams are present, and thus, the probability
estimates from these n-grams are less reliable. Lingua additionally incorporates a rule-based engine
that complements its statistical model. This engine initially identifies the alphabet of the input text
and searches for characters unique to specific languages. If a single language can be conclusively
identified through this method, the statistical model is not required. The rule-based engine also serves
to eliminate languages that do not meet the criteria of the input text before the probabilistic n-gram
model is considered. This not only saves memory but also improves runtime performance by reducing
the number of language models loaded.
4https://docs.python.org/3.11/library/unicodedata.html
5https://github.com/pemistahl/lingua-py
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Extraction Modules</title>
        <p>Pattern Matching: Aho-Corasick FSA. For pattern matching, we employ the Aho-Corasick FSA [19],
a specialized algorithm designed to eficiently handle multiple pattern searches simultaneously. The
Aho-Corasick algorithm constructs a finite state automaton from a set of strings, efectively creating
a trie (pronounced tree, but originates from retrieval) structure with additional failure links. These
failure links connect each node to the next node that represents the longest possible sufix of the string
corresponding to the current node. This structure allows the algorithm to transition between trie nodes
without requiring a restart from the root, thus avoiding unnecessary reprocessing of the input text [19],
cf. Appendix A.</p>
        <p>
          Pattern Matching: Gazetteers. For our gazetteers, we compile city names from various sources
to enhance the accuracy and comprehensiveness of our location-mention recognition system. This
compilation involves a manual collection of city names from Wikipedia6 which provides approximately
230 city names across three languages: English, Ukrainian, and Russian. Additionally, we utilize the
Humanitarian Data Exchange Project7 to obtain a more extensive list, encompassing roughly 100,000
populated places in Ukraine, also available in English, Ukrainian, and Russian. These diverse sources
are integrated to create three distinct gazetteers, each varying in granularity. The smallest dataset
focuses solely on locations within the first and second administrative regions, ofering a more targeted
data set. The medium-sized data set expands this scope to include locations from the first, second,
and third administrative regions, providing a broader, yet still manageable, collection of locations.
The largest dataset encompasses locations from all four administrative regions, ofering the most
comprehensive coverage. This tiered approach allows for flexible application of the gazetteers based on
the specific needs of the task, whether it requires detailed granularity or extensive coverage.
Named Entity Recognition. Related work [
          <xref ref-type="bibr" rid="ref8">25</xref>
          ] efectively illustrates the limitations inherent in
models that rely solely on gazetteers for geographical coverage. These models can be overly restrictive
and may also lead to mismatches due to their inability to contextualize the data they process. For
instance, during preliminary analysis, we identified villages in Ukraine named ”Lazy” and ”Smile”.
However, the context-unaware nature of gazetteers led to the erroneous recognition of the common
words ”lazy” and ”smile” as location names when they appear in standard conversations. To tackle
context-unawareness we additionally employ NER. We observed that using individual NER models
trained on specific languages performed better than a single multilingual model. Specifically for the
”quality” 8 version of pipeline we utilize the following BERT-based models from Huggingface [
          <xref ref-type="bibr" rid="ref9">26</xref>
          ], each
ifne-tuned for NER tasks tailored to diferent languages and data sets:
• For English, we employ a fine-tuned BERT on the task of NER on X data [
          <xref ref-type="bibr" rid="ref10">27</xref>
          ]. This model was
trained on a corpus of 154 million tweets, making it highly adept at recognizing and classifying
named entities within the informal and often abbreviated language commonly found on X.
• For Russian tweets, we utilize a fine-tuned BERT on the AmazonScience MASSIVE data set
[
          <xref ref-type="bibr" rid="ref11 ref12">28, 29</xref>
          ].
• Ukrainian tweets were processed using another BERT-based model [
          <xref ref-type="bibr" rid="ref13">30</xref>
          ] fine-tuned on the
Slavic
        </p>
        <p>NER dataset9.</p>
        <p>By employing these specialized, fine-tuned models, our pipeline is well-equipped to handle the
intricacies and linguistic variations present in tweets across these three languages.
6https://en.wikipedia.org/wiki/List_of_cities_in_Ukraine
7https://data.humdata.org/dataset/ukraine-populated-places
8See Results section
9https://github.com/SlavicNLP/SlavicNER</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Mapping Module</title>
        <p>After a location mention has been extracted via NER and the gazetteers, it is passed on to the mapping
module. For this purpose, we utilize OSM. The integration of OSM in our pipeline allows us to leverage
its comprehensive and up-to-date geographic database to associate each identified location with its
corresponding administrative region. To implement this, once locations are identified and extracted
from the tweets, each location name is queried against the OSM database to retrieve its geographic
coordinates. Employing OSM for this task ofers several advantages, including access to a global scope
of geographic information and the ability to receive updates and corrections from a vast community
of users. This ensures that our location mapping remains accurate and reflects current administrative
boundaries, thereby improving the reliability of the subsequent analysis or application that relies on
this geocoded data.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Post-Processing Module</title>
        <p>The ”Post-Processing Module” is the last component in the pipeline and is used for output-related tasks
such as output formatting and in our special use case the admin-level mapper and the special case
matcher.</p>
        <p>Admin-level Mapper. The retrieved geographical coordinates are then used to verify whether a given
location lies within an administrative region by comparing it with the geographical properties of the
administrative regions, which we obtained from the Humanitarian Data Exchange project.10
Special case matcher As a last step before output formatting, we use the special case processing
module that addresses some of the issues specific to our task, such as the presence of occupied
territories and common but extreme misspellings of location names. Such problems present substantial
challenges in accurately assigning geographic data to the correct administrative regions, particularly
because our methodology heavily relies on the ability to query geographical information from OSM
successfully. To address these special cases, we implement a dictionary-based approach involving
predefined dictionaries that contain the properties of locations known to be problematic. Each entry
in these dictionaries includes the correct mapping for a location, accounting for its unique circumstances.
10https://data.humdata.org/dataset/geoboundaries-admin-boundaries-for-ukraine
Output Formatting. The last step of the pipeline is the formatting of the output. As of the time
of writing this paper the pipeline’s current output provided by the Post-processing layer, can be
characterized as follows: Similar to spacy we return a ”DOC object” for each individual row of the
data. The DOC object can be thought of a list that contains all information found by the pipeline.
This includes the found location, position in the sentence, geographical information, and the obtained
administrative level of the found location among other information11.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <sec id="sec-4-1">
        <title>4.1. Ukraine Benchmark</title>
        <p>
          We evaluate two variants of our pipeline: a performance version, which in the extraction module only
uses the pattern matching component to speed up the geoparsing process, and a quality version, which
in addition utilizes the NER component as described in Section 3.4. We compare our results with the
spaCy-based python package geoparser12 [
          <xref ref-type="bibr" rid="ref14">31</xref>
          ] on our labeled Ukrainian dataset described in Section
3.1, using both Accuracy and F1 score across all three languages. We additionally report average GPU
runtimes across three runs on a common consumer-level GPU (NVIDIA RTX 3070) as well as CPU
runtimes on an I5-1345U Intel CPU.
        </p>
        <p>As shown in Table 1, TBGAT performs well in both predictive accuracy and computational eficiency.
Compared to geoparser, it improves overall accuracy by up to 2.1 percentage points (+4.1%) and increases
the F1 score by 0.19 (+38.8%). Performance increases can be observed across the board for both types of
pipelines. GPU runtime is reduced by a factor of 3.6, reaching up to 11.4 in the performance-optimized
version. This reduction in runtime is essential for making it feasible to process and match millions of
tweets to locations. Additionally, we observe performance diferences between the languages, with
the performance-optimized pipeline, which excludes NER, slightly outperforming the quality-focused
version for Russian tweets.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. IDRISI-RE Benchmark</title>
        <p>
          To validate our approach and compare its performance against a well-known benchmark dataset, we
use the publicly available IDRISI-RE dataset [11], selecting all English-based sub-datasets located in
the US. As shown in Table 2, our pipeline, despite being originally designed for tweets from Ukraine
only, can easily be adapted to other countries and regions by exchanging individual components, as the
performance results are competitive.13 Since our framework allows for an easy exchange of components,
refinements in the pipeline would further improve this performance.
11For more details please see the GitHub repository.
12Geoparser is one of the very few geoparsing libraries based on state-of-the-art NER from spaCy, which is also actively
maintained. In contrast, most other geoparsing libraries are dificult to set up, often due to complex installation processes or
other technical challenges.
13In the extraction module, for simplicity, we remove the pattern matching and only rely on a roBERTa-based NER model [
          <xref ref-type="bibr" rid="ref15">32</xref>
          ].
        </p>
        <p>We additionally set our mapping module, i.e. OSM, to the US.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ukranian Tweet Analysis</title>
        <p>In addition to the benchmarks, we also applied our pipeline to all ∼ 2.3M tweets, for which we present
our findings below. We plot the geographical distribution of tweets across all first-level administrative
zones in Figure 2, revealing that most tweets mention Kyiv Oblast, followed by Crimea and Kharkiv.</p>
        <p>In Figure 3 we visualize the number of tweets over time for the five administrative zones with the
most location mentions. As evident from the graphs, there is a clear spike with the start of the Russian
invasion. Generally, the spikes for the diferent locations seem to align with the course of the war. For
example, Kyiv Oblast was mainly targeted in the early stages of the invasion. By the start of April,
most of the Russian troops were successfully forced out, which is also noticeable in the decrease in
tweets. Another example is Kharkiv Oblast. In September, the Ukrainian troops launched a major
counterofensive, in which they successfully reclaimed multiple cities in the Oblast. In our data, this
ofensive coincides with a major increase in tweets mentioning Kharkiv Oblast during that time.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Limitations</title>
      <p>Our TBGAT pipeline successfully matches multilingual tweets from Ukraine to their mentioned locations.
Notably, its use extends beyond tweets, as it is similarly capable to match any other social media post
to Ukrainian locations. While the flexibility of our pipeline ofers many advantages over other available
packages, it has still several limitations. One of the biggest limitations concerns the complexity of
Russian and Ukrainian grammar. Russian and Ukrainian are both characterized by strong declension14.
14’declension’: the inflection of nouns, pronouns, or adjectives for case, number, and gender, for an example see Appendix
This often hinders the ability to map the identified locations in text with the OSM layer, as the declined
locations cannot be found in the OSM database. A possible solution to this would for example be the
introduction of a ”language normalization” module, specific to each language.</p>
      <p>Another problem is that the pipeline cannot account for heavy misspellings, which regularly take place
in social media posts. Furthermore, the ongoing renaming of Ukrainian cities poses a challenge. While
the renaming can be tackled on an administrative level via OSM, intra-personal communication cannot
be strictly mandated, and some inhabitants still refer to cities by their old name as we observe in our
data. Due to this, the matching of a mentioned location to coordinates is not always possible, as the old
city name may simply not exist anymore in any of the geographical databases. An expansion of our
”special case matcher” in the post-processing module can potentially ofset this issue, however, this
requires regular updating in order to guarantee correctness.</p>
      <p>
        Finally, we want to make researchers aware that X (formely Twitter) recently has, similar to many other
social platforms, severely restricted (research) access to their API. Additionally, data sharing is also
usually either restricted or entirely forbidden according to social media platforms’ legal terms [
        <xref ref-type="bibr" rid="ref16">33</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Outlook</title>
      <p>We have identified several potential goals for future work. First, we would like to implement a custom
module that normalizes Russian and Ukrainian tweets in order to tackle declension. Second, fine-tuning
a NER model for location recognition on a more granular level, to e.g. map specific locations such
as Maidan square, is another avenue for future work, as it is currently not possible to identify these
correctly. Finally, a deeper analysis of the tweets and their corresponding locations in Ukraine with
respect to the invasion should prove promising for political scientists to better understand all facets of
the war.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Matthias Aßenmacher was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research
Foundation) under the National Research Data Infrastructure – NFDI 27/1 - 460037581.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors are not native English speakers; therefore, we used ChatGPT and Grammarly to assist with
grammar corrections, spell-checking, and rewriting certain passages for clarity and conciseness.
[7] S. E. Middleton, G. Kordopatis-Zilos, S. Papadopoulos, Y. Kompatsiaris, Location extraction
from social media: Geoparsing, location disambiguation, and geotagging, ACM Transactions on
Information Systems (TOIS) 36 (2018) 1–27.
[8] E. Aldana-Bobadilla, A. Molina-Villegas, I. Lopez-Arevalo, S. Reyes-Palacios, V. Muñiz-Sanchez,
J. Arreola-Trapala, Adaptive geoparsing method for toponym recognition and resolution in
unstructured text, Remote Sensing 12 (2020) 3041.
[9] X. Hu, Z. Zhou, H. Li, Y. Hu, F. Gu, J. Kersten, H. Fan, F. Klan, Location reference recognition from
texts: A survey and comparison, ACM Computing Surveys 56 (2023) 1–37.
[10] D. Racek, B. I. Davidson, P. W. Thurner, X. X. Zhu, G. Kauermann, The russian war in ukraine
increased ukrainian language use on social media, Communications Psychology 2 (2024) 1.
[11] R. Suwaileh, T. Elsayed, M. Imran, Idrisi-re: A generalizable dataset with benchmarks for location
mention recognition on disaster tweets, Information Processing and Management 60 (2023)
103340. URL: https://www.sciencedirect.com/science/article/pii/S0306457323000778. doi:https:
//doi.org/10.1016/j.ipm.2023.103340.
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. arXiv:1810.04805.
[13] E. Shook, V. K. Turner, The socio-environmental data explorer (sede): a social media–enhanced
decision support system to explore risk perception to hazard events, Cartography and Geographic
Information Science 43 (2016) 427–441.
[14] A. Kruspe, J. Kersten, F. Klan, Detection of actionable tweets in crisis events, Natural Hazards and</p>
      <p>Earth System Sciences 21 (2021) 1825–1845.
[15] P. Scott, M. K.-F. Bader, T. Burgess, G. Hardy, N. Williams, Global biogeography and invasion risk
of the plant pathogen genus phytophthora, Environmental Science &amp; Policy 101 (2019) 175–182.
[16] J. He, W. Shen, P. Divakaruni, L. Wynter, R. Lawrence, Improving trafic prediction with tweet
semantics, in: Twenty-third international joint conference on artificial intelligence, Citeseer, 2013.
[17] T. Dasgupta, A. Naskar, R. Saha, L. Dey, Crimeprofiler: Crime information extraction and
visualization from news media, in: Proceedings of the international conference on web intelligence,
2017, pp. 541–549.
[18] N. Freire, J. Borbinha, P. Calado, B. Martins, A metadata geoparsing system for place name
recognition and resolution in metadata records, in: Proceedings of the 11th annual international
ACM/IEEE joint conference on Digital libraries, 2011, pp. 339–348.
[19] A. V. Aho, M. J. Corasick, Eficient string matching: an aid to bibliographic search, Commun. ACM
18 (1975) 333–340. URL: https://doi.org/10.1145/360825.360855. doi:10.1145/360825.360855.
[20] S. Malmasi, M. Dras, Location mention detection in tweets and microblogs, in: Computational
Linguistics: 14th International Conference of the Pacific Association for Computational Linguistics,
PACLING 2015, Bali, Indonesia, May 19-21, 2015, Revised Selected Papers 14, Springer, 2016, pp.
123–134.
[21] B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification
for written language resources, in: N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani,
J. Odijk, D. Tapias (Eds.), Proceedings of the Fifth International Conference on Language Resources
and Evaluation (LREC’06), European Language Resources Association (ELRA), Genoa, Italy, 2006.</p>
      <p>URL: http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf.
[22] T. Baldwin, M. Lui, Language identification: The long and the short of the matter, in: R. Kaplan,
J. Burstein, M. Harper, G. Penn (Eds.), Human Language Technologies: The 2010 Annual Conference
of the North American Chapter of the Association for Computational Linguistics, Association for
Computational Linguistics, Los Angeles, California, 2010, pp. 229–237. URL: https://aclanthology.
org/N10-1027.
[23] B. King, S. Abney, Labeling the languages of words in mixed-language documents using weakly
supervised methods, in: L. Vanderwende, H. Daumé III, K. Kirchhof (Eds.), Proceedings of the
2013 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, 2013,
pp. 1110–1119. URL: https://aclanthology.org/N13-1131.</p>
    </sec>
    <sec id="sec-9">
      <title>Appendix</title>
    </sec>
    <sec id="sec-10">
      <title>A. Aho-Corasick FSA</title>
      <p>When processing an input string, the Aho-Corasick FSA moves through the trie according to the
characters of the string. If a character does not have a corresponding child in the trie, the algorithm
follows the fail link to continue the search. This approach ensures that all potential matches are found
eficiently, as the automaton can check for multiple patterns in a single pass through the text. By
employing this algorithm, our system can rapidly and accurately identify multiple location mentions,
enhancing both the speed and accuracy of the recognition process.</p>
    </sec>
    <sec id="sec-11">
      <title>B. Labeling the Ukraine Benchmark Data</title>
      <p>The labels for the dataset were assigned by a single annotator with the help of Google Translate, with a
second person reviewing the labels to ensure accuracy and consistency. One of the labelers was fluent
in Ukrainian and had a good understanding of Russian.</p>
      <p>• The Locations were labeled as is, i.e. including misspellings and typos. Furthermore, in cases
where multiple locations are mentioned in the same tweet, all of the mentions were labeled, even
when they are referencing the same location.
• To determine geographic coordinates, we utilized OSM information. Generally, when available,
the coordinates for the OSM text label were taken. Otherwise, we assigned coordinates based on
the centroid of the respective region.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Crooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Croitoru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          , J. Radzikowski, #
          <article-title>earthquake: Twitter as a distributed sensor system</article-title>
          ,
          <source>Transactions in GIS 17</source>
          (
          <year>2013</year>
          )
          <fpage>124</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Padmanabhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Soltani</surname>
          </string-name>
          , Y. Liu,
          <article-title>Flumapper: A cybergis application for interactive analysis of massive location-based social media</article-title>
          ,
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>26</volume>
          (
          <year>2014</year>
          )
          <fpage>2253</fpage>
          -
          <lpage>2265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hobbs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lajevardi</surname>
          </string-name>
          ,
          <article-title>Efects of divisive political campaigns on the day-to-day segregation of arab and muslim americans</article-title>
          ,
          <source>American Political Science Review</source>
          <volume>113</volume>
          (
          <year>2019</year>
          )
          <fpage>270</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jurdak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M. AbouJaoude,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cameron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Newth</surname>
          </string-name>
          ,
          <article-title>Understanding human mobility from twitter</article-title>
          ,
          <source>PloS one 10</source>
          (
          <year>2015</year>
          )
          <article-title>e0131469</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Padilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kavak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Gore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Diallo</surname>
          </string-name>
          ,
          <article-title>Temporal and spatiotemporal investigation of tourist attraction visit sentiment on twitter</article-title>
          ,
          <source>PloS one 13</source>
          (
          <year>2018</year>
          )
          <article-title>e0198857</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.-Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Understanding the removal of precise geotagging in tweets</article-title>
          ,
          <source>Nature Human Behaviour</source>
          <volume>4</volume>
          (
          <year>2020</year>
          )
          <fpage>1219</fpage>
          -
          <lpage>1221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>I.</given-names>
            <surname>Balazevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-R. Müller</surname>
          </string-name>
          ,
          <article-title>Language detection for short text messages in social media</article-title>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1608.08515. arXiv:
          <volume>1608</volume>
          .
          <fpage>08515</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Imran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sajjad</surname>
          </string-name>
          ,
          <article-title>When a disaster happens, we are ready: Location mention recognition from crisis tweets</article-title>
          ,
          <source>International Journal of Disaster Risk Reduction</source>
          <volume>78</volume>
          (
          <year>2022</year>
          )
          <article-title>103107</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S2212420922003260. doi:https: //doi.org/10.1016/j.ijdrr.
          <year>2022</year>
          .
          <volume>103107</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .03771. arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Antypas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ushio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rezaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. CamachoCollados</surname>
          </string-name>
          ,
          <article-title>Supertweeteval: A challenging, unified and heterogeneous benchmark for social media nlp research</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bastianelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vanzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Swietojanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <article-title>SLURP: A spoken language understanding resource package</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7252</fpage>
          -
          <lpage>7262</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>588</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.588.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [29]
          <string-name>
            <surname>J. FitzGerald</surname>
            ,
            <given-names>C.</given-names>
            Hench, C.
          </string-name>
          <string-name>
            <surname>Peris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mackie</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Rottmann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nash</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Urbach</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kakarala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ranganath</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Crist</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Britan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Leeuwis</surname>
            , G. Tur,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <article-title>Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2204</volume>
          .
          <fpage>08582</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Piskorski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marcińczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yangarber</surname>
          </string-name>
          ,
          <article-title>Cross-lingual named entity corpus for Slavic languages</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , pp.
          <fpage>4143</fpage>
          -
          <lpage>4157</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>369</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <article-title>Geoparser: A python package for geoparsing text</article-title>
          , https://pypi.org/project/geoparser/,
          <source>2024. Version 0.1.8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>B. I.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wischerath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Racek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Parry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Godwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hinds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Van Der</given-names>
            <surname>Linden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Roscoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ayravainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Cork</surname>
          </string-name>
          ,
          <article-title>Platform-controlled social media apis threaten open science</article-title>
          ,
          <source>Nature Human Behaviour</source>
          <volume>7</volume>
          (
          <year>2023</year>
          )
          <fpage>2054</fpage>
          -
          <lpage>2057</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>