<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Geographic Knowledge from Large Language Models: An Experiment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantinos Salmas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Despina-Athanasia Pantazi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manolis Koubarakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We perform an experimental analysis of how the inner architecture of large language models behaves whilst extracting geographic knowledge. Our aim is to conclude on whether models actually incorporate geospatial information or simply follow statistical patterns in the data; hence to contribute to the research area of creating knowledge graphs from large language models. To achieve this, we study one specific geospatial relation and explore diferent techniques that leverage the masked language modeling abilities of BERT and RoBERTa. Our study should be construed as a stepping stone to the general study of the ways large language models encapsulate geospatial knowledge. In addition, it has allowed us to observe important points one should focus on when querying language models, which we discuss in detail.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>geospatial data</kwd>
        <kwd>geospatial knowledge</kwd>
        <kwd>knowledge graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There has been a lot of interest recently in the development of large language models (LLMs) and
their relationship to knowledge graphs (KGs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LLMs such as BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] were
pre-trained in a self-supervised manner using a great amount of textual corpora. In the probing
setting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], LLMs can fill a masked word included in a text sequence to extract a relational
assertion for a given subject [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For example, BERT successfully fills the sentence "Athens
is located in" with "Greece", which can be preceived as a KG subject-predicate-object triple
&lt;Athens, located in, Greece&gt;.
      </p>
      <p>
        Witnessing that LLMs are constantly evolving and that they have been exposed on a vast
quantity of data, the question of whether we can extract their knowledge and automatically
create KGs arises. An analysis of the factual and commonsense knowledge in publicly available
pretrained language models was published in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where the authors concluded that it is not
trivial to extract a KG from them. Another work proposed the injection of factual knowledge
into BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] discussed how we can create structured KBs from LLMs, and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explored
how robustly world knowledge is stored in LLMs.
      </p>
      <p>
        In this paper, we are interested in geospatial KGs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and their relationship to LLMs. Geospatial
KGs include YAGO2 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], YAGO2geo [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], WorldKG [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], KnowWhereGraph [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the geospatial
extension of YAGO4 of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and others. Geospatial KGs contain information about geographic
entities (e.g., the city of Athens) and their geospatial (e.g., Athens’ geometry in the WGS84
coordinate reference system) and thematic characteristics (e.g., Athens’ population or the name
of its mayor). Geospatial KGs can store both qualitative (e.g., Thessaloniki being north of Athens),
and quantitative (e.g., their distance being 504km) knowledge about geographic entities.
      </p>
      <p>LLMs like BERT and RoBERTa have been trained on Wikipedia, book corpora, web text etc.
As these sources (especially Wikipedia) contain a lot of geospatial knowledge like the one stored
in geospatial KGs, it is very interesting to explore the LLMs’ capabilities in this regard. For
example, there is a lot of qualitative knowledge about geographic entities in Wikipedia, such as
cardinal direction knowledge to describe the relative position of various entities on the world
map (e.g., “Bulgaria borders Greece to the north”). Additionally, many geospatial properties
included in Wikipedia can be implicitly reasoned through diferent texts. For instance, Athens
is in Greece (Europe) and Accra is in Ghana (Africa). The conclusion that Athens is north of
Accra can be inferred by the additional fact that Europe is north of Africa. Understanding such
relations and being able to easily answer qualitative geospatial questions that depend on them
can be an important capability of LLMs.</p>
      <p>Motivated by the above discussion, we seek to broaden our perspective regarding the ways
LLMs can answer geospatial questions. The contributions of the paper are the following:
• We carry out an experiment regarding the ways LLMs can answer qualitative geospatial
questions of the form X is a city in Y. To achieve this, we exploit the fill-the-mask pipeline
on the pre-trained models BERT and RoBERTa without fine-tuning them. In this way,
we aim to understand if LLMs are actually able to answer simple qualitative geospatial
questions correctly.
• We present our findings on answering geospatial questions using LLMs, and we
experiment with the efect diferent layers and their attention heads have on the final results.
We also explore diferent techniques on mask filling while retrieving the answers from
the LLMs. To support our findings, we include detailed results for each experiment. In
addition, in Appendix A we conduct a complementary study to further analyse the results
of our experimental evaluation concerning diferent variations of LLMs.</p>
      <p>The rest of the paper is structured as follows: Section 2 discusses related work, while Section 3
gives a detailed overview of the methodology we followed to show if the LLMs studied can
answer geospatial questions. Section 4 presents the experimental evaluation we conducted.
Section 5 concludes the paper and makes recommendations for extending this work. In
Appendix A we present our complementary experiments using diferent variations of the LLMs.
As a contribution to the research community we release our source code1.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recent works have studied the possibility that LLMs could be used for KG construction and
augmentation. Generally the LM-as-KG paradigm embodies three diferent approaches:
promptbase retrieval, case-based analogy, and context-based inference [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In prompt-base retrieval,
one masks the desired answer and deems the returned token to be the answer; e.g., Paris is the
capital of [MASK]. In case-based analogy, the prompt includes an example prior to the mask one
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/AI-team-UoA/gsp-knowledge-extraction-llms</title>
        <p>wants filled; e.g., Athens is the capital of Greece. Paris is the capital of [MASK]. In context-based
inference, the prompt is enriched with relating information; e.g., Athens is in Greece.</p>
        <p>
          Petroni et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] introduced the LAMA probe, which is a prompt-based retrieval approach.
Using the LAMA probe, they explored the factual knowledge an LLM encapsulates, simply from
its pre-training procedure. Their contribution consists of a systematic analysis which reaches
the conclusion that BERT-large is better at knowledge extraction compared to its competitors,
that relation extraction performance is not easily improved simply by increasing the data
volume, and finally, that we need better understanding regarding aspects of knowledge that
LLMs capture. Wang et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] moved one step further and proposed a framework that can
construct KGs from LLMs. Their approach suggests a single forward pass of the LLMs without
ifne-tuning through textual corpora. The proposed MaMa (Match and Map) framework consists
of two stages that result in an open KG with mapped facts being in a fixed schema, while
unmapped ones being in an open schema. They argue that the resulting KGs (with a measured
precision of more than 60%) indicate that their approach is reliable. Hao et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] presented a
more advanced prompt-based retrieval approach by introducing a framework that can construct
a relational KG via an LLM without textual corpora parsing, simply by the utilisation of some
examples and a prompt. They paraphrase the initial prompt and use the alternatives to find
out which of them can help the LLM to efectively produce valid answers through the use of
the said examples and the score they achieve. Their approach leverages the masked language
modeling (MLM) abilities of LLMs and retrieves knowledge via fill-the-mask tasks.
        </p>
        <p>
          Razniewski et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] argue that LLMs should be a means to curate and augment KBs and
not simply replace them. They propose some pragmatic and intrinsic considerations such as a
common bias of the aforementioned techniques, namely the lack of disambiguation between
statistical correlation and explicit knowledge. Considerable attention has also been paid to
the inner workings of the LLMs: their layers and the corresponding attention heads. Clark
et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] hypothesise that some attention heads of BERT-base appear to behave in specific
patterns that could indicate that BERT learns syntactic dependencies of the English language. A
similar study has been conducted by Kovaleva et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], also focusing on BERT’s self attention
mechanisms, suggesting that BERT can benefit from attention heads disabling in some tasks.
        </p>
        <p>
          In the area of KGs, there is research on the enhancement of KGs along temporal and spatial
dimensions, the latter being the topic of this paper. YAGO2 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is such an example that
extends the classic Subject-Property-Object (SPO) triples adding Time and Location. It was
further extended with richer geospatial knowledge (not just coordinates) in YAGO2geo [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
and YAGO4 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Other recent approaches to geospatial knowledge graphs are WorldKG [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
and KnowWhereGraph [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          As per the geospatial abilities of LLMs, Roberts et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] probed GPT4 which performed
generally well but remains unclear if it did so by reasoning or simple memorization. Cohn et
al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] conducted dialectal evaluations on state-of-the-art models and suggest the models do
not always succeed in spatial reasoning. Faisal et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] proposed a framework to examine
LLMs’ geographic knowledge and biases. They seemed to understand geographic proximity but
serious limitations exist. Hofman et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] propose geoadaptation, a task-agnostic training
step performed on pretrained LLMs which they argue allows models to learn geographic and
dialectal knowledge. Finally, Mai et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] discuss the development of a foundation model for
geospatial AI and introduce a framework to achieve such goal.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        We focus on Transformer-based language models that have been pre-trained through the masked
language modeling (MLM) paradigm. BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was trained on BookCorpus (800M words) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
and the English version of Wikipedia (2,500M words), excluding lists, headers and tables. As for
the MLM tasks, 15% of the tokens were masked (i.e., replaced with the special [MASK] token).
More specifically, in 10% of that 80%, the masked token was replaced with another random
token and in 10% it was left unchanged. BERT was also pre-trained on Next Sequence Prediction
(NSP). In this task, the model should predict if a sentence A was following sentence B or not.
RoBERTa [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] follows similar training techniques (MLM, NSP) and almost identical architecture
to BERT. However, the authors have changed some major points; the MLM is performed via
dynamic masking and tokenization is replaced with byte-pair encoding (BPE). Finally, the data
upon which it was trained (apart from that BERT used) include CC-News2, OpenWebText3 and
Stories [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. We use the pretrained versions of BERT and RoBERTa without fine-tuning them.
      </p>
      <p>In the rest of the paper we try to answer the question: Can BERT and RoBERTa answer the
very simple geospatial question “Is X a city in Y?”. The question is posed as a geospatial phrase
(GSP) of the form X is a city in Y. For example, if an LLM has learned that “Athens is a city in
Greece” it should be able to answer the GSP Athens is a city in Y.</p>
      <p>
        We chose to carry out this very simple experiment since “in” is an important topological
relation in all qualitative spatial reasoning models [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and, as such, it deserves to be studied
ifrst. Also, “city” is an important class of geographic features and there is plenty of knowledge
about cities and the administrative divisions they belong to (e.g., states of countries) in the data
BERT and RoBERTa have been trained on (e.g., Wikipedia or BookCorpus).
      </p>
      <sec id="sec-3-1">
        <title>3.1. Knowledge extraction settings</title>
        <p>Layers. BERT-like models take a sequence of tokens as an input and pass it through their inner
layers. When they are used for MLM, a specific head is added on top of the models that takes
the contextual embeddings as an input, passes them through a feed-forward neural network
(FNN) and returns a sequence of predicted tokens. We aim to explore how the answers are
constructed at each layer of the model. In order to record that, we changed the default forward
function of these models to have the answer from a layer of our choice. When asking a model
with K layers to fill the masked tokens from the layer N, we actually allow the model to use all
layers 1 ≤  ≤  and then bridge the gap between the remaining layers and the MLM head
(i.e., layers  &lt;  ≤  were not used at all). If one requests to get an answer from the Kℎ
layer, the process is identical to a simple fill-the-mask task in which the model would use all of
its layers to produce the outcome.</p>
        <p>Top-K answers. A softmax function is applied to the embeddings every BERT-like model
returns. These embeddings are then sorted and the top-k of them (along with the confidence
of the model) are kept as the most probable answers. We tampered with a few diferent top-k
values, but we settled to a top-k value of 10 and 100. Note that a large model (24 layers) with
top-k=100 would return a total of 2400 answers.</p>
        <sec id="sec-3-1-1">
          <title>2https://commoncrawl.org/2016/10/news-dataset-available/ 3https://skylion007.github.io/OpenWebTextCorpus/</title>
          <p>Multi-mask filling methods. Some geospatial relations like the relation “in” we are working
with, require more than one mask to be filled, e.g., [MASK] is a city in [MASK]. We test two
diferent approaches as to how the full answer would be constructed, as described below:
• Left-To-Right (LTR): Firstly, the model fills the left most mask and proceeds to the
remaining ones on the right. For example:
[MASK] is a city in [MASK] →− Athens is a city in [MASK] →−
Athens is a city in Greece.
• Right-To-Left (RTL): This is the exact opposite of LTR and starts the process of filling
from the right most [MASK] token. The above example would be reformulated as:
[MASK] is a city in [MASK] →− [MASK] is a city in Greece →− Athens is a city in Greece.
The reason both these diferent approaches were tested lies mainly in the fact that inserting
biases while attempting to extract geospatial knowledge is fairly easy. According to the GSP,
the choice of the method can afect the outcome greatly. For instance, using LTR in the relation
"X is a city in Y" creates the following problem; for the answer  of the top-k answers, the
model would attempt to produce top-k tokens for the other mask. However, even if the model
was an oracle and could safely predict  as the correct answer ( is a city in the country of
 ), it would continue to produce top(k-1) more answers which would be wrong. As a result,
the percentage of correct answers is severely limited by a human induced bias. For this reason,
we introduce one more parameter; the cutof .</p>
          <p>Cutof. When cutof is enabled, the model constructs top-k answers for the first mask to be
iflled (according to the opted method), and then returns one token for each of the top-k answers.
Alternatively, when cutof is disabled, the total answers produced are  ·  , where  is the
number of layers, K is the top-k value, and M is the number of masks.</p>
          <p>Layer Drop. In some experiments we explore if some specific layers afect the final results to a
great extent. That is why we drop some of them from the model. Simply freezing a layer would
still allow the tokens to flow through it and be susceptible to its normalization mechanism. We
aim however to completely remove a layer and disallow it from influencing the data. In this
regard, when removing the ℎ layer, we copy the internal encoder structure except for the ℎ
layer, and assign the new layer list to the model. As a result, we are able to keep all the other
layers unafected by the removal and examine the influence such tweaks have on the results.
Attention Heads Drop. In a similar mindset, we also examine the extent to which specific
attention heads (from specific layers) afect a model’s answers. We utilize the internal mechanisms
of a model that allow us to easily prune said heads; when pruned they serve as a no-op.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Compatibility Matrices</title>
        <p>We are not only interested in the correct percentage of the answers, we also want to examine the
consistency of the models’ results. This is the reason why we construct compatibility matrices
with which we are able to compare the percentage of compatibility between the layers. They
are 2D heat maps that visually demonstrate how similar the answers yielded at each state of
the model while answering a geospatial question. We constructed the following two types of
compatibility matrices:
• Self-Compatibility: These matrices are symmetrical and compare a model to itself. By
examining them, we are able to see how much the model changes its answers
throughout the layers. Note that the main diagonal is not always 100% because we count the
compatibility discarding the duplicate answers.
• Cross-Model Compatibility: Similar to the self-compatibility matrices, these compare
diferent models (with the same architecture or not). We note some interesting points
utilizing these graphs as to how diferent models behave whilst constructing their answers.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Validation</title>
        <p>We utilize the GeoPy python API4 to validate the answers to a geospatial question provided
by an LLM. It is a module for geocoding that uses the OpenStreetMap (OSM) Nominatim
service5. It takes a name as an input and returns the location matching that name along with
its characteristics such as the feature type (e.g., city). The validation of an answer (,  )
to “X is a city in Y.” would proceed as follows. Running ( = ,  _ =
,  =  ) we would get the city that belongs in the country of  and matches the
name . Note that the feature type of city may also allow towns, villages and communes.
Unfortunately, sometimes GeoPy might confuse a clearly wrong answer for an existing location.
For example ( = 1982) returns Elewijt, Belgium6. That is why we further restrict
GeoPy’s answer and model’s answer to have an edit distance (i.e., changes needed to match the
two strings) of 0 or 17.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>Settings and evaluation metrics. In our experimental evaluation, we use the two major
variations of the BERT and RoBERTa models’ sizes - base and large. As far as BERT is concerned,
we also experiment upon the diferent casing versions. The uncased version (as opposed to the
cased one) was trained with all textual data being lowered during the pre-processing procedure.
For the top-k variable, we selected the values 10 and 100. More values were tested (e.g., 300,
1000), but ultimately we settled down on these for the following reasons. Higher top-k values
dramatically reduced P@C (percentage of correct answers) on each layer, whilst simultaneously
blunting the fluctuations at lower layers. As the answers on the results in the early stages
are not yet suficiently evaluated by the model, a lot of correct answers lie lower than they
should be; hence a high top-k reveals those answers and tends to yield somewhat linear graphs.
Such results would not allow us to focus on diferent layers’ performance and their deeper
analysis. Moreover, we needed a suficient, yet small, top-k to be used as a reference for future
GSPs’ experimentation that have a finite number of correct answers. For example there exist
approximately 190 countries, hence “[MASK] is a country.” cannot have 300 correct responses.
To evaluate our experiments, for each layer of a model, we compute the correct percentage of
the answers produced by this layer (P@C) and we depict it in graphs. When a model returns a
4https://pypi.org/project/geopy/
5https://nominatim.org/
61982 is the zip code of Elewijt, Belgium
7_(1982, ) = 7
specific token as an answer, it also assigns a score to it, corresponding to the confidence it has for
the token to be the actual answer. For each layer we compute the mean score of the mentioned
confidence for the returned tokens, we normalize it and then feed it to a MinMaxScaler so that
we can depict the diferent levels of confidence in the graphs. The closer the color of a scatter
point is to black, the more confident the model was on that specific layer.</p>
      <p>Task 1 - Results. We attempt to construct answers for the GSP of "_ is a city in Europe.". In
Figure 1, we present the percentages of the correct answers each layer produces for the said
phrase. As we can observe, the BERT models perform adequately in this specific task, while
the RoBERTa models struggle with higher top-k values. This holds true probably because of
the datasets that were used during pre-training; RoBERTa processed a great volume of data
irrelevant to the Wikipedia textual corpus.</p>
      <p>Intuitively, we can assume that the uncased versions of the models would perform worse, for
the selected GSP, as we are searching for answers that are cities and almost always appear
capitalized. However in some examples uncased versions achieve better scores. We can also
observe that almost all the models appear to be strongly confident on lower levels with moderate
P@C; this could indicate a high level of randomness to the answers. Another diference between
the BERT and RoBERTa models appears on the final layers. RoBERTa seems to rearrange its
answers and performs worse even though it had previously reached a higher score.
Task 2 - Layer Pruning. In 1 we can see that sometimes local minima appear on the graphs,
indicating layers that afect the general model performance. We therefore remove the said layers
and observe the resulting graphs. Our results are included in Figure 2. What we can see is
that such pruning allows the models to reach similar maximum scores with fewer layers. Even
though a slight drop appears on the maximum P@C, in some cases we have pruned enough
layers to decrease the total model size (trade-of); this could indicate that not all layers are
necessary for such tasks and should we attempt to fine-tune them for better results, the process
would be less computationally heavy and expensive.</p>
      <p>
        Task 3 - Attention Head Pruning. As shown in Figure 3, we experimented with diferent
combinations of what heads to keep and what to prune. It has been argued that specific heads
are able to perform better at certain tasks as classifiers [
        <xref ref-type="bibr" rid="ref17 ref18">18, 17</xref>
        ]. We could not however specify a
general rule of thumb in our experiments. Through trial and error we were able to sometimes
rectify the final scores or moderately afect them while having pruned a significant amount
of heads. It seems that attention heads are crucial, but sometimes the models come equipped
with more that the necessary amount [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. What is also believed to be true, is that some heads
on a layer often exhibit similar behaviours [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Hence, in some cases we were able to remove
approximately half of a layer’s heads without significant performance reduction.
Task 4 - Multiple Masks. As discussed in Section 3, when more than one mask appears,
we need to specify the order we will fill them. It is understandable from the results shown
in Figure 4 that cutof afects the total number of the produced answers. This is the reason
cutof-enabled curves perform better. Moreover, we observe that RTL achieves higher scores
since multiple cities belong to a country while the reverse is non true. This fact indicates again
the lack of a general rule, as the optimal method has to be chosen in regards to the GSP.
Task 5 - Compatibility Matrices. In some cases, the models seem to return similar answers;
both between their layers and generally between diferent models. That is why we constructed
compatibility matrices that count the number of common answers per layer, which helps us
determine some insights on the similarities between them. For this experimental setting, we
present in Figure 5 the compatibility matrices for the layers of the uncased BERT-base model
ans its corresponding matrix for the negation of the same phrase, with top-k equal to 100. As
we can observe, close layers in a model seem to yield similar answers. This compatibility span
may appear to be slightly higher on upper layers; answers are more stable with fewer changes.
Regarding the negation matrix, many scores are higher than what we expected. This is an
indication that the models almost discard the negation. An interesting point lies also to the
comparison of base and large models where the base’s 12 layers are more compatible to the last
24 layers of the large versions. The additional matrices that reflect diferent model cases can be
found in Appendix A.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Summary and Future Work</title>
      <p>In this work, we carried out an experiment to investigate how LLMs behave whilst constructing
simple geographic knowledge of the form X is a city in Y via the MLM paradigm. We
experimented with the efect diferent layers and their attention heads have on the final output of
the model and explored diferent techniques for mask filling. We found out that the results of
geographic knowledge extraction from LLMs can vary highly; diferent methods and/or models
greatly fluctuate the metric P@C.</p>
      <p>In future work, we would like to study whether other kinds of geospatial knowledge can be
extracted from LLMs using appropriate templates; in the introduction we have discussed various
kinds of such knowledge. We also want to study how much geospatial answering and geospatial
reasoning can be done by more recent language models such as ChatGPT, Bard, Claude and
LLaMA. Through these models we would like to extend our research to more relations that are
also important in the geospatial dimension such as cardinal relations. We aim to validate our
future work using geospatial KGs and evaluate the aforementioned models on their geographic
knowledge.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the first call for H.F.R.I. Research Projects to support faculty
members and researchers and the procurement of high-cost research equipment grant
(HFRIFM17-2351).</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <sec id="sec-7-1">
        <title>A.1. Additional Compatibility Matrices</title>
        <p>In Figure 6, we can see the self compatibility matrices for the large versions of the models
BERTuncased and RoBERTa, which are symmetrical, and the diagonals correspond to a layer being
compared to itself. Moreover, on a comparison between BERT’s base (discussed in Section 4) and
the large versions shown in Figure 7, we can easily observe that the answers the base version
yields shifted to the last 12 layers of the large versions. The early levels of BERT-large achieve
very low similarity scores compared to the BERT-base layers.</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Highest Scores</title>
        <p>In Table 1, we present the highest scores for top-k ∈ {10, 100} and how it was achieved. The
model column consists of abbreviations of the models (bert-base-cased: bbc, bert-base-uncased:
bbu, bert-large-uncased:blu). Unsurprisingly, a top-k value equal to 10 is able to yield better
scores than the higher value. When we include multiple masks, scores decline dramatically and
only with the cutof variable enabled are the models able to perform better. Moreover, the LTR
method seems to outperform the RTL one. What is more, bert-large-uncased achieved a higher
score on the 23 layer.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?, in: EMNLP-IJCNLP, Hong Kong</article-title>
          , China,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, Minneapolis</article-title>
          , MN, USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig,
          <article-title>Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kassner</surname>
          </string-name>
          , G. Weikum,
          <article-title>Language models as or for knowledge bases</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pörner</surname>
          </string-name>
          , U. Waltinger,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          , E-BERT:
          <article-title>eficient-yet-efective entity embeddings for BERT</article-title>
          , in: EMNLP,
          <string-name>
            <surname>Online</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Language models are open knowledge graphs</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinzerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries</article-title>
          , in: EACL, Online,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          (Ed.),
          <article-title>Geospatial Data Science: A Hands-on Approach for Building Geospatial Applications Using Linked Data Technologies</article-title>
          , volume
          <volume>51</volume>
          , 1 ed.,
          <source>Association for Computing Machinery</source>
          , New York, NY, USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Berberich</surname>
          </string-name>
          , G. Weikum,
          <article-title>YAGO2: A spatially and temporally enhanced knowledge base from wikipedia</article-title>
          ,
          <source>Artificial Intelligence</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Karalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Mandilaras</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Koubarakis, Extending the YAGO2 knowledge graph with precise geospatial knowledge</article-title>
          , in: ISWC,
          <string-name>
            <surname>Auckland</surname>
          </string-name>
          , New Zealand„
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dsouza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tempelmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gottschalk</surname>
          </string-name>
          , E. Demidova,
          <article-title>WorldKG: A world-scale geographic knowledge graph</article-title>
          ,
          <source>in: CIKM</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          et al.,
          <article-title>Know, know where, knowwheregraph: A densely connected, crossdomain knowledge graph and geo-enrichment service stack for applications in environmental intelligence</article-title>
          ,
          <source>AI Mag</source>
          .
          <volume>43</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. D. Siampou</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Karalis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>Extending YAGO4 knowledge graph with geospatial knowledge</article-title>
          ,
          <source>in: The 5th International Workshop on Geospatial Linked Data at ESWC</source>
          , Hersonissos, Greece,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Sun,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Knowledgeable or educated guess? revisiting language models as knowledge bases</article-title>
          , in: ACL/IJCNLP, Online,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          , Bertnet:
          <article-title>Harvesting knowledge graphs from pretrained language models</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>What does BERT look at? an analysis of bert's attention</article-title>
          , in: ACL Workshop, Florence,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <article-title>Revealing the dark secrets of BERT, in: EMNLP-IJCNLP, Hong Kong</article-title>
          , China,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lüddecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          , K. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Albanie,
          <article-title>GPT4GEO: how a language model sees the world's geography (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernández-Orallo</surname>
          </string-name>
          ,
          <article-title>Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Faisal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <article-title>Geographic and geopolitical biases of language models (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>V.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          , G. Glavas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubesic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Pierrehumbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>Geographic adaptation of pretrained language models (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          , T. Liu, G. Cong,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <article-title>On the opportunities and challenges of foundation models for geospatial artificial intelligence (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Fidler.</surname>
          </string-name>
          ,
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books</article-title>
          , in: ICCV,
          <string-name>
            <surname>Santiago</surname>
          </string-name>
          , Chile,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Trinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>A simple method for commonsense reasoning</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Renz</surname>
          </string-name>
          ,
          <article-title>Qualitative spatial representation and reasoning</article-title>
          ,
          <source>in: Handbook of Knowledge Representation</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>E.</given-names>
            <surname>Voita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Talbot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Moiseev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Titov</surname>
          </string-name>
          ,
          <article-title>Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned</article-title>
          , ACL, Florence, Italy (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>