Using Graph Databases for Historical Language Data:
Challenges and Opportunities
Barbara McGillivray1 , Pierluigi Cassotti2,* , Pierpaolo Basile2 , Davide Di Pierro2 and
Stefano Ferilli2
1
    King’s College London, Strand Campus, Strand, London, WC2R 2LS, United Kingdom
2
    University of Bari Aldo Moro, Department of Computer Science, Via E. Orabona 4, Bari, 70125, Italy


                                         Abstract
                                         The integration of semantic information into language resources has the potential to open up new avenues
                                         of enquiry into the mechanisms of language change. We present the first experiments in integrating data
                                         from Latin textual corpora and language resources into a graph database via the GraphBRAIN Schema
                                         and show the potential of this model for research into the mechanisms of semantic change in Latin.

                                         Keywords
                                         Knowledge Graphs, Latin Corpora, Semantic Change Detection, Graph Data Model


1. Introduction
Research in Historical Linguistics often requires the analysis and support of heterogeneous
data and tools, such as lexical resources, encyclopaedias, and large corpora. Nevertheless,
these resources are often siloed. Graph Databases present an ideal opportunity to combine the
advantages of DataBase Management Systems (DBMSs) for handling individuals (scalability,
storage optimization, efficient handling, mining and browsing of the data, etc.) with the high-
level functionalities available in Knowledge Bases (KBs). Graph DBMS are intrinsically designed
to store schemaless data. Differently from traditional DBMSs like the relational [1] or object-
oriented [2] ones, they lack predefined structures. Following this approach, Neo4j 1 , one of the
most common graph DBMSs, does not provide support for introducing ontology definitions
based on labels and/or arcs. The absence of a schema may lead to ambiguity when reading and
managing data in downstream applications due to the inherent ambiguity of the words used for
expressing concepts. Hence, the semantics becomes blurred.
   To address these issues, we propose the use of GraphBRAIN [3] as a solution. GraphBRAIN
consists of a graph database which follows the Labelled Property Graph (LPG) [4] structure.

19th IRCDL (The Conference on Information and Research science Connecting to Digital and Library science), February
23–24, 2023, Bari, Italy
*
  Corresponding author.
$ barbara.mcgillivray@kcl.ac.uk (B. McGillivray); pierluigi.cassotti@uniba.it (P. Cassotti);
pierpaolo.basile@uniba.it (P. Basile); davide.dipierro@uniba.it (D. Di Pierro); stefano.ferilli@uniba.it (S. Ferilli)
 0000-0003-3426-8200 (B. McGillivray); 0000-0001-7824-7167 (P. Cassotti); 0000-0002-0545-1105 (P. Basile);
0000-0002-8081-3292 (D. Di Pierro); 0000-0003-1118-0601 (S. Ferilli)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
             CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
    https://neo4j.com/
This structure stores nodes with specific labels, arcs which represent relationships among
nodes and properties on both nodes and arcs. Properties are stored in the format of key/value
pairs. GraphBRAIN requires KB designers to define a data schema which operates also as an
ontology. GraphBRAIN provides a mapping mechanism for exporting schemes into an SW-
compliant language, the Web Ontology Language (OWL). These schemes guide access through
all the CRUD operations on the database but also ensure interpretability and interoperability
among different applications. Following the schemes, applications become compliant with
each other. Neo4j, in its Enterprise Edition, does not provide any constraint definition process.
In other versions, it supports a few constraints like unique node property constraints, node
property existence constraints, relationship property existence constraints and node key constraints.
Evidently, these tools are not as expressive as ontology definitions.
   In [5], we adopted GraphBRAIN technology to model time-sensitive linguistic knowledge in
a graph database, describing a time-sensitive model of linguistic knowledge that can be used
for graph databases. In this paper, we show an application of this model to the lexical semantic
analysis of Latin data, i.e. the analysis of the meanings of Latin words. Differently from previous
approaches, such as Basile et al. [6], Hamilton et al. [7], and Carlo et al. [8], we exploit graph
database potentialities to detect semantic changes in specific concepts.
   Latin is in a particularly favourable position among historical languages for the large-scale
analysis of semantic change processes, thanks to a number of factors. First, Latin researchers
now enjoy unprecedented access to digital data covering over two thousand years of history.
Thanks to the ERC-funded LiLa project 2 , seven Latin language resources and six corpora have
been linked at the level of word lemmas so far, making Latin a unique case among historical
languages. Second, we have access to extensive computational language resources for Latin,
Latin WordNet [9], and digitised dictionaries of Latin, which provide rich information about
words’ semantics and examples of usage. Finally, focussing on Latin allows us to investigate
semantic change processes over long time spans. Latin has one of the longest recorded histories
of any human language, making it naturally suitable for quantitative studies [10]. The first
inscriptional records date from the sixth century BCE, and Latin continues to be used to the
current day by the Catholic Church and some academic and legal institutions around the world.
Written Latin diverged from the spoken vernaculars in the second half of the first millennium
of the Christian era, but it remained in use as one of the principal channels of communication
across most of Europe for the next thousand years. The humanists’ conscious effort to reproduce
Classical Latin led to a range of interesting developments, particularly affecting the neo-Latin
lexicon to enable the expression of new concepts [11]. This extensive chronological span has
raised the question of the extent to which Latin is seen as a dead or fossilised language (e.g.
Herman [12], Butterfield [13]). However, it remains an open question to what extent this
fossilisation affected the semantics of words, as we know that the Latin lexicon, in this respect,
has remained dynamic (over 4,500 words have acquired new meanings since the Renaissance;
Demo 2022). The extent to which post-classical Latin can really be considered as a “fixed”
language (Leonhardt [14], Roelli [15], Langslow [16]) from the point of view of its ability to
generate new meanings of words is still largely unknown beyond anecdotal evidence.
   In Section 2 we present the Linguistic Knowledge Graph, in Section 3 we describe the Latin

2
    https://lila-erc.eu/
data that we worked on, and in Section 4 we show how we loaded the Latin data into the
Linguistic Knowledge Graph. Finally, in Section 5 we draw some conclusions and outline future
directions of work.


2. The Linguistic Knowledge Graph
The Linguistic Knowledge Graph (LKG) aims to capture different aspects of lexical resources,
such as relations between words and concepts, morphological, and syntactical information.
Moreover, LKG covers diachronic aspects of language, such as the date of publication of a
document, and the birth and death of an author. The schema we designed takes inspiration
from the ontological lexicon model LeMON [17]. For space constraints, we report in Table 1
node types and in Table 2 the relationships adopted for diachronic analysis. The lexical unit is
represented as node of type InflectedWord or Lemma, which are subclass of Word, i.e. Lemma
IS_A Word and InflectedWord IS_A Word. The Lemma can be a multi-word expression (mwe),
in this case, the flag mwe is set to True. The respective lemma of an InflectedWord can be
retrieved exploiting the relationship HAS_LEMMA between InflectedWord and Lemma. The
LexiconConcept is used to represent the word’s meanings, and each instance of LexiconConcept
represents a different meaning. For example, the LexiconConcept can represent the senses
reported on a sense inventory, e.g. synsets in WordNet [18]. The relationship between a
word and its meaning is expressed using the relationship HAS_CONCEPT among instances of
Word and instance of LexiconConcept. Multiple relationships can be defined over couples of
LexiconConcept using the reflexive relationship SEM_RELATION. At the same time, reflexive
relationships over the Word instances can be described by the LEX_RELATION relationship.
   The document structure from which words are extracted can be represented at different
levels of granularity: Sentence,Text, Document, and Corpus. In particular, each excerpt can be
represented as Text or Sentence, which is a subclass of Text. A Text may belong to (BELONG_TO)
a Document and a Document can be part of (BELONG_TO) a Corpus. The occurrences of a word
in a particular Text are traced by the relationship HAS_OCCURRENCE among Word and Text.
In the case of sense-annotated corpora, such as SemCor, is possible to specify the occurrences
of senses using the relationship HAS_EXAMPLE among LexiconConcept and Text. Currently,
the LKG takes into account two types of metadata: author and language. The relationship
HAS_AUTHOR among nodes of type Text and nodes of type Person determines the author of
a Text. The relationship HAS_LANGUAGE among nodes of type Text, Document, Corpus, and
Word to nodes of type Language specifies the respective language.
   The time is modelled using two classes of nodes: TimeInterval, and TimePoint, both subclasses
of TemporalSpecification. The TimeInterval type is used when the date is not precisely stated,
while the TimePoint is used in cases where the date is fixed. The start and end extremes of the
TimeInterval nodes can be specified using the respective relationships startTime and endTime.
In the current version of the LKG, time specification is supported for Person and Text. More
specifically, the date of birth and death of authors is specified using the relationship BORN and
DIED between Person and TemporalSpecification. The publishing date of a text is specified by
the relationship PUBLISHED_IN among Text nodes and TemporalSpecification nodes.
Table 1
LKG classes with their respective superclasses and attributes.
                     Class                   Superclass              Attributes
                     Word                                            value:String
                                                                     value:String
                     Lemma                   Word                    posTag:String
                                                                     mwe:Boolean
                     InflectedWord           Word                    value:String
                     Stem                                            value:String
                                                                     id:String
                     LexiconConcept          Concept
                                                                     resource:String
                     Text                                            value:String
                     Sentence                Text
                     Document                                        title:String
                     Corpus                                          name:String
                                                                     name:String
                     TemporalSpecification
                                                                     description:String
                                                                     Year:Integer
                     TimePoint               TemporalSpecification   Month:Integer
                                                                     day:Integer
                     TimeInterval            TemporalSpecification
                                                                     name:String
                     Person
                                                                     lastname:String
                                                                     iso639-1:String
                     Language                                        iso639-2:String
                                                                     enName:String
                     Category                                        id:String


3. Latin data
The data we loaded into the graph consists of a portion of the LatinISE corpus [19] annotated
at the level of dictionary senses. LatinISE is a Latin corpus covering the period from the fifth
century BCE to the twenty-first century and contains 10 million word tokens, semi-automatically
lemmatised and part-of-speech tagged. The metadata fields in LatinISE indicate text identifier,
author, title, dates, century, genre, url of source, and optionally book title/number and character
names (for plays). The annotated dataset was produced as part of the SemEval shared task on
Unsupervised Lexical Semantic Change Detection [20]. 40 Latin lemmas (“target words”) are
selected, of which 20 are known to have changed their meaning with the advent of Christianity
(for example, beatus, which shifted its meaning from ‘fortunate’ to ‘blessed’) and 20 are known
to not have changed their meaning between the BCE era and the CE era. For each of the 40
lemmas, 60 sentences are randomly extracted from LatinISE, 30 of them are from texts dated in
the BCE era, and 30 from texts dated in the CE era. Each sentence was annotated by at least
one expert annotator, according to the DuReL framework [21]. The annotators were asked to
judge the semantic relatedness of an instance of usage of a target word with respect to the
list of its dictionary definitions using a four-point scale (Unrelated, Distantly Related, Closely
Related, and Identical). The definitions were taken from the Latin portion of the Logeion online
dictionary (https://logeion.uchicago.edu/) containing Lewis and Short’s Latin-English Lexicon
(1879) [22], Lewis’ Elementary Latin Dictionary (1890) [23], and Du Fresne Du Cange et al. [24].
See McGillivray et al. [25] for further details about the dataset and its annotation framework.
Table 2
LKG relationships with their respective subject, object and attributes.
                Relationship     Subject                           Object                  Attributes
                                 Sentence                          Text                    id:Integer
                IS_A
                                 Lemma ∪𝐼𝑛𝑓 𝑙𝑒𝑐𝑡𝑒𝑑𝑊 𝑜𝑟𝑑            Word                    id:Integer
                                 Text                              Document                id:Integer
                BELONG_TO        Document                          Corpus                  id:Integer
                                 Text                              Category
                                                                                           begin:Integer
                HAS_OCCURRENCE   Word                              Text
                                                                                           end:Integer
                {LEX_RELATION}   Word                              Word
                HAS_LEMMA        Word                              Lemma
                HAS_CONCEPT      Word                              LexiconConcept          grade:Float
                HAS_EXAMPLE      LexiconConcept                    Text
                HAS_DEFINITION   LexiconConcept                    Text
                REFER_TO         LexiconConcept                    Concept
                {SEM_RELATION}   LexiconConcept                    LexiconConcept
                PUBLISHED_IN     Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠           TemporalSpecification
                HAS_AUTHOR       Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠           Person
                BORN             Person                            TemporalSpecification
                DIED             Person                            TemporalSpecification
                startTime        TimeInterval                      TimePoint
                endTime          TimeInterval                      TimePoint
                HAS_LANGUAGE     Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠 ∪ 𝑊 𝑜𝑟𝑑   Language

MATCH
(centuryNode:TimeInterval)-[:startTime]->(startCentury:TimePoint),
(centuryNode:TimeInterval)-[:endTime]->(endCentury:TimePoint),
(pubNode:TimeInterval)-[:startTime]->(startPub:TimePoint),
(pubNode:TimeInterval)-[:endTime]->(endPub:TimePoint),
(text:Text)-[:PUBLISHED_IN]->(pubNode)
WHERE
centuryNode.description="century"
WITH text,
centuryNode,
CASE WHEN endPub.Year > endCentury.Year THEN endCentury.Year ELSE endPub.Year END as minEnd,
CASE WHEN startPub.Year > startCentury.Year THEN startPub.Year ELSE startCentury.Year END as maxStart
WITH *,
CASE WHEN minEnd-maxStart+1 > 0 THEN minEnd-maxStart+1 ELSE 0 END as time_overlap
ORDER BY time_overlap DESC
WITH text,
collect({century:centuryNode})[0] AS max
WITH *,
max .century as century
CREATE (text)-[r:CLUSTER]->(century)
RETURN text,century
UNION ALL
MATCH
(centuryNode:TimeInterval)-[:startTime]->(startCentury:TimePoint),
(centuryNode:TimeInterval)-[:endTime]->(endCentury:TimePoint),
(text:Text)-[:PUBLISHED_IN]->(point:TimePoint)
WHERE
centuryNode.description="century" and
point.Year>=startCentury.Year and
point.Year<=endCentury.Year
WITH text, centuryNode as century
CREATE (text)-[r:CLUSTER]->(century)
RETURN text, century;


                          Listing 1: Clustering publishing date by centuries
Figure 1: Graph for the Latin word beatus.


4. Loading the Latin data in the Linguistic Knowledge Graph
For each instance of the target words in the Latin corpus we encode:

    • the author as Person,
    • the manuscript as Document,
    • the year as TimePoint if the date is certain, TimeInterval otherwise,
    • the sentence (left context, target word and right context) as Text,
    • the definitions of the Lewis and Short Dictionary as LexiconConcept,
    • the word lemma as Lemma,
    • the inflected forms of the target words as InflectedWord,
    • the scores associated with each LexiconConcept as properties of the HAS_EXAMPLE and
      HAS_OCCURRENCE relationships.

In order to simplify and make the visualisation more effective, we created the HAS_EXAMPLE
relationship only in cases where the annotation reported a score of 4. In addition, to make
more evident the distribution of senses with respect to centuries, we associate each date of
publication of the texts with the reference century. We do this via the query given in Listing
1. In case a Text is not associated with a specific TimePoint, it will be linked with the century
having the greatest overlap with the TimeInterval of the text itself. On the other hand, for texts
for which a precise date is specified, the query associates the Text with the respective century of
its year. The centuries are represented as TimeInterval, and the description attribute is validated
with “century”. A new relationship, called CLUSTER, is so created among nodes of type Text
and nodes of type TimeInterval to indicate the century.
   A subgraph for the word beatus is shown in Figure 1. The graph shows the nodes representing
the texts from which the word beatus is extracted, the centuries and the senses given in the Lewis
and Short Dictionary. The relationships among these nodes are CLUSTER and HAS_EXAMPLE.
The former connects nodes of type TimeInterval and nodes of type Text, see 1. The latter links
LexiconConcepts and Texts. Most occurrences of the word beatus in the reference corpus are
dated 1st century BCE and 11th century CE. One can immediately notice a difference in the
distribution of the senses: “happy” and “fortunate” on the one hand are associated with the
time period BCE (see the cluster of nodes on the left of Figure 1), and “blessed”, on the other
hand, is associated with the time period CE (see the cluster of nodes on the right of Figure 1).
In fact, only one sentence in the dataset displays the sense “blessed” in the first century BCE.
Similarly, only two sentences dated CE contain the word beatus with the meaning of “fortunate”,
the latter, on the other hand, is dated 1079-1142 CE and is an excerpt from the Sermones of
Petrus Abaelardus.


5. Conclusions
In this work, we introduced an application of LKG for Latin data. It appears to be an interesting
and novel approach to tackling the analysis of diachronic corpora. Furthermore, differently
from previous approaches, it gives rise to explainable results since we take advantage of explicit
relationships modelled as graphs. The LKG seems to lead to promising results, and it is ready
forfurther investigations into Lexical Semantic Change Detection (LSCD). Future developments
include a better visualization of resources, machine-learning-based techniques for automatic
LSCD and an interface for querying and analysing the LKG data.


Acknowledgement
This work fulfils the research objectives of the PNRR project FAIR - Future AI Research, spoke
6 - Symbiotic AI, CUP H97G22000210007, as well as the CHANGES - Cultural Heritage Active
Innovation for Next-Gen Sustainable Society, CUP H53C22000860006.


References
 [1] H.-P. Kriegel, M. Pfeifle, M. Pötke, T. Seidl, The paradigm of relational indexing: A survey,
     in: BTW 2003–Datenbanksysteme für Business, Technologie und Web, Tagungsband der
     10. BTW Konferenz, Gesellschaft für Informatik eV, 2003.
 [2] E. Bertino, L. Martino, Object-oriented database management systems: concepts and
     issues, Computer 24 (1991) 33–47.
 [3] S. Ferilli, Integration strategy and tool between formal ontology and graph database
     technology, Electronics 10 (2021). URL: https://www.mdpi.com/2079-9292/10/21/2616.
     doi:10.3390/electronics10212616.
 [4] C. Sharma, R. Sinha, A schema-first formalism for labeled property graph databases:
     Enabling structured data loading and analytics, in: Proceedings of the 6th ieee/acm
     international conference on big data computing, applications and technologies, 2019, pp.
     71–80.
 [5] P. Basile, P. Cassotti, S. Ferilli, B. McGillivray, A New Time-sensitive Model of Linguistic
     Knowledge for Graph Databases, CEUR Workshop Proceedings, 2022, p. 69.
 [6] P. Basile, A. Caputo, G. Semeraro, Temporal random indexing: a tool for analysing
     word meaning variations in news, in: M. Martinez-Alvarez, U. Kruschwitz, G. Kazai,
     F. Hopfgartner, D. P. A. Corney, R. Campos, D. Albakour (Eds.), Proceedings of the First
     International Workshop on Recent Trends in News Information Retrieval co-located with
     38th European Conference on Information Retrieval (ECIR 2016), volume 1568 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2016, pp. 39–41. URL: http://ceur-ws.org/Vol-1568/
     paper7.pdf.
 [7] W. L. Hamilton, J. Leskovec, D. Jurafsky, Diachronic word embeddings reveal statistical
     laws of semantic change, in: Proceedings of the 54th Annual Meeting of the Association
     for Computational Linguistics, ACL 2016, Volume 1: Long Papers, The Association for
     Computer Linguistics, 2016. URL: https://doi.org/10.18653/v1/p16-1141. doi:10.18653/
     v1/p16-1141.
 [8] V. D. Carlo, F. Bianchi, M. Palmonari, Training temporal word embeddings with a compass,
     in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-
     First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
     AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI, AAAI Press,
     2019, pp. 6326–6334. URL: https://doi.org/10.1609/aaai.v33i01.33016326. doi:10.1609/
     aaai.v33i01.33016326.
 [9] S. Minozzi, Latin WordNet, una rete di conoscenza semantica per il latino e alcune ipotesi
     di utilizzo nel campo dell’Information Retrieval, Strumenti digitali e collaborativi per le
     Scienze dell’Antichita (2017) 123–134.
[10] H. Pinkster, Sintassi e semantica latina, Rosenberg & Sellier, 1991.
[11] J. Ramminger, Latin and the early modern world: linguistic identity and the polity from
     petrarch to the habsburg novelists, 2016.
[12] J. Herman, Vulgar Latin. Translated by Roger Wright, The Pennsylvania State University,
     2000.
[13] D. Butterfield, A companion to the latin language, 2011.
[14] J. Leonhardt, Latin: Story of a World Language, The Belknap Press of Harvard University
     Press, 2013.
[15] P. Roelli, Latin as the Language of Science and Learning, De Gruyter, 2021.
[16] D. R. Langslow, Bilingualism in ancient society, 2002.
[17] T. Declerck, P. Buitelaar, T. Wunner, J. McCrae, E. Montiel-Ponsoda, G. Aguado de Cea,
     Lemon: An ontology-lexicon model for the multilingual semantic web. (2010).
[18] G. A. Miller, WORDNET: a lexical database for english, in: Speech and Natural Language:
     Proceedings of a Workshop Held at Harriman, New York, USA, February 23-26, 1992,
     Morgan Kaufmann, 1992. URL: https://aclanthology.org/H92-1116/.
[19] B. McGillivray, A. Kilgarriff, Tools for historical corpus research, and a corpus of Latin, in:
     P. Bennett, M. Durrell, S. Scheible, R. J. Whitt (Eds.), New Methods in Historical Corpus
     Linguistics, Narr, Tübingen, 2013, pp. 247–257.
[20] D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, N. Tahmasebi, Semeval-
     2020 task 1: Unsupervised lexical semantic change detection, in: A. Herbelot, X. Zhu,
     A. Palmer, N. Schneider, J. May, E. Shutova (Eds.), Proceedings of the Fourteenth Workshop
     on Semantic Evaluation, SemEval@COLING 2020, Barcelona (online), December 12-13,
     2020, International Committee for Computational Linguistics, 2020, pp. 1–23. URL: https:
     //doi.org/10.18653/v1/2020.semeval-1.1. doi:10.18653/v1/2020.semeval-1.1.
[21] D. Schlechtweg, S. Schulte im Walde, S. Eckmann, Diachronic Usage Relatedness (DURel):
     A framework for the annotation of lexical semantic change, in: Proceedings of the
     2018 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, New Orleans, Louisiana, 2018, pp. 169–174.
     URL: https://www.aclweb.org/anthology/N18-2027/.
[22] C. T. Lewis, C. Short, A Latin Dictionary, Founded on Andrews’ edition of Freund’s Latin
     dictionary revised, enlarged, and in great part rewritten by Charlton T. Lewis, Ph.D. and
     Charles Short, Clarendon Press, Oxford, 1879.
[23] C. T. Lewis, An Elementary Latin Dictionary, American Book Company, New York, Cincin-
     nati, and Chicago, 1890.
[24] C. Du Fresne Du Cange, G. A. L. Henschel, P. Carpentier, J. C. Adelung, L. Favre, Glossarium
     mediæet infimælatinitatis, L. Favre, Niort, 1883-1887.
[25] B. McGillivray, D. Kondakova, A. Burman, F. Dell’Oro, H. Bermúdez Sabel, P. Marongiu,
     M. Márquez Cruz, A new corpus annotation framework for latin diachronic lexical
     semantics, Journal of Latin Linguistics 21 (2022) 47–105. doi:https://doi.org/10.
     1515/joll-2022-2007.