Using Graph Databases for Historical Language Data: Challenges and Opportunities Barbara McGillivray1 , Pierluigi Cassotti2,* , Pierpaolo Basile2 , Davide Di Pierro2 and Stefano Ferilli2 1 King’s College London, Strand Campus, Strand, London, WC2R 2LS, United Kingdom 2 University of Bari Aldo Moro, Department of Computer Science, Via E. Orabona 4, Bari, 70125, Italy Abstract The integration of semantic information into language resources has the potential to open up new avenues of enquiry into the mechanisms of language change. We present the first experiments in integrating data from Latin textual corpora and language resources into a graph database via the GraphBRAIN Schema and show the potential of this model for research into the mechanisms of semantic change in Latin. Keywords Knowledge Graphs, Latin Corpora, Semantic Change Detection, Graph Data Model 1. Introduction Research in Historical Linguistics often requires the analysis and support of heterogeneous data and tools, such as lexical resources, encyclopaedias, and large corpora. Nevertheless, these resources are often siloed. Graph Databases present an ideal opportunity to combine the advantages of DataBase Management Systems (DBMSs) for handling individuals (scalability, storage optimization, efficient handling, mining and browsing of the data, etc.) with the high- level functionalities available in Knowledge Bases (KBs). Graph DBMS are intrinsically designed to store schemaless data. Differently from traditional DBMSs like the relational [1] or object- oriented [2] ones, they lack predefined structures. Following this approach, Neo4j 1 , one of the most common graph DBMSs, does not provide support for introducing ontology definitions based on labels and/or arcs. The absence of a schema may lead to ambiguity when reading and managing data in downstream applications due to the inherent ambiguity of the words used for expressing concepts. Hence, the semantics becomes blurred. To address these issues, we propose the use of GraphBRAIN [3] as a solution. GraphBRAIN consists of a graph database which follows the Labelled Property Graph (LPG) [4] structure. 19th IRCDL (The Conference on Information and Research science Connecting to Digital and Library science), February 23–24, 2023, Bari, Italy * Corresponding author. $ barbara.mcgillivray@kcl.ac.uk (B. McGillivray); pierluigi.cassotti@uniba.it (P. Cassotti); pierpaolo.basile@uniba.it (P. Basile); davide.dipierro@uniba.it (D. Di Pierro); stefano.ferilli@uniba.it (S. Ferilli)  0000-0003-3426-8200 (B. McGillivray); 0000-0001-7824-7167 (P. Cassotti); 0000-0002-0545-1105 (P. Basile); 0000-0002-8081-3292 (D. Di Pierro); 0000-0003-1118-0601 (S. Ferilli) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://neo4j.com/ This structure stores nodes with specific labels, arcs which represent relationships among nodes and properties on both nodes and arcs. Properties are stored in the format of key/value pairs. GraphBRAIN requires KB designers to define a data schema which operates also as an ontology. GraphBRAIN provides a mapping mechanism for exporting schemes into an SW- compliant language, the Web Ontology Language (OWL). These schemes guide access through all the CRUD operations on the database but also ensure interpretability and interoperability among different applications. Following the schemes, applications become compliant with each other. Neo4j, in its Enterprise Edition, does not provide any constraint definition process. In other versions, it supports a few constraints like unique node property constraints, node property existence constraints, relationship property existence constraints and node key constraints. Evidently, these tools are not as expressive as ontology definitions. In [5], we adopted GraphBRAIN technology to model time-sensitive linguistic knowledge in a graph database, describing a time-sensitive model of linguistic knowledge that can be used for graph databases. In this paper, we show an application of this model to the lexical semantic analysis of Latin data, i.e. the analysis of the meanings of Latin words. Differently from previous approaches, such as Basile et al. [6], Hamilton et al. [7], and Carlo et al. [8], we exploit graph database potentialities to detect semantic changes in specific concepts. Latin is in a particularly favourable position among historical languages for the large-scale analysis of semantic change processes, thanks to a number of factors. First, Latin researchers now enjoy unprecedented access to digital data covering over two thousand years of history. Thanks to the ERC-funded LiLa project 2 , seven Latin language resources and six corpora have been linked at the level of word lemmas so far, making Latin a unique case among historical languages. Second, we have access to extensive computational language resources for Latin, Latin WordNet [9], and digitised dictionaries of Latin, which provide rich information about words’ semantics and examples of usage. Finally, focussing on Latin allows us to investigate semantic change processes over long time spans. Latin has one of the longest recorded histories of any human language, making it naturally suitable for quantitative studies [10]. The first inscriptional records date from the sixth century BCE, and Latin continues to be used to the current day by the Catholic Church and some academic and legal institutions around the world. Written Latin diverged from the spoken vernaculars in the second half of the first millennium of the Christian era, but it remained in use as one of the principal channels of communication across most of Europe for the next thousand years. The humanists’ conscious effort to reproduce Classical Latin led to a range of interesting developments, particularly affecting the neo-Latin lexicon to enable the expression of new concepts [11]. This extensive chronological span has raised the question of the extent to which Latin is seen as a dead or fossilised language (e.g. Herman [12], Butterfield [13]). However, it remains an open question to what extent this fossilisation affected the semantics of words, as we know that the Latin lexicon, in this respect, has remained dynamic (over 4,500 words have acquired new meanings since the Renaissance; Demo 2022). The extent to which post-classical Latin can really be considered as a “fixed” language (Leonhardt [14], Roelli [15], Langslow [16]) from the point of view of its ability to generate new meanings of words is still largely unknown beyond anecdotal evidence. In Section 2 we present the Linguistic Knowledge Graph, in Section 3 we describe the Latin 2 https://lila-erc.eu/ data that we worked on, and in Section 4 we show how we loaded the Latin data into the Linguistic Knowledge Graph. Finally, in Section 5 we draw some conclusions and outline future directions of work. 2. The Linguistic Knowledge Graph The Linguistic Knowledge Graph (LKG) aims to capture different aspects of lexical resources, such as relations between words and concepts, morphological, and syntactical information. Moreover, LKG covers diachronic aspects of language, such as the date of publication of a document, and the birth and death of an author. The schema we designed takes inspiration from the ontological lexicon model LeMON [17]. For space constraints, we report in Table 1 node types and in Table 2 the relationships adopted for diachronic analysis. The lexical unit is represented as node of type InflectedWord or Lemma, which are subclass of Word, i.e. Lemma IS_A Word and InflectedWord IS_A Word. The Lemma can be a multi-word expression (mwe), in this case, the flag mwe is set to True. The respective lemma of an InflectedWord can be retrieved exploiting the relationship HAS_LEMMA between InflectedWord and Lemma. The LexiconConcept is used to represent the word’s meanings, and each instance of LexiconConcept represents a different meaning. For example, the LexiconConcept can represent the senses reported on a sense inventory, e.g. synsets in WordNet [18]. The relationship between a word and its meaning is expressed using the relationship HAS_CONCEPT among instances of Word and instance of LexiconConcept. Multiple relationships can be defined over couples of LexiconConcept using the reflexive relationship SEM_RELATION. At the same time, reflexive relationships over the Word instances can be described by the LEX_RELATION relationship. The document structure from which words are extracted can be represented at different levels of granularity: Sentence,Text, Document, and Corpus. In particular, each excerpt can be represented as Text or Sentence, which is a subclass of Text. A Text may belong to (BELONG_TO) a Document and a Document can be part of (BELONG_TO) a Corpus. The occurrences of a word in a particular Text are traced by the relationship HAS_OCCURRENCE among Word and Text. In the case of sense-annotated corpora, such as SemCor, is possible to specify the occurrences of senses using the relationship HAS_EXAMPLE among LexiconConcept and Text. Currently, the LKG takes into account two types of metadata: author and language. The relationship HAS_AUTHOR among nodes of type Text and nodes of type Person determines the author of a Text. The relationship HAS_LANGUAGE among nodes of type Text, Document, Corpus, and Word to nodes of type Language specifies the respective language. The time is modelled using two classes of nodes: TimeInterval, and TimePoint, both subclasses of TemporalSpecification. The TimeInterval type is used when the date is not precisely stated, while the TimePoint is used in cases where the date is fixed. The start and end extremes of the TimeInterval nodes can be specified using the respective relationships startTime and endTime. In the current version of the LKG, time specification is supported for Person and Text. More specifically, the date of birth and death of authors is specified using the relationship BORN and DIED between Person and TemporalSpecification. The publishing date of a text is specified by the relationship PUBLISHED_IN among Text nodes and TemporalSpecification nodes. Table 1 LKG classes with their respective superclasses and attributes. Class Superclass Attributes Word value:String value:String Lemma Word posTag:String mwe:Boolean InflectedWord Word value:String Stem value:String id:String LexiconConcept Concept resource:String Text value:String Sentence Text Document title:String Corpus name:String name:String TemporalSpecification description:String Year:Integer TimePoint TemporalSpecification Month:Integer day:Integer TimeInterval TemporalSpecification name:String Person lastname:String iso639-1:String Language iso639-2:String enName:String Category id:String 3. Latin data The data we loaded into the graph consists of a portion of the LatinISE corpus [19] annotated at the level of dictionary senses. LatinISE is a Latin corpus covering the period from the fifth century BCE to the twenty-first century and contains 10 million word tokens, semi-automatically lemmatised and part-of-speech tagged. The metadata fields in LatinISE indicate text identifier, author, title, dates, century, genre, url of source, and optionally book title/number and character names (for plays). The annotated dataset was produced as part of the SemEval shared task on Unsupervised Lexical Semantic Change Detection [20]. 40 Latin lemmas (“target words”) are selected, of which 20 are known to have changed their meaning with the advent of Christianity (for example, beatus, which shifted its meaning from ‘fortunate’ to ‘blessed’) and 20 are known to not have changed their meaning between the BCE era and the CE era. For each of the 40 lemmas, 60 sentences are randomly extracted from LatinISE, 30 of them are from texts dated in the BCE era, and 30 from texts dated in the CE era. Each sentence was annotated by at least one expert annotator, according to the DuReL framework [21]. The annotators were asked to judge the semantic relatedness of an instance of usage of a target word with respect to the list of its dictionary definitions using a four-point scale (Unrelated, Distantly Related, Closely Related, and Identical). The definitions were taken from the Latin portion of the Logeion online dictionary (https://logeion.uchicago.edu/) containing Lewis and Short’s Latin-English Lexicon (1879) [22], Lewis’ Elementary Latin Dictionary (1890) [23], and Du Fresne Du Cange et al. [24]. See McGillivray et al. [25] for further details about the dataset and its annotation framework. Table 2 LKG relationships with their respective subject, object and attributes. Relationship Subject Object Attributes Sentence Text id:Integer IS_A Lemma ∪𝐼𝑛𝑓 𝑙𝑒𝑐𝑡𝑒𝑑𝑊 𝑜𝑟𝑑 Word id:Integer Text Document id:Integer BELONG_TO Document Corpus id:Integer Text Category begin:Integer HAS_OCCURRENCE Word Text end:Integer {LEX_RELATION} Word Word HAS_LEMMA Word Lemma HAS_CONCEPT Word LexiconConcept grade:Float HAS_EXAMPLE LexiconConcept Text HAS_DEFINITION LexiconConcept Text REFER_TO LexiconConcept Concept {SEM_RELATION} LexiconConcept LexiconConcept PUBLISHED_IN Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠 TemporalSpecification HAS_AUTHOR Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠 Person BORN Person TemporalSpecification DIED Person TemporalSpecification startTime TimeInterval TimePoint endTime TimeInterval TimePoint HAS_LANGUAGE Text ∪𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ∪ 𝐶𝑜𝑟𝑝𝑢𝑠 ∪ 𝑊 𝑜𝑟𝑑 Language MATCH (centuryNode:TimeInterval)-[:startTime]->(startCentury:TimePoint), (centuryNode:TimeInterval)-[:endTime]->(endCentury:TimePoint), (pubNode:TimeInterval)-[:startTime]->(startPub:TimePoint), (pubNode:TimeInterval)-[:endTime]->(endPub:TimePoint), (text:Text)-[:PUBLISHED_IN]->(pubNode) WHERE centuryNode.description="century" WITH text, centuryNode, CASE WHEN endPub.Year > endCentury.Year THEN endCentury.Year ELSE endPub.Year END as minEnd, CASE WHEN startPub.Year > startCentury.Year THEN startPub.Year ELSE startCentury.Year END as maxStart WITH *, CASE WHEN minEnd-maxStart+1 > 0 THEN minEnd-maxStart+1 ELSE 0 END as time_overlap ORDER BY time_overlap DESC WITH text, collect({century:centuryNode})[0] AS max WITH *, max .century as century CREATE (text)-[r:CLUSTER]->(century) RETURN text,century UNION ALL MATCH (centuryNode:TimeInterval)-[:startTime]->(startCentury:TimePoint), (centuryNode:TimeInterval)-[:endTime]->(endCentury:TimePoint), (text:Text)-[:PUBLISHED_IN]->(point:TimePoint) WHERE centuryNode.description="century" and point.Year>=startCentury.Year and point.Year<=endCentury.Year WITH text, centuryNode as century CREATE (text)-[r:CLUSTER]->(century) RETURN text, century; Listing 1: Clustering publishing date by centuries Figure 1: Graph for the Latin word beatus. 4. Loading the Latin data in the Linguistic Knowledge Graph For each instance of the target words in the Latin corpus we encode: • the author as Person, • the manuscript as Document, • the year as TimePoint if the date is certain, TimeInterval otherwise, • the sentence (left context, target word and right context) as Text, • the definitions of the Lewis and Short Dictionary as LexiconConcept, • the word lemma as Lemma, • the inflected forms of the target words as InflectedWord, • the scores associated with each LexiconConcept as properties of the HAS_EXAMPLE and HAS_OCCURRENCE relationships. In order to simplify and make the visualisation more effective, we created the HAS_EXAMPLE relationship only in cases where the annotation reported a score of 4. In addition, to make more evident the distribution of senses with respect to centuries, we associate each date of publication of the texts with the reference century. We do this via the query given in Listing 1. In case a Text is not associated with a specific TimePoint, it will be linked with the century having the greatest overlap with the TimeInterval of the text itself. On the other hand, for texts for which a precise date is specified, the query associates the Text with the respective century of its year. The centuries are represented as TimeInterval, and the description attribute is validated with “century”. A new relationship, called CLUSTER, is so created among nodes of type Text and nodes of type TimeInterval to indicate the century. A subgraph for the word beatus is shown in Figure 1. The graph shows the nodes representing the texts from which the word beatus is extracted, the centuries and the senses given in the Lewis and Short Dictionary. The relationships among these nodes are CLUSTER and HAS_EXAMPLE. The former connects nodes of type TimeInterval and nodes of type Text, see 1. The latter links LexiconConcepts and Texts. Most occurrences of the word beatus in the reference corpus are dated 1st century BCE and 11th century CE. One can immediately notice a difference in the distribution of the senses: “happy” and “fortunate” on the one hand are associated with the time period BCE (see the cluster of nodes on the left of Figure 1), and “blessed”, on the other hand, is associated with the time period CE (see the cluster of nodes on the right of Figure 1). In fact, only one sentence in the dataset displays the sense “blessed” in the first century BCE. Similarly, only two sentences dated CE contain the word beatus with the meaning of “fortunate”, the latter, on the other hand, is dated 1079-1142 CE and is an excerpt from the Sermones of Petrus Abaelardus. 5. Conclusions In this work, we introduced an application of LKG for Latin data. It appears to be an interesting and novel approach to tackling the analysis of diachronic corpora. Furthermore, differently from previous approaches, it gives rise to explainable results since we take advantage of explicit relationships modelled as graphs. The LKG seems to lead to promising results, and it is ready forfurther investigations into Lexical Semantic Change Detection (LSCD). Future developments include a better visualization of resources, machine-learning-based techniques for automatic LSCD and an interface for querying and analysing the LKG data. Acknowledgement This work fulfils the research objectives of the PNRR project FAIR - Future AI Research, spoke 6 - Symbiotic AI, CUP H97G22000210007, as well as the CHANGES - Cultural Heritage Active Innovation for Next-Gen Sustainable Society, CUP H53C22000860006. References [1] H.-P. Kriegel, M. Pfeifle, M. Pötke, T. Seidl, The paradigm of relational indexing: A survey, in: BTW 2003–Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW Konferenz, Gesellschaft für Informatik eV, 2003. [2] E. Bertino, L. Martino, Object-oriented database management systems: concepts and issues, Computer 24 (1991) 33–47. [3] S. Ferilli, Integration strategy and tool between formal ontology and graph database technology, Electronics 10 (2021). URL: https://www.mdpi.com/2079-9292/10/21/2616. doi:10.3390/electronics10212616. [4] C. Sharma, R. Sinha, A schema-first formalism for labeled property graph databases: Enabling structured data loading and analytics, in: Proceedings of the 6th ieee/acm international conference on big data computing, applications and technologies, 2019, pp. 71–80. [5] P. Basile, P. Cassotti, S. Ferilli, B. McGillivray, A New Time-sensitive Model of Linguistic Knowledge for Graph Databases, CEUR Workshop Proceedings, 2022, p. 69. [6] P. Basile, A. Caputo, G. Semeraro, Temporal random indexing: a tool for analysing word meaning variations in news, in: M. Martinez-Alvarez, U. Kruschwitz, G. Kazai, F. Hopfgartner, D. P. A. Corney, R. Campos, D. Albakour (Eds.), Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), volume 1568 of CEUR Workshop Proceedings, CEUR-WS.org, 2016, pp. 39–41. URL: http://ceur-ws.org/Vol-1568/ paper7.pdf. [7] W. L. Hamilton, J. Leskovec, D. Jurafsky, Diachronic word embeddings reveal statistical laws of semantic change, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, The Association for Computer Linguistics, 2016. URL: https://doi.org/10.18653/v1/p16-1141. doi:10.18653/ v1/p16-1141. [8] V. D. Carlo, F. Bianchi, M. Palmonari, Training temporal word embeddings with a compass, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI, AAAI Press, 2019, pp. 6326–6334. URL: https://doi.org/10.1609/aaai.v33i01.33016326. doi:10.1609/ aaai.v33i01.33016326. [9] S. Minozzi, Latin WordNet, una rete di conoscenza semantica per il latino e alcune ipotesi di utilizzo nel campo dell’Information Retrieval, Strumenti digitali e collaborativi per le Scienze dell’Antichita (2017) 123–134. [10] H. Pinkster, Sintassi e semantica latina, Rosenberg & Sellier, 1991. [11] J. Ramminger, Latin and the early modern world: linguistic identity and the polity from petrarch to the habsburg novelists, 2016. [12] J. Herman, Vulgar Latin. Translated by Roger Wright, The Pennsylvania State University, 2000. [13] D. Butterfield, A companion to the latin language, 2011. [14] J. Leonhardt, Latin: Story of a World Language, The Belknap Press of Harvard University Press, 2013. [15] P. Roelli, Latin as the Language of Science and Learning, De Gruyter, 2021. [16] D. R. Langslow, Bilingualism in ancient society, 2002. [17] T. Declerck, P. Buitelaar, T. Wunner, J. McCrae, E. Montiel-Ponsoda, G. Aguado de Cea, Lemon: An ontology-lexicon model for the multilingual semantic web. (2010). [18] G. A. Miller, WORDNET: a lexical database for english, in: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, USA, February 23-26, 1992, Morgan Kaufmann, 1992. URL: https://aclanthology.org/H92-1116/. [19] B. McGillivray, A. Kilgarriff, Tools for historical corpus research, and a corpus of Latin, in: P. Bennett, M. Durrell, S. Scheible, R. J. Whitt (Eds.), New Methods in Historical Corpus Linguistics, Narr, Tübingen, 2013, pp. 247–257. [20] D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, N. Tahmasebi, Semeval- 2020 task 1: Unsupervised lexical semantic change detection, in: A. Herbelot, X. Zhu, A. Palmer, N. Schneider, J. May, E. Shutova (Eds.), Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, Barcelona (online), December 12-13, 2020, International Committee for Computational Linguistics, 2020, pp. 1–23. URL: https: //doi.org/10.18653/v1/2020.semeval-1.1. doi:10.18653/v1/2020.semeval-1.1. [21] D. Schlechtweg, S. Schulte im Walde, S. Eckmann, Diachronic Usage Relatedness (DURel): A framework for the annotation of lexical semantic change, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, 2018, pp. 169–174. URL: https://www.aclweb.org/anthology/N18-2027/. [22] C. T. Lewis, C. Short, A Latin Dictionary, Founded on Andrews’ edition of Freund’s Latin dictionary revised, enlarged, and in great part rewritten by Charlton T. Lewis, Ph.D. and Charles Short, Clarendon Press, Oxford, 1879. [23] C. T. Lewis, An Elementary Latin Dictionary, American Book Company, New York, Cincin- nati, and Chicago, 1890. [24] C. Du Fresne Du Cange, G. A. L. Henschel, P. Carpentier, J. C. Adelung, L. Favre, Glossarium mediæet infimælatinitatis, L. Favre, Niort, 1883-1887. [25] B. McGillivray, D. Kondakova, A. Burman, F. Dell’Oro, H. Bermúdez Sabel, P. Marongiu, M. Márquez Cruz, A new corpus annotation framework for latin diachronic lexical semantics, Journal of Latin Linguistics 21 (2022) 47–105. doi:https://doi.org/10. 1515/joll-2022-2007.