DYLEN: Diachronic Dynamics of Lexical Networks Andreas Baumann Department of English and American Studies, University of Vienna, Austria andreas.baumann@univie.ac.at Julia Neidhardt Faculty of Informatics, TU Wien, Austria julia.neidhardt@ec.tuwien.ac.at Tanja Wissik Austrian Centre for Digital Humanities, Austrian Academy of Sciences, Austria tanja.wissik@oeaw.ac.at Abstract In this contribution we present a use case of the application of big language data and digital methods such as natural language processing, machine learning, and network analysis in the fields of digital humanities and linguistics by characterizing and modeling the diachronic dynamics of lexical networks. The proposed analysis will be based on two corpora containing 20 years of data with billions of tokens. 2012 ACM Subject Classification Human-centered computing → Social network analysis; Comput- ing methodologies → Natural language processing; Computing methodologies → Machine learning Keywords and phrases language change, language resources, natural language processing, network analysis, big data Funding The project Diachronic Dynamics of Lexical Networks (DYLEN) is funded by the ÖAW go!digital Next Generation grant (GDNG 2018-020). 1 Background and Research Aims Evidently, languages are constantly subject to change. For example, on the word level, new items enter the vocabulary (i.e. the lexical system) of a language, others cease to be used by speakers, and some established words may change their meaning. Characterizing and modeling these dynamics has a broad field of applications including linguistics, natural language processing, digital humanities, artificial intelligence, computer sciences and cognitive sciences. In the project Diachronic Dynamics of Lexical Networks we therefore want to investigate, 1) how and why lexical systems of natural languages change, thereby considering social factors such as influential individuals as well as cognitive factors [3, 6, 11]; and 2) how language change in the lexical domain can be measured. Here, approaches such as corpus analysis and statistical analysis of word-frequency trajectories are typically employed in the field of diachronic linguistics (i.e. the analysis of language over time). Figure 1, for example, shows frequency trajectories of two lexical innovations. Recently, however, network-based approaches [1] have become increasingly important in this context [16, 9, 10, 4]. The advantage of network-based approaches for the analysis of lexical dynamics is that they allow to study the semantic properties of words in addition to word frequency, since the meaning of a word is closely related with its context, i.e. other words it co-occurs with frequently. So, we can track lexical innovations (i.e. new words) introduced by influential individuals (politicians) and systematically analyze contextual, i.e., semantic, changes of these words. More specifically, our project focuses on the following research questions: © A. Baumann, J. Neidhardt and T. Wissik; licensed under Creative Commons License CC-BY LDK 2019 - Posters Track. Editors: Thierry Declerck and John P. McCrae XX:2 DYLEN: Diachronic Dynamics of Lexical Networks Figure 1 Frequency trajectories of two competing Austrian German terms, “Hacklerregelung” and “Langzeitversichertenregelung” (long-term insurance regulation). Both terms show a frequency increase during the observation period in ParlAT and AMC. Do they also undergo contextual change? 1. How and why do lexical systems change? What is the role of influential innovators (e.g. politicians) in lexical change? What determines the successful spread of lexical innovations? Can we disentangle social factors from cognitive factors in lexical change? 2. How can lexical change be measured? Does network science give more detailed answers about language change than traditional frequency based methods? Which computational method is most suitable to analyze the evolution of lexical networks through time? How can we enrich the digital-humanities toolbox with the output of the project? 2 Used Data Sets As data sets we use two diachronically layered big text corpora available for Austrian German: the Austrian Media Corpus (AMC), containing more than 20 years of journalistic prose [15] and the ParlAT corpus, covering the Austrian parliamentary records of the last 20 years [21]. The journalistic prose included in the Austrian Media corpus comprises Austrian press agency releases, most Austrian periodicals such as all daily national newspapers as well as a large number of the major weekly and monthly magazines, in total 53 different newspapers and magazines. Moreover, the Austrian Media Corpus contains also transcripts of Austrian television news programs, news stories and interviews [15]. In total, the AMC contains 10.500 million tokens with 40 million wordforms and 33 million lemmas. The ParlAT corpus contains the stenographic records, in German called “Stenographische Protokolle” from the XX to the XXV legislative period (1996 – 2017). So they are not transcripts of recordings but shorthand records. The corpus size is 75 million tokens with over 0.6 million word forms and 0.4 million lemmas [21]. Both corpora are tokenized, part-of-speech tagged and lemmatized. Crucially, the two corpora cover lexical innovations both directly in the linguistic output of politicians as well as indirectly in media texts. Thus, the two corpora provide an ideal testing ground for the hypotheses outlined above. A. Baumann, J. Neidhardt and T. Wissik XX:3 3 Approach and Expected Outcome To address the questions mentioned in section 1, we analyze the above described data sets, namely the Austrian Media Corpus (AMC), and the ParlAT corpus. In addition, we will provide an easy-to-use online tool to enable researchers to do diachronic analyses of lexical networks by themselves. Our approach requires the following steps, which are schematically depicted in Figure 2: 1. NLP pre-processing and data model development: For both corpora (i.e. AMC and ParlAT) a number of data pre-processing steps have already been conducted, i.e. tokenization, part-of-speech tagging, segmentation, lemmatization, named-entity (NE) recognition. Parts of the existing NE recognition will be enhanced using machine learning and semantic knowledge bases, e.g. Wikidata [19]. Furthermore, we will introduce a comprehensive data model combining both corpora and all metadata. In addition we will compile a list of relevant Austrian politicians as we want to analyze their impact on language change. 2. Network construction and description: A systematic procedure will be defined to 1) construct different co-occurrence networks (i.e. networks, where nodes represent identified entities, e.g. politicians, as well as nouns, verbs or adjectives and edges represent the co-occurrence of these nodes in a sentence, paragraph or document) for different time intervals (i.e. all documents within a week, a month, a year, etc.); and 2) extract basic properties (e.g. number of nodes/edges, clustering, centrality) to describe the networks. Together with frequency of occurrence, these properties can be interpreted cognitively and semantically [9, 10]. 3. Network analyses and comparisons: In-depth analyses of the resulting networks will be conducted using network analysis and visualization. As the number of networks is assumed to be quite large, an approach will be developed to systematically compare these networks over time and across the two corpora. Therefore, different methods from network analysis, machine learning and statistical modeling will be tested. This will allow to identify relevant parameters (e.g. network properties) to capture diachronic developments. 4. Modeling diachronic developments: Statistical models including time-series analysis with generalized additive models and time-series clustering techniques for analyzing the co-evolution of parameters (see 3.) in multiple networks will be employed. 5. Interactive web application: A web-based interactive tool will be developed that retrieves the constructed networks and allows to explore, analyze and visualize them. The technical implementation, which will build on an existing prototype [7], will mainly be based on Python and appropriate libraries [5, 8, 12, 14], on Neo4j [20] to store the network and on software for big data analysis , e.g. Apache Spark [23], Hadoop Yarn [17, 18], HDFS [17]. Gephi [2] will be used to visualize the graphs, and R for the statistical analyses [13, 22]. We expect our project, which has to face specific challenges such as NE recognition for Austrian German and the analysis of two large-scale diachronic corpora, to contribute to the understanding of the role that influential speakers and other linguistic factors play in lexical change by analyzing big amounts of language data. Since we cover both the linguistic output of influential speakers (ParlAT) as well as their linguistic reflex (AMC), we can test if lexical innovations introduced by these individuals behave differently than other lexical innovations. This allows us to disentangle social effects from cognitive effects in the process of lexical spread. For example, by analyzing the evolution of the clustering coefficients of L D K Po s t e r s XX:4 DYLEN: Diachronic Dynamics of Lexical Networks Figure 2 Work flow in the DYLEN project. Lexical networks are generated from diachronically layered corpus data. Network properties of lexical items, such as semantic neighborhood density, are then investigated across time to derive insights into semantic change. networks around lexical innovations, we can test if increase in frequency is accompanied by semantic widening effects; a correlation which is expected given results from research on language change [3, 6, 11]. We also seek to foster network theory as a suitable tool to analyze and make sense of diachronic language data in the linguistic research community. References 1 Albert-László Barabási. Network science. Cambridge university press, 2016. 2 Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: an open source software for exploring and manipulating networks. In Third international AAAI conference on weblogs and social media, 2009. 3 Joan Bybee. Language, usage and cognition. Cambridge University Press, 2010. 4 Heng Chen, Xinying Chen, and Haitao Liu. How does language change as a lexical network? an investigation based on written chinese word co-occurrence networks. PloS one, 13(2):e0192545, 2018. 5 Gabor Csardi, Tamas Nepusz, et al. The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5):1–9, 2006. 6 Nick C Ellis, Matthew Brook O’Donnell, and Ute Römer. The processing of verb-argument constructions is sensitive to form, function, frequency, contingency and prototypicality, 2014. 7 Gabriel Grill, Julia Neidhardt, and Hannes Werthner. Network analysis on the austrian media corpus. In VSS 2017 - Vienna young Scientists Symposium, pages 128–129, 2017. 8 Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), USA, 2008. 9 William L Hamilton, Jure Leskovec, and Dan Jurafsky. Cultural shift or linguistic drift? comparing two computational measures of semantic change. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2016, pages 2116–2121. NIH Public Access, 2016. 10 William L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096, 2016. 11 Martin Hilpert and Florent Perek. Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts. Linguistics Vanguard, 1(1):339–350, 2015. A. Baumann, J. Neidhardt and T. Wissik XX:5 12 Eric Jones, Travis Oliphant, and Pearu Peterson. SciPy: Open source scientific tools for Python. 2014. 13 Pablo Montero, José A Vilar, et al. TSclust: An R package for time series clustering. Journal of Statistical Software, 62(1):1–43, 2014. 14 Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006. 15 Jutta Ransmayr, Karlheinz Mörth, and Matej Ďurčo. AMC (Austrian Media Corpus) - Korpus- basierte Forschungen zum Österreichischen Deutsch, pages 27–38. Verlag der Österreichischen Akademie der Wissenschaften, 2017. 16 Eyal Sagi, Stefan Kaufmann, and Brady Clark. Tracing semantic change with latent semantic analysis. Current methods in historical semantics, 73:161–183, 2011. 17 Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pages 1–10. IEEE, 2010. 18 Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache Hadoop Yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013. 19 Denny Vrandečić and Markus Krötzsch. Wikidata: a free collaborative knowledge base. 2014. 20 J Webber. A programmatic introduction to Neo4j in: Proceedings of the 3rd annual conference on systems, programming, and applications: Software for humanity, 217–218. ACM, 2012. 21 Tanja Wissik and Hannes Pirker. ParlAT beta corpus of austrian parliamentary records. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC2018 Workshop ParlaCLARIN. European Language Resources Association, 2018. 22 Simon N Wood. Generalized additive models: an introduction with R. Chapman and Hall/CRC, 2017. 23 Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010. L D K Po s t e r s