<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DYLEN: Diachronic Dynamics of Lexical Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Baumann</string-name>
          <email>andreas.baumann@univie.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of English and American Studies, University of Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Julia Neidhardt Faculty of Informatics</institution>
          ,
          <addr-line>TU Wien</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tanja Wissik Austrian Centre for Digital Humanities, Austrian Academy of Sciences</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this contribution we present a use case of the application of big language data and digital methods such as natural language processing, machine learning, and network analysis in the fields of digital humanities and linguistics by characterizing and modeling the diachronic dynamics of lexical networks. The proposed analysis will be based on two corpora containing 20 years of data with billions of tokens. 2012 ACM Subject Classification Human-centered computing → Social network analysis; Computing methodologies → Natural language processing; Computing methodologies → Machine learning Evidently, languages are constantly subject to change. For example, on the word level, new items enter the vocabulary (i.e. the lexical system) of a language, others cease to be used by speakers, and some established words may change their meaning. Characterizing and modeling these dynamics has a broad field of applications including linguistics, natural language processing, digital humanities, artificial intelligence, computer sciences and cognitive sciences. In the project Diachronic Dynamics of Lexical Networks we therefore want to investigate, 1) how and why lexical systems of natural languages change, thereby considering social factors such as influential individuals as well as cognitive factors [3, 6, 11]; and 2) how language change in the lexical domain can be measured. Here, approaches such as corpus analysis and statistical analysis of word-frequency trajectories are typically employed in the field of diachronic linguistics (i.e. the analysis of language over time). Figure 1, for example, shows frequency trajectories of two lexical innovations. Recently, however, network-based approaches [1] have become increasingly important in this context [16, 9, 10, 4]. The advantage of network-based approaches for the analysis of lexical dynamics is that they allow to study the semantic properties of words in addition to word frequency, since the meaning of a word is closely related with its context, i.e. other words it co-occurs with frequently. So, we can track lexical innovations (i.e. new words) introduced by influential individuals (politicians) and systematically analyze contextual, i.e., semantic, changes of these words. More specifically, our project focuses on the following research questions:</p>
      </abstract>
      <kwd-group>
        <kwd>and phrases language change</kwd>
        <kwd>language resources</kwd>
        <kwd>natural language processing</kwd>
        <kwd>network analysis</kwd>
        <kwd>big data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Funding The project Diachronic Dynamics of Lexical Networks (DYLEN) is funded by the ÖAW
go!digital Next Generation grant (GDNG 2018-020).</p>
    </sec>
    <sec id="sec-2">
      <title>1. How and why do lexical systems change?</title>
      <p>What is the role of influential innovators (e.g. politicians) in lexical change?</p>
      <sec id="sec-2-1">
        <title>What determines the successful spread of lexical innovations?</title>
      </sec>
      <sec id="sec-2-2">
        <title>Can we disentangle social factors from cognitive factors in lexical change?</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. How can lexical change be measured?</title>
      <p>Does network science give more detailed answers about language change than traditional
frequency based methods?
Which computational method is most suitable to analyze the evolution of lexical
networks through time?</p>
      <p>How can we enrich the digital-humanities toolbox with the output of the project?
2</p>
      <sec id="sec-3-1">
        <title>Used Data Sets</title>
        <p>As data sets we use two diachronically layered big text corpora available for Austrian German:
the Austrian Media Corpus (AMC), containing more than 20 years of journalistic prose [15]
and the ParlAT corpus, covering the Austrian parliamentary records of the last 20 years [21].
The journalistic prose included in the Austrian Media corpus comprises Austrian press agency
releases, most Austrian periodicals such as all daily national newspapers as well as a large
number of the major weekly and monthly magazines, in total 53 different newspapers and
magazines.</p>
        <p>Moreover, the Austrian Media Corpus contains also transcripts of Austrian television
news programs, news stories and interviews [15]. In total, the AMC contains 10.500 million
tokens with 40 million wordforms and 33 million lemmas. The ParlAT corpus contains the
stenographic records, in German called “Stenographische Protokolle” from the XX to the
XXV legislative period (1996 – 2017). So they are not transcripts of recordings but shorthand
records. The corpus size is 75 million tokens with over 0.6 million word forms and 0.4 million
lemmas [21]. Both corpora are tokenized, part-of-speech tagged and lemmatized.</p>
        <p>Crucially, the two corpora cover lexical innovations both directly in the linguistic output
of politicians as well as indirectly in media texts. Thus, the two corpora provide an ideal
testing ground for the hypotheses outlined above.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Approach and Expected Outcome</title>
        <p>To address the questions mentioned in section 1, we analyze the above described data sets,
namely the Austrian Media Corpus (AMC), and the ParlAT corpus. In addition, we will
provide an easy-to-use online tool to enable researchers to do diachronic analyses of lexical
networks by themselves. Our approach requires the following steps, which are schematically
depicted in Figure 2:
1. NLP pre-processing and data model development: For both corpora (i.e. AMC
and ParlAT) a number of data pre-processing steps have already been conducted, i.e.
tokenization, part-of-speech tagging, segmentation, lemmatization, named-entity (NE)
recognition. Parts of the existing NE recognition will be enhanced using machine learning
and semantic knowledge bases, e.g. Wikidata [19]. Furthermore, we will introduce a
comprehensive data model combining both corpora and all metadata. In addition we
will compile a list of relevant Austrian politicians as we want to analyze their impact on
language change.
2. Network construction and description: A systematic procedure will be defined
to 1) construct different co-occurrence networks (i.e. networks, where nodes represent
identified entities, e.g. politicians, as well as nouns, verbs or adjectives and edges represent
the co-occurrence of these nodes in a sentence, paragraph or document) for different time
intervals (i.e. all documents within a week, a month, a year, etc.); and 2) extract basic
properties (e.g. number of nodes/edges, clustering, centrality) to describe the networks.
Together with frequency of occurrence, these properties can be interpreted cognitively
and semantically [9, 10].
3. Network analyses and comparisons: In-depth analyses of the resulting networks will
be conducted using network analysis and visualization. As the number of networks is
assumed to be quite large, an approach will be developed to systematically compare
these networks over time and across the two corpora. Therefore, different methods from
network analysis, machine learning and statistical modeling will be tested. This will
allow to identify relevant parameters (e.g. network properties) to capture diachronic
developments.
4. Modeling diachronic developments: Statistical models including time-series analysis
with generalized additive models and time-series clustering techniques for analyzing the
co-evolution of parameters (see 3.) in multiple networks will be employed.
5. Interactive web application: A web-based interactive tool will be developed that
retrieves the constructed networks and allows to explore, analyze and visualize them.</p>
        <p>The technical implementation, which will build on an existing prototype [7], will mainly be
based on Python and appropriate libraries [5, 8, 12, 14], on Neo4j [20] to store the network and
on software for big data analysis , e.g. Apache Spark [23], Hadoop Yarn [17, 18], HDFS [17].
Gephi [2] will be used to visualize the graphs, and R for the statistical analyses [13, 22].</p>
        <p>We expect our project, which has to face specific challenges such as NE recognition for
Austrian German and the analysis of two large-scale diachronic corpora, to contribute to
the understanding of the role that influential speakers and other linguistic factors play in
lexical change by analyzing big amounts of language data. Since we cover both the linguistic
output of influential speakers (ParlAT) as well as their linguistic reflex (AMC), we can test
if lexical innovations introduced by these individuals behave differently than other lexical
innovations. This allows us to disentangle social effects from cognitive effects in the process
of lexical spread. For example, by analyzing the evolution of the clustering coefficients of</p>
        <p>Figure 2 Work flow in the DYLEN project. Lexical networks are generated from diachronically
layered corpus data. Network properties of lexical items, such as semantic neighborhood density, are
then investigated across time to derive insights into semantic change.
networks around lexical innovations, we can test if increase in frequency is accompanied by
semantic widening effects; a correlation which is expected given results from research on
language change [3, 6, 11].</p>
        <p>We also seek to foster network theory as a suitable tool to analyze and make sense of
diachronic language data in the linguistic research community.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Albert-László</given-names>
            <surname>Barabási</surname>
          </string-name>
          . Network science. Cambridge university press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Mathieu</given-names>
            <surname>Bastian</surname>
          </string-name>
          , Sebastien Heymann, and
          <string-name>
            <given-names>Mathieu</given-names>
            <surname>Jacomy</surname>
          </string-name>
          .
          <article-title>Gephi: an open source software for exploring and manipulating networks</article-title>
          .
          <source>In Third international AAAI conference on weblogs and social media</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Joan</given-names>
            <surname>Bybee</surname>
          </string-name>
          .
          <article-title>Language, usage and cognition</article-title>
          . Cambridge University Press,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Heng</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xinying Chen, and Haitao Liu.
          <article-title>How does language change as a lexical network? an investigation based on written chinese word co-occurrence networks</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):e0192545,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>InterJournal</surname>
          </string-name>
          , Complex Systems,
          <volume>1695</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Nick C Ellis</surname>
          </string-name>
          ,
          <string-name>
            <surname>Matthew Brook O'Donnell</surname>
            ,
            <given-names>and Ute</given-names>
          </string-name>
          <string-name>
            <surname>Römer</surname>
          </string-name>
          .
          <article-title>The processing of verb-argument constructions is sensitive to form, function, frequency, contingency</article-title>
          and prototypicality,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Gabriel</given-names>
            <surname>Grill</surname>
          </string-name>
          , Julia Neidhardt, and
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Werthner</surname>
          </string-name>
          .
          <article-title>Network analysis on the austrian media corpus</article-title>
          .
          <source>In VSS 2017 - Vienna young Scientists Symposium</source>
          , pages
          <fpage>128</fpage>
          -
          <lpage>129</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Aric</given-names>
            <surname>Hagberg</surname>
          </string-name>
          , Pieter Swart, and Daniel S Chult.
          <article-title>Exploring network structure, dynamics, and function using networkx</article-title>
          .
          <source>Technical report</source>
          , Los Alamos National Lab.
          <source>(LANL)</source>
          , USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>William L Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jure Leskovec</surname>
            , and
            <given-names>Dan</given-names>
          </string-name>
          <string-name>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Cultural shift or linguistic drift? comparing two computational measures of semantic change</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing</source>
          , volume
          <volume>2016</volume>
          , pages
          <fpage>2116</fpage>
          -
          <lpage>2121</lpage>
          . NIH Public Access,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>William L Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jure Leskovec</surname>
            , and
            <given-names>Dan</given-names>
          </string-name>
          <string-name>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Diachronic word embeddings reveal statistical laws of semantic change</article-title>
          .
          <source>arXiv preprint arXiv:1605.09096</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Hilpert</surname>
          </string-name>
          and
          <string-name>
            <given-names>Florent</given-names>
            <surname>Perek</surname>
          </string-name>
          .
          <article-title>Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts</article-title>
          .
          <source>Linguistics Vanguard</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>339</fpage>
          -
          <lpage>350</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Eric</given-names>
            <surname>Jones</surname>
          </string-name>
          , Travis Oliphant, and Pearu Peterson.
          <article-title>SciPy: Open source scientific tools for Python</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Montero</surname>
          </string-name>
          ,
          <source>José A Vilar</source>
          , et al.
          <article-title>TSclust: An R package for time series clustering</article-title>
          .
          <source>Journal of Statistical Software</source>
          ,
          <volume>62</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Travis E</given-names>
            <surname>Oliphant</surname>
          </string-name>
          .
          <article-title>A guide to NumPy</article-title>
          , volume
          <volume>1</volume>
          .
          <string-name>
            <given-names>Trelgol</given-names>
            <surname>Publishing</surname>
          </string-name>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Jutta</given-names>
            <surname>Ransmayr</surname>
          </string-name>
          , Karlheinz Mörth, and
          <string-name>
            <given-names>Matej</given-names>
            <surname>Ďurčo. AMC (Austrian Media Corpus</surname>
          </string-name>
          ) -
          <source>Korpusbasierte Forschungen zum Österreichischen Deutsch</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>38</lpage>
          . Verlag der Österreichischen Akademie der Wissenschaften,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Eyal</given-names>
            <surname>Sagi</surname>
          </string-name>
          , Stefan Kaufmann, and
          <string-name>
            <given-names>Brady</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Tracing semantic change with latent semantic analysis</article-title>
          .
          <source>Current methods in historical semantics</source>
          ,
          <volume>73</volume>
          :
          <fpage>161</fpage>
          -
          <lpage>183</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Shvachko</surname>
          </string-name>
          , Hairong Kuang, Sanjay Radia, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Chansler</surname>
          </string-name>
          .
          <article-title>The hadoop distributed file system</article-title>
          .
          <source>In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Kumar</surname>
          </string-name>
          <string-name>
            <surname>Vavilapalli</surname>
          </string-name>
          , Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah,
          <string-name>
            <given-names>Siddharth</given-names>
            <surname>Seth</surname>
          </string-name>
          , et al.
          <article-title>Apache Hadoop Yarn: Yet another resource negotiator</article-title>
          .
          <source>In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Denny</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          .
          <article-title>Wikidata: a free collaborative knowledge base</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>J</given-names>
            <surname>Webber</surname>
          </string-name>
          .
          <article-title>A programmatic introduction to Neo4j in: Proceedings of the 3rd annual conference on systems, programming, and applications: Software for humanity</article-title>
          ,
          <fpage>217</fpage>
          -
          <lpage>218</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Tanja</given-names>
            <surname>Wissik</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Pirker</surname>
          </string-name>
          .
          <article-title>ParlAT beta corpus of austrian parliamentary records</article-title>
          . In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors,
          <source>Proceedings of the LREC2018 Workshop ParlaCLARIN. European Language Resources Association</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Simon N Wood</surname>
          </string-name>
          .
          <article-title>Generalized additive models: an introduction with R. Chapman</article-title>
          and Hall/CRC,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>Spark: Cluster computing with working sets</article-title>
          .
          <source>HotCloud</source>
          ,
          <volume>10</volume>
          (
          <fpage>10</fpage>
          -10):
          <fpage>95</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>