<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Love-Hate Relationship for Big Data and Linguistics: Present Issues and Future Possibilities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <email>giorgiomaria.dinunzio@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cecilia Poletto</string-name>
          <email>cecilia.poletto@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Linguistic Databases, Geolinguistics, Linguistic Maps.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Via Gradenigo 6/a, 35131 Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Linguistic and Literary Studies, University of Padua</institution>
          ,
          <addr-line>Piazzetta Folena 1, 35137 Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>22</volume>
      <issue>2016</issue>
      <abstract>
        <p>In this paper, we present an overview of some issues related to the use of Big Data in the area of Linguistics that have been debated in workshops and conferences in the last two years. We also consider some requirements that \big" linguistic databases should have in order to tackle some of these issues; nally, we discuss a set of possible interactive visualization approaches of large datasets that may have an impact in this research eld.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The availability of large text collections for the research
eld of Linguistics has increased signi cantly in the last
years. Researchers have turned their attention to large
linguistic databases and corpora of all kinds, to
experimental results, and to many other types of data sources. This
development went hand in hand with an expansion of the
scope of the theories, and collaborations with e.g.
historical linguistics, dialectologists, sociolinguists, and
psycholinguists. 1 These linguistic databases open new opportunities
for linguistic research, but they may be problematic in terms
of representativeness and accuracy of the data. Although
the term \big data" is well de ned in several empirical
domains, in comparative linguistics the discussion on the size
and properties big data should have in order to be
considered as such has started only very recently. At present, it
is the object of ongoing discussion and runs in parallel with
the construction of comparative linguistic data bases, i.e. of
essential tools to large empirical investigations. Even inside
linguistics, the various domains are not homogenous, since
phonology has started the discussion on big data and
gathering data on the phonological inventory of many languages
much earlier than other domains, like syntax.</p>
      <p>
        In the last two years, the debate about the impact of Big
Data in this research eld has shown two di erent points of
view: on the one hand, we have researchers who are aware
of the challenges that the creation and use of big data for
1http://www.meertens.knaw.nl/baddata/?page id=5
The idea is to use these mega-corpora with judgement
because they can be crucial in understanding languages and
language variation. It is important to understand whether
and how these data correlate with data from more carefully
constructed, balanced corpora [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In 2013, an
interesting discussion about the Global Lexicostatistical Database
(GLD)4 took place in a conference at Max-Planck-Institut
about \Historicizing Big Data".5 In particular, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
presented this big data stage as a continuity rather than a
rupture in the research eld. She showed that theory is also
built into the database infrastructure of contemporary
linguistics research in the GLD and, while this collaborative
online database is new, it brings together two century-old,
formerly competing traditions in linguistics. One year after,
in 2014, a panel of the Joint British Academy and
Philological Society6 discussed some fundamental questions about
how the results of traditional scholarship can be integrated
with those derived by digital methods as well as how we can
2https://www.helsinki. /en/researchgroups/varieng/
d2e-from-data-to-evidence
3http://www.theguardian.com/education/2014/may/07/
what-big-data-tells-about-language
4The major goal of this project is to put together and make
available, for specialists and the general public alike, the
most complete and thoroughly annotated collection of basic
wordlists of the world's language
5https://www.mpiwg-berlin.mpg.de/en/research/projects/
deptiin kaplann reconstruction
6http://www.britac.ac.uk/events/2014/Language
Linguistics Data Explosion.cfm
measure the impact of such challenges on diverse areas of
language-study.
      </p>
      <p>
        On the other hand, we have other researchers who have
a critical positions towards the use of Big Data. In 2016,
the main focus of a workshop at the Meertens Institute was
about dealing with bad data in linguistic theory.7 The
participants debated about di erent classes of problems that big
data may have on linguistics research, such as incomplete
data, noisy data, one-sided data, and con icting data. For
example, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed that very large dataset are potentially
very useful to improve our understanding of some
linguistic theoretical analysis, but at the same time there are
important reasons to consider that are in part complementary
to the ones already mentioned, i.e. missing data, di erent
methodologies, di erent frequencies, and di erent systems.
Therefore, it becomes crucial to understand to what extent
which of these properties has an impact on any theoretical
conclusions that can be drawn from this data set.
      </p>
    </sec>
    <sec id="sec-2">
      <title>BIG GEOLINGUISTICS DATABASES</title>
      <p>
        Research in language variation allows linguists to
understand the fundamental principles that underlie language
systems and grammatical changes in time and space.
Geolinguistics is an interdisciplinary eld that aims at mapping the
geographical distribution of phenomena which are mainly
due to processes of grammatical principled changes [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In
this context, the linguistic atlas has proved to be a vital
tool and product of geolinguistics since the earliest stages
of the eld, and it has provided a stage for the
incorporation of modern GIS. In the last two decades, several
largescale databases of linguistic material of various types have
been developed worldwide [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. One of the basic problems we
have to deal with geolinguistic databases data is related to
the qualitatively and quantitatively di erent types of data
that have to be classi ed and retrieved [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A
geolinguistic database with the function of the traditional linguistic
atlases contains a variety of data that are beyond the
simple linguistic information (for example, geographic locations,
the type of inquiries adopted to gather the data, the
speakers who have delivered the data, and so on), all of them
being relevant to the geolinguistic analysis.
      </p>
      <p>The interaction between linguistic and geographical
information becomes crucial in situations when a given linguistic
phenomenon is found in the same geographical area as
another, or the two linguistic phenomena are in geographically
disjunct areas, or the area of the rst implies the area of
the second. The visualization of the geographic distribution
of these phenomena represents precious information for the
linguist and should be immediately retrievable from the
interface. Another important facet for geolinguistic research
concerns the test subjects used to gather the data on the
eld. They might provide input to investigate how language
changes over time in a given geographical space. Since the
data to be combined are of di erent origin and are to be
classi ed according to di erent parameters, a careful planning
of the structure of the database as well as the interactive
visualization is necessary to develop a system which has the
properties of durability and wide usage among researchers
that justify such an expensive enterprise.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>New insights through linguistic maps</title>
      <sec id="sec-3-1">
        <title>7ttp://www.meertens.knaw.nl/baddata/</title>
        <p>The problem concerning the tools to lter incomplete,
noisy or more generally \bad data" clearly depends on the
type of research aims. If the aim is a qualitative and not
a quantitative investigation, we de nitely need a new set
of procedures to compare big amounts of data. The degree
of ne-grained distinctions necessary in a qualitative
enterprise is clearly much higher than the one generally required
in quantitative research and which cannot be provided by
standard statistical approaches already used in language
acquisition and psycho-linguistics.</p>
        <p>
          Big linguistic enterprises, like data bases, atlases and all
sorts of corpora, always contain a certain amount of \noise".
They are by de nition always both incomplete and
inaccurate when the linguistic hypothesis we want to test is
already very detailed and precise. On the other hand, data
are also incomplete even when we adopt a qualitative
procedure which investigates a smaller subset of data. For
instance, Buchstaller and Corrigan in[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] show that the results
we can obtain also depend of the type of task the test
subjects have been asked to perform, and that the best policy is
not only to control for all possible factors intervening in the
experiment we are performing, but also to combine di erent
tasks to single out stable linguistic generalizations that are
not prone to be simple task e ects. This means that the
amount of data we are considering is irrelevant with respect
to the problem of how complete and reliable our data set
can be. Since data are always inevitably incomplete, what
we have to do is develop new strategies to compensate for
the inaccuracy and incompleteness of the data. In this
respect, it can be useful to consider some strategies that can
help us to nd out interesting theoretical clues even in data
that provide by de nition a coarse-grained picture of the
linguistic reality. If it is true that big data is never precise
enough for a very detailed hypothesis, we can still try to
exploit the peculiarity of a blurred but very big image to single
out the general outlines of the linguistic panorama, which
would remain otherwise uncovered. In this way, using big
data mining can nicely complement our introspective type of
empirical evidence in the spirit of Buchstaller and Corrigan,
who suggest the combination of di erent test strategies. The
reason why big data are always too \noisy" is that we do not
treat them in the right way, i.e. the questions we ask are not
adapted to the type of evidence we have. The unavoidable
conclusion is that new strategies to represent and exploit
the data we have at our disposal have to be developed. The
general gist of the solution to the problem we will present is
the following: up to now we have only used big corpora to
look at the presence versus absence of a given phenomenon
X in a given language L and related it to other phenomena.
        </p>
        <p>An innovative way to think about big data and tailor our
questions on the linguistic evidence provided by big data is
to consider the type of geographical variation itself as a clue
indicating di erent natural classes of linguistic phenomena.
It is possible to single out at least three distributional
patterns and determine to which type of phenomenon each type
of variation is related.
2.1.1</p>
        <p>“Classic method”</p>
        <p>The rst \classic" method to be taken into account and
further developed with new technical representation tools
is the one adopted by geolinguists since the beginning of
the discipline, i.e. the one of comparing the geographical
distribution of di erent phenomena inside a genetically
homogeneous linguistic area and consider the theoretical
import of di erent distributional patterns. In particular there
are three clearly identi able distributional patterns that can
provide us with new insights into the linguistic system when
we consider the distribution of two phenomena. The two
phenomena can a) completely overlap, b) be in
complementary distribution or c) one can be included into the other on a
linguistic map representing both of them. When they
completely overlap, this can be interpreted as having the two
phenomena depend on a single abstract property. When
they are in complementary distribution, this means that
they occupy the same space inside the linguistic system, i.e.
they satisfy the same requirement. Complementary
distribution thus means that two phenomena are alternative
checkers of the same linguistic property, so that they exclude each
other. When one is included in the other, this can be taken
as an indication that the wider phenomenon partially
depends on a setting that also the smaller one shares, but has
additional requirements. In other terms, this could mean
that the phenomenon that is more largely represented is a
necessary but not su cient condition for the occurrence of
the second. Hence, the pure geographical distribution of
covarying phenomena can provide us with interesting clues to
interpret the data.</p>
        <p>Although it was never really formalized, this type of
methodology has been used by traditional dialectologists and is still
used today in formal frameworks and can be only be adopted
when we are comparing two phenomena and trying to
establish whether they are intrinsically related or not. Still, the
distribution we nd could only be by chance, but if we have
enough languages, the probability that we only have to do
with chance reduces progressively the bigger our sample is.
Which means, big data are a valuable source to linguistic
investigation, but they have better be really big.
2.1.2</p>
        <p>“Leopard spots”</p>
        <p>Another type of geographical distribution which can
provide us with interesting observations that we would not be
able to see on the basis of a detailed qualitative
investigation or even analyzing big data on the basis of other devices
that are not as visually immediately interpretable as maps
are. This type of distribution is called by traditional
dialectologists \leopard spots" because the phenomenon under
study occurs precisely with an apparently random
distribution which however covers the whole area taken into
account. This type of distribution is generally found when we
deal with a phenomenon that is only possible when a speci c
complex constellation of factors is instantiated in the same
language. Since the phenomenon depends on several factors
that do not depend on each other in any sense (either
implication of exclusion), we nd it only where all the factors
cluster together and this can happen in various points of the
area. The study of this type of variation can lead us to nd
out exactly what the complex prerequisites are that lead to
the occurrence of the phenomenon under study. This
special type of distribution has been discovered in all areas that
include strictly genetically related languages, while it is not
found on linguistic maps treating languages which are very
di erent, for instance in linguistic typology. Since it is
typical of microvariation and not macrovariation, it constitutes
a very powerful tool to pin down phenomena that depend
on complex clusters of often unrelated properties.
2.1.3</p>
        <p>“Genetically” related languages</p>
        <p>A third type of variational data that can be exploited
once it is set on a map, has to do with a very simple
observation and has never been used up to now, although it
is actually very simple. It is possible to extract syntactic
and semantic observations from lexical data simply by
looking at the type and number of possible lexical forms for the
same element starting from a simple but rather strong
hypothesis: the index of lexical variation of a functional word
like auxiliaries, prepositions, pronouns etc. within a
genetically related set of languages co-varies with the semantic and
syntactic complexity of the item itself. Therefore, a simple
count of the possible forms across the area considered gives
us information about the semantic/syntactic complexity of
the item we are investigating. This evidently only works
within a domain where all languages are genetically related,
i.e. where the original etymological set of possible elements
is constrained by the fact that all languages considered come
from the same source language (like for instance all Romance
languages share their major lexical endowment which comes
from Latin). This means that a rather simple count of the
possible lexical etymological sources used in a set of
genetically related languages gives us very precise indications of
the featural primitive components the functional element is
made up of. This is a tool that has never been tried out up
to now precisely because no one has ever thought of using
the massive amount of data we now have at our disposal in
this way.
3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>DISCUSSION</title>
      <p>
        This new way to think about big data and linguistics
requires some thoughts at di erent levels: the choice of the
type of visual interaction; an e cient data structure to store,
organize and retrieve linguistic data; an evaluation of the
implemented system. The exploration of new visualization and
interaction systems with a geographic map were presented
in the ASIt (Syntactic Atlas of Italy) project [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">3, 1, 2</xref>
        ]. In
Figure 1 we show an interactive interface used to search a
linguistic database with sentence tags or POS tags. The
results of this map can be explored and studied by a linguist
in the way described in Section 2.1.1. One of our
proposals in this paper is to extend the `classic' view of linguistic
data described in Section 2.1.1 to more complex interactions
with the map, like overlapping two or more maps with
different level of transparency and, for example, highlights the
`leopard spots' described in Section 2.1.2.
      </p>
      <p>
        The second point is the study of the `right' model and data
structure for large linguistic datasets. On the one hand, we
have the problem of designing systems that should give
access to digital objects that may be stored in di erent
institutions, i.e archives and museums; therefore, the
interoperability among the Digital Library System which manage the
digital resources of these institutions is a key concern [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For
this purpose, the working group of open data in linguistics
has recently promoted the idea and de nition of open data
in Linguistics and in particular to the use of Linked Open
Data (LOD) to implement it. The LOD paradigm refers to a
set of best practices for publishing data on the Web8 and it
is based on a standardized data model. In the ASIt project,
we have proposed a LOD approach for increasing the level
of interoperability of geolinguistic applications and the reuse
      </p>
      <sec id="sec-4-1">
        <title>8http://www.w3.org/DesignIssues/LinkedData.html</title>
        <p>
          of the data. In particular, we de ned an extensible
ontology for geolinguistic resources based on the common ground
de ned by current European linguistic projects and we
applied this ontology on top on a real linguistic dataset [
          <xref ref-type="bibr" rid="ref11 ref7">11, 7</xref>
          ].
Nevertheless, a study of the e ciency of the LOD approach
on very large dataset is still to be completed.
        </p>
        <p>
          Last but not least, since the system we are trying to
develop requires interaction, interoperability and e ciency, we
need a validation of the system. In the CULTURA project,
for example, the IPSA system was evaluated by both expert
researchers and students in order to collect all the di erent
points of view of the users of a digital library, in its
transition from an isolated archive to an archive fully immersed
in a new adaptive environment [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>The traditional geolinguistic tool of linguistic maps can
provide new and important indications to theoretically
interpret linguistic phenomena. We described the
fundamental requirements that are needed to adapt classical linguistic
maps in order to carry out more sophisticated analyses.</p>
      <p>This is only possible when two conditions are met: the
set of languages investigated is genetically related and we
are really dealing with big data. These methods can also
compensate with the inaccuracy of the data, since they do
not need to be very detailed for us to gather an idea of
the type of distributional pattern we are dealing with.
Evidently, this type of methodology is not intended as a
substitution of more traditional methods, but complements other
research methodologies in an integrated view of research.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We are grateful to Maristella Agosti for the fruitful
discussions and her useful comments on an earlier version of this
paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Beninca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dussin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pescarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabanus</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomaselli</surname>
          </string-name>
          .
          <article-title>Asit: A grammatical survey of italian dialects and cimbrian: Fieldwork, data management, and linguistic analysis</article-title>
          .
          <source>In Digital Libraries and Archives - 7th Italian Research Conference, IRCDL</source>
          <year>2011</year>
          , Pisa, Italy, January
          <volume>20</volume>
          -
          <issue>21</issue>
          ,
          <year>2011</year>
          . Revised Papers, pages
          <volume>100</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dussin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabanus</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomaselli</surname>
          </string-name>
          .
          <article-title>A curated database for linguistic research: The test case of cimbrian varieties</article-title>
          .
          <source>In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)</source>
          , Istanbul, Turkey, May
          <volume>23</volume>
          -25,
          <year>2012</year>
          , pages
          <fpage>2230</fpage>
          {
          <fpage>2236</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Beninca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miotto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pescarini</surname>
          </string-name>
          .
          <article-title>A digital library e ort to support the building of grammatical resources for italian dialects</article-title>
          .
          <source>In Digital Libraries - 6th Italian Research Conference, IRCDL</source>
          <year>2010</year>
          , Padua, Italy, January
          <volume>28</volume>
          -
          <issue>29</issue>
          ,
          <year>2010</year>
          .
          <source>Revised Selected Papers</source>
          , pages
          <volume>89</volume>
          {
          <fpage>100</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Poletto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rinke</surname>
          </string-name>
          .
          <article-title>Designing A Long Lasting Linguistic Project: The Case Study of ASIt</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources</source>
          and
          <article-title>Evaluation (LREC-</article-title>
          <year>2016</year>
          ), Portoroz, Slovenia, May
          <volume>23</volume>
          -28,
          <year>2016</year>
          ., page In press.,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Silvello.</surname>
          </string-name>
          <article-title>Digital library interoperability at high level of abstraction</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>55</volume>
          :
          <fpage>129</fpage>
          {
          <issue>146</issue>
          , 2
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Man oletti</article-title>
          , N. Orio,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ponchia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <source>Bridging Between Cultural Heritage Institutions: 9th Italian Research Conference, IRCDL 2013</source>
          , Rome, Italy,
          <source>January 31{February 1</source>
          ,
          <year>2013</year>
          ,
          <article-title>Revised Selected Papers, chapter The Evaluation Approach of IPSA@CULTURA</article-title>
          , pages
          <volume>147</volume>
          {
          <fpage>152</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <article-title>A curated and evolving linguistic linked dataset</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ):
          <volume>265</volume>
          {
          <fpage>270</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Buchstaller</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Corrigan</surname>
          </string-name>
          .
          <article-title>Making intuitions work: Testing instruments for measuring dialect syntax</article-title>
          . In W. Maguire and
          <string-name>
            <surname>A</surname>
          </string-name>
          . McMahon, editors,
          <source>Analysing Variation in English</source>
          , pages
          <volume>30</volume>
          {
          <fpage>48</fpage>
          . Cambridge University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Craenenbroeck</surname>
          </string-name>
          .
          <article-title>Handle your verb clusters with care. Dealing with bad data in linguistic theory</article-title>
          . Amsterdam, the Netherlands.,
          <year>March 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Davies</surname>
          </string-name>
          .
          <article-title>Why size alone is not enough: the importance of historical, genre-based, and dialectal variation in language</article-title>
          . D2E Conference, \
          <article-title>From data to evidence in English language research: Big data, rich data</article-title>
          ,
          <source>uncharted data"</source>
          , Helsinki,
          <year>October 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <article-title>A linked open data approach for geolinguistics applications</article-title>
          .
          <source>IJMSO</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <volume>29</volume>
          {
          <fpage>41</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoch</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Hayes</surname>
          </string-name>
          .
          <source>Geolinguistics: The Incorporation of Geographic Information Systems and Science. The Geographical Bulletin</source>
          ,
          <volume>51</volume>
          (
          <issue>1</issue>
          ):
          <volume>23</volume>
          {
          <fpage>36</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          .
          <article-title>The Global Lexicostatistical Database: Integrating Traditions in Long-Range Historical Linguistics</article-title>
          . \
          <source>Historicizing Big Data" Conference</source>
          ,
          <string-name>
            <surname>Max-</surname>
          </string-name>
          Planck-Institut,
          <source>October 31 { November 2</source>
          ,
          <year>2013</year>
          ,
          <year>October 2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>