=Paper=
{{Paper
|id=Vol-2006/paper065
|storemode=property
|title=A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing
|pdfUrl=https://ceur-ws.org/Vol-2006/paper065.pdf
|volume=Vol-2006
|authors=Rachele Sprugnoli,Sara Tonelli,Giovanni Moretti,Stefano Menini
|dblpUrl=https://dblp.org/rec/conf/clic-it/SprugnoliTMM17
}}
==A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing==
A little bit of bella pianura: Detecting Code-Mixing in Historical English
Travel Writing
Rachele Sprugnoli1-2 , Sara Tonelli1 , Giovanni Moretti1 , Stefano Menini1-2
1
Fondazione Bruno Kessler, Trento
2
Università di Trento
{sprugnoli,satonelli,moretti,menini}@fbk.eu
Abstract widely studied from the linguistic, psycholinguis-
English. Code-mixing is the alternation tic, and sociolinguistic point of view (Gardner-
between two or more languages in the Chloros, 1995; Grosjean, 1995; Ho, 2007) but
same text. This phenomenon is very rele- there is no consensus on the terminology to be
vant in the travel domain, since it can pro- adopted. In this paper code-mixing is used as an
vide new insight in the way foreign cul- umbrella term to indicate a manifestation of lan-
tures are perceived and described to the guage contact subsuming other expressions such
readers. In this paper, we analyse English- as code-switching, languaging, borrowing, lan-
Italian code-mixing in historical English guage crossing (Muysken, 2000).
travel writings about Italy. We retrain and Code-mixing characterizes communication of
compare two existing systems for the auto- post-colonial, migrant and multilingual communi-
matic detection of code-mixing, and anal- ties (Papalexakis et al., 2014; Frey et al., 2016)
yse the semantic categories mostly con- and it emerges in different types of documents,
nected to Italian. Besides, we release the for example parliamentary debates, interviews
domain corpus used in our experiments and social media posts (Carpuat, 2014; Das and
and the output of the extraction. Gambäck, 2015; Piergallini et al., 2016). Travel
writings (e.g. guidebooks, travelogues, diaries,
Italiano. Il code-mixing è l’alternanza di blogs, travel articles in magazines) are affected as
lingue diverse nello stesso testo. Questo well by this phenomenon that has been studied in
fenomeno è particolarmente importante particular by analyzing small corpora of contem-
nel dominio dei viaggi, poiché aiuta a porary tourism discourse through manual inspec-
comprendere meglio il modo in cui ven- tion (Dann, 1996). Even if code-mixing occurs in
gono percepite e descritte culture diverse less than 1% of the cases (Cappelli, 2013), it has
da quella dell’autore. In questo lavoro, several important functions in the travel domain:
analizziamo il code-mixing tra inglese ed it gives a “linguistic sense of place” (Cortese and
italiano nei testi di viaggio scritti in in- Hymes, 2001), it adds authenticity to a narration, it
glese e aventi come soggetto l’Italia. A provides translation of cultural-specific words and
questo scopo confrontiamo due sistemi es- it is a mean to define social identity (“us” tourists
istenti per il riconoscimento automatico versus “they” locals) (Jaworski et al., 2003).
del code-mixing dopo averli ri-addestrati In this work, we investigate the phenomenon
e analizziamo le categorie semantiche of code-mixing in travel writings, but differently
connesse alle parole/espressioni italiane. from previous works we shift the focus of analy-
Inoltre, rilasciamo il corpus e il risultato sis from contemporary to historical data and from
dell’estrazione. manual to automatic information extraction. As
for the first point, we present a corpus of more than
1 Introduction 3.5 millions words of English travel writings pub-
lished between the end of the XIX Century and
Code-mixing is the alternation between two or
the beginning of the XX Century, which we have
more languages that can occur between sentences
retrieved from freely available sources and we re-
(inter-sentential), within the same utterance (intra-
lease in a cleaned format. As for automatic infor-
sentential), or even inside a single token (mix-
mation extraction, we retrain two state-of-the-art
ing of morphemes). This phenomenon has been
tools to identify English-Italian code-mixing and growth of Anglo-American economy and a greater
evaluate them on a sample of our dataset. We fur- emancipation of women with more female travel-
ther launch the best system on the whole dataset ers (Schriber, 1995). Moreover, after unification,
and then we perform a semi-automatic refinement new routes to Southern Italy and the islands were
of the automatic annotation. The corpus, the train- opened, so that travelers’ attention was no longer
ing and test data and the outcome of the extraction limited to the classic destinations in the North and
are available online1 . Central Italy, such as Venice, Florence and Rome
(Ouditt and Polezzi, 2012).
2 Related Work The corpus is made by 57 texts3 , divided into
travel narratives (reports, diaries, collections of
Automatic language identification of monolingual
letters) and guidebooks, for a total of 3,630,781
documents has a long tradition in Natural Lan-
tokens. We distinguish between these two types
guage Processing (Hughes et al., 2006; Lui and
of text, following a standard classification of doc-
Baldwin, 2012). More recently a new hot topic of
uments in the travel domain. However, the dis-
research has emerged, that is the detection of lan-
tinction was not so clear-cut in the period we take
guage at word level in code-mixing texts. Ded-
into account as it is now, since reports on per-
icated workshops and evaluation exercises have
sonal travel experiences were often mixed with
been organized on this task dealing with differ-
practical recommendations and long disquisitions
ent pairs of languages and with social media data
on art and history. Therefore, we adopt as a rule
(Choudhury et al., 2014; Solorio et al., 2014;
of thumb the distinction suggested in (Santulli,
Molina et al., 2016). The most common approach
2007): travel narratives are those told in the first
of the proposed systems is based on Conditional
person, while guidebooks are written in imper-
Random Fields (CRFs) but there are also imple-
sonal form.
mentations of Logistic Regression and deep learn-
The authors of the selected texts belong to dif-
ing algorithms.
ferent nationalities (UK, US, Ireland, Australia)
To the best our knowledge, there is no previ-
and are both male and female. Some books dwell
ous work on the automatic identification of code-
on specific cities or regions, others cover different
mixing in travel writing. Cappelli (2013) and
parts of Italy or even several countries: in the lat-
Gandin (2014) have studied the phenomenon, but
ter case we extracted only the chapters related to
they have mainly used standard corpus linguis-
Italy. Although we made an effort to have a di-
tics tools, i.e. WordSmith (Scott, 2008), to anal-
verse, well-balanced corpus in terms of content,
yse language contact in English guidebooks, travel
author’s gender and nationality, this was only par-
blogs written by expatriates and travel articles
tially possible because of the limited availability of
from 2002-2012.
online travel books whose text is freely available
3 Corpus Description and cleaned from OCR errors. The distribution
of tokens according to the year of publication and
Differently from the works cited in the previ- type of text is shown in Fig. 1. Details about au-
ous Section, we focus on historical texts. To thors are given in a spreadsheet provided together
this end, we collect from Project Gutenberg2 a with the corpus.
corpus of travel writings about Italy written by
English native authors and published between
the country unification and the beginning of the 4 Code-Mixing Detection
30’s. We choose this period because in the sec-
ond half of the XIX Century the tradition of the In this Section we describe the experiments on
Grand Tour declined and leisure-oriented travels code-mixing, comparing the performance of two
emerged. This radical transformation was en- available systems in different configurations. We
abled by technological, economic and sociolog- also detail the post-processing step introduced to
ical, factors, such as the development of steam- refine the output of the best performing system.
powered ships and of the railway network, the
1 3
https://dh.fbk.eu/technologies/ Thirty of these texts are also available in TEI-XML
code-mixing format on the website https://sites.google.com/
2
https://www.gutenberg.org/ view/travelwritingsonitaly.
and Keller, 2016). It implements a CRF classi-
fier with features generated from TreeTagger mod-
els and word lists of both languages6 . Differently
from langid that classifies words as belonging to
one language rather than the other, this latter sys-
tem performs a fine-grained annotation by distin-
guishing five classes (see below). Since this sys-
tem is fully supervised, we create a training set by
manually annotating 3,900 tokens from 4 samples
extracted from our corpus, a size in line with the
training data used in the original paper. The train-
Figure 1: Distribution of tokens per year of publi- ing data were annotated with 5 different classes:
cation and sub-genre. Italian tokens (i), English tokens (e), punctuation
(p), named entities (NEs) (n), and ambiguous to-
kens that belong to the dictionary of both lan-
4.1 Experimental Setting guages (a).
In order to automatically extract Italian words, Both langid and CodeSwitching were evaluated
expressions and sentences from the corpus de- on the same test set, i.e. two samples of texts
scribed in Section 3, we train and test two systems (one from a travel narrative and one from a guide-
whose source code is available on the web. The book) of 1,623 tokens. The test set was anno-
first one (henceforth, langid) is based on charac- tated by assigning to each token a label for English
ter n-grams (n = 1 to 5) and adopts a weakly su- or Italian, as required by langid, and also mark-
pervised approach, i.e. training data are mono- ing punctuation, NEs and ambiguous tokens, fol-
lingual texts of few thousand tokens (King and lowing CodeSwitching scheme. Since the perfor-
Abney, 2013). This system includes four clas- mance of CodeSwitching is sensitive to the length
sification algorithms: Conditional Random Field of the input file, we split the test set in batches of
(CRF), Hidden Markov Model (HMM) and Max- 40 sentences, replicating the experimental setting
imum Entropy Model with and without general- presented in (Schulz and Keller, 2016).
ized expectation criteria (MaxEnt-GE and Max-
Ent). langid has been successfully evaluated on 4.2 Evaluation
documents containing English texts mixed with Table 1 presents the performances of langid on the
30 different minority languages such as Zulu and test set: contrary to the results achieved by King
Chippewa4 . and Abney (2013), HMM – not CRF – proved
For our experiments, we retrain langid using to be the best approach. This is likely due to
a collection of about 300,000 tokens taken from the greater sparseness of the code-mixing phe-
monolingual Italian and English books, of differ- nomenon in our dataset with respect to what was
ent genres, published in the same period of our registered in the original corpus, where languages
corpus5 . different from English cover the 56% of the over-
The second system (henceforth, CodeSwitch- all number of tokens.
ing), has been developed to detect languages in Table 2 reports Precision, Recall and F-measure
texts mixing Latin and Middle English (Schulz of the retrained CodeSwitching system. Even if the
4
overall performance is slightly better than the one
http://www-personal.umich.edu/
obtained with HMM in langid, the scores for the
˜benking/resources/langid_release.tar.gz
5
For Italian: “Le Avventure di Pinocchio” by C. Col- detection of Italian tokens (i) are lower (0.82 ver-
lodi, “Una donna” by S. Aleramo, “Il Valdarno da Firenze al sus 0.90 in terms of F-measure). Punctuation (i)
mare” by G. Carocci, “La vita operosa” by M. Bontempelli,
“Dopo il divorzio” by G. Deledda, “Novelle umoristiche” by
and ambiguous tokens (a) are generally detected
A. Albertazzi, “Lezioni e Racconti per i bambini” by I. Bac- with a good performance, while NEs (e) repre-
cini. For English: “The Adventures of Tom Sawyer” by M. sent the most challenging class. Given that we are
Twain, “Pioneers of the Old Southwest” by C. L. Skinner,
“The Happy Prince, and Other Tales” by O. Wilde, “Vanished mainly interested in recognising English and Ital-
Arizona” by M. Summerhayes, “The Tale of Peter Rabbit” by
6
B. Potter, “The Strange Case of Dr. Jekyll and Mr. Hyde” by https://github.com/sarschu/
R. L. Stevenson. CodeSwitching
CRF HMM MaxEnt MaxEnt-GE
P 1 0.89 0.59 0.82
R 0.51 0.92 0.90 0.47
F 0.67 0.90 0.71 0.60
Table 1: Results of the evaluation on the retrained
langid system in terms of precision (P), recall (R),
and F-Measure (F).
i e a n p ALL
Figure 2: Examples of langid output.
P 0.83 0.98 0.98 0.85 0.98 0.92
R 0.80 0.99 0.90 0.85 0.96 0.90
F 0.82 0.99 0.94 0.85 0.97 0.91
cepts covered by 20 semantic classes, both in
Table 2: Results of the evaluation on the retrained guidebooks and in travel narratives. Only one
CodeSwitching system in terms of precision (P), USAS class, the one related to “Science and tech-
recall (R), and F-Measure (F) for each class and nology”, is not found in the corpus. Table 5 shows
the macro-average of all classes. frequency and examples for each detected class.
As in contemporary travel writings (Francesconi,
2007), food is well represented: traditional dishes,
ian terms, and that on this task langid performs
drinks and products (e.g. polenta, Chianti, mor-
better, we run this tool on the whole corpus.
tadella) appear together with fruits, vegetables
4.3 Post-processing (e.g. mandarini, finocchio) and also eating estab-
lishments (e.g. osteria, trattoria, locanda). The
In order to refine the output of langid (see Figure
attention for Italian art and architecture manifests
2), we perform three post-processing steps. First
itself through the use of many specialized terms
of all, we check whether tokens tagged as Italian
(cassettoni, gotico, giallo antico). The semantic
are included in Morph-it, an Italian lexicon of in-
areas of emotions and psychological processes are
flected forms (Zanchetta and Baroni, 2005): in this
not recorded in previous work on contemporary
way we are able to detect false positives. Then, we
texts but are frequent especially in travel reports
run the Polyglot Python module on the corpus to
(e.g. addolorata, trionfo, simpatico). As for NEs,
find out if the processed documents contain other
city names reveal an increasing interest for towns
languages beside English and Italian7 . Indeed 27
in Central regions (for example, Perugia has a high
books result to have a high probability of includ-
frequency of occurrence in both genres). More-
ing text written also in Latin, French, Germany or
over, following Italy unification, travellers discov-
Greek. These books are likely to be problematic
ered several locations in the South (e.g. Ragusa,
given that langid recognizes only English and Ital-
Catanzaro). Among the most mentioned peo-
ian. Information obtained in these two steps are
ple, there are representatives of past Italian poli-
then used to manually check the outcome of langid
tics (e.g. Lorenzo and Cosimo de Medici), artists
extraction and correct it semi-automatically. Fur-
(e.g. Giotto, Dante) and religious figures (e.g.
thermore, we employ the USAS Italian semantic
Madonna, San Michele).
tagger (Piao et al., 2015) to obtain a categoriza-
In many cases, the use of Italian is not limited to
tion of the terms tagged as Italian. Based on the
single words or multi-token expressions (e.g. ap-
21 semantic classes recognised by USAS, we are
partamento signorile) but longer utterances are re-
able to understand in which cases and why writ-
ported. Texts of both genres contain proverbs (e.g.
ers used to switch their narration from English to
chi tardi arriva mal alloggia) and citations, not
Italian.
only from the canon of Italian literature, such as
5 Discussion Leopardi’s poems, but also from the popular tradi-
tion, such as Tuscan songs (O rosa O rosa O rosa
The classification performed with the USAS tag- gentillina). The main difference between travel
ger shows that Italian is adopted to express con- narratives and guidebooks is the greater presence
7
http://polyglot.readthedocs.io/en/ in the former of dialogues or expressions heard by
latest/Installation.html the author during his/her stay in Italy (voi siete un
GUIDEBOOKS TRAVEL NARRATIVES
SEMANTIC CLASS # EXAMPLES SEMANTIC CLASS # EXAMPLES
names & grammar 29,927 Pisa names & grammar 28,694 Donatello
architecture 3,070 villa social elements 3,134 popolo
movement 2,294 automobile architecture 3,065 palazzo
social elements 1,590 trinità environment 1,311 lago
materials & objects 717 fontana movement 1,207 vetturino
environment 713 campagna materials & objects 965 rosso
general/abstract terms 580 essere general/abstract terms 943 fare
measurement 340 alto food & farming 665 trattoria
arts & crafts 231 stucco life 479 fiore
time 225 nuovo measurement 464 grande
life 222 agnello time 379 primavera
body 211 cintola body 350 braccio
public domain 205 podestà psyche 330 vedere
psyche 198 volere entertainment 319 marionetta
food & farming 162 maccaroni money & commerce 269 dazio
entertainment 141 giuoco communication 268 dire
emotion 137 amore public domain 260 carabiniere
communication 131 motto arts & crafts 206 arte
money & commerce 127 soldo emotion 176 evviva
education 22 università education 135 maestro
Table 3: Italian word frequency for each semantic class
cattivo; e voi siete bella). Monojit Choudhury, Gokul Chittaranjan, Parth Gupta,
and Amitava Das. 2014. Overview of fire 2014
6 Conclusions and Future Work track on transliterated search. In Proceedings of
FIRE.
In this work, we presented the first automated
analysis of code-mixing in historical travel writ- Giuseppina Cortese and Dell Hymes. 2001. Languag-
ing in and across human groups. Perspectives on
ings. In particular, we focus on English docu- difference and asymmetry. Textus. English Studies in
ments about Italy, and we compare guidebooks Italy, 14(2).
and travel narratives, analysing the semantic cat-
Graham MS Dann. 1996. The language of tourism: a
egories mostly related to code-mixing.
sociolinguistic perspective. Cab International.
In the future, we plan to investigate how code-
mixing phenomena relate to content types in travel Amitava Das and Björn Gambäck. 2015. Code-mixing
writings (Sprugnoli et al., 2017). Besides, we are in social media text: the last language identification
frontier? Revue TAL, pages 41–64.
planning to implement an algorithm to automati-
cally link code-mixing quotations to their original Sabrina Francesconi. 2007. Italian borrowings from
source text. Finally, we would like to extend our the semantic fields of food and drink in English
tourism texts. The Languages of Tourism: turismo e
experiments to recognise code-mixing in multi-
mediazione, Milano: Unicopli, page 129.
ple languages, and compare the semantic domains
specific to each language. Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W
Stemle. 2016. The DiDi Corpus of South Tyrolean
CMC Data: A Multilingual Corpus of Facebook
References Texts. In Proceedings of CLIC-it.
Gloria Cappelli. 2013. Travelling words: Languaging Stefania Gandin. 2014. Investigating loan words and
in english tourism discourse. Travels and transla- expressions in tourism discourse: A corpus driven
tions, pages 353–374. analysis on the bbctravel corpus. European Scien-
tific Journal, 10(2).
Marine Carpuat. 2014. Mixed-language and code-
switching in the canadian hansard. In Proceedings Penelope Gardner-Chloros. 1995. Code-switching in
of EMNLP 2014, page 107. community, regional and national repertoires: the
myth of the discreteness of linguistic systems. One Mario Piergallini, Rouzbeh Shirvani, Gauri S Gautam,
speaker, two languages: Cross-disciplinary perspec- and Mohamed Chouikha. 2016. Word-level lan-
tives on code-switching, pages 68–89. guage identification and predicting codeswitching
points in swahili-english language data. In Proceed-
François Grosjean. 1995. A psycholinguistic approach ings of EMNLP 2016.
to code-switching: The recognition of guest words
by bilinguals. One speaker, two languages: Cross- Francesca Santulli. 2007. Le parole ei luoghi: de-
disciplinary perspectives on code-switching, pages scrizione e racconto. Antelmi, Donelli/Held, Gu-
259–275. drun/Santulli, Francesca, pages 81–153.
Judy Woon Yee Ho. 2007. Code-mixing: Linguistic Mary Suzanne Schriber. 1995. Women’s place in
form and socio-cultural meaning. The International travel texts. Prospects, 20:161179.
Journal of Language Society and Culture, 21.
Sarah Schulz and Mareike Keller. 2016. Code-
switching ubique est – Language identification and
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy
part-of-speech tagging for historical mixed text. In
Nicholson, and Andrew MacKinlay. 2006. Re-
Proceedings of LaTeCH Workshop.
considering language identification for written lan-
guage resources. In Proc. International Conference Mike Scott. 2008. WordSmith tools version 5. Liver-
on Language Resources and Evaluation, pages 485– pool: Lexical Analysis Software, 122.
488.
Thamar Solorio, Elizabeth Blair, Suraj Mahar-
Adam Jaworski, Crispin Thurlow, Sarah Lawson, and jan, Steven Bethard, Mona Diab, Mahmoud
Virpi Ylänne-McEwen. 2003. The uses and repre- Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Ju-
sentations of local languages in tourist destinations: lia Hirschberg, Alison Chang, et al. 2014. Overview
A view from British TV holiday programmes. Lan- for the first shared task on language identification
guage Awareness, 12(1):5–29. in code-switched data. In Proceedings of the First
Workshop on Computational Approaches to Code
Ben King and Steven P Abney. 2013. Labeling the Switching, pages 62–72.
languages of words in mixed-language documents
using weakly supervised methods. In Proceedings Rachele Sprugnoli, Tommaso Caselli, Sara Tonelli,
of HLT-NAACL, pages 1110–1119. and Giovanni Moretti. 2017. The Content Types
Dataset: a new Resource to Explore semantic and
Marco Lui and Timothy Baldwin. 2012. langid. py: functional Characteristics of Texts. In Proceed-
An off-the-shelf language identification tool. In ings of the 15th Conference of the European Chap-
Proceedings of the ACL 2012 system demonstra- ter of the Association for Computational Linguis-
tions, pages 25–30. Association for Computational tics: Volume 2, Short Papers, pages 260–266, Va-
Linguistics. lencia, Spain, April. Association for Computational
Linguistics.
Giovanni Molina, Nicolas Rey-Villamizar, Thamar
Solorio, Fahad AlGhamdi, Mahmoud Ghoneim, Ab- Eros Zanchetta and Marco Baroni. 2005. Morph-it!
delati Hawwari, and Mona Diab. 2016. Overview A free corpus-based morphological resource for the
for the second shared task on language identification Italian language. Corpus Linguistics 2005, 1(1).
in code-switched data. In Proceedings of EMNLP
2016, pages 40–49.
Pieter Muysken. 2000. Bilingual speech: A typology
of code-mixing, volume 11. Cambridge University
Press.
Sharon Ouditt and Loredana Polezzi. 2012. Introduc-
tion: Italy as place and space. Studies in Travel
Writing, 16(2):97–105.
Evangelos Papalexakis, Dong-Phuong Nguyen, and
A Seza Doğruöz. 2014. Predicting code-switching
in multilingual communication for immigrant com-
munities. In Proceedings of The First Workshop on
Computational Approaches to Code Switching. As-
sociation for Computational Linguistics.
Scott Piao, Francesca Bianchi, Carmen Dayrell, An-
gela D’egidio, and Paul Rayson. 2015. Develop-
ment of the multilingual semantic annotation sys-
tem. Association for Computational Linguistics.