=Paper= {{Paper |id=Vol-2006/paper065 |storemode=property |title=A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing |pdfUrl=https://ceur-ws.org/Vol-2006/paper065.pdf |volume=Vol-2006 |authors=Rachele Sprugnoli,Sara Tonelli,Giovanni Moretti,Stefano Menini |dblpUrl=https://dblp.org/rec/conf/clic-it/SprugnoliTMM17 }} ==A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing== https://ceur-ws.org/Vol-2006/paper065.pdf
    A little bit of bella pianura: Detecting Code-Mixing in Historical English
                                    Travel Writing

          Rachele Sprugnoli1-2 , Sara Tonelli1 , Giovanni Moretti1 , Stefano Menini1-2
                             1
                               Fondazione Bruno Kessler, Trento
                                     2
                                       Università di Trento
               {sprugnoli,satonelli,moretti,menini}@fbk.eu


                     Abstract                           widely studied from the linguistic, psycholinguis-
     English. Code-mixing is the alternation            tic, and sociolinguistic point of view (Gardner-
     between two or more languages in the               Chloros, 1995; Grosjean, 1995; Ho, 2007) but
     same text. This phenomenon is very rele-           there is no consensus on the terminology to be
     vant in the travel domain, since it can pro-       adopted. In this paper code-mixing is used as an
     vide new insight in the way foreign cul-           umbrella term to indicate a manifestation of lan-
     tures are perceived and described to the           guage contact subsuming other expressions such
     readers. In this paper, we analyse English-        as code-switching, languaging, borrowing, lan-
     Italian code-mixing in historical English          guage crossing (Muysken, 2000).
     travel writings about Italy. We retrain and            Code-mixing characterizes communication of
     compare two existing systems for the auto-         post-colonial, migrant and multilingual communi-
     matic detection of code-mixing, and anal-          ties (Papalexakis et al., 2014; Frey et al., 2016)
     yse the semantic categories mostly con-            and it emerges in different types of documents,
     nected to Italian. Besides, we release the         for example parliamentary debates, interviews
     domain corpus used in our experiments              and social media posts (Carpuat, 2014; Das and
     and the output of the extraction.                  Gambäck, 2015; Piergallini et al., 2016). Travel
                                                        writings (e.g. guidebooks, travelogues, diaries,
     Italiano. Il code-mixing è l’alternanza di        blogs, travel articles in magazines) are affected as
     lingue diverse nello stesso testo. Questo          well by this phenomenon that has been studied in
     fenomeno è particolarmente importante             particular by analyzing small corpora of contem-
     nel dominio dei viaggi, poiché aiuta a            porary tourism discourse through manual inspec-
     comprendere meglio il modo in cui ven-             tion (Dann, 1996). Even if code-mixing occurs in
     gono percepite e descritte culture diverse         less than 1% of the cases (Cappelli, 2013), it has
     da quella dell’autore. In questo lavoro,           several important functions in the travel domain:
     analizziamo il code-mixing tra inglese ed          it gives a “linguistic sense of place” (Cortese and
     italiano nei testi di viaggio scritti in in-       Hymes, 2001), it adds authenticity to a narration, it
     glese e aventi come soggetto l’Italia. A           provides translation of cultural-specific words and
     questo scopo confrontiamo due sistemi es-          it is a mean to define social identity (“us” tourists
     istenti per il riconoscimento automatico           versus “they” locals) (Jaworski et al., 2003).
     del code-mixing dopo averli ri-addestrati              In this work, we investigate the phenomenon
     e analizziamo le categorie semantiche              of code-mixing in travel writings, but differently
     connesse alle parole/espressioni italiane.         from previous works we shift the focus of analy-
     Inoltre, rilasciamo il corpus e il risultato       sis from contemporary to historical data and from
     dell’estrazione.                                   manual to automatic information extraction. As
                                                        for the first point, we present a corpus of more than
1    Introduction                                       3.5 millions words of English travel writings pub-
                                                        lished between the end of the XIX Century and
Code-mixing is the alternation between two or
                                                        the beginning of the XX Century, which we have
more languages that can occur between sentences
                                                        retrieved from freely available sources and we re-
(inter-sentential), within the same utterance (intra-
                                                        lease in a cleaned format. As for automatic infor-
sentential), or even inside a single token (mix-
                                                        mation extraction, we retrain two state-of-the-art
ing of morphemes). This phenomenon has been
tools to identify English-Italian code-mixing and     growth of Anglo-American economy and a greater
evaluate them on a sample of our dataset. We fur-     emancipation of women with more female travel-
ther launch the best system on the whole dataset      ers (Schriber, 1995). Moreover, after unification,
and then we perform a semi-automatic refinement       new routes to Southern Italy and the islands were
of the automatic annotation. The corpus, the train-   opened, so that travelers’ attention was no longer
ing and test data and the outcome of the extraction   limited to the classic destinations in the North and
are available online1 .                               Central Italy, such as Venice, Florence and Rome
                                                      (Ouditt and Polezzi, 2012).
2       Related Work                                     The corpus is made by 57 texts3 , divided into
                                                      travel narratives (reports, diaries, collections of
Automatic language identification of monolingual
                                                      letters) and guidebooks, for a total of 3,630,781
documents has a long tradition in Natural Lan-
                                                      tokens. We distinguish between these two types
guage Processing (Hughes et al., 2006; Lui and
                                                      of text, following a standard classification of doc-
Baldwin, 2012). More recently a new hot topic of
                                                      uments in the travel domain. However, the dis-
research has emerged, that is the detection of lan-
                                                      tinction was not so clear-cut in the period we take
guage at word level in code-mixing texts. Ded-
                                                      into account as it is now, since reports on per-
icated workshops and evaluation exercises have
                                                      sonal travel experiences were often mixed with
been organized on this task dealing with differ-
                                                      practical recommendations and long disquisitions
ent pairs of languages and with social media data
                                                      on art and history. Therefore, we adopt as a rule
(Choudhury et al., 2014; Solorio et al., 2014;
                                                      of thumb the distinction suggested in (Santulli,
Molina et al., 2016). The most common approach
                                                      2007): travel narratives are those told in the first
of the proposed systems is based on Conditional
                                                      person, while guidebooks are written in imper-
Random Fields (CRFs) but there are also imple-
                                                      sonal form.
mentations of Logistic Regression and deep learn-
                                                         The authors of the selected texts belong to dif-
ing algorithms.
                                                      ferent nationalities (UK, US, Ireland, Australia)
   To the best our knowledge, there is no previ-
                                                      and are both male and female. Some books dwell
ous work on the automatic identification of code-
                                                      on specific cities or regions, others cover different
mixing in travel writing. Cappelli (2013) and
                                                      parts of Italy or even several countries: in the lat-
Gandin (2014) have studied the phenomenon, but
                                                      ter case we extracted only the chapters related to
they have mainly used standard corpus linguis-
                                                      Italy. Although we made an effort to have a di-
tics tools, i.e. WordSmith (Scott, 2008), to anal-
                                                      verse, well-balanced corpus in terms of content,
yse language contact in English guidebooks, travel
                                                      author’s gender and nationality, this was only par-
blogs written by expatriates and travel articles
                                                      tially possible because of the limited availability of
from 2002-2012.
                                                      online travel books whose text is freely available
3       Corpus Description                            and cleaned from OCR errors. The distribution
                                                      of tokens according to the year of publication and
Differently from the works cited in the previ-        type of text is shown in Fig. 1. Details about au-
ous Section, we focus on historical texts. To         thors are given in a spreadsheet provided together
this end, we collect from Project Gutenberg2 a        with the corpus.
corpus of travel writings about Italy written by
English native authors and published between
the country unification and the beginning of the      4       Code-Mixing Detection
30’s. We choose this period because in the sec-
ond half of the XIX Century the tradition of the      In this Section we describe the experiments on
Grand Tour declined and leisure-oriented travels      code-mixing, comparing the performance of two
emerged. This radical transformation was en-          available systems in different configurations. We
abled by technological, economic and sociolog-        also detail the post-processing step introduced to
ical, factors, such as the development of steam-      refine the output of the best performing system.
powered ships and of the railway network, the
    1                                                     3
    https://dh.fbk.eu/technologies/                       Thirty of these texts are also available in TEI-XML
code-mixing                                           format on the website https://sites.google.com/
  2
    https://www.gutenberg.org/                        view/travelwritingsonitaly.
                                                               and Keller, 2016). It implements a CRF classi-
                                                               fier with features generated from TreeTagger mod-
                                                               els and word lists of both languages6 . Differently
                                                               from langid that classifies words as belonging to
                                                               one language rather than the other, this latter sys-
                                                               tem performs a fine-grained annotation by distin-
                                                               guishing five classes (see below). Since this sys-
                                                               tem is fully supervised, we create a training set by
                                                               manually annotating 3,900 tokens from 4 samples
                                                               extracted from our corpus, a size in line with the
                                                               training data used in the original paper. The train-
Figure 1: Distribution of tokens per year of publi-            ing data were annotated with 5 different classes:
cation and sub-genre.                                          Italian tokens (i), English tokens (e), punctuation
                                                               (p), named entities (NEs) (n), and ambiguous to-
                                                               kens that belong to the dictionary of both lan-
4.1    Experimental Setting                                    guages (a).
In order to automatically extract Italian words,                  Both langid and CodeSwitching were evaluated
expressions and sentences from the corpus de-                  on the same test set, i.e. two samples of texts
scribed in Section 3, we train and test two systems            (one from a travel narrative and one from a guide-
whose source code is available on the web. The                 book) of 1,623 tokens. The test set was anno-
first one (henceforth, langid) is based on charac-             tated by assigning to each token a label for English
ter n-grams (n = 1 to 5) and adopts a weakly su-               or Italian, as required by langid, and also mark-
pervised approach, i.e. training data are mono-                ing punctuation, NEs and ambiguous tokens, fol-
lingual texts of few thousand tokens (King and                 lowing CodeSwitching scheme. Since the perfor-
Abney, 2013). This system includes four clas-                  mance of CodeSwitching is sensitive to the length
sification algorithms: Conditional Random Field                of the input file, we split the test set in batches of
(CRF), Hidden Markov Model (HMM) and Max-                      40 sentences, replicating the experimental setting
imum Entropy Model with and without general-                   presented in (Schulz and Keller, 2016).
ized expectation criteria (MaxEnt-GE and Max-
Ent). langid has been successfully evaluated on                4.2   Evaluation
documents containing English texts mixed with                  Table 1 presents the performances of langid on the
30 different minority languages such as Zulu and               test set: contrary to the results achieved by King
Chippewa4 .                                                    and Abney (2013), HMM – not CRF – proved
   For our experiments, we retrain langid using                to be the best approach. This is likely due to
a collection of about 300,000 tokens taken from                the greater sparseness of the code-mixing phe-
monolingual Italian and English books, of differ-              nomenon in our dataset with respect to what was
ent genres, published in the same period of our                registered in the original corpus, where languages
corpus5 .                                                      different from English cover the 56% of the over-
   The second system (henceforth, CodeSwitch-                  all number of tokens.
ing), has been developed to detect languages in                   Table 2 reports Precision, Recall and F-measure
texts mixing Latin and Middle English (Schulz                  of the retrained CodeSwitching system. Even if the
   4
                                                               overall performance is slightly better than the one
      http://www-personal.umich.edu/
                                                               obtained with HMM in langid, the scores for the
˜benking/resources/langid_release.tar.gz
    5
      For Italian: “Le Avventure di Pinocchio” by C. Col-      detection of Italian tokens (i) are lower (0.82 ver-
lodi, “Una donna” by S. Aleramo, “Il Valdarno da Firenze al    sus 0.90 in terms of F-measure). Punctuation (i)
mare” by G. Carocci, “La vita operosa” by M. Bontempelli,
“Dopo il divorzio” by G. Deledda, “Novelle umoristiche” by
                                                               and ambiguous tokens (a) are generally detected
A. Albertazzi, “Lezioni e Racconti per i bambini” by I. Bac-   with a good performance, while NEs (e) repre-
cini. For English: “The Adventures of Tom Sawyer” by M.        sent the most challenging class. Given that we are
Twain, “Pioneers of the Old Southwest” by C. L. Skinner,
“The Happy Prince, and Other Tales” by O. Wilde, “Vanished     mainly interested in recognising English and Ital-
Arizona” by M. Summerhayes, “The Tale of Peter Rabbit” by
                                                                 6
B. Potter, “The Strange Case of Dr. Jekyll and Mr. Hyde” by        https://github.com/sarschu/
R. L. Stevenson.                                               CodeSwitching
         CRF     HMM       MaxEnt     MaxEnt-GE
    P    1       0.89      0.59       0.82
    R    0.51    0.92      0.90       0.47
    F    0.67    0.90      0.71       0.60

Table 1: Results of the evaluation on the retrained
langid system in terms of precision (P), recall (R),
and F-Measure (F).

          i      e      a      n      p      ALL
                                                             Figure 2: Examples of langid output.
    P     0.83   0.98   0.98   0.85   0.98   0.92
    R     0.80   0.99   0.90   0.85   0.96   0.90
    F     0.82   0.99   0.94   0.85   0.97   0.91
                                                       cepts covered by 20 semantic classes, both in
Table 2: Results of the evaluation on the retrained    guidebooks and in travel narratives. Only one
CodeSwitching system in terms of precision (P),        USAS class, the one related to “Science and tech-
recall (R), and F-Measure (F) for each class and       nology”, is not found in the corpus. Table 5 shows
the macro-average of all classes.                      frequency and examples for each detected class.
                                                       As in contemporary travel writings (Francesconi,
                                                       2007), food is well represented: traditional dishes,
ian terms, and that on this task langid performs
                                                       drinks and products (e.g. polenta, Chianti, mor-
better, we run this tool on the whole corpus.
                                                       tadella) appear together with fruits, vegetables
4.3      Post-processing                               (e.g. mandarini, finocchio) and also eating estab-
                                                       lishments (e.g. osteria, trattoria, locanda). The
In order to refine the output of langid (see Figure
                                                       attention for Italian art and architecture manifests
2), we perform three post-processing steps. First
                                                       itself through the use of many specialized terms
of all, we check whether tokens tagged as Italian
                                                       (cassettoni, gotico, giallo antico). The semantic
are included in Morph-it, an Italian lexicon of in-
                                                       areas of emotions and psychological processes are
flected forms (Zanchetta and Baroni, 2005): in this
                                                       not recorded in previous work on contemporary
way we are able to detect false positives. Then, we
                                                       texts but are frequent especially in travel reports
run the Polyglot Python module on the corpus to
                                                       (e.g. addolorata, trionfo, simpatico). As for NEs,
find out if the processed documents contain other
                                                       city names reveal an increasing interest for towns
languages beside English and Italian7 . Indeed 27
                                                       in Central regions (for example, Perugia has a high
books result to have a high probability of includ-
                                                       frequency of occurrence in both genres). More-
ing text written also in Latin, French, Germany or
                                                       over, following Italy unification, travellers discov-
Greek. These books are likely to be problematic
                                                       ered several locations in the South (e.g. Ragusa,
given that langid recognizes only English and Ital-
                                                       Catanzaro). Among the most mentioned peo-
ian. Information obtained in these two steps are
                                                       ple, there are representatives of past Italian poli-
then used to manually check the outcome of langid
                                                       tics (e.g. Lorenzo and Cosimo de Medici), artists
extraction and correct it semi-automatically. Fur-
                                                       (e.g. Giotto, Dante) and religious figures (e.g.
thermore, we employ the USAS Italian semantic
                                                       Madonna, San Michele).
tagger (Piao et al., 2015) to obtain a categoriza-
                                                          In many cases, the use of Italian is not limited to
tion of the terms tagged as Italian. Based on the
                                                       single words or multi-token expressions (e.g. ap-
21 semantic classes recognised by USAS, we are
                                                       partamento signorile) but longer utterances are re-
able to understand in which cases and why writ-
                                                       ported. Texts of both genres contain proverbs (e.g.
ers used to switch their narration from English to
                                                       chi tardi arriva mal alloggia) and citations, not
Italian.
                                                       only from the canon of Italian literature, such as
5       Discussion                                     Leopardi’s poems, but also from the popular tradi-
                                                       tion, such as Tuscan songs (O rosa O rosa O rosa
The classification performed with the USAS tag-        gentillina). The main difference between travel
ger shows that Italian is adopted to express con-      narratives and guidebooks is the greater presence
  7
    http://polyglot.readthedocs.io/en/                 in the former of dialogues or expressions heard by
latest/Installation.html                               the author during his/her stay in Italy (voi siete un
                    GUIDEBOOKS                                    TRAVEL NARRATIVES
    SEMANTIC CLASS            #       EXAMPLES         SEMANTIC CLASS            #   EXAMPLES
    names & grammar        29,927     Pisa             names & grammar        28,694 Donatello
    architecture           3,070      villa            social elements        3,134  popolo
    movement               2,294      automobile       architecture           3,065  palazzo
    social elements        1,590      trinità         environment            1,311  lago
    materials & objects    717        fontana          movement               1,207  vetturino
    environment            713        campagna         materials & objects    965    rosso
    general/abstract terms 580        essere           general/abstract terms 943    fare
    measurement            340        alto             food & farming         665    trattoria
    arts & crafts          231        stucco           life                   479    fiore
    time                   225        nuovo            measurement            464    grande
    life                   222        agnello          time                   379    primavera
    body                   211        cintola          body                   350    braccio
    public domain          205        podestà         psyche                 330    vedere
    psyche                 198        volere           entertainment          319    marionetta
    food & farming         162        maccaroni        money & commerce       269    dazio
    entertainment          141        giuoco           communication          268    dire
    emotion                137        amore            public domain          260    carabiniere
    communication          131        motto            arts & crafts          206    arte
    money & commerce       127        soldo            emotion                176    evviva
    education              22         università      education              135    maestro

                          Table 3: Italian word frequency for each semantic class


cattivo; e voi siete bella).                           Monojit Choudhury, Gokul Chittaranjan, Parth Gupta,
                                                        and Amitava Das. 2014. Overview of fire 2014
6    Conclusions and Future Work                        track on transliterated search. In Proceedings of
                                                        FIRE.
In this work, we presented the first automated
analysis of code-mixing in historical travel writ-     Giuseppina Cortese and Dell Hymes. 2001. Languag-
                                                         ing in and across human groups. Perspectives on
ings. In particular, we focus on English docu-           difference and asymmetry. Textus. English Studies in
ments about Italy, and we compare guidebooks             Italy, 14(2).
and travel narratives, analysing the semantic cat-
                                                       Graham MS Dann. 1996. The language of tourism: a
egories mostly related to code-mixing.
                                                         sociolinguistic perspective. Cab International.
  In the future, we plan to investigate how code-
mixing phenomena relate to content types in travel     Amitava Das and Björn Gambäck. 2015. Code-mixing
writings (Sprugnoli et al., 2017). Besides, we are      in social media text: the last language identification
                                                        frontier? Revue TAL, pages 41–64.
planning to implement an algorithm to automati-
cally link code-mixing quotations to their original    Sabrina Francesconi. 2007. Italian borrowings from
source text. Finally, we would like to extend our        the semantic fields of food and drink in English
                                                         tourism texts. The Languages of Tourism: turismo e
experiments to recognise code-mixing in multi-
                                                         mediazione, Milano: Unicopli, page 129.
ple languages, and compare the semantic domains
specific to each language.                             Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W
                                                          Stemle. 2016. The DiDi Corpus of South Tyrolean
                                                          CMC Data: A Multilingual Corpus of Facebook
References                                                Texts. In Proceedings of CLIC-it.

Gloria Cappelli. 2013. Travelling words: Languaging    Stefania Gandin. 2014. Investigating loan words and
  in english tourism discourse. Travels and transla-      expressions in tourism discourse: A corpus driven
  tions, pages 353–374.                                   analysis on the bbctravel corpus. European Scien-
                                                          tific Journal, 10(2).
Marine Carpuat. 2014. Mixed-language and code-
 switching in the canadian hansard. In Proceedings     Penelope Gardner-Chloros. 1995. Code-switching in
 of EMNLP 2014, page 107.                                community, regional and national repertoires: the
  myth of the discreteness of linguistic systems. One      Mario Piergallini, Rouzbeh Shirvani, Gauri S Gautam,
  speaker, two languages: Cross-disciplinary perspec-       and Mohamed Chouikha. 2016. Word-level lan-
  tives on code-switching, pages 68–89.                     guage identification and predicting codeswitching
                                                            points in swahili-english language data. In Proceed-
François Grosjean. 1995. A psycholinguistic approach       ings of EMNLP 2016.
  to code-switching: The recognition of guest words
  by bilinguals. One speaker, two languages: Cross-        Francesca Santulli. 2007. Le parole ei luoghi: de-
  disciplinary perspectives on code-switching, pages         scrizione e racconto. Antelmi, Donelli/Held, Gu-
  259–275.                                                   drun/Santulli, Francesca, pages 81–153.

Judy Woon Yee Ho. 2007. Code-mixing: Linguistic            Mary Suzanne Schriber. 1995. Women’s place in
  form and socio-cultural meaning. The International        travel texts. Prospects, 20:161179.
  Journal of Language Society and Culture, 21.
                                                           Sarah Schulz and Mareike Keller. 2016. Code-
                                                             switching ubique est – Language identification and
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy
                                                             part-of-speech tagging for historical mixed text. In
  Nicholson, and Andrew MacKinlay. 2006. Re-
                                                             Proceedings of LaTeCH Workshop.
  considering language identification for written lan-
  guage resources. In Proc. International Conference       Mike Scott. 2008. WordSmith tools version 5. Liver-
  on Language Resources and Evaluation, pages 485–           pool: Lexical Analysis Software, 122.
  488.
                                                           Thamar Solorio, Elizabeth Blair, Suraj Mahar-
Adam Jaworski, Crispin Thurlow, Sarah Lawson, and            jan, Steven Bethard, Mona Diab, Mahmoud
  Virpi Ylänne-McEwen. 2003. The uses and repre-            Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Ju-
  sentations of local languages in tourist destinations:     lia Hirschberg, Alison Chang, et al. 2014. Overview
  A view from British TV holiday programmes. Lan-            for the first shared task on language identification
  guage Awareness, 12(1):5–29.                               in code-switched data. In Proceedings of the First
                                                             Workshop on Computational Approaches to Code
Ben King and Steven P Abney. 2013. Labeling the              Switching, pages 62–72.
  languages of words in mixed-language documents
  using weakly supervised methods. In Proceedings          Rachele Sprugnoli, Tommaso Caselli, Sara Tonelli,
  of HLT-NAACL, pages 1110–1119.                             and Giovanni Moretti. 2017. The Content Types
                                                             Dataset: a new Resource to Explore semantic and
Marco Lui and Timothy Baldwin. 2012. langid. py:             functional Characteristics of Texts. In Proceed-
 An off-the-shelf language identification tool. In           ings of the 15th Conference of the European Chap-
 Proceedings of the ACL 2012 system demonstra-               ter of the Association for Computational Linguis-
 tions, pages 25–30. Association for Computational           tics: Volume 2, Short Papers, pages 260–266, Va-
 Linguistics.                                                lencia, Spain, April. Association for Computational
                                                             Linguistics.
Giovanni Molina, Nicolas Rey-Villamizar, Thamar
  Solorio, Fahad AlGhamdi, Mahmoud Ghoneim, Ab-            Eros Zanchetta and Marco Baroni. 2005. Morph-it!
  delati Hawwari, and Mona Diab. 2016. Overview              A free corpus-based morphological resource for the
  for the second shared task on language identification      Italian language. Corpus Linguistics 2005, 1(1).
  in code-switched data. In Proceedings of EMNLP
  2016, pages 40–49.

Pieter Muysken. 2000. Bilingual speech: A typology
   of code-mixing, volume 11. Cambridge University
   Press.

Sharon Ouditt and Loredana Polezzi. 2012. Introduc-
  tion: Italy as place and space. Studies in Travel
  Writing, 16(2):97–105.

Evangelos Papalexakis, Dong-Phuong Nguyen, and
  A Seza Doğruöz. 2014. Predicting code-switching
  in multilingual communication for immigrant com-
  munities. In Proceedings of The First Workshop on
  Computational Approaches to Code Switching. As-
  sociation for Computational Linguistics.

Scott Piao, Francesca Bianchi, Carmen Dayrell, An-
  gela D’egidio, and Paul Rayson. 2015. Develop-
  ment of the multilingual semantic annotation sys-
  tem. Association for Computational Linguistics.