=Paper=
{{Paper
|id=Vol-3290/short_paper9153
|storemode=property
|title='Entrez!' she called: Evaluating Language Identification Tools
    in English Literary Texts
|pdfUrl=https://ceur-ws.org/Vol-3290/short_paper9153.pdf
|volume=Vol-3290
|authors=Erik Ketzan,Nicolas Werner
|dblpUrl=https://dblp.org/rec/conf/chr/KetzanW22
}}
=='Entrez!' she called: Evaluating Language Identification Tools
    in English Literary Texts==
<pdf width="1500px">https://ceur-ws.org/Vol-3290/short_paper9153.pdf</pdf>
<pre>
‘Entrez!’ she called: Evaluating Language
Identification Tools in English Literary Texts
Erik Ketzan1,∗ , Nicolas Werner2
1
    Centre for Digital Humanities, Trinity College Dublin, Dublin, Ireland
3
    University of Cologne, Cologne, Germany


                                         Abstract
                                         This short paper presents work in progress on the evaluation of current language identi昀椀cation (LI) tools
                                         for identifying foreign language n-grams in English-language literary texts, for instance, “‘Entrez!’ she
                                         called”. We 昀椀rst manually annotated French and Spanish words appearing in 12,000-word text samples
                                         by F. Scott Fitzgerald and Ernest Hemingway using a TEI tag. We then split the tagged sample texts
                                         into four groups of n-grams, from unigram to tetragram, and compared the accuracy of 昀椀ve LI packages
                                         on correctly identifying the language of the tagged foreign-language snippets. We report that, of the
                                         packages tested, Fasttext proved most accurate for this task overall, but that methodological questions
                                         and future work remain.

                                         Keywords
                                         Language Identi昀椀cation, Computational Literary Studies


1. Introduction
This paper presents work in progress on the evaluation of current language identi昀椀cation (LI)
tools for identifying foreign language n-grams in literary texts. The primary goal is to aid
literary scholars who may wish to automatically identify, for example, bits of French in the
novels of F. Scott Fitzgerald or Spanish in the novels of Ernest Hemingway. Ideally, the outcome
of this project would be the creation of a single, easy-to-use Python script, by which a literary
scholar could input any text, and the script would generate a table of n-grams with language
candidates generated by the LI.
   Automatic language identi昀椀cation, the task of determining the natural language that a doc-
ument or part thereof is written in [7], has been extensively researched since the 1960’s, and
a number of advanced o昀昀-the-shelf LI packages and APIs are now available. As a richly devel-
oped NLP technology, LI systems have been the subject of large and robust evaluation methods
and experiments [7], but we present this report on the precision of leading o昀昀-the-shelf LI pack-
ages for, speci昀椀cally, the use case of computational literary studies, and even more speci昀椀cally,
texts which feature a primary language and small snippets of other languages.


CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium
∗
 Corresponding author.
£ ketzane@tcd.ie (E. Ketzan)
ȉ 0000-0003-0264-2157 (E. Ketzan)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                        366
Table 1
Selection of manually tagged Spanish in Hemingway’s The Sun Also Rises. Emphasis added.
    He was anxious to know the English for <foreign xml:lang=”es”>Corrida de toros</foreign>
    Drunk,” I said. ”<foreign xml:lang=”es”>Borracho! Muy borracho!</foreign>
    They come to see the last day of the quaint little Spanish <foreign xml:lang=”es”>fiesta</foreign>.
    At a thousand <foreign xml:lang=”es”>duros</foreign> apiece
    You’re not an <foreign xml:lang=”es”>aficionado</foreign>?


   Literary scholars, editors, and creators of scholarly editions may wish to have such an au-
tomatic language identi昀椀cation tool at hand to aid scholarly commentary on literary texts, to
annotate in TEI and/or machine translate secondary languages in such texts, or to investigate
code-switching (CS), which has been de昀椀ned as “the ability on the part of bilinguals to alternate
e昀昀ortlessly between their two languages” [3]. While much literature on CS focuses on speech,
“interest in written code-switching has developed more slowly” [6]. Studies on CS in literary
texts o昀琀en employ some form of digital tool for analysis, but identifying/tagging the second
language (L2) is o昀琀en performed manually [4, 1, 12, 11, 13, 15]. Anecdotally, we are aware
that researchers prefer manual annotation for this step, as LI remains too unreliable. And our
experimental results below support this impression.
   The application of recent LI models to literary studies has so far been minimal. Pianzola et
al. [14] report that in the classi昀椀cation of titles of texts (not the body of the texts) of fan 昀椀ction
and classics in numerous languages, “CLD2 and CLD3 were not able to identify [the correct
language of] 24.6% and 29.1% of the titles respectively”. As Pianzola et al. were dealing with
text titles which presumably maintained language consistency throughout the title, while we
look at 1- through 4-grams which can mix languages, we expect our results to be worse than
the ~70-75 accuracy they report.


2. Corpora and Tagging
As this paper presents work in progress, our text sample is small while we solicit feedback
on experiment design on a larger scale. As sample evaluation texts, we chose two well-known
literary texts in English. A single annotator manually annotated 49 Spanish n-grams in a 12,000-
word sample of Hemingway’s The Sun Also Rises (1926, published under the title Fiesta in Lon-
don),1 and 27 French n-grams in a 12,000-word sample of Fitzgerald’s Tender is the Night (1934)2
using a single TEI P5 tag, <foreign xml:lang="es"> or <foreign xml:lang="fr">, which
identi昀椀es a word or phrase as belonging to a language other than that of the surrounding text.3
Named entities were excluded from manual tagging, and a sample of manually tagged foreign
language strings in these texts is presented in Tables 1 and 2.
   One unresolved challenge of manually tagging foreign-language strings is: should foreign
1
  In total, this was 237 unigrams + bigrams + trigrams + tetragrams.
2
  In total, this was 281 unigrams + bigrams + trigrams + tetragrams.
3
  https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-foreign.html


                                                      367
Table 2
Selection of manually tagged French in Fitzgerald’s Tender is the Night. Emphasis added.

    <foreign xml:lang=”fr”>”Entrez!”</foreign> she called, but there was no answer
    Even his <foreign xml:lang=”fr”>carte d’identité</foreign> has been seen.
    The famous Paul, the <foreign xml:lang=”fr”>concessionaire</foreign>, had not arrived
    At four the <foreign xml:lang=”fr”>chasseur</foreign> approached him
    he was joined by a <foreign xml:lang=”fr”>gendarme</foreign>


language words widely recognizable to native speakers, as a result of language borrowing and
loan words [10, 5, 2] be tagged as foreign language? Some of the many examples of borrowed
words or loan words in English include restaurant, table, bonhomie, and kindergarten, and there
is the additional complication with historical texts that what is considered a loan word can
change over time. While loan words present a methodological challenge for the technical eval-
uation of LI in literary texts, this challenge is perhaps less relevant for the desired outcome of
this project: a Python script which presents a literary scholar with all or many candidates of for-
eign language words in a text, which the scholar could then si昀琀 according to their own research
goals. Future work could also, of course, consider multiple annotators and inter-annotator
agreement metrics in this step.


3. Experiment: Evaluation of LI packages on Hemingway and
   Fitzgerald samples
In our 昀椀rst experiment, we evaluated 昀椀ve current LI packages on the two manually tagged
samples of Hemingway and Fitzgerald:

      • fasttext: includes models for LI, developed by Facebook AI Research [9, 8].4
      • langdetect: port of Nakatani Shuyo’s language-detection library to Python, developed
        by Michal Mimino Danilak.5
      • langid: standalone LI tool, developed by Marco Liu and Timothy Baldwin.6
      • polyglot: depends on the pycld2 library which in turn depends on Google’s Compact
        Language Detector v2 (CLD2) library for detecting languages in plain text. Developed
        by Rami Al-Rfou.7
      • pycld3: Python bindings to Google’s Compact Language Detector v3 (CLD3), a neural
        network model for language identi昀椀cation, support for over 100 languages, developed
        by Brad Solomon.8
4
  https://fasttext.cc
5
  https://pypi.org/project/langdetect/
6
  https://github.com/saffsd/langid.py
7
  https://polyglot.readthedocs.io/en/latest/Detection.html
8
  https://pypi.org/project/pycld3/


                                                        368
   Our manual annotation of foreign language text served as the gold standard or evaluation
criterium, and we wished to see how well the di昀昀erent LI packages performed in matching our
manual tags. For literary texts with minor amounts of secondary language, such as the bits of
Spanish in The Sun Also Rises, our method of query is to split the texts into n-gram strings and
have the LI packages make a prediction for each string. We thus transformed our manually
annotated foreign language words into 1- to 4-grams, where at least 50% of the word tokens
are foreign language. For instance, Hemingway’s line, “We wished him ‘Mucha suerte,’ shook
hands, and went out”, generated the bigrams: “him mucha”, “mucha suerte”, and “suerte shook”.
Again, as our explored method of automatically detecting Spanish n-grams would 昀椀rst require
a split of the original text into n-grams. But what level of n-gram would produce the most
accurate result? Here, we tested the packages by splitting the texts into four di昀昀erent lists of
n-grams, as well as a combined result, measuring whether the correct prediction was made
by any level of text split (1-, 2-, 3-, or 4-gram). The texts were split using the nltk.util.ngrams
method from the nltk package. For those LI libraries which o昀昀er the option to return multiple
language predictions with probability values, we only used the prediction with the highest
probability for evaluation.
   Evaluating the accuracy of the 昀椀ve LI packages on our sample, the LI package Fasttext per-
formed overall best when considering 1- through 4-grams (Figures 1 and 2). The Langid pack-
age outperformed Fasttext on French 4-grams and Spanish 3-grams. An intuitive hypothesis
would be that all LI packages would perform better, given a longer text sample. Our method,
however, of performing the evaluation on n-grams that mix languages, complicates this pro-
cess. By splitting the texts into n-grams that mix lexis from two languages, this no doubt
“confused” the LI packages, which in turn resulted in longer text snippets (as long as 4-grams)
not necessarily generating better LI accuracy. This also strikes at the core of the problem in
this particular research: while LI presumably performs best with longer texts in a single lan-
guage, foreign language words and phrases in English literary texts seem to most o昀琀en be very
short, down to the unigram level. It must be noted that at these small sample sizes, even a few
correct or incorrect classi昀椀cations can alter the results signi昀椀cantly.
   With those caveats, what emerges from this small experiment is that Fasttext performs best
on the task, with around 40% - 60% accuracy when 1- 4-grams are all considered together. Is
a tool based on this method actually useful for literary scholars who wish to automatically
pull out the French in Fitzgerald or Spanish in Hemingway? At this juncture, it is better than
nothing, but certainly short of the goal.


4. Conclusion and Future Work
We suggest that LI holds great potential to assist a variety of computational literary studies.
Some ideas for future work are discussed above, but certainly include larger evaluation sam-
ples, more languages, texts from more time periods, perhaps multiple annotators and inter-
annotator agreement metrics, and a resolution of methodological questions such as the classi-
昀椀cation and possible exclusion of loanwords. Future work must devote signi昀椀cant attention to
the di昀昀erent approaches of the packages to better understand and improve results. For instance,
Fasttext is a (sub)word embedding library, while langid employs a Naive Bayes classi昀椀cation


                                               369
Figure 1: Accuracy of five LI packages in detecting French n-grams in a 12,000 word sample of Fitzger-
ald’s Tender is the Night.


approach. Errors in the classi昀椀cation task should also be analyzed closely.
  Another unexplored method is to simply restrict the languages that an LI can guess for each
text. Our experiments allowed the LI packages to predict any language (leading to incorrect
predictions from German to Filipino), but undoubtedly results would improve if, for instance,
we restricted the LI to predict only English or Spanish in the Hemingway text. While this
would improve results, it would require manual inspection of each literary text, which would
become a greater burden for researchers, the larger the target corpus grows.
  Di昀昀erent approaches other than LI should be attempted for this task. A dictionary-


                                                370
Figure 2: Accuracy of five LI packages in detecting Spanish n-grams in a 12,000 word sample of Hem-
ingway’s The Sun Also Rises.


based method could reveal possible foreign words. Certain part-of-speech taggers tag for-
eign/unknown words. A predictive language model may be able to identify foreign words
candidates; for instance, given “he was joined by a”, the model could predict the next word,
and if the actual word (here “gendarme”) is not in the top 50/100/500, then it may be a foreign
word. A document-based language distribution may also provide candidates for foreign words;
as a text introduces most of the words that are used throughout at the beginning of text (a Zipf
distribution), if a new word appears later in the text, it could possibly be foreign.
   It should be noted that there are already many practical solutions for readers and scholars


                                               371
in locating and understanding snippets of foreign languages in literary texts. Canonical novels
are o昀琀en published in popular or critical editions with translations in footnotes. Professional
or fan-made websites may contain lists of translations of foreign language texts in popular
novels. Ebook readers such as Amazon’s Kindle contain machine translation of snippets as
a built-in feature. Users may point their mobile phones at foreign language texts and obtain
on-the-昀氀y machine translation from such applications as Google Translate. Improved foreign
language identi昀椀cation in literary texts could, however, supplement or improve upon some of
these tasks.
   A future goal of our research is to potentially unearth macro trends in foreign language
text usage in literary history, for example the presumed decline of Latin or French in English-
language literature. Such distant reading or cultural analytics of large literary corpora could
perhaps reveal long-term trends in the rise and fall of foreign language usage over decades and
centuries. And indeed, we have already performed such exploratory experiments with large
corpora, but the low accuracy of LI packages in our experiments thus far and the tremendous
amount of false positives in the results have not allowed meaningful results. But as LI pack-
ages improve at the n-gram level, and our methodological application of them to literary texts
improves, such trends may become observable.


5. Acknowledgments
The authors wish to thank the peer reviewers for their extremely helpful comments.


References
 [1] M. Albakry and P. H. Hancock. “Code switching in Ahdaf Soueif’s The Map of Love”. In:
     Language and Literature: International Journal of Stylistics 17.3 (2008), pp. 221–234. doi:
     10.1177/0963947008092502. url: http://journals.sagepub.com/doi/10.1177/096394700809
     2502.
 [2] K. Brown and J. Miller. The Cambridge Dictionary of Linguistics. 1st ed. Cambridge Uni-
     versity Press, 2013. doi: 10.1017/cbo9781139049412. url: https://www.cambridge.org/c
     ore/product/identifier/9781139049412/type/book.
 [3] B. E. Bullock and A. J. Toribio. “Themes in the study of code-switching”. In: The Cam-
     bridge Handbook of Linguistic Code-switching. Ed. by B. E. Bullock and A. J. Toribio. 1st ed.
     Cambridge University Press, 2009, pp. 1–18. doi: 10.1017/cbo9780511576331.002.
 [4] L. Callahan. Spanish/English Codeswitching in a Written Corpus. Vol. 27. Studies in Bilin-
     gualism. Amsterdam: John Benjamins Publishing Company, 2004. doi: 10.1075/sibil.27.
     url: http://www.jbe-platform.com/content/books/9789027295378.
 [5] D. Crystal and D. Crystal. A dictionary of linguistics and phonetics. 6th ed. The language
     library. Malden, MA ; Oxford: Blackwell Pub, 2008.


                                              372
 [6] P. Gardner-Chloros and D. Weston. “Code-switching and multilingualism in literature”.
     In: Language and Literature: International Journal of Stylistics 24.3 (2015), pp. 182–193.
     doi: 10.1177/0963947015585065. url: http://journals.sagepub.com/doi/10.1177/09639470
     15585065.
 [7] T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, and K. Lindén. Automatic Language Iden-
     ti昀椀cation in Texts: A Survey. 2018. url: http://arxiv.org/abs/1804.08186.
 [8] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. FastText.zip:
     Compressing text classi昀椀cation models. 2016. url: http://arxiv.org/abs/1612.03651.
 [9] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of Tricks for E昀케cient Text Classi-
     昀椀cation. 2016. url: http://arxiv.org/abs/1607.01759.
[10]   P. H. Matthews. The concise Oxford dictionary of linguistics. 2nd ed. Oxford paperback
       reference. Oxford [England] ; New York: Oxford University Press, 2007.
[11]   C. Montes-Alcalá. “Code-switching in US Latino literature: The role of biculturalism”. In:
       Language and Literature: International Journal of Stylistics 24.3 (2015), pp. 264–281. doi:
       10.1177/0963947015585224. url: http://journals.sagepub.com/doi/10.1177/096394701558
       5224.
[12]   A. Mullen. “‘In both our languages’: Greek–Latin code-switching in Roman literature”.
       In: Language and Literature: International Journal of Stylistics 24.3 (2015), pp. 213–232.
       doi: 10.1177/0963947015585244. url: http://journals.sagepub.com/doi/10.1177/09639470
       15585244.
[13]   K. B. Müller. “Code-switching in Italo-Brazilian literature from Rio Grande do Sul and São
       Paulo: A sociolinguistic analysis of the forms and functions of literary code-switching”.
       In: Language and Literature: International Journal of Stylistics 24.3 (2015), pp. 249–263.
       doi: 10.1177/0963947015585228. url: http://journals.sagepub.com/doi/10.1177/09639470
       15585228.
[14]   F. Pianzola, S. Rebora, and G. Lauer. “Wattpad as a resource for literary studies. Quanti-
       tative and qualitative examples of the importance of digital social reading and readers’
       comments in the margins”. In: Plos One 15.1 (2020). Ed. by D. Orrego-Carmona, e0226708.
       doi: 10.1371/journal.pone.0226708. url: https://dx.plos.org/10.1371/journal.pone.02267
       08.
[15]   M. J. Sánchez and E. Pérez-Garcı́a. “Acculturation through Code-Switching Linguistic
       Analysis in Three Short-Stories: “Invierno”, “Nilda” and “The Pura Principle” (Dı́az 2012)”.
       In: Miscelánea: A Journal of English and American Studies 61 (2020), pp. 59–79. doi: 10.2
       6754/ojs\_misc/mj.20205139. url: https://papiro.unizar.es/ojs/index.php/misc/article/vi
       ew/5139.


                                               373

</pre>