<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Loanword Detection with Maximally Simple Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Hammond</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Linguistics, U. of Arizona</institution>
          ,
          <addr-line>Tucson, AZ 85721</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper demonstrates a maximally simple approach to loanword detection. The diferent systems developed here all start with handcrafted features. The first approach used a logistic regression model and a second approach used a simple feed-forward neural network. Finally, in a third approach, the output of the neural network is “repaired” in various ways for the submission. The resulting system is not competitive, but exemplifies how extremely simple techniques can produce output that is “in the ballpark”.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;loanwords</kwd>
        <kwd>English</kwd>
        <kwd>Spanish</kwd>
        <kwd>ensemble methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Overview</title>
    </sec>
    <sec id="sec-2">
      <title>2. Previous approaches</title>
      <p>
        A very similar task was was run in 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In that task, teams were provided specific training data. At
that time, only two system descriptions were included. One team used conditional random fields and
data augmentation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The other team used used pretraining with unlabeled data [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        The task is related to the more general task of language identification [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. There are a few earlier
studies in in diferent languages using a variety of more traditional techniques [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The specific task</title>
      <p>
        The specific task here was to identify unassimilated borrowed words or word sequences from English
in Spanish sentences [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. We were given a model file reference.csv demonstrating the format.
Here is the first line of that file. 1
      </p>
      <p>Somos un país en el que ‘youtubers’ y ‘gamers’ millonarios deciden
irse al paraíso fiscal de Andorra porque tributan menos sin preocuparse
del bienestar de sus vecinos y de quienes les han hecho ricos,
ni acordarse de la Educación, Sanidad, infraestructuras de las
que han disfrutado durante años gracias a la solidaridad de todos.
;youtubers;gamers;;;
Fields are separated with ;. The sentence appears first and borrowings occupy remaining fields on the
line. In this line, the borrowings appear in single quotes, but this is not generally true.</p>
      <p>Testing was on the basis of the input.csv file (from the submission website). This file also has a
single sentence on each line, but the remaining fields that would otherwise contain any borrowings are
absent.</p>
      <p>The goal was to extract the borrowings from the sentence and append them as additional fields on
the line.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The approach</title>
      <p>The approach taken here is maximally simple. I deliberately make use of the simplest possible techniques
with an eye to seeing how far these go. There is a pedagogical purpose to this. I teach an introductory
course in our Human Language Technology program on statistical natural language processing. The
question addressed here is whether techniques covered in that introductory course are suficient for a
reasonable attempt at loanword identification.</p>
      <p>I do not expect these to be competitive; rather the hope is that: i) they are in the ballpark as far as
success; and ii) that their performance points the way toward more successful strategies.</p>
      <p>The specific strategy employed here is as follows.
1. Identify by hand features that are likely to be helpful in identifying borrowings.
2. Break each sentence up into words and do predictions word by word.
3. Build an initial logistic regression model to identify borrowings.
4. Build a simple neural net, replacing the logistic regression model.
5. Look at the output and add post-processing as needed.</p>
      <p>6. Reasssemble words into sentences for evaluation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Resources</title>
      <p>My submissions all made use of these resources:
1. COALAS dataset (https://github.com/lirondos/coalas).
2. NLTK wordlists
• nltk.corpus.words (English)
• nltk.corpus.cess_esp (Spanish)
3. NLTK stemmers
1Here is a description of the task https://adobo-task.github.io/borrowing.html and here is the submission website https:
//www.codabench.org/competitions/7284/
• nltk.stem.PorterStemmer (for English)
• nltk.stem.SnowballStemmer (for Spanish)
4. unix/mac English wordlist (/usr/share/dict/words)
5. Another English wordlist (https://github.com/dwyl/english-words)
6. spacy taggers for English and Spanish</p>
      <p>The COALAS dataset was used to train all models and is organized by word. Here’s the first few lines
of the training partition.</p>
      <p>Microsoft
promete
formación
digital
gratis
a
25
millones
de
personas
este
año
El
gigante
del
software
Microsoft
lanzó
este
martes</p>
      <p>O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
B-ENG
O
O
O</p>
      <p>O
Each word appears on its own line and is coded for whether it’s a borrowing or not. Sentences are
separated by spaces.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Features</title>
      <p>I set this up initially as binary logistic regression where 0 is Spanish and 1 is borrowed. I generated a
set of features based entirely on intuition. These include:
• Is the word in all capitals?</p>
      <p>The logic here was that capitalization would indicate an acronym.
• Is the first letter capitalized?</p>
      <p>This assumes proper names are not borrowings.
• Does the word include non-alphabetic characters?</p>
      <p>Strings containing special characters are not borrowings.
• Does the word include common Spanish sufixes?</p>
      <p>Check specifically for whether the word ends in high-frequency Spanish endings, e.g. ión, o(s),
a(s).
• Does the word appear in any of the English wordlists?</p>
      <p>Brute-force check against all the English wordlists, unioned together and implemented as a
python set.
• Does the word appear in the Spanish wordlist?</p>
      <p>Same for Spanish.
• What is the character bigram probability with respect to English?</p>
      <p>I built a character bigram model for both languages and this feature was the score for the English
model.
• What is the character bigram probability with respect to Spanish?</p>
      <p>Score for the Spanish bigram model.
• Can the word be stemmed for English?</p>
      <p>Using the nltk Porter stemmer, is the result “smaller” than the input?
• Can the word be stemmed for Spanish?</p>
      <p>Same procedure using the Snowball stemmer for Spanish.
• I tagged the sentence for Spanish with ‘spacy‘ and generated a rule for each tag.</p>
      <p>I used the spacy part-of-speech tagger to find the tags for each word. I then had a rule for each
tag. The basic idea here is that borrowings would be most associated with nouns and adjectives
and these rules would pick up on that.</p>
      <p>Both logistic regression and neural net models made use of these same features. With those rules in
hand, each item in the COALAS dataset was used for training. Each item was scored with respect to the
rules above and then converted to -scores. Using the reference.csv file for testing, this got an 1
around .6 over multiple runs. Results from these models were not submitted.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Simple neural net</title>
      <p>The logistic regression model did not include interactions between factors, so I implemented a neural
net with pytorch. The logic here is that a fully-connected multi-layer net would capture all possible
interactions between features. The architecture was extremely simple: 3 layers with the same dimensions
as the input. There was a final output layer that produced a single output value. The activation function
for all layers was sigmoid.</p>
      <p>Again I tested with reference.csv and various numbers of epochs and diferent batch sizes, this
reached 1 values as high as .65 over multiple runs.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Rules</title>
      <p>I augmented the neural net with a rule-based system. That is, I applied rules to the output of the
network. Specifically:
• Adjacent borrowings count as a single borrowing.</p>
      <p>Since my system was word-based, I needed to convert the output to back to sentences and
concatenate borrowings.
• All quoted sequences of up to 3 words are borrowings.</p>
      <p>It became clear that quotes were being used to mark borrowings in the input.csv file.
• All capitalized sequences of up to 3 words are borrowings.</p>
      <p>Capitalization was used in a similar fashion.
• If a word appears in any of the English wordlists and does not appear in the Spanish wordlist, then
it’s a borrowing.</p>
      <p>This required some tweaking. Specifically, if a word appeared in both English and Spanish lists
and was adjacent to something already marked as a borrowing, then it was marked as a borrowing
as well.</p>
      <p>This approach reached an 1 of about .75 over multiple runs.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Discussion</title>
      <p>Let’s look a bit more closely at the performance of the system. We focus on the final 15th run that
I submitted. This achieved a precision score of 0.67, recall of 0.84 and 1 of 0.75. In terms of exact
numbers, there were 1735 true positives, 844 false positives, and 341 false negatives.</p>
      <p>If we look over the actual errors on a sentence-by-sentence basis, we find 680 sentences out of 1836
had some sort of error. Of these, the majority were cases where there were additional borrowings to be
found and our system did not find them.</p>
      <p>Another pattern was observed as well. These were a number of cases where a borrowed span was
redundantly parsed or separately parsed. Here’s an example (sentence, desired result, actual result):
Falsificó los papeles y la policía acabó deteniéndole en el control de aduana,
cuando intentó cruzar la frontera con una GREEN CARD FAKE.
[’GREEN CARD’, ’FAKE’]
[’GREEN’, ’CARD’, ’GREEN CARD FAKE’, ’FAKE’]</p>
      <p>A number of these could be mitigated by checking the output for repeated items, though we’d have
to investigate what to do when this occurs, which items to remove when repetitions occur.
10. Conclusion
In the approach here, there were three principal components. First, features were generated by hand.
Second, I used a simple neural network to model how the features might interact. Third, I included
additional rules to postprocess the output of the network.</p>
      <p>Again, the goal of this system was to exemplify simple techniques taught in my introductory statistical
natural language processing course.</p>
      <p>There are a number of ways we might improve on this system. First, creating the features by hand is
a bottleneck on how the shape and context of each word might determine whether it’s a borrowing. A
better approach would be to input the full spelling of the word and the words in context and use that
information to generate features.</p>
      <p>A second and related move would be to enrich the neural net architecture, specifically something
that included the context of each word would surely help. A recurrent or attention-based architecture
are the obvious choices here.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>Thanks to Diane Ohala for useful discussion. All errors are my own.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-12">
      <title>A. Online Resources</title>
      <p>• https://adobo-task.github.io/borrowing.html
• https://github.com/hammondm/adobo2025/tree/main
• https://faculty.sbs.arizona.edu/hammond/ling439539-sp25/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Á. Mellado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Arroyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lignos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Zamorano</surname>
          </string-name>
          , Overview of ADoBo 2021:
          <article-title>Automatic detection of unassimilated borrowings in the Spanish press</article-title>
          ,
          <source>arXiv preprint arXiv:2110.15682</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , BERT4EVER at ADoBo 2021:
          <article-title>Detection of borrowings in the Spanish language using pseudo-label technology</article-title>
          ,
          <source>in: IberLEF@ SEPLN</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>940</fpage>
          -
          <lpage>946</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>J. De la Rosa</surname>
          </string-name>
          ,
          <article-title>The futility of STILTs for the classification of lexical borrowings in Spanish</article-title>
          ,
          <source>arXiv preprint arXiv:2109.08607</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Févry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>01088</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Beesley</surname>
          </string-name>
          ,
          <article-title>Language identifier: A computer program for automatic natural-language identification of on-line text</article-title>
          ,
          <source>in: Proceedings of the 29th annual conference of the American Translators Association</source>
          , volume
          <volume>47</volume>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Cavnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Trenkle</surname>
          </string-name>
          ,
          <article-title>N-gram-based text categorization</article-title>
          ,
          <source>in: Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dunning</surname>
          </string-name>
          ,
          <article-title>Statistical identification of language</article-title>
          ,
          <source>Technical Report</source>
          , New Mexico State University,
          <year>1994</year>
          .
          <source>CRL MCCS-94-273.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Alex</surname>
          </string-name>
          ,
          <article-title>Automatic detection of English inclusions in mixed-lingual data with an application to parsing</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Edinburgh,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Furiassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofland</surname>
          </string-name>
          ,
          <article-title>The retrieval of false anglicisms in newspaper texts</article-title>
          ,
          <source>in: Corpus linguistics 25 years on, Brill</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>347</fpage>
          -
          <lpage>363</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <article-title>Semi-automatic approaches to Anglicism detection in Norwegian corpus data, in: The anglicization of European lexis</article-title>
          , John Benjamins Publishing Company,
          <year>2012</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Álvarez-Mellado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Porta-Zamorano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lignos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , Overview of ADoBo at IberLEF 2025:
          <article-title>Automatic Detection of Anglicisms in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>