=Paper= {{Paper |id=Vol-2943/adobo_paper1 |storemode=property |title=BERT4EVER at ADoBo 2021: Detection of Borrowings in the Spanish Language Using Pseudo-label Technology |pdfUrl=https://ceur-ws.org/Vol-2943/adobo_paper1.pdf |volume=Vol-2943 |authors=Shengyi Jiang,Tong Cui,Yingwen Fu,Nankai Lin,Jieyi Xiang |dblpUrl=https://dblp.org/rec/conf/sepln/JiangCFLX21 }} ==BERT4EVER at ADoBo 2021: Detection of Borrowings in the Spanish Language Using Pseudo-label Technology== https://ceur-ws.org/Vol-2943/adobo_paper1.pdf
    BERT4EVER at ADoBo 2021: Detection of Borrowings
        in the Spanish Language Using Pseudo-label
                        Technology

                                                                                          (   )
      Shengyi Jiang1,2, Tong Cui1, Yingwen Fu1, Nankai Lin1  and Jieyi Xiang1
 1School of Information Science and Technology, Guangdong University of Foreign
                                  Studies, China
2 Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong Uni-

                      versity of Foreign Studies, Guangzhou
                                         neakail@outlook.com



       Abstract. In this paper, we report the solution of the team BERT 4EVER for the
       automatic detection of borrowings in the Spanish Language task in IberLeF 2021,
       which aims to detect lexical borrowings that appear in the Spanish press. We
       adopt the CRF model to tackle the problem. In addition, we introduce pseudo-
       label technology and ensemble learning to improve the generalization capability.
       Experimental results demonstrate the effectiveness of CRF model and pseudo-
       label technology.

       Keywords: Automatic Detection of Borrowings, CRF, Pseudo-label Technol-
       ogy.


1      Introduction

Lexical borrowing is a word formation that is widely used in many languages. Previous
work on computational detection of lexical borrowings has relied mostly on dictionary
and corpora lookup [1][2][3], with the limitation coming from the original dictionary
or corpora. On the other hand, computational approaches to mixed-language data have
usually framed the task of identifying the language of a word as a sequence labeling
task, where every word in the sequence is attached to a language tag [4][5].
   IberLeF 2021 proposes the task “Automatic Detection of Borrowings in the Spanish
Language” [6]. Our team, BERT 4EVER, also participates in this task. In this report,
we will review our solution to this task, namely, the CRF model aided by pseudo-label
technology and ensemble learning.



IberLEF 2021, September 2021, Málaga, Spain.
            ©️ 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
2      Related Work

Linguistic borrowing is the process of copying elements and patterns from another lan-
guage into one [7]. This classification system is based on two processes: import and
substitution. Input is the incorporation into the recipient's language of a foreign form
that may or may not contain a meaning. Substitution refers to the substitution of foreign
phonemes or morphemes by foreign phonemes or morphemes of the recipient language
so as to localize the foreign form. Both processes can occur in the same borrowings.
Thus, linguistic borrowing involves communication between two languages and has
been extensively studied in the field of contact linguistics [8]. Various typologies have
been proposed to classify language loanwords according to different criteria, such as
typological features, linguistic hierarchy involved, integration of loanwords elements
in the recipient's language, etc [9][10][11].
   Now that English has established itself as the global lingua franca, many languages
are currently undergoing the process of importing new loanwords from English. In the
past decade, English has produced a large number of lexical loanwords in many Euro-
pean languages, especially in the press.
   Previous work on computational detection of lexical borrowings have relied mostly
on dictionary and corpora lookup. Studies on anglicization have begun to use a multi-
million-word corpus [12][13][14]. Alex [16] combined lexicon lookup and a search
engine module that used the web as a corpus to detect English inclusions in a German
text corpus and compared the proposed model with a maxent Markov model. Furiassi
and Hofland [17] explored corpora lookup and character n-grams to extract false angli-
cisms from an Italian newspaper corpus. Andersen [2] used dictionary lookup, regular
expressions and lexicon-derived frequencies of character n-grams to detect anglicism
candidates in the Norwegian Newspaper Corpus (NNC). In computational approaches
to mix-language data, the task, aiming for the identification of the language of a word,
has usually been assumed as a tagging problem which needs every word in the sequence
to be tagged [5].
   The large amount of available data presents methodological challenges to data pro-
cessing for English language research. Corpus-based studies of English borrowings in
Spanish media have traditionally relied on manual evaluation of either previously com-
piled general corpora such as CREA [15], or new tailor-made corpora designed to ana-
lyze specific genres, varieties or phenomena. In Spanish, Serigos [18] extracted angli-
cisms from an Argentinian newspaper corpus by combining dictionary lookup (aided
by TreeTagger and the NLTK lemmatizer) with automatic filtering of capitalized words
and manual inspection. In Serigos [3], a character n-gram module was added in the
dictionary lookup method to estimate the probabilities of a word being English or Span-
ish. Moreno Fernandez and Moreno Sandoval[19] used different pattern-matching fil-
ters and lexicon lookup to extract anglicism candidates from a tweet corpus in US Span-
ish.
                                                                                                         3


3      Method

In the automatic detection of borrowings in the Spanish Language task, we train five
CRF models based on the five-fold data and then use the trained CRF models to predict
unlabeled samples. We gather the pseudo-labeled dataset together with the training set
to train the new CRF model.
3.1     CRF




                                      Fig. 1. CRF Model.


   There are two random variables, 𝑋 is a random variable on the sequence of data to
be labeled, and 𝑌 is a random variable on the corresponding sequence of data to be
labeled. The random variables 𝑋 and 𝑌 are under the common distribution, but we con-
struct a conditional model 𝑝(𝑌|𝑋) from paired observation and label sequences in a
discriminative framework, without explicitly modeling the marginal 𝑝(𝑋).
    Let 𝐺 = (𝑉, 𝐸) be a graph such that 𝑌 = (𝑌𝑣 )𝑣∈𝑉 ,so that 𝑌 is indexed by the verti-
ces of 𝐺.Then (𝑋, 𝑌) is a conditional random field in case, when conditioned on 𝑋, the
random variables 𝑌𝑣 obey the Markov property with respect to the graph:
                          𝑝(𝑌𝑣 |𝑋, 𝑌𝑤 , 𝑤 ≠ 𝑣) = 𝑝(𝑌𝑣 |𝑋, 𝑌𝑤 , 𝑤~𝑣)
    where 𝑤~𝑣 means that 𝑤 and 𝑣 are neighbors in 𝐺, 𝑤 ≠ 𝑣 means all vertices ex-
cept 𝑣. 𝑌𝑣 and 𝑌𝑤 are random variables corresponding to 𝑣 and 𝑤.
    Table 1 lists the feature set which obtained the best performance in our experiments,
and we report the experiment result based on this feature set. The “collocation” feature
is the Co-occurrence between the current word and the next (or last) word.

                        Table 1. The defined feature sets used in CRF.
 Type          Feature                 Description
 Unigram       𝑤𝑛 (𝑛 = −1,0,1)         The previous 𝑛, current, and next 𝑛 words
 Prefix        𝑝𝑛 (𝑤0 ) , 𝑛 = 2,3,4    The first 𝑛 letters in the current word
 Suffix        𝑠𝑛 (𝑤0 ) , 𝑛 = 2,3,4    The last 𝑛 letters in the current word
 Collocation   𝑤𝑛−1 𝑤𝑛 (𝑛 = 0,1)       The collocation of the current word and the next (or last) word
3.2    Ensemble learning

   We conduct five-fold cross-validation for the training data and then train five models
based on the five-fold data. Each model predicts the test data separately. For each token
𝑥, the predicted output of the model is
                                     𝑦𝑖 = 𝐶𝑅𝐹𝑖 (𝑧)
   in which 𝑧 is the token 𝑥’s feature representation, 𝐶𝑅𝐹𝑖 is the 𝑖-th CRF model and
𝑦𝑖 is the output of 𝑖-th CRF model. Therefore, the output of the five models is
                                 𝑌 = [𝑦1 , 𝑦2 , 𝑦3 , 𝑦4 , 𝑦5 ]
   We consider the label that appears most in 𝑌 as the label of 𝑥.
3.3     Pseudo-label technology

   We use a pseudo-label strategy [20][21] to generate labeled data that does not require
manual labeling, as shown in Figure 2. We first use the competition open training set
to train CRF models, and then use the trained CRF models to predict unlabeled samples,
the predicted results as the sample label. And then we screen all the predicted samples
to filter out the sentences without lexical borrowings, only the sentences with lexical
borrowings exist. The unlabeled samples we use from GlobalVoices (Spanish portion
of GlobalVoices)2 and News-Commentary11 (Spanish portion of NCv11)3. We gather
the filtered sentence set together with the training set to train the new CRF model, and
although the sample quality obtained through data enhancement is not high, the new
model has higher generalization capability to some extent because the new model
trained with more data.




                         Fig. 2. Pseudo-label Technology Flow Chart.


4      Results




2 http://opus.nlpl.eu/GlobalVoices.php
3 http://opus.nlpl.eu/News-Commentary-v11.php
                                                                                                5


           Table 2. The experiment of exploring different Prefix/Suffix performance.
 Prefix/ Suffix           P                       R                           F
 /                        91.21%                  8.11%                       14.90%
 2                        73.06%                  40.73%                      52.29%
 3                        71.27%                  30.76%                      42.95%
 4                        83.15%                  20.68%                      33.08%
 All                      76.33%                  41.13%                      53.42%

We first explore the performance of different prefixes/suffixes, and the results are
shown in Table 2. The feature of first (and last) 2 letters in the current word has the
greatest impact on the task, with an increase of 37.39 in the F value. When all the pre-
fix/suffix features are used together, the effect is the best, and the result based on five-
fold cross-validation has reached 53.42%.

              Table 3. The results of our model based on five-fold cross-validation.
               Model                               P                  R                   F
                CRF                             76.33%             41.13%              53.42%
    CRF + Pseudo-label Technology               67.82%             42.25%              52.06%

                       Table 4. The results of our model on final test set.
                  Model                     Type            P             R               F
                                            ENG          76.48%        25.99%          38.80%
                  CRF                      OTHER         60.00%         6.52%          11.76%
                                            ALL          76.29%        25.29%          37.99%
                                            ENG          75.43%        28.25%          41.10%
       CRF + Data Augmentation             OTHER         60.00%         6.52%          11.76%
                                            ALL          75.27%        27.47%          40.25%

   As shown in Table 3 and Table 4, the recall of CRF based on pseudo-label technol-
ogy is significantly improved, which proves that the pseudo-label technology can im-
prove the generalization performance of the model. On the final test set, the F value of
the CRF model reached 37.99%, and the F value of the CRF model based on pseudo-
label technology is 40.25%, which shows that the pseudo-label technology has a sig-
nificant impact on detection of borrowings in the Spanish language task.


5        Conclusion

In the automatic detection of borrowings in the Spanish Language task in IberLeF 2021,
we adopt the CRF model aided by pseudo-label technology and ensemble learning. In
addition, we also explore the impact of different features on the task. In the future, we
will try to combine pseudo-label technology with deep learning models in order to
achieve better results on the detection of borrowings tasks.
Acknowledgements

This work was supported by the Key Field Project for Universities of Guangdong Prov-
ince (No. 2019KZDZX1016), the National Natural Science Foundation of China (No.
61572145) and the National Social Science Foundation of China (No. 17CTQ045). The
authors would like to thank the anonymous reviewers for their valuable comments and
suggestions.


References
 1. Alex, B.: Automatic detection of English inclusions in mixed-lingual data with an applica-
    tion to parsing. University of Edinburgh (2008).
 2. Andersen, G.: Semi-automatic approaches to Anglicism detection in Norwegian corpus data.
    The anglicization of European lexis, 111-130 (2012).
 3. Serigos, J. R. L.: Applying corpus and computational methods to loanword research: new
    approaches to Anglicisms in Spanish. University of Texas at Austin (2017).
 4. Molina, G., AlGhamdi, F., Ghoneim, M., et al.: Overview for the second shared task on
    language identification in code-switched data. In: Proceedings of the Second Workshop on
    Computational Approaches to Code Switching, 40–49 (2019).
 5. Solorio, T., Blair, E., Maharjan, S., et al.: Overview for the first shared task on language
    identification in code-switched data. In: Proceedings of the First Workshop on Computa-
    tional Approaches to Code Switching, pp. 62–72. (2014).
 6. Alvarez Mellado, E., Espinosa Anke, L., Gonzalo Arroyo, J., Lignos, C., and Porta Zamo-
    rano, J.: Overview of ADoBo 2021 shared task: Automatic Detection of Unassimilated Bor-
    rowings in the Spanish Press. Procesamiento del Lenguaje Natural, 67 (2021).
 7. Haugen, E.: The analysis of linguistic borrowing. Language 26(2), 210–231 (1950).
 8. Weinreich, U.: Languages in contact. Findings and Problems (1953).
 9. Haspelmath, M. and Tadmor, U.: Loanwords in the world’s languages: a comparative hand-
    book. Walter de Gruyter (2009).
10. Matras, Y. and Sakel, J.: Grammatical borrowing in cross-linguistic perspective.Walter de
    Gruyter 38 (2007).
11. Thomason, S. G. and Kaufman, T.: Language contact, creolization, and genetic linguistics.
    Univ of California Press (1992).
12. Andersen, G.: Pragmatic borrowing. Journal of Pragmatics 67, 17–33 (2014).
13. Balteiro, I.: A reassessment of traditional lexicographical tools in the light of new corpora:
    sports Anglicisms in Spanish. International Journal of English Studies 11(2), 23–52 (2011).
14. Zenner, E., Speelman, D., and Geeraerts, D.: Cognitive Sociolinguistics meets loanword
    research: Measuring variation in the success of anglicisms in Dutch. Cognitive Linguistics
    23(4), 749–792 (2012).
15. Núñez, N. E. E.: Anglicisms in CREA: a quantitative analysis in Spanish newspapers. In:
    Language design: journal of theoretical and experimental linguistics 18, 215–242 (2016).
16. Alex, B.: Comparing Corpus-based to Web-based Lookup Techniques for Automatic Eng-
    lish Inclusion Detection. In: Proceedings of the Sixth International Conference on Language
    Resources and Evaluation (LREC’08), European Language Resources Association (ELRA)
    (2008).
17. Furiassi, C. and Hofland, K.: The retrieval of false anglicisms in newspaper texts. Corpus
    Linguistics 25 Years On, 347–363 (2007).
                                                                                               7


18. Serigos, J.: Using distributional semantics in loan word research: A concept-based approach
    to quantifying semantic specificity of anglicisms in Spanish. International Journal of Bilin-
    gualism 21(5), 521–540 (2007).
19. Moreno F. F., Moreno S. A.: Configuración lingüística de anglicismos procedentes de Twit-
    ter en el español estadounidense. Revista signos 51(98), 382-409 (2018).
20. Lee, D.: Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for
    Deep Neural Networks. In: ICML 2013 Workshop: Challenges in Representation Learning.
    pp. 1-6. (2013).
21. Shi, W., Gong, Y., Ding, C., et al.: Transductive semi-supervised deep learning sing min-
    max features. In Proceedings of the European Conference on Computer Vision. pp. 299-315.
    (2018).