=Paper= {{Paper |id=Vol-2947/paper11 |storemode=property |title=Garbled-Word Embeddings for Jumbled Text |pdfUrl=https://ceur-ws.org/Vol-2947/paper11.pdf |volume=Vol-2947 |authors=Gianluca Sperduti,Alejandro Moreo,Fabrizio Sebastiani |dblpUrl=https://dblp.org/rec/conf/iir/SperdutiM021 }} ==Garbled-Word Embeddings for Jumbled Text== https://ceur-ws.org/Vol-2947/paper11.pdf
Garbled-Word Embeddings for Jumbled Text
Gianluca Sperduti1 , Alejandro Moreo1 and Fabrizio Sebastiani1
1
 Istituto di Scienza e Tecnologie dell’Informazione
Consiglio Nazionale delle Ricerche
Via Giuseppe Moruzzi 1
Pisa, Italy, 56124


                                         Abstract
                                         “Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod
                                         are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl
                                         mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey
                                         lteter by istlef, but the wrod as a wlohe”. We investigate the extent to which this phenomenon applies to
                                         computers as well. Our hypothesis is that computers are able to learn distributed word representations
                                         that are resilient to character reshuffling, without incurring a significant loss in performance in tasks
                                         that use these representations. If our hypothesis is confirmed, this may form the basis for a new and
                                         more efficient way of encoding character-based representations of text in deep learning, and one that
                                         may prove especially robust to misspellings, or to corruption of text due to OCR. This paper discusses
                                         some fundamental psycho-linguistic aspects that lie at the basis of the phenomenon we investigate, and
                                         reports on a preliminary proof of concept of the above idea.

                                         Keywords
                                         Garbled-Word Embeddings, Garbled Words, Misspellings, Distributional Semantic Models




1. Introduction
Since 2003, the sentence quoted in the abstract of the present paper has been circulating around
the Internet, and has become fairly popular in social networks and forums. Even though there is
no evidence whatsoever of any such research at Cambridge University, a lot of psycho-linguistic
literature [1, 2, 3, 4, 5] has shown that, though at the cost of sacrificing reading speed (a cost that
depends on the specific words that are affected), humans can actually understand garbled (a.k.a.
jumbled) text, which we here define as text containing words affected by character reshuffling,
as long as the first and last letters of each word stay in place. In this paper we put forth
our conjecture that, if humans can make sense of garbled text, it is likely that computerized
distributional semantic models can do so as well. Despite the fact that some previous work has
suggested that this may not be the case [6, 7], we call into question the way surface forms are
customarily represented in this field, and propose a new mechanism that natively disregards
character order.
   One of the goals of our investigation is furthering our understanding of the connections

IIR 2021 – 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy
" gianluca.sperduti@isti.cnr.it (G. Sperduti); alejandro.moreo@isti.cnr.it (A. Moreo);
fabrizio.sebastiani@isti.cnr.it (F. Sebastiani)
 0000-0002-4287-8968 (G. Sperduti); 0000-0002-0377-1025 (A. Moreo); 0000-0003-4221-6427 (F. Sebastiani)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
between the capabilities of humans and those of computers. However, the vectorial representa-
tions that derive from this investigation, called garbled-word embeddings, have an applicative
impact too, since they could also be applied to building resiliency to misspellings into comput-
erized models of text, to automatic misspelling correction, and to handling out-of-vocabulary
terms, and could hopefully set the foundations of a new, more efficient modality of representing
character-based textual inputs for deep neural models. While we plan to cope with all these as-
pects in our future research, we devote this short paper to studying the viability of garbled-word
embeddings.
   This paper is organized as follows. In Section 2 we review and discuss the main works
devoted to the study of the above-mentioned psycho-linguistic phenomena, and the (few)
computational approaches that have derived. In order to verify the conjecture we have described,
we subsequently devise a transformation of the word surface form (that we dub BE-sorting –
Section 3) whose output is consistent with all possible “garbled variants” of the same word. We
then learn embeddings for BE-sorted words, by using any of the publicly available techniques
(we here use word2vec [8]), from a corpus of texts in which each word occurrence has been
replaced by its corresponding BE-sorted version. We then compare the performance of these
garbled-word embeddings, across a series of downstream tasks and standard benchmarks for
them, with that of standard word embeddings. The experiments we have carried out are
presented and discussed in Section 4. We sketch future work in Section 5.


2. Related work
How transposed letters and misspellings affect our reading performance has long been studied in
the field of psycho-linguistics [1, 2, 3, 4, 5] and, more recently, also in the area of NLP [6, 7, 9, 10].
   The literature distinguishes two different types of word corruption, i.e., (a) corruption due to
transpositions only (whereby an anagram of the original word is formed – this is what we have
called garbling), and (b) corruption that also involves letter insertion and/or deletion and/or
replacement. In their seminal work, Rayner et al. [1] show that humans are able to read (and
understand) corrupted text with good reading speed (which is proportional to the inverse of
cognitive effort), but that this speed depends on which type of word corruption is involved,
i.e., humans deal with type (a) (“garbling”) much more easily than with type (b). In this paper
we only consider word garbling. Concerning garbled text, Rayner et al. [1] find that reading
speed decays by 12% when the transposed letters do not involve the first and last letters of the
word, by 26% when the transposed letters are the first and the last one, and by 36% when the
transposition involves the first and second letters of the word.
   Many studies have been devoted to identifying the factors and conditions that impact on the
difficulty of reading garbled words. Andrews [3] finds that a garbled word is easier to process
when it is a highly frequent one, since, when reading, words that are common in language
surprise us the least, and involve thus a smaller cognitive effort to make sense of when they
occur in garbled form [4]. Many authors seem to agree that the neighbourhood size 𝑛 of a
garbled word (the neighbourhood of a given word being the set containing all valid words that
can be obtained by replacing a single letter) is a good proxy for estimating the difficulty of
processing it, with higher values of 𝑛 typically implying higher reading difficulty [3]. English
terms tend to display higher values of 𝑛 on average than many other languages [5], thus making
it one of the most challenging languages for reading garbled words. We use English as our
target language in this study.
   In the field of NLP, several studies have been carried out in order to assess the resiliency
of embedding-generation techniques and pre-trained language models to misspellings and
garbled words. Heigold et al. [6] test the impact that different types of noise have on the
performance of convolutional and recurrent neural architectures equipped with different input
modes (character-based and byte-pair encoding) in morphological tagging and translation tasks,
concluding that all models are highly sensitive to (even subtle) perturbations in the text, but
that such degradation in performance can be partially countered by reproducing the same
noise conditions in the training data. Yang and Gao [7] show that even stronger models like
BERT [11] cannot elude the catastrophic deterioration in performance due to the presence of
corrupted text. Nguyen and Grieve [9] compare the performance of word embeddings generated
from corrupted words using word2vec’s skip-gram with negative sampling (SGNS – [8]) and
fastText [12, 13], showing that SGNS tends to perform better than fastText in the majority
of intrinsic tasks1 , despite the fact that SGNS is a spelling-agnostic method.2 The authors also
note that both methods tend to perform fairly well in tasks characterized by the presence of
emotion-laden, intentionally elongated words (e.g., “gooood”), and they conjecture that this
may be due to the fact that such intentional misspellings tend to occur in specific and controlled
contexts. Other work has focused on creating embeddings that are resilient to misspellings.
Arguably, the most notable such example is Misspelling-Oblivious Word Embeddings (MOEs)
[10]. MOEs are based on the fastText algorithm extended with a redesigned loss function that
implements a distance between correct and misspelled words. MOEs have been tested in both
“intrinsic” and “extrinsic” tasks at various levels of corruption, with notable results in highly
corrupted text. Differently from this approach, we do not attempt to create word embeddings
for the (potentially many) altered surface forms of the words in the vocabulary, but instead
BE-sort each word, thus “collapsing” all possible such variations into a single canonical form
which is then handled as any standard word by any off-the-shelf word-embedding generation
technique.


3. BE-sorting
The method we propose is very easy, and comes down to a simple pre-processing of the text
input. Given a word 𝑤 = [𝑐1 , 𝑐2 , . . . , 𝑐𝑛 ], in which 𝑐𝑖 denotes the character at position 𝑖, we
BE-sort it, i.e., we map it into an artificial token 𝑅(𝑤) = [𝑐1 , sort([𝑐2 , . . . , 𝑐𝑛−1 ]), 𝑐𝑛 ] in which
the beginning (B) and end (E) characters are left at their original position while the other ones
(which we here call the “middle” ones) are sorted in alphabetical order. (Any deterministic
sorting method other than alphabetical would do as well, though.) For example, given the input
word “embedding” (short for [𝑒, 𝑚, 𝑏, 𝑒, 𝑑, 𝑑, 𝑖, 𝑛, 𝑔]), the function 𝑅 will return the BE-sorted
    1
       Consistently with this literature, we call intrinsic (resp., extrinsic) those tasks in which words (resp., docu-
ments) are the objects of study, and which are directly (resp., indirectly) affected by how we handle words; an
example is word analogy (resp., text classification).
     2
       In particular, the authors found out that fastText performs better in tasks where some characters have been
deleted, while SGNS performs better in all other tasks, and especially so in tasks having to do with semantic analogy.
token “ebddeimng”; any garbled variant of “embedding” in which the B and E characters are in
their original position (e.g., “ebdimnedg”, “edmnbdieg”) will result in the same BE-sorted token.
   We apply the 𝑅 function to replace all word occurrences in the training corpus with their
BE-sorted versions, and then use, in the standard way, any standard function for generating
word embeddings. During test, we BE-sort all word occurrences in the test corpus and use the
embeddings generated at training time. Note that the sorting function produces a deterministic
and unique version of the text corpus, irrespective of whether the original text was garbled
or not. Should our hypothesis be correct, the word embeddings learned from the BE-sorted
training corpus would behave no worse than a set of embeddings learned from the original
(non-BE-sorted) corpus.


4. Experiments
We choose the British National Corpus (BNC)3 as our corpus for training word embeddings.
From the BNC, we generate different sets of embeddings:

    • Garbled(𝑥%), obtained by randomly picking 𝑥% of the word occurrences in the corpus,
      garbling them, and generating embeddings from the resulting (garbled or non-garbled)
      words; we generate five different instances, i.e., for 𝑥 ∈ {5, 10, 50, 100}); note that the
      𝑥 = 0 version corresponds to embeddings generated from the original version of the BNC,
      while for the 𝑥 = 100 version all the words from which the embeddings are generated
      are garbled.
    • BE-sorted, obtained by BE-sorting all word occurrences in the corpus and generating
      embeddings from the resulting tokens;
    • Full-sorted, obtained by sorting all characters in the word alphabetically, for each word
      occurrence in the corpus; this is the same as the BE-sorted version, but for the fact that
      the B and E characters of the word are not necessarily left at their original positions;
    • RandEmbeds, a set of embeddings (one for each word in the original corpus) randomly
      generated and not optimized any further; this set establishes a lower-bound baseline.

For each variant of our dataset, we lower-case all text and then generate word embeddings
using word2vec’s SGNS [8], using the same hyper-parameter values in all cases. 4 In order
to compensate for fluctuations in results due to randomness, we carried out 10 runs for each
model using different seeds. Table 1 reports the results of our experiments in using the sets of
embeddings learned from each corpus across a well-established battery of intrinsic-task bench-
marks5 , comprising 17 different tasks dealing with semantic categorization, word similarity,
word relatedness, and word analogy (see [14] for further details).

       3
         The British National Corpus was originally created by Oxford University, and contains 100-million-words
worth of text from various genres from the later part of the 20th century, including spoken, fiction, magazines,
newspapers, and academic. The corpus is available at http://www.natcorp.ox.ac.uk/corpus/index.xml .
       4
         We use the Gensim implementation with hyper-parameter values min_count=1, max_vocab_size=None
(i.e., without limit), window=5, vector_size=300, sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20,
sg=1. We leave the other hyper-parameters at their default values.
       5
         https://github.com/kudkudak/word-embeddings-benchmarks
Table 1
Performance evaluation of different sets of embeddings on 17 intrinsic-task benchmarks, grouped ac-
cording to task (2nd row) and evaluation measure (3rd row).




                                                    ESSLLI2b

                                                                ESSLLI1a
                                         ESSLLI2c




                                                                                              WS353R

                                                                                                       WS353S




                                                                                                                                                 Google




                                                                                                                                                                SE2012
                                                                                      WS353




                                                                                                                   MTurk
                        BLESS

                                Battig




                                                                                                                             SL999




                                                                                                                                          RG65
                                                                              MEN




                                                                                                                                                          MSR
                                                                                                                                     RW
                  AP
                            Categorization                                              Relatedness                             Similarity           Analogy
                                Purity                                                  Correlation                            Correlation           Accuracy
 Garbled(0%)     .618   .835 .376 .662 .765                    .847         .725     .635 .588 .682             .553       .329 .160 .782        .262 .015 .148
 Garbled(5%)     .640   .818 .376 .653 .747                    .836        .728      .635 .590 .683              .548       .324   .144 .788     .240 .012 .153
 Garbled(10%)    .628   .819 .372 .640 .750                    .822         .726    .640 .596 .680               .536       .320   .133 .788     .220 .009 .147
 Garbled(50%)    .593   .804 .344 .600 .710                    .795         .713     .627 .606 .662              .520       .267   .054 .735     .087 .002 .142
 Garbled(100%)   .333   .539 .192 .566 .625                    .663         .439     .253 .205 .288              .030       .140   .134 .377     .002 .000 .069
 BE-sorted       .622   .833 .374 .644 .745                    .841         .719    .640 .594 .685               .549       .324   .157 .785     .241 .015 .150
 Full-sorted     .499   .626 .328 .515 .675                    .659         .211     .249 .295 .250              .221       .124   .132 .049     .196 .009 .080
 RandEmbeds      .159   .230 .092 .378 .525                    .432        -.018     .127 .178 .048             -.074       .010 -.038 .006      .000 .000 .011




   Interestingly, the results obtained with our method (BE-sorted) are, by and large, very similar
to the ones obtained on the original corpus (Garbled(0%)), and almost always superior to those
obtained by Garbled(5%), thus confirming our initial hypothesis. When character sorting
is not used, performance seems to deteriorate as the fraction of garbled word occurrences
increases. The results also clearly indicate that Full-sorted fares worse than BE-sorted, thus
bringing empirical support to the intuition according to which, for computer-based distributional
semantic models, as well as for humans, the first and the last letters should remain in place in
order to achieve comparable performance. We are currently investigating this aspect in greater
detail.


5. Conclusions
Somehow surprisingly, humans can effectively read garbled words, provided that their first
and last letters stay in place. While this phenomenon has long lured the attention of psycho-
linguists, authors of recent computational experiments have argued that the performance of
computational models, unlike that of humans, deteriorates noticeably in the presence of garbled
input. However, we hypothesize that computational models can be made robust to garbled
text too, and argue that the key to doing this is devising word representation mechanisms that
natively disregard character order within words. As a preliminary proof of concept, we have
shown that it is indeed possible to learn word embeddings from text in which we simply sort the
middle letters of each word (thus intentionally getting rid of any character order information)
with no substantial difference in performance. We are currently investigating character-based
representations for deep learning that implement this intuition.
References
 [1] K. Rayner, S. White, R. Johnson, S. Liversedge, Raeding wrods with jubmled lettres: There
     is a cost, Psychological Science 17 (2006) 192–3.
 [2] L. X. McCusker, P. B. Gough, R. G. Bias, Word recognition inside out and outside in, Journal
     of Experimental Psychology: Human Perception and Performance 7 (1981) 538.
 [3] S. Andrews, Lexical retrieval and selection processes: Effects of transposed-letter confus-
     ability, Journal of Memory and Language 35 (1996) 775–800.
 [4] A. F. Healy, Detection errors on the word the: Evidence for reading units larger than
     letters, Journal of Experimental Psychology: Human Perception and Performance 2 (1976)
     235.
 [5] V. Marian, J. Bartolotti, S. Chabal, A. Shook, Clearpond: Cross-linguistic easy-access
     resource for phonological and orthographic neighborhood densities, PLOS ONE 7 (2012)
     e43230.
 [6] G. Heigold, S. Varanasi, G. Neumann, J. van Genabith, How robust are character-based
     word embeddings in tagging and MT against wrod scramlbing or randdm nouse?, in:
     Proceedings of the 13th Conference of the Association for Machine Translation in the
     Americas (Volume 1: Research Track), Boston, US, 2018.
 [7] R. Yang, Z. Gao, Can machines read jmulbed senetcnes? (2019). URL: https://runzhe-yang.
     science/demo/jumbled.pdf.
 [8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations
     of words and phrases and their compositionality, in: Proceedings of the 27th Annual
     Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, US, 2013,
     pp. 3111–3119.
 [9] D. Nguyen, J. Grieve, Do word embeddings capture spelling variation?, in: Proceedings of
     the 28th International Conference on Computational Linguistics, Barcelona, ES, 2020, pp.
     870–881.
[10] A. Piktus, N. B. Edizel, P. Bojanowski, E. Grave, R. Ferreira, F. Silvestri, Misspelling-
     oblivious word embeddings, in: Proceedings of the 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Volume 1, Minneapolis, US, 2019, pp. 3226–3234.
[11] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of the
     North American Chapter of the Association for Computational Linguistics (NAACL 2019),
     Minneapolis, US, 2019, pp. 4171–4186.
[12] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
     information, Transactions of the Association for Computational Linguistics 5 (2017)
     135–146. doi:10.1162/tacl_a_00051.
[13] E. Grave, T. Mikolov, A. Joulin, P. Bojanowski, Bag of tricks for efficient text classification,
     in: Proceedings of the 15th Conference of the European Chapter of the Association for
     Computational Linguistics (EACL 2017), Valencia, ES, 2017, pp. 427–431.
[14] A. Lenci, M. Sahlgren, P. Jeuniaux, A. C. Gyllensten, M. Miliani, A comprehensive
     comparative evaluation and analysis of distributional semantic models, arXiv preprint
     arXiv:2105.09825 (2021).