1. Introduction

Bari, Italy " gianluca.sperduti@isti.cnr.it (G. Sperduti); alejandro.moreo@isti.cnr.it (A. Moreo); fabrizio.sebastiani@isti.cnr.it (F. Sebastiani)

Garbled-Word Embeddings for Jumbled Text

Gianluca Sperduti

Alejandro Moreo

Fabrizio Sebastiani

Italy

Garbled-Word Embeddings, Garbled Words, Misspellings, Distributional Semantic Models

2021

000 0 0002

“Aoccdrnig to a reasrech at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny itmopnrat tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe”. We investigate the extent to which this phenomenon applies to computers as well. Our hypothesis is that computers are able to learn distributed word representations that are resilient to character reshufling, without incurring a significant loss in performance in tasks that use these representations. If our hypothesis is confirmed, this may form the basis for a new and more eficient way of encoding character-based representations of text in deep learning, and one that may prove especially robust to misspellings, or to corruption of text due to OCR. This paper discusses some fundamental psycho-linguistic aspects that lie at the basis of the phenomenon we investigate, and reports on a preliminary proof of concept of the above idea.

1. Introduction

Since 2003, the sentence quoted in the abstract of the present paper has been circulating around the Internet, and has become fairly popular in social networks and forums. Even though there is no evidence whatsoever of any such research at Cambridge University, a lot of psycho-linguistic literature [ 1, 2, 3, 4, 5 ] has shown that, though at the cost of sacrificing reading speed (a cost that depends on the specific words that are afected), humans can actually understand garbled (a.k.a. jumbled) text, which we here define as text containing words afected by character reshufling, as long as the first and last letters of each word stay in place. In this paper we put forth our conjecture that, if humans can make sense of garbled text, it is likely that computerized distributional semantic models can do so as well. Despite the fact that some previous work has suggested that this may not be the case [ 6, 7 ], we call into question the way surface forms are customarily represented in this field, and propose a new mechanism that natively disregards character order.

One of the goals of our investigation is furthering our understanding of the connections between the capabilities of humans and those of computers. However, the vectorial representations that derive from this investigation, called garbled-word embeddings, have an applicative impact too, since they could also be applied to building resiliency to misspellings into computerized models of text, to automatic misspelling correction, and to handling out-of-vocabulary terms, and could hopefully set the foundations of a new, more eficient modality of representing character-based textual inputs for deep neural models. While we plan to cope with all these aspects in our future research, we devote this short paper to studying the viability of garbled-word embeddings.

This paper is organized as follows. In Section 2 we review and discuss the main works devoted to the study of the above-mentioned psycho-linguistic phenomena, and the (few) computational approaches that have derived. In order to verify the conjecture we have described, we subsequently devise a transformation of the word surface form (that we dub BE-sorting – Section 3) whose output is consistent with all possible “garbled variants” of the same word. We then learn embeddings for BE-sorted words, by using any of the publicly available techniques (we here use word2vec [ 8 ]), from a corpus of texts in which each word occurrence has been replaced by its corresponding BE-sorted version. We then compare the performance of these garbled-word embeddings, across a series of downstream tasks and standard benchmarks for them, with that of standard word embeddings. The experiments we have carried out are presented and discussed in Section 4. We sketch future work in Section 5.

2. Related work

How transposed letters and misspellings afect our reading performance has long been studied in the field of psycho-linguistics [ 1, 2, 3, 4, 5 ] and, more recently, also in the area of NLP [ 6, 7, 9, 10 ].

The literature distinguishes two diferent types of word corruption, i.e., (a) corruption due to transpositions only (whereby an anagram of the original word is formed – this is what we have called garbling), and (b) corruption that also involves letter insertion and/or deletion and/or replacement. In their seminal work, Rayner et al. [ 1 ] show that humans are able to read (and understand) corrupted text with good reading speed (which is proportional to the inverse of cognitive efort), but that this speed depends on which type of word corruption is involved, i.e., humans deal with type (a) (“garbling”) much more easily than with type (b). In this paper we only consider word garbling. Concerning garbled text, Rayner et al. [ 1 ] find that reading speed decays by 12% when the transposed letters do not involve the first and last letters of the word, by 26% when the transposed letters are the first and the last one, and by 36% when the transposition involves the first and second letters of the word.

Many studies have been devoted to identifying the factors and conditions that impact on the dificulty of reading garbled words. Andrews [ 3 ] finds that a garbled word is easier to process when it is a highly frequent one, since, when reading, words that are common in language surprise us the least, and involve thus a smaller cognitive efort to make sense of when they occur in garbled form [ 4 ]. Many authors seem to agree that the neighbourhood size of a garbled word (the neighbourhood of a given word being the set containing all valid words that can be obtained by replacing a single letter) is a good proxy for estimating the dificulty of processing it, with higher values of typically implying higher reading dificulty [ 3 ]. English terms tend to display higher values of on average than many other languages [ 5 ], thus making it one of the most challenging languages for reading garbled words. We use English as our target language in this study.

In the field of NLP, several studies have been carried out in order to assess the resiliency of embedding-generation techniques and pre-trained language models to misspellings and garbled words. Heigold et al. [ 6 ] test the impact that diferent types of noise have on the performance of convolutional and recurrent neural architectures equipped with diferent input modes (character-based and byte-pair encoding) in morphological tagging and translation tasks, concluding that all models are highly sensitive to (even subtle) perturbations in the text, but that such degradation in performance can be partially countered by reproducing the same noise conditions in the training data. Yang and Gao [ 7 ] show that even stronger models like BERT [ 11 ] cannot elude the catastrophic deterioration in performance due to the presence of corrupted text. Nguyen and Grieve [ 9 ] compare the performance of word embeddings generated from corrupted words using word2vec’s skip-gram with negative sampling (SGNS – [ 8 ]) and fastText [ 12, 13 ], showing that SGNS tends to perform better than fastText in the majority of intrinsic tasks1, despite the fact that SGNS is a spelling-agnostic method.2 The authors also note that both methods tend to perform fairly well in tasks characterized by the presence of emotion-laden, intentionally elongated words (e.g., “gooood”), and they conjecture that this may be due to the fact that such intentional misspellings tend to occur in specific and controlled contexts. Other work has focused on creating embeddings that are resilient to misspellings. Arguably, the most notable such example is Misspelling-Oblivious Word Embeddings (MOEs) [ 10 ]. MOEs are based on the fastText algorithm extended with a redesigned loss function that implements a distance between correct and misspelled words. MOEs have been tested in both “intrinsic” and “extrinsic” tasks at various levels of corruption, with notable results in highly corrupted text. Diferently from this approach, we do not attempt to create word embeddings for the (potentially many) altered surface forms of the words in the vocabulary, but instead BE-sort each word, thus “collapsing” all possible such variations into a single canonical form which is then handled as any standard word by any of-the-shelf word-embedding generation technique.

3. BE-sorting

The method we propose is very easy, and comes down to a simple pre-processing of the text input. Given a word = [1, 2, . . . , ], in which denotes the character at position , we BE-sort it, i.e., we map it into an artificial token () = [1, sort([2, . . . , − 1]), ] in which the beginning (B) and end (E) characters are left at their original position while the other ones (which we here call the “middle” ones) are sorted in alphabetical order. (Any deterministic sorting method other than alphabetical would do as well, though.) For example, given the input word “embedding” (short for [, , , , , , , , ]), the function will return the BE-sorted 1Consistently with this literature, we call intrinsic (resp., extrinsic) those tasks in which words (resp., documents) are the objects of study, and which are directly (resp., indirectly) afected by how we handle words; an example is word analogy (resp., text classification).

2In particular, the authors found out that fastText performs better in tasks where some characters have been deleted, while SGNS performs better in all other tasks, and especially so in tasks having to do with semantic analogy. token “ebddeimng”; any garbled variant of “embedding” in which the B and E characters are in their original position (e.g., “ebdimnedg”, “edmnbdieg”) will result in the same BE-sorted token.

We apply the function to replace all word occurrences in the training corpus with their BE-sorted versions, and then use, in the standard way, any standard function for generating word embeddings. During test, we BE-sort all word occurrences in the test corpus and use the embeddings generated at training time. Note that the sorting function produces a deterministic and unique version of the text corpus, irrespective of whether the original text was garbled or not. Should our hypothesis be correct, the word embeddings learned from the BE-sorted training corpus would behave no worse than a set of embeddings learned from the original (non-BE-sorted) corpus.

4. Experiments

We choose the British National Corpus (BNC)3 as our corpus for training word embeddings. From the BNC, we generate diferent sets of embeddings: • Garbled(%), obtained by randomly picking % of the word occurrences in the corpus, garbling them, and generating embeddings from the resulting (garbled or non-garbled) words; we generate five diferent instances, i.e., for ∈ {5, 10, 50, 100}); note that the = 0 version corresponds to embeddings generated from the original version of the BNC, while for the = 100 version all the words from which the embeddings are generated are garbled. • BE-sorted, obtained by BE-sorting all word occurrences in the corpus and generating embeddings from the resulting tokens; • Full-sorted, obtained by sorting all characters in the word alphabetically, for each word occurrence in the corpus; this is the same as the BE-sorted version, but for the fact that the B and E characters of the word are not necessarily left at their original positions; • RandEmbeds, a set of embeddings (one for each word in the original corpus) randomly generated and not optimized any further; this set establishes a lower-bound baseline. For each variant of our dataset, we lower-case all text and then generate word embeddings using word2vec’s SGNS [ 8 ], using the same hyper-parameter values in all cases. 4 In order to compensate for fluctuations in results due to randomness, we carried out 10 runs for each model using diferent seeds. Table 1 reports the results of our experiments in using the sets of embeddings learned from each corpus across a well-established battery of intrinsic-task benchmarks5, comprising 17 diferent tasks dealing with semantic categorization, word similarity, word relatedness, and word analogy (see [ 14 ] for further details).

3The British National Corpus was originally created by Oxford University, and contains 100-million-words worth of text from various genres from the later part of the 20th century, including spoken, fiction, magazines, newspapers, and academic. The corpus is available at http://www.natcorp.ox.ac.uk/corpus/index.xml .

4We use the Gensim implementation with hyper-parameter values min_count=1, max_vocab_size=None (i.e., without limit), window=5, vector_size=300, sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20, sg=1. We leave the other hyper-parameters at their default values.

5https://github.com/kudkudak/word-embeddings-benchmarks

5. Conclusions

Somehow surprisingly, humans can efectively read garbled words, provided that their first and last letters stay in place. While this phenomenon has long lured the attention of psycholinguists, authors of recent computational experiments have argued that the performance of computational models, unlike that of humans, deteriorates noticeably in the presence of garbled input. However, we hypothesize that computational models can be made robust to garbled text too, and argue that the key to doing this is devising word representation mechanisms that natively disregard character order within words. As a preliminary proof of concept, we have shown that it is indeed possible to learn word embeddings from text in which we simply sort the middle letters of each word (thus intentionally getting rid of any character order information) with no substantial diference in performance. We are currently investigating character-based representations for deep learning that implement this intuition.

[1]

Rayner , S. White ,

Johnson , S. Liversedge, Raeding wrods with jubmled lettres: There is a cost , Psychological Science 17 ( 2006 ) 192 - 3 .

[2]

L. X.

McCusker ,

P. B.

Gough ,

R. G.

Bias , Word recognition inside out and outside in , Journal of Experimental Psychology: Human Perception and Performance 7 ( 1981 ) 538 .

[3]

Andrews , Lexical retrieval and selection processes: Efects of transposed-letter confusability , Journal of Memory and Language 35 ( 1996 ) 775 - 800 .

[4]

A. F.

Healy , Detection errors on the word the: Evidence for reading units larger than letters , Journal of Experimental Psychology: Human Perception and Performance 2 ( 1976 ) 235 .

[5]

Marian ,

Bartolotti ,

Chabal ,

Shook , Clearpond: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities , PLOS ONE 7 ( 2012 ) e43230 .

[6]

Heigold ,

Varanasi , G. Neumann, J. van Genabith , How robust are character-based word embeddings in tagging and MT against wrod scramlbing or randdm nouse? , in: Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1 : Research Track), Boston, US, 2018 .

[7]

Yang ,

Gao , Can machines read jmulbed senetcnes? ( 2019 ). URL: https://runzhe-yang. science/demo/jumbled.pdf.

[8]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , in: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013 ), Lake Tahoe, US , 2013 , pp. 3111 - 3119 .

[9]

Nguyen ,

Grieve , Do word embeddings capture spelling variation? , in: Proceedings of the 28th International Conference on Computational Linguistics , Barcelona, ES , 2020 , pp. 870 - 881 .

[10]

Piktus ,

N. B.

Edizel ,

Bojanowski , E. Grave,

Ferreira ,

Silvestri , Misspellingoblivious word embeddings , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 , Minneapolis , US , 2019 , pp. 3226 - 3234 .

[11]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019 ), Minneapolis, US , 2019 , pp. 4171 - 4186 .

[12]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 ( 2017 ) 135 - 146 . doi: 10 .1162/tacl_a_ 00051 .

[13]

Grave ,

Mikolov ,

Joulin ,

Bojanowski , Bag of tricks for eficient text classification , in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017 ), Valencia, ES , 2017 , pp. 427 - 431 .

[14]

Lenci ,

Sahlgren ,

Jeuniaux ,

A. C.

Gyllensten ,

Miliani , A comprehensive comparative evaluation and analysis of distributional semantic models , arXiv preprint arXiv:2105.09825 ( 2021 ).