Emojitalianobot and EmojiWorldBot New online tools and digital environments for translation into emoji Johanna Monti Federico Sangati L’Orientale University Independent Researcher Naples, Italy The Netherlands jmonti@unior.it federico.sangati@gmail.com Francesca Chiusaroli Martin Benjamin Sina Mansour University of Macerata EPFL EPFL Italy Lausanne, Switzerland Lausanne, Switzerland f.chiusaroli@unimc.it martin@kamusiproject.org mansour@ee.sharif.edu Abstract ideato per coadiuvare la traduzione di Pinocchio in emoji su Twitter da parte English. Emojitalianobot and Emo- dei follower del blog Scritture brevi e jiWorldBot are two new online tools contiene pertanto anche il glossario con and digital environments for transla- tutti gli usi degli emoji nella traduzione tion into emoji on Telegram, the pop- del celebre romanzo per ragazzi. Emoji- ular instant messaging platform. Emo- WorldBot, epigono di Emojitalianobot, jitalianobot is the first open and free è un dizionario multilingue che usa gli Emoji-Italian and Emoji-English trans- emoji come lingua pivot tra dozzine lation bot based on Unicode descrip- di lingue differenti. Attualmente le tions. The bot was designed to sup- funzioni emoji-parola e parola-emoji port the translation of Pinocchio into sono disponibili per 72 lingue impor- emoji carried out by the followers of tate dalle tabelle Unicode e forniscono the "Scritture brevi" blog on Twit- agli utenti delle semplici funzioni di ter and contains a glossary with all ricerca per trovare le corrispondenze in the uses of emojis in the translation emoji delle parole e viceversa per cias- of the famous Italian novel. Emo- cuna di queste lingue. Questo contrib- jiWorldBot, an off-spring project of uto presenta i progetti, il background Emojitalianobot, is a multilingual dic- e le principali caratteristiche di queste tionary that uses Emoji as a pivot applicazioni. language from dozens of different lan- guages. Currently the emoji-word and word-emoji functions are available for 1 Introduction 72 languages imported from the Uni- code tables and provide users with an Emojitalianobot 1 and EmojiWorldBot 2 are two easy search capability to map words new translation bots3 into and from emoji. in each of these languages to emojis, These two bots were designed starting from the and vice versa. This paper presents the hypothesis of setting up an emoji multilingual projects, the background and the main dictionary and translator through a process of characteristics of these applications. selection and assessment of conventional se- mantic values. Translation cases may show Italiano. Emojitalianobot e Emoji- how images can convey common and universal WorldBot sono due applicazioni on- meanings, beyond specific peculiarities, so as line per la traduzione in e da emoji they can stand as models in the perspective of su Telegram, la popolare piattaforma an interlanguage (Chiusaroli, 2015). The two di messaggistica istantanea. Emojital- 1 ianobot è il primo bot aperto e gratuito 2 https://telegram.me/emojitalianobot/ di traduzione che contiene i dizionari https://telegram.me/emojiworldbot 3 Computer programmes that carry out repetitive Emoji-Italiano ed Emoji-Inglese basati tasks and in their more sophisticated form can also sule descrizioni Unicode. Il bot è stato simulate human behaviours. bots ease the use of emojis but also collect, re- guages as in Aikuma10 and Ma Iwaidja11 , or fine and make available valuable linguistic data (ii) to gather grammaticality judgments (Mad- by means of crowdsourcing and gamification nani et al., 2011). The social dimension of approaches. these types of activities is sometimes connected This contribution presents the state-of-the- and fed by social communities, where users dis- art concerning the use of crowdsourcing and cuss problems, give suggestions, and exchange gamification approaches to linguistics in sec- ideas (Brabham, 2012; McGonigal, 2011). In tion 2, the Emojitalianobot and the Pinocchio order to loyalize social communities and im- project in section 3, the EmojiWorldBot in sec- prove their engagement, gamification is used tion 4 and finally conclusions and future work very often. The use of games is a very effec- in section 5. tive tool for active participation since it pro- vides a strong motivational framework which 2 Crowdsourcing and gamification pushes people to act for good. Some effec- Crowdsourcing, i.e., the act of a company or tive uses of games are to create new habits institution taking a function once performed or modify wrong actions. Wang et al. (2013) by employees and outsourcing it to an unde- list Games with a purpose (GWAPs)12 among fined (and generally large) network of people the different types of crowdsourcing. Some in the form of an open call (Howe, 2006) is good examples of games with a purpose in becoming a widespread practice on the Inter- the lexicographic field are Phrase Detectives13 net to develop linguistic resources (dictionar- and JeuxDeMots14 . The main advantage of ies, glossaries, translation memories, etc.) or GWAPs is their high attractiveness, because services (translation, localisation, fansubbing, people love playing games and it is easier to etc.) (Monti, 2012, 2014). It allows the large obtain their contribution in this way in com- scale involvement of users who contribute with parison to other forms of crowdsourcing. The their knowledge, their ideas, and their skills, difficulty in designing such games is to match in this way performing an active role in the attractiveness with usefulness, i.e. an attrac- achievement of a common goal. Crowdsourc- tive game which produces valuable data. ing can be used for the creation, maintenance 3 Emojitalianobot and the and sharing of lexical/terminological data such Pinocchio project as: i. lexical resources for online dictionaries, e.g., Wiki platforms such as Wiktionary4 and Emojitalianobot is the first open and free Omegawiki5 , and recent forays by more tra- Emoji-Italian translation bot on Telegram. ditional dictionary publishing companies like It was developed to support the translation Collins, Oxford, and Macmillan; ii. termi- project of Pinocchio in emoji 15 launched on nological resources for online terminological Twitter in February 2016 by F. Chiusaroli, J. databases, like TermWiki6 , the terminological Monti and F. Sangati. The translation of the counterpart of Wiktionary or TaaS7 ; iii. lexical famous children’s novel was carried out by the and semantic resources for Natural Language followers of the Scritture brevi blog16 (by F. Processing (NLP) tasks, such as Word Sense Chiusaroli and F.M. Zanzotto) and the first fif- Disambiguation (WSA), Sentiment Analysis, teen chapters have been translated, which cor- Computer Aided Translation, Machine Trans- respond to the original novel published by Col- lation and so on, using platforms for distribut- lodi in 1881. Every day tweets with sentences ing parts of large development projects to pro- taken from the novel were posted on Twitter fessional or occasional lexicographers such as and the followers suggested their translations Mechanical Turk8 . To the best of our knowl- 10 http://www.aikuma.org/aikuma-app.html edge only very few projects so far have been 11 https://itunes.apple.com/au/app/ tailored to mobile devices to gather linguistic ma-iwaidja/id557824618?mt=8 12 data in the field, (i) to collect dialect data as in When a player without any special knowledge is put into a gaming environment and has to make de- Dialectbot9 , (ii) to document endangered lan- cisions to win the game under the pressure of time or 4 any game mechanics’ constraints. https://en.wiktionary.org/ 13 5 https://anawiki.essex.ac.uk/ http://www.omegawiki.org/Meta:Main_Page phrasedetectives/ 6 14 http://it.termwiki.com/ http://www.jeuxdemots.org/jdm-accueil.php 7 15 https://term.tilde.com/ http://www.treccani.it/lingua_italiana/ 8 https://www.mturk.com/mturk/welcome speciali/ludolinguistica/Chiusaroli.html 9 16 https://telegram.me/dialectbot/ https://www.scritturebrevi.it/ in emoji; at the end of each day, the official for the Italian word peggio (worst). version of the translations was validated and published.17 Translators used Emojitalianobot 4 EmojiWorldBot that contains (i) the Emoji-Italian dictionary, On the basis of (both linguistic and technologi- (ii) the Emoji-English descriptions based on cal) experience with Emojitalianobot, the three Unicode and (iii) a glossary with all the uses Italian researchers together with Martin Ben- of emoji in the translation of Pinocchio. The jamin and Sina Mansour of the Kamusi Project project was associated with the Emojitalia dis- International18 and EPFL (Switzerland) de- cussion group on Telegram, where users met to signed a new bot on Telegram in April 2016: discuss problems, solutions, suggest improve- EmojiWorldBot, a multilingual dictionary that ments of the bot, in addition to the transla- uses Emoji as a pivot language from dozens of tion choices for Pinocchio and communicate different languages. Currently the emoji-word in emoji. The Pinocchio translation project and word-emoji functions are available for 70 therefore allowed to crowdsource different lin- languages imported from the Unicode tables 19 guistic data connected with the use of emojis and provide users with an easy search capabil- as actual means of communication and not just ity to map words in each of these languages to simple graphics to express amusement or in- emojis, and vice versa. Looking at the UNI- terest. In this respect the main findings of the CODE descriptions (see Fig. 1) it is apparent project are twofold: the need to recur to com- that emojis are not annotated in a coherent pound multi-emoji expressions in order to ex- way across languages, so some languages have press concepts which are not represented in the more descriptions and some others, especially current set, as well as a related simple gram- underrepresented languages, have less or in the mar to express syntactic relations among emo- most cases some languages are not represented jis, past and future tenses, etc. Unlike previous at all. literary translation project in emojis, such as the translations of Moby Dick or Alice in Won- derland, this is the first attempt of a collective shared emoji code (vocabulary and grammar) based on a word for word translation totally in emojis. Emojitalianobot is an ideal test bench to experiment with new approaches like crowd- sourcing and gamification in the field of Natu- ral Language Processing (NLP). The Pinocchio project, games and features available in the bot to learn or guess the meaning of emoji are de- Figure 1: Annotations in Romance languages vised indeed both to enjoy the bot while using it and at the same time to give the opportu- Our first goal with EmojiWorldBot is there- nity to users to develop linguistic descriptions fore to reach a uniform and comprehensive list of emoji tailored on their actual perceptions. of tags across multiple languages with a precise The most important reward for playing with mapping between any language pair, which the bot is the awareness of helping develop may serve to bootstrap a massive multilingual a linguistic resource for one’s mother tongue, dictionary. The bot currently features: and the pride in contributing to it. Since its release on Telegram, the project • emoji-to-word and word-to-emoji transla- was an instant success, becoming a viral web tion for more than 70 languages phenomenon thanks to the Scritture brevi com- • Eggs, a tagging game for people to con- munity and the Pinocchio translation in emo- tribute to the expansion of these dictio- jis, so that the bot has now almost 750 users. naries or the creation of new ones for any The Pinocchio translation project in emojis additional language. Users can suggest counts 611 tweets , 980 glossary entries which additional tags for single emojis in any correspond to 2127 words, of which 185 are language (for example adding egg to the multi-emojis, i.e. compound emojis, such as tag list for in English). 18 https://kamusi.org/ 17 19 The translation of Pinocchio in emoji can be fol- http://www.unicode.org/cldr/charts/29/ lowed on Twitter using #emojitaliano. annotations/ • inline queries: type EmojiWorldBot and a the one-to-one relationships should be discov- word, and it will suggest a set of emojis for ered, and all instances of a term that does not that word you can send in any Telegram have a translation equivalent on the other side conversation will be revealed. When it is known that no • the possibility to add new languages.To match from English exists, Ducks presents the date 56 new languages were added, such as definition, the emojis, and the English term, Latin, Esperanto, Sardinian among oth- and asks the user to type in the best equiv- ers. alent in their language. This is the method that will be most efficacious for new languages, The basic idea of the Eggs game is to collect bypassing the need to disentangle the many- new tags to associate with emojis as shown in to-many associations introduced through term Fig. 2. clustering in the CLDR annotations. It should be noted that many terms will be removed from the game cycle through comparisons with Wordnets for available languages. For exam- ple, самолет appears in conjunction with En- glish airplane in both the Emoji annotations and the Bulgarian Wordnet that are linked to the same English Princeton Wordnet (PWN) sense, which gives sufficient confirmation with- out needing a mass of human players. As of this writing, the project is in the process of im- porting and aligning Wordnet data for some 50 languages. In future work, terms from Word- net synsets will be tested against the emojis with which they theoretically share a sense, e.g. asking crowd members whether applies to other members of the PWN synset for bus Figure 2: Eggs game (autobus, coach, jitney, motorbus, etc.), but the mechanism for doing so has not been fi- With fewer than 2000 official emojis, nalized. In this way EmojiWorldBot employs stretching the boundaries of their communica- crowd methods as part of an arsenal intended tive potential makes them more useful. How- to conquer the walls of collecting data for nu- ever, it also makes the dictionary more essen- merous diverse languages. Data validation will tial, so that someone who receives in a chat be achieved via a consensus model through in any language might look to see if it sig- which answers are accepted as correct if the nifies something other than an eggplant. In same result is provided by a threshold number the future, Eggs will experiment with multi- of respondents. The new version of the bot will emoji terms (METs), building on the work of allow to: the Pinocchio translation project to Emoji, in • add new terms to the current languages an effort to build a larger pictorial vocabu- (including the names of the countries for lary that is comprehensible across languages national flags) (Chiusaroli, 2015). A new version of the bot • compare definitions across languages. is already under development. It will feature Ducks, a second game where users are asked From the computational point of view, this to map tags from a source language (e.g. En- project, as the Emojitalianobot, attempts to glish) to a target language (e.g. Swahili). In address the data chasm for natural language the example of Figure 1, several Romanian processing for most languages by distilling users would be shown the sense-specific defini- data collection to simple micro-tasks (Ben- tion of grin from Wordnet and all of the emo- jamin, 2015) using techniques adapted to least- jis that have been attached to that definition, common-denominator technology. and be asked which of the options among fat, ă 5 Conclusions ı̂ncântată fat, ă and ı̂ncântare, if any, is a good translation. The game would also be played We described the Emojitalianobot and the for face and grinning face. In this way, all of EmojiWorldBot projects. Combining crowd- sourcing, gamification a nd a s martphone app Johanna Monti. 2014. Dictionaries in the is a powerful strategy to collect, improve and cloud: state of the art, trends and chal- refine v aluable l inguistic d ata e asily a nd i n a lenges. Les Cahiers du dictionnaire, (6):95– short time particularly for less-resourced lan- 110. guages (Benjamin and Radetzky, 2014).These Aobo Wang, Cong Duy Vu Hoang, and Min- may be the first crowdsourcing projects of this Yen Kan. 2013. Perspectives on crowdsourc- type to use bots for linguistic data collection ing annotations for natural language pro- and validation and are unique in their at- cessing. Language resources and evaluation, tempts at engaging participants for different 47(1):9–31. languages. References Martin Benjamin. 2015. Crowdsourcing micro- data for cost-effective and reliable lexicogra- phy. In Proceedings of AsiaLex 2015 Hong Kong, EPFL-CONF-215062, pages 213–221. Martin Benjamin and Paula Radetzky. 2014. Multilingual lexicography with a focus on less-resourced languages: Data mining, ex- pert input, crowdsourcing, and gamification. In 9th edition of the Language Resources and Evaluation Conference, EPFL-CONF- 200375. Daren C Brabham. 2012. A model for lever- aging online communities. The participatory cultures handbook, 120. Francesca Chiusaroli. 2015. La scrittura in emoji tra dizionario e traduzione. CLiC it, page 88. Jeff Howe. 2006. The rise of crowdsourcing. Wired magazine, 14(6):1–4. Nitin Madnani, Joel Tetreault, Martin Chodorow, and Alla Rozovskaya. 2011. They can help: Using crowdsourcing to improve the evaluation of grammatical error detection systems. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers- Volume 2, pages 508–513. Association for Computational Linguistics. Jane McGonigal. 2011. Reality is broken: Why games make us better and how they can change the world. Penguin. Johanna Monti. 2012. Translators’ knowledge in the cloud: The new translation technolo- gies. In International Symposium on Lan- guage and Communication: Research Trends and Challenges(ISLC).