Understandability of machine-translated Hindi tweets before and after post-editing: perspectives for a recommender system Comprensibilidad de tweets en Hindi traducidos por un sistema de traducción automática antes y después de post-edición: perspectivas para un sistema de recomendación Ritesh Shah Christian Boitet Université Grenoble-Alpes, Université Grenoble-Alpes, GETALP-LIG, GETALP-LIG, Grenoble, France Grenoble, France ritesh.shah@imag.fr christian.boitet@imag.fr Resumen: En el proceso de construcción de un sistema de recomendación basado en tweets en Hindi para un proyecto, queremos determinar si los resultados brutos de traducción automática (TA) podrían ser útiles. Hemos recogido 100K tales tweets y experimentado con 200 de ellos como paso preliminar. Por lo menos el 50% de los tweets traducidos por TA resultaban comprensibles para hablantes de inglés, mientras que por lo menos el 80% de comprensibilidad sería necesario para que la TA fuera útil en este contexto. Posteriormente, hemos post-editado los resultados de TA y hemos observado que la comprensibilidad aumentó al 70%, mientras que el tiempo de post-edición fue 5 veces menor que el tiempo de traducción humana. Esbozamos, así, un escenario para producir un sistema de TA especializado para traducir (automáticamente) tweets de hindi a inglés, consiguiendose que del 70% al 80% de los tweets obtenidos fueran comprensibles. Palabras clave: tweets en Hindi, sistemas especializados de traducción automática, post-edición, comprensibilidad, sistema de recomendación Abstract: In the process of building a recommender system based on Hindi tweets for a project, we want to determine whether raw Machine Translation (MT) results could be useful. We collected 100K such tweets and experimented on 200 of them as a preliminary step. Less than 50% of the machine-translated tweets were un- derstandable by English speakers, while at least 80% understandability seems to be required for MT to be included as a useful feature in this context. We then post-edited the MT results and observed that understandability reached 70%, while post-editing time was 5 times less than human translation time. We outline a sce- nario to produce a specialised MT system that would be able to translate (fully automatically) 70% to 80% of the tweets in Hindi into understandable English. Keywords: Hindi tweets, specialised MT system, understandability, recommender system, post-editing 1 Introduction and objectives is a primary constraint. When the task is simply to help people understand an unknown or little The operational architecture of a Machine known language, the design of the MT system Translation (MT) system is determined by pre- is driven by coverage and automaticity rather cise conditions of the use and development of the than by the output quality, while only the gist system. For instance, the architecture changes of a translation is to be conveyed (Boitet et al., depending on the role of MT system users (say, 2009). authors, professional translators), the language pairs involved, or when availability of resources An interesting case is that of multilingual rec- ommender systems relying on information mined In the following section, we elaborate on the from tweets in regional languages. The user of data collection and preprocessing. Section 3 ex- the system, for instance a tourist, might like to plains the experimental setup and procedure. have a look at the top five translated tweets hav- Experimental observations and a scenario for ing influenced the recommendation (summarized building a good enough specialized MT system as usual by 0 to 5 ”stars”). A tweet translation follow in the last two sections. system providing an operational quality output could be sufficient in such cases. 2 Dataset: Hindi Tweets Keeping in mind the above context, we make 2.1 Technology and Twitter API a preliminary study of the understandability of constraints tweet translations from Hindi to English, before and after post-editing them. For that, we ran- There are numerous services presently avail- domly selected 200 tweets from the 100K col- able for providing customised social content lected, had them translated by Google Translate data, including tweets (GNIP, 2015). For (GT), evaluated their understandability as is by our tweet dataset, we make use of the Twit- English (non-Hindi) speakers, and asked a few ter search API to extract tweets. The search Hindi speaking colleagues to post-edit the MT API (non-Streaming API) from Twitter allows results (which we call ”pre-translations”) using the developer to obtain a maximum of 1.73M the iMAG/SECtra (Huynh, Boitet, and Blan- tweets/day through the Application-user au- chon, 2008) web tool, giving them simple post- thentication (AuA) and a maximum of 4.32M editing (PE) guidelines. In particular, they were tweets/day through the Application-only au- asked to do minimal editing and not to aim at thentication (AoA). ”normalizing”, improving, or inserting missing The search API returns a collection of tweets information, and to write down the total time it corresponding to the requested query and the took them to post-edit each tweet. specified query filters. As we want to investi- We then asked the same English (non-Hindi) gate tweet translations from Hindi to English, speakers to evaluate again the proportion of and make use of the search API under the AuA understandable tweets. That rate rised from mode with a query containing the language filter less than 50% before post-editing to more than ’lang:hi’. The query allows us to extract Hindi 70% after post-editing. In the context of a rec- (translation source language) tweets within the ommender system and of the scenario sketched rate-limit specified by the API. We used an in- above, if more than 20% (or perhaps 30%) of the teractive Python programming environment for (translated) tweets are ununderstandable, the data preprocessing and development to collect usage value of the MT system would be null, 100K tweets in Hindi. because users would simply stop looking at the tweets. On the other hand, if only 1 out of 5 2.2 Preprocessing tweets is ununderstandable, they would continue The preprocessing of our data involved format- to look at them when they are curious about the conversion of the tweet dataset into HTML files reason for a particularly good or bad recommen- as required by the iMAG1 framework. We also dation, so that the usage quality of the MT sys- had to normalize a subset of characters (in par- tem might be judged good enough or useful or ticular, emojis) to avoid potential systemic prob- only usable. Our real distinction is whether the lems on account of data encoding and decoding. MT results would be used, even sparingly, or not 2.2.1 Data format at all. The extracted tweets are in the JSON2 format While the value of the minimal rate of un- that contains the metadata and the textual con- derstandability certainly depends on each per- tent of each tweet. We kept only the textual son, we could not yet set up an experiment with content (’text’ field) and the tweet identifier many tweet users, as we wished. In fact, the above value of about 70% has been obtained by 1 interactive Multilingual Access Gateway asking only 2 English-only readers. 2 JavaScript Object Notation (’id_str’ field) of each tweet. We finally con- integration of specialised MT systems. The sys- verted the messages to a set of HTML files, each tem provides pre-translations to the post-editor containing a table of 100 rows and 3 columns as and allows post-editing in various modes. It also shown in Figure 1. A third column with ’enum’ allows post-editors to grade the quality of post- field is added programmatically during conver- editions and record total time for post-editing sion for enumeration. (T petotal ). In iMAG/SECTra, each segment has a reliability level 3 and a quality score between 0 and 204 (Wang and Boitet, 2013). While the reliability level is fixed by the tool, the quality score can be modified by the post-editor (ini- tially, it is that defined in his profile) or by any reader. The quality of the PE of a segment is deemed to be good enough if its quality score is higher or equal to 12/20. 3.2 Experimental setting Our experimental procedure has two parts: 1. evaluating the understandability of pre- translations 2. post-editing pre-translations and estimat- Figure 1: Tweets in Hindi(Devanagari script) in an ing the output quality in relation with the HTML table with fields:enum,id_str and text post-editing times recorded by the post- editors. 2.2.2 Emoji issues In order to verify data robustness and systemic First, we randomly selected two Hindi tweet consistency for further experiments, we set up datasets containing 100 tweets each (twTxtSet1 an existing iMAG for a few files. During the and twTxtSet2) and then we set up an process, we identified a problem that manifested iMAG/SECTra for post-editing the tweets. in the form of emoji(s) and emoticons which are 3.2.1 Pre-translation understandability frequently used in tweet texts. The incorrect handling of the UTF-8 mapping scheme for those In order to determine the proportion of under- Unicode points that code these emojis caused the standable pre-translations (that is, tweets trans- setup to fail. lated by GT), 2 participants speaking English Our solution was to normalise, during the and no Hindi were selected. Each participant preprocessing step, a range of such special occur- was asked to give a score of 1 if a (translated) rences. We identified and converted characters tweet was found to be understandable and 0 oth- in the following Unicode point ranges (emojiList- erwise. The proportion of understandable tweets 1, 2015) (emojiList-2, 2015) in such a way that was recorded as 39% for twTxtSet1 and 45% for it should be possible to restore them at the end twTxtSet2. of the translation process. 3 * for dictionary-based translation, ** for MT output, For instance, the character ’\U0001F44C’ is *** for PE by a bilingual contributor, *** for PE by a converted to ’%%EMOJI-0001f44c’. An example professional translator, and ***** for PE by a translator ”certified” by the producer of the content. can be seen in row 4 of Figure 1. 4 10: pass, 12: good enough, 14: good, 16: very good, 3 Experiment 18: exceptional, 20: perfect. 8-9: not satisfied with some- thing in the PE. 6-7: sure to have produced a bad trans- 3.1 About iMAG/SECTra lation! 4-5: the PE corresponds to a text differing from that of the source segment. That happens when a sen- iMAG/SECTra is a post-editing framework tence has been erroneously split into 2 segments and the which internally employs GT by default (and order of words is different in the 2 languages. 2: the any number of available MT servers) and allows source segment was already in the target language. 3.2.2 Post-editing methodology Term definitions used in Table 1 and 2: std_page: consists of 250 words Speakers of both Hindi and English were se- #segments: number of translated segments lected to post-edit the pre-translations, thereby #source-words: number of words in source dataset jotting down their T petotal for each tweet. To #logical-pages: number of PE pages in SECTra be in line with the envisaged scenario, where pages-std: #source-words/250 a tweet reader knowing both languages might T petotal−mn : T petotal in minutes conceivably (but rarely) correct a translation mn/std_page: T petotal−mn /pages-std (contribute in Google’s words), while no other T humstd_page = 60 reader would independently contribute on the T humestim−mn : T humstd_page * pages-std same tweet, each tweet was post-edited only once. Monolingual and bilingual participants In Table 2, the quality formula used (NII5 were then asked to score the post-edited pre- lecture notes (Boitet et al., 2009)) is as follows translations for understandability. (assuming that the human time to produce a Even though post-editing was to be done in translation draft for a standard page is 1 hour): a minimal time possible, the post-editor was al- lowed to quickly label a tweet with a single word T petotal−mn Q = 1−2/100× ×T humstd_page which would help human understanding and fur- T humestim−mn ther elicit the context of certain tweets. This la- bel was meant to be added but without taking Examples: much time. For instance, spam or derogatory Q = 40% if T petotal = 30mn/p (8/20) tweets could be labeled as ”((??spam??))”, and Q = 60% if T petotal = 20mn/p (12/20) code-mixed tweets could be labeled as ”((??mix- Q = 90% if T petotal = 5mn/p (18/20) ing??))”. The label delimiters were pre-decided We proceed to add a few illustrative exam- to separate them from the original tweet text. ples with descriptions to better visualise perti- No set of labels was prepared beforehand. La- nent stages of our experimental procedure and bels were introduced by the post-editors them- observations. selves. The most frequent were {”news”, ”phi- losophy”, ”politics”, ”sports”, ”joke”, ”humour”, ”sarcasm”, ”quote”} 4.1.1 Example 1: An ununderstandable 4 Observations MT output 4.1 Post-editing statistics H: मंिजल िमले ना िमले ये तो मुकदर क बात है ! Table 1 shows the total post-editing times (in T: These services are not met, then the floor is mins) for twTxtSet1 and twTxtSet2. We ob- Mukdr! serve and note additional statistics (given in the term definitions). More importantly, we obtain PE: Destination whether met or are not met, the quality measure which stands at 56.1% for then it is luck! twTxtSet1 and 73.6% for twTxtSet2. Dataset #logical- #segments #source- T petotal−mn pages words H: A Hindi tweet as an input to iMAG TwTxtSet1 17 331 1843 162.6 TwTxtSet2 18 356 1780 93.7 T: The machine-translated Hindi tweet (pre-translation) Table 1: Observed PE statistics PE: The post-edited output with score 14 (good) and with a T petotal of 33 seconds Dataset pages- mn per T humestim−mn Quality std std_page TwTxtSet1 7.4 21.97 444 56.1% TwTxtSet2 7.1 13.19 426 73.6% Table 2: Calculated PE statistics 5 National Institute of Informatics, Japan 4.1.2 Example 2: Post-editing environment Figure 2: Screenshot of SECtra post-editing mode: source text, post-edit area with reliability level and quality score, post-editor’s name (hidden in the image) and post-editing time. Right: trace of the edit distance algorithm computation. Figure 3: Screenshot of the SECtra Export mode. It allows data export in several formats. One also can view the translation memory. This figure shows only the source tweets, the pretranslations and the post-edition. 5 Towards building a hi-en MT The PE interface will systematically propose system useful for tweeters the results of the current twMT-hi-en-x.y ver- sion in the PE area of each segment, but results From the observations at hand, and know- produced by GT and if possible other systems ing that specializing a MT system to a re- (Systran, Bing, Indian systems) will also be vis- stricted sublanguage can dramatically increase ible, with a button to reinitialize the PE area all quality indicators (Chandioux, 1989) (Is- with each of them. No development is needed, abelle, 1987), we can outline a scenario to pro- as this is a standard feature of the SECTra in- duce a specialized MT system that would be able terface since 2008. The quality measure used to to translate (fully automatically) 70% to 80% of determine when the specialized system will be the Hindi tweets into understandable English. good enough to open the MT cross-lingual access The idea is to include the MT cross-lingual facility to tweeters. There will be a first period access facility in the recommender system al- during which twMT-hi-en-x.y will remain infe- most from the start, but not to make it accessible rior to GT, that is, will require more PE time.8 immediately, in order not to discourage forever After a certain version (a.b), the PE time for tweeters to use it. There will be a phase whose results of twMT-hi-en-x.y with x.y ≥ a.b will length will depend on the number of bilingual be less than that for GT, but results will still Hindi-English speakers contributing to the build- not be understandable enough. How to know if ing of a specialized hi-en tweet-MT system. and when this will happen? As has already been done successfully for French-Chinese (Wang and Boitet, 2013), we The experiment described in this paper shows will estimate the best size Sizetw of an aligned that PE of current MT results allows to almost hi-en learning corpus (a first guess might be get to the required understandability level of Sizetw = 10000 or Sizetw = 15000 for the ob- 70%-80%, with a PE time of 12mn/p. We hope served sublanguage of Hindi tweets). that MT outputs needing only 5mn/p of PE to Initially, we will populate it using parts of reach 90% understandability will be understand- some genuine hi-en corpus, if any, and, if none able enough (70%-80%) without PE. is available, a part of the CFILT6 en-hi corpus. The idea is that, if that correlation holds, Even if inverted translations are notoriously not which we will verify by testing it every time a translation examples, an inverted parallel corpus new version (x.1) is issued, we will open the MT is better than nothing. That will be the basis for cross-lingual access facility to tweeters when this building version 0.1 of a Moses-based specialized minimal understandability threshold will have system, say, twMT-hi-en-0.1. been attained through this supervised learning The contributors team will then post-edit process. what it can, working some time every day. In- Another worry will then be to ensure non- cremental improvement will be performed a cer- regressivity. It is expected that some continuous tain number of times7 after each new batch of human supervision will remain needed, and that good enough post-editions will become available, no dedicated contributors group will be maintain- giving twMT-hi-en-0.1 … twMT-hi-en-0.20 able. Then, some self-organizing community of if there are 20 incremental improvement steps. contributors (post-editors) should emerge, some- Version 1.1 (twMT-hi-en-1.1) will then be pro- what like what has happened for many open duced by full recompilation, and the whole pro- source software localization projects. Another cess will be iterated. encouraging perspective is the announcement of a new kind of web service such as SYNAPS 6 Centre for Indian Language Technology, IITB, India (Viséo, 2015), aiming at organizing contributive 7 Experiments on French-Chinese have shown that im- activities. provement levels out after 10-20 incremental improve- ment steps. It is then necessary to recompile the full sys- 8 tem, and that is also an appropriate time to modify the The PE time for GT outputs will be estimated with- learning set by including all good enough post-editions, out any supplementary human work because experiments say, N pe bisegments, and keeping only Sizetw − N pe of show a very good correlation between our mixed PE dis- the unspecialized parallel corpus. tance ∆m (mt, pe) and T petotal (mt). 6 References Bibliografía Boitet, Christian, Hervé Blanchon, Mark Selig- man, and Valérie Bellynck. 2009. Evolution of MT with the Web. In ”Proceedings of the International Conference ’Machine Transla- tion 25 Years On’ ”, number from 2008, pages 1–13, Cranfield, November. Chandioux, John. 1989. 10 ans de ME- TEO (MD). In A Abbou, editor, ”Proceed- ings of Traduction Assistée par Ordinateur: Perspectives Technologiques, Industrielles et Économiques Envisageables à l’Horizon 1990: l’Offre, la Demande, les Marchés et les Évo- lutions en Cours”, pages 169–172, Paris. Daicadif. emojiList-1. 2015. Full emoji list. http://www.unicode.org/emoji/charts/ full-emoji-list.html. emojiList-2. 2015. Other emoji list. http://www.unicode.org/Public/emoji/ 1.0/emoji-data.txt. GNIP. 2015. Gnip. https://gnip.com/ sources/twitter/. Huynh, Cong-Phap, Christian Boitet, and Hervé Blanchon. 2008. SECTra_w.1: An on- line collaborative system for evaluating, post- editing and presenting MT translation cor- pora. In ”Proceedings of the Sixth Interna- tional Conference on Language Resources and Evaluation”, pages 2571–2576. Isabelle, Pierre. 1987. Machine Translation at the TAUM group. In ”Proceedings of Ma- chine Translation Today: The State of the Art”, pages 247–277, Edinburgh. Edinburgh University Press. Viséo. 2015. Synaps website. http://www. viseo.com/fr/offre/synaps. Wang, Lingxiao and Christian Boitet. 2013. On- line production of HQ parallel corpora and permanent task-based evaluation of multiple MT systems: both can be obtained through iMAGs with no added cost. In ”Proceedings of the 2nd Workshop on Post-Editing Tech- nologies and Practice at MT Summit 2013”, pages 103–110, Nice, September.