=Paper= {{Paper |id=Vol-1445/tweetmt-7-shah |storemode=property |title=Understandability of Machine-translated Hindi Tweets Before and After Post-editing: Perspectives for a Recommender System |pdfUrl=https://ceur-ws.org/Vol-1445/tweetmt-7-shah.pdf |volume=Vol-1445 |dblpUrl=https://dblp.org/rec/conf/sepln/ShahB15 }} ==Understandability of Machine-translated Hindi Tweets Before and After Post-editing: Perspectives for a Recommender System== https://ceur-ws.org/Vol-1445/tweetmt-7-shah.pdf
    Understandability of machine-translated Hindi tweets before and
      after post-editing: perspectives for a recommender system

Comprensibilidad de tweets en Hindi traducidos por un sistema de traducción
automática antes y después de post-edición: perspectivas para un sistema de
                              recomendación
                   Ritesh Shah                              Christian Boitet
             Université Grenoble-Alpes,                  Université Grenoble-Alpes,
                   GETALP-LIG,                                 GETALP-LIG,
                  Grenoble, France                            Grenoble, France
                ritesh.shah@imag.fr                       christian.boitet@imag.fr

       Resumen: En el proceso de construcción de un sistema de recomendación basado
       en tweets en Hindi para un proyecto, queremos determinar si los resultados brutos de
       traducción automática (TA) podrían ser útiles. Hemos recogido 100K tales tweets
       y experimentado con 200 de ellos como paso preliminar. Por lo menos el 50% de
       los tweets traducidos por TA resultaban comprensibles para hablantes de inglés,
       mientras que por lo menos el 80% de comprensibilidad sería necesario para que la
       TA fuera útil en este contexto. Posteriormente, hemos post-editado los resultados
       de TA y hemos observado que la comprensibilidad aumentó al 70%, mientras que
       el tiempo de post-edición fue 5 veces menor que el tiempo de traducción humana.
       Esbozamos, así, un escenario para producir un sistema de TA especializado para
       traducir (automáticamente) tweets de hindi a inglés, consiguiendose que del 70% al
       80% de los tweets obtenidos fueran comprensibles.
       Palabras clave: tweets en Hindi, sistemas especializados de traducción automática,
       post-edición, comprensibilidad, sistema de recomendación
       Abstract: In the process of building a recommender system based on Hindi tweets
       for a project, we want to determine whether raw Machine Translation (MT) results
       could be useful. We collected 100K such tweets and experimented on 200 of them
       as a preliminary step. Less than 50% of the machine-translated tweets were un-
       derstandable by English speakers, while at least 80% understandability seems to
       be required for MT to be included as a useful feature in this context. We then
       post-edited the MT results and observed that understandability reached 70%, while
       post-editing time was 5 times less than human translation time. We outline a sce-
       nario to produce a specialised MT system that would be able to translate (fully
       automatically) 70% to 80% of the tweets in Hindi into understandable English.
       Keywords: Hindi tweets, specialised MT system, understandability, recommender
       system, post-editing

1    Introduction and objectives                    is a primary constraint. When the task is simply
                                                    to help people understand an unknown or little
The operational architecture of a Machine           known language, the design of the MT system
Translation (MT) system is determined by pre-       is driven by coverage and automaticity rather
cise conditions of the use and development of the   than by the output quality, while only the gist
system. For instance, the architecture changes      of a translation is to be conveyed (Boitet et al.,
depending on the role of MT system users (say,      2009).
authors, professional translators), the language
pairs involved, or when availability of resources      An interesting case is that of multilingual rec-
ommender systems relying on information mined            In the following section, we elaborate on the
from tweets in regional languages. The user of       data collection and preprocessing. Section 3 ex-
the system, for instance a tourist, might like to    plains the experimental setup and procedure.
have a look at the top five translated tweets hav-   Experimental observations and a scenario for
ing influenced the recommendation (summarized        building a good enough specialized MT system
as usual by 0 to 5 ”stars”). A tweet translation     follow in the last two sections.
system providing an operational quality output
could be sufficient in such cases.                   2       Dataset: Hindi Tweets
   Keeping in mind the above context, we make        2.1        Technology and Twitter API
a preliminary study of the understandability of
                                                                constraints
tweet translations from Hindi to English, before
and after post-editing them. For that, we ran-       There are numerous services presently avail-
domly selected 200 tweets from the 100K col-         able for providing customised social content
lected, had them translated by Google Translate      data, including tweets (GNIP, 2015).        For
(GT), evaluated their understandability as is by     our tweet dataset, we make use of the Twit-
English (non-Hindi) speakers, and asked a few        ter search API to extract tweets. The search
Hindi speaking colleagues to post-edit the MT        API (non-Streaming API) from Twitter allows
results (which we call ”pre-translations”) using     the developer to obtain a maximum of 1.73M
the iMAG/SECtra (Huynh, Boitet, and Blan-            tweets/day through the Application-user au-
chon, 2008) web tool, giving them simple post-       thentication (AuA) and a maximum of 4.32M
editing (PE) guidelines. In particular, they were    tweets/day through the Application-only au-
asked to do minimal editing and not to aim at        thentication (AoA).
”normalizing”, improving, or inserting missing          The search API returns a collection of tweets
information, and to write down the total time it     corresponding to the requested query and the
took them to post-edit each tweet.                   specified query filters. As we want to investi-
   We then asked the same English (non-Hindi)        gate tweet translations from Hindi to English,
speakers to evaluate again the proportion of         and make use of the search API under the AuA
understandable tweets. That rate rised from          mode with a query containing the language filter
less than 50% before post-editing to more than       ’lang:hi’. The query allows us to extract Hindi
70% after post-editing. In the context of a rec-     (translation source language) tweets within the
ommender system and of the scenario sketched         rate-limit specified by the API. We used an in-
above, if more than 20% (or perhaps 30%) of the      teractive Python programming environment for
(translated) tweets are ununderstandable, the        data preprocessing and development to collect
usage value of the MT system would be null,          100K tweets in Hindi.
because users would simply stop looking at the
tweets. On the other hand, if only 1 out of 5
                                                     2.2        Preprocessing
tweets is ununderstandable, they would continue      The preprocessing of our data involved format-
to look at them when they are curious about the      conversion of the tweet dataset into HTML files
reason for a particularly good or bad recommen-      as required by the iMAG1 framework. We also
dation, so that the usage quality of the MT sys-     had to normalize a subset of characters (in par-
tem might be judged good enough or useful or         ticular, emojis) to avoid potential systemic prob-
only usable. Our real distinction is whether the     lems on account of data encoding and decoding.
MT results would be used, even sparingly, or not     2.2.1 Data format
at all.                                              The extracted tweets are in the JSON2 format
   While the value of the minimal rate of un-        that contains the metadata and the textual con-
derstandability certainly depends on each per-       tent of each tweet. We kept only the textual
son, we could not yet set up an experiment with      content (’text’ field) and the tweet identifier
many tweet users, as we wished. In fact, the
above value of about 70% has been obtained by            1
                                                             interactive Multilingual Access Gateway
asking only 2 English-only readers.                      2
                                                             JavaScript Object Notation
(’id_str’ field) of each tweet. We finally con-      integration of specialised MT systems. The sys-
verted the messages to a set of HTML files, each     tem provides pre-translations to the post-editor
containing a table of 100 rows and 3 columns as      and allows post-editing in various modes. It also
shown in Figure 1. A third column with ’enum’        allows post-editors to grade the quality of post-
field is added programmatically during conver-       editions and record total time for post-editing
sion for enumeration.                                (T petotal ). In iMAG/SECTra, each segment has
                                                     a reliability level 3 and a quality score between
                                                     0 and 204 (Wang and Boitet, 2013). While the
                                                     reliability level is fixed by the tool, the quality
                                                     score can be modified by the post-editor (ini-
                                                     tially, it is that defined in his profile) or by any
                                                     reader.
                                                         The quality of the PE of a segment is deemed
                                                     to be good enough if its quality score is higher or
                                                     equal to 12/20.
                                                     3.2 Experimental setting
                                                     Our experimental procedure has two parts:
                                                      1. evaluating the understandability of pre-
                                                         translations
                                                      2. post-editing pre-translations and estimat-
Figure 1: Tweets in Hindi(Devanagari script) in an       ing the output quality in relation with the
HTML table with fields:enum,id_str and text              post-editing times recorded by the post-
                                                         editors.
2.2.2 Emoji issues
In order to verify data robustness and systemic      First, we randomly selected two Hindi tweet
consistency for further experiments, we set up       datasets containing 100 tweets each (twTxtSet1
an existing iMAG for a few files. During the         and twTxtSet2) and then we set up an
process, we identified a problem that manifested     iMAG/SECTra for post-editing the tweets.
in the form of emoji(s) and emoticons which are      3.2.1 Pre-translation understandability
frequently used in tweet texts. The incorrect
handling of the UTF-8 mapping scheme for those       In order to determine the proportion of under-
Unicode points that code these emojis caused the     standable pre-translations (that is, tweets trans-
setup to fail.                                       lated by GT), 2 participants speaking English
    Our solution was to normalise, during the        and no Hindi were selected. Each participant
preprocessing step, a range of such special occur-   was asked to give a score of 1 if a (translated)
rences. We identified and converted characters       tweet was found to be understandable and 0 oth-
in the following Unicode point ranges (emojiList-    erwise. The proportion of understandable tweets
1, 2015) (emojiList-2, 2015) in such a way that      was recorded as 39% for twTxtSet1 and 45% for
it should be possible to restore them at the end     twTxtSet2.
of the translation process.                             3
                                                          * for dictionary-based translation, ** for MT output,
    For instance, the character ’\U0001F44C’ is      *** for PE by a bilingual contributor, *** for PE by a
converted to ’%%EMOJI-0001f44c’. An example          professional translator, and ***** for PE by a translator
                                                     ”certified” by the producer of the content.
can be seen in row 4 of Figure 1.                       4
                                                          10: pass, 12: good enough, 14: good, 16: very good,
3     Experiment                                     18: exceptional, 20: perfect. 8-9: not satisfied with some-
                                                     thing in the PE. 6-7: sure to have produced a bad trans-
3.1    About iMAG/SECTra                             lation! 4-5: the PE corresponds to a text differing from
                                                     that of the source segment. That happens when a sen-
iMAG/SECTra is a post-editing framework              tence has been erroneously split into 2 segments and the
which internally employs GT by default (and          order of words is different in the 2 languages. 2: the
any number of available MT servers) and allows       source segment was already in the target language.
3.2.2 Post-editing methodology                                   Term definitions used in Table 1 and 2:
                                                                 std_page: consists of 250 words
Speakers of both Hindi and English were se-
                                                                 #segments: number of translated segments
lected to post-edit the pre-translations, thereby
                                                                 #source-words: number of words in source dataset
jotting down their T petotal for each tweet. To
                                                                 #logical-pages: number of PE pages in SECTra
be in line with the envisaged scenario, where
                                                                 pages-std: #source-words/250
a tweet reader knowing both languages might
                                                                 T petotal−mn : T petotal in minutes
conceivably (but rarely) correct a translation
                                                                 mn/std_page: T petotal−mn /pages-std
(contribute in Google’s words), while no other
                                                                 T humstd_page = 60
reader would independently contribute on the
                                                                 T humestim−mn : T humstd_page * pages-std
same tweet, each tweet was post-edited only
once. Monolingual and bilingual participants                  In Table 2, the quality formula used (NII5
were then asked to score the post-edited pre-              lecture notes (Boitet et al., 2009)) is as follows
translations for understandability.                        (assuming that the human time to produce a
   Even though post-editing was to be done in              translation draft for a standard page is 1 hour):
a minimal time possible, the post-editor was al-
lowed to quickly label a tweet with a single word                                  T petotal−mn
                                                           Q = 1−2/100×                         ×T humstd_page
which would help human understanding and fur-                                    T humestim−mn
ther elicit the context of certain tweets. This la-
bel was meant to be added but without taking               Examples:
much time. For instance, spam or derogatory                Q = 40% if T petotal = 30mn/p (8/20)
tweets could be labeled as ”((??spam??))”, and             Q = 60% if T petotal = 20mn/p (12/20)
code-mixed tweets could be labeled as ”((??mix-            Q = 90% if T petotal = 5mn/p (18/20)
ing??))”. The label delimiters were pre-decided
                                                              We proceed to add a few illustrative exam-
to separate them from the original tweet text.
                                                           ples with descriptions to better visualise perti-
No set of labels was prepared beforehand. La-
                                                           nent stages of our experimental procedure and
bels were introduced by the post-editors them-
                                                           observations.
selves. The most frequent were {”news”, ”phi-
losophy”, ”politics”, ”sports”, ”joke”, ”humour”,
”sarcasm”, ”quote”}
                                                           4.1.1      Example 1: An ununderstandable
4     Observations                                                    MT output

4.1    Post-editing statistics                             H: मंिजल िमले ना िमले ये तो मुकदर क बात है !
Table 1 shows the total post-editing times (in             T: These services are not met, then the floor is
mins) for twTxtSet1 and twTxtSet2. We ob-                  Mukdr!
serve and note additional statistics (given in the
term definitions). More importantly, we obtain
                                                           PE: Destination whether met or are not met,
the quality measure which stands at 56.1% for
                                                           then it is luck!
twTxtSet1 and 73.6% for twTxtSet2.
 Dataset     #logical- #segments #source-   T petotal−mn
             pages               words                        H: A Hindi tweet as an input to iMAG
 TwTxtSet1   17        331       1843         162.6
 TwTxtSet2   18        356       1780          93.7           T: The machine-translated Hindi tweet
                                                              (pre-translation)
           Table 1: Observed PE statistics                    PE: The post-edited output with score 14
                                                              (good) and with a T petotal of 33 seconds
 Dataset     pages-   mn    per   T humestim−mn Quality
             std      std_page
 TwTxtSet1   7.4      21.97       444            56.1%
 TwTxtSet2   7.1      13.19       426            73.6%

        Table 2: Calculated PE statistics                    5
                                                                 National Institute of Informatics, Japan
4.1.2    Example 2: Post-editing environment




Figure 2: Screenshot of SECtra post-editing mode: source text, post-edit area with reliability level and quality score,
post-editor’s name (hidden in the image) and post-editing time. Right: trace of the edit distance algorithm computation.




Figure 3: Screenshot of the SECtra Export mode.               It allows data export in several formats. One also can view the
translation memory. This figure shows only the source tweets, the pretranslations and the post-edition.
5       Towards building a hi-en MT                              The PE interface will systematically propose
        system useful for tweeters                            the results of the current twMT-hi-en-x.y ver-
                                                              sion in the PE area of each segment, but results
From the observations at hand, and know-
                                                              produced by GT and if possible other systems
ing that specializing a MT system to a re-
                                                              (Systran, Bing, Indian systems) will also be vis-
stricted sublanguage can dramatically increase
                                                              ible, with a button to reinitialize the PE area
all quality indicators (Chandioux, 1989) (Is-
                                                              with each of them. No development is needed,
abelle, 1987), we can outline a scenario to pro-
                                                              as this is a standard feature of the SECTra in-
duce a specialized MT system that would be able
                                                              terface since 2008. The quality measure used to
to translate (fully automatically) 70% to 80% of
                                                              determine when the specialized system will be
the Hindi tweets into understandable English.
                                                              good enough to open the MT cross-lingual access
    The idea is to include the MT cross-lingual               facility to tweeters. There will be a first period
access facility in the recommender system al-                 during which twMT-hi-en-x.y will remain infe-
most from the start, but not to make it accessible            rior to GT, that is, will require more PE time.8
immediately, in order not to discourage forever
                                                                 After a certain version (a.b), the PE time for
tweeters to use it. There will be a phase whose
                                                              results of twMT-hi-en-x.y with x.y ≥ a.b will
length will depend on the number of bilingual
                                                              be less than that for GT, but results will still
Hindi-English speakers contributing to the build-
                                                              not be understandable enough. How to know if
ing of a specialized hi-en tweet-MT system.
                                                              and when this will happen?
    As has already been done successfully for
French-Chinese (Wang and Boitet, 2013), we                       The experiment described in this paper shows
will estimate the best size Sizetw of an aligned              that PE of current MT results allows to almost
hi-en learning corpus (a first guess might be                 get to the required understandability level of
Sizetw = 10000 or Sizetw = 15000 for the ob-                  70%-80%, with a PE time of 12mn/p. We hope
served sublanguage of Hindi tweets).                          that MT outputs needing only 5mn/p of PE to
    Initially, we will populate it using parts of             reach 90% understandability will be understand-
some genuine hi-en corpus, if any, and, if none               able enough (70%-80%) without PE.
is available, a part of the CFILT6 en-hi corpus.                 The idea is that, if that correlation holds,
Even if inverted translations are notoriously not             which we will verify by testing it every time a
translation examples, an inverted parallel corpus             new version (x.1) is issued, we will open the MT
is better than nothing. That will be the basis for            cross-lingual access facility to tweeters when this
building version 0.1 of a Moses-based specialized             minimal understandability threshold will have
system, say, twMT-hi-en-0.1.                                  been attained through this supervised learning
    The contributors team will then post-edit                 process.
what it can, working some time every day. In-                    Another worry will then be to ensure non-
cremental improvement will be performed a cer-                regressivity. It is expected that some continuous
tain number of times7 after each new batch of                 human supervision will remain needed, and that
good enough post-editions will become available,              no dedicated contributors group will be maintain-
giving twMT-hi-en-0.1 … twMT-hi-en-0.20                       able. Then, some self-organizing community of
if there are 20 incremental improvement steps.                contributors (post-editors) should emerge, some-
Version 1.1 (twMT-hi-en-1.1) will then be pro-                what like what has happened for many open
duced by full recompilation, and the whole pro-               source software localization projects. Another
cess will be iterated.                                        encouraging perspective is the announcement of
                                                              a new kind of web service such as SYNAPS
    6
    Centre for Indian Language Technology, IITB, India        (Viséo, 2015), aiming at organizing contributive
    7
    Experiments on French-Chinese have shown that im-         activities.
provement levels out after 10-20 incremental improve-
ment steps. It is then necessary to recompile the full sys-
                                                                8
tem, and that is also an appropriate time to modify the           The PE time for GT outputs will be estimated with-
learning set by including all good enough post-editions,      out any supplementary human work because experiments
say, N pe bisegments, and keeping only Sizetw − N pe of       show a very good correlation between our mixed PE dis-
the unspecialized parallel corpus.                            tance ∆m (mt, pe) and T petotal (mt).
6 References
Bibliografía
Boitet, Christian, Hervé Blanchon, Mark Selig-
  man, and Valérie Bellynck. 2009. Evolution
  of MT with the Web. In ”Proceedings of the
  International Conference ’Machine Transla-
  tion 25 Years On’ ”, number from 2008, pages
  1–13, Cranfield, November.
Chandioux, John. 1989. 10 ans de ME-
  TEO (MD). In A Abbou, editor, ”Proceed-
  ings of Traduction Assistée par Ordinateur:
  Perspectives Technologiques, Industrielles et
  Économiques Envisageables à l’Horizon 1990:
  l’Offre, la Demande, les Marchés et les Évo-
  lutions en Cours”, pages 169–172, Paris.
  Daicadif.
emojiList-1.   2015.    Full emoji list.
  http://www.unicode.org/emoji/charts/
  full-emoji-list.html.
emojiList-2.  2015.    Other emoji list.
  http://www.unicode.org/Public/emoji/
  1.0/emoji-data.txt.
GNIP. 2015. Gnip.           https://gnip.com/
  sources/twitter/.
Huynh, Cong-Phap, Christian Boitet, and Hervé
  Blanchon. 2008. SECTra_w.1: An on-
  line collaborative system for evaluating, post-
  editing and presenting MT translation cor-
  pora. In ”Proceedings of the Sixth Interna-
  tional Conference on Language Resources and
  Evaluation”, pages 2571–2576.
Isabelle, Pierre. 1987. Machine Translation at
   the TAUM group. In ”Proceedings of Ma-
   chine Translation Today: The State of the
   Art”, pages 247–277, Edinburgh. Edinburgh
   University Press.
Viséo. 2015. Synaps website. http://www.
   viseo.com/fr/offre/synaps.
Wang, Lingxiao and Christian Boitet. 2013. On-
  line production of HQ parallel corpora and
  permanent task-based evaluation of multiple
  MT systems: both can be obtained through
  iMAGs with no added cost. In ”Proceedings
  of the 2nd Workshop on Post-Editing Tech-
  nologies and Practice at MT Summit 2013”,
  pages 103–110, Nice, September.