=Paper= {{Paper |id=Vol-1642/paper4 |storemode=property |title=Towards a Gamified System to Improve Translation for Online Meetings |pdfUrl=https://ceur-ws.org/Vol-1642/paper4.pdf |volume=Vol-1642 |authors=Laura Guillot,Quentin Bragard,Ross Smith,Dan Bean,Anthony Ventresque |dblpUrl=https://dblp.org/rec/conf/sigir/GuillotBSBV16 }} ==Towards a Gamified System to Improve Translation for Online Meetings== https://ceur-ws.org/Vol-1642/paper4.pdf
  Towards a Gamified System to Improve Translation for
                    Online Meetings

     Laura Guillot1,2 , Quentin Bragard1 , Ross Smith3 , Dan Bean3 and Anthony Ventresque1
             1
               Lero, School of Computer Science, University College Dublin, Ireland
                                 2
                                   École Centrale de Nantes, France
                       3
                         Microsoft Corporation, Skype Division, Seattle, USA
     laura.guillot@eleves.ec-nantes.fr, quentin.bragard@ucdconnect.ie, rosss@microsoft.com,
                         danbean@microsoft.com, anthony.ventresque@ucd.ie



                                                                 1     Introduction
                                                                 Machine Translation has been a rich field of research
                        Abstract                                 for decades and many directions and models have been
                                                                 proposed [Koe09]. Commercial systems, such as, Mi-
    Translation of online meetings (e.g., Skype                  crosoft Translator1 , Google Translate2 or Systran3 , are
    conversations) is a useful feature that can                  commonplace nowadays and have proven to be useful
    help users to understand each other. How-                    to users on the Internet and in real life. However,
    ever translations can sometimes be inaccurate                despite their obvious successes, academia and indus-
    or they can miss the context of the discus-                  try are still facing hard challenges and many transla-
    sion. This is for instance the case in corpo-                tions proposed ‘in the wild’ lack quality. By ‘in the
    rate environments where some words are used                  wild’ we mean translation of short and, at times, low
    with special meanings that can be obscure to                 linguistic quality texts, as we see them for instance
    other people. This paper presents the proto-                 on online social network applications. This is espe-
    type of a gamified application that aims at                  cially a challenge for Machine Translation as speech
    improving translations of and for online meet-               (e.g., during online meetings) and short written posts
    ings. In our system, users play to earn points               and comments (e.g., on online social applications) are
    and rewards – and they try to propose and                    usual ways of communicating nowadays. These ways
    vote for the most accurate translations in con-              of communicating, not only they are incomplete and
    text. Our system uses various techniques to                  noisy, but they are also contextual in nature: they are
    split conversations in various semantically co-              often linked to a particular community (e.g., teenagers,
    herent segments and label them with relevant                 employees of an Enterprise, members of a professional
    keyphrases. This is how we extract a descrip-                field) and evolve quickly. For instance some phrases
    tion of the context of a sentence and we use                 are subversive and become quickly popular among a
    this context to: (i) weight users’ expertise                 group of peers while the rest of the population does
    and their translation (e.g., an AI specialist is             not know their meaning or how to use them. The
    more likely than a lay person to give a correct              expressions “deadly” (as in, “it was deadly”, which
    translation for a sentence about deep learn-                 means “it was great”) that you will hear in Dublin or
    ing) (ii) map the various translations of words              “c’est solide” (“it’s fair”) used by French teenagers are
    and phrases and their context, so that we can                hardly found on online resources and Machine Trans-
    use them during online meetings.                             lation systems are unlikely to handle them correctly.
                                                                     In this paper we propose a gamified application
                                                                 that aims at collecting translations from and for online
Copyright c by the paper’s authors. Copying permitted for        meetings. First, our system encourages users of online
private and academic purposes.
In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder            1 https://www.microsoft.com/en-us/translator

(eds.): Proceedings of the GamifIR 2016 Workshop, Pisa, Italy,       2 https://translate.google.com/

21-July-2016, published at http://ceur-ws.org                        3 http://www.systransoft.com/
meeting systems (our system is based on Skype) to            this paper, we combine di↵erent techniques to generate
submit elements of their discussions and the transla-        a cloud of topic labels. First, we apply a text segmen-
tions that go with them. These elements are then seg-        tation [SCS04] method on the meeting transcript to
mented into contextually homogeneous partitions and          split it into one-subject sections. Then, we retrieve a
we apply a topic labelling mechanism to detect the rele-     distribution of keywords for each section using a La-
vant keyphrases to describe them. Users can then play        tent Dirichlet Allocation [MBYNIJ03] (LDA). Even-
with our system and try to find the best translations        tually, we put a label [Sch09] on the list of keywords
given the context of the sentences. In the back-end,         using the Wikipedia category network, finding the cat-
our system selects the best translations depending on        egory the describes the best the keyphrases contained
both the crowds’ preference and the expertise of the         in the topic.
players (using a mapping between context and exper-
tise). Our system can then be used during online meet-       2.1.1   Text Tiling
ings, where the context is monitored to find the most        For the first part of our topic detection algorithm, we
accurate translation.                                        use a Text Tiling [BIR] method which, given a docu-
   We perform evaluations for the topic detection and        ment, returns a list of segments where each segment
usability (i.e., how easy and satisfying is the interface)   corresponds to a single subtopic. This method is based
elements of our system. We show that our system finds        on the computation of lexical coherence between two
the right description of the context 65.8% of the time       adjacent blocks of text to determine where there is a
– and that our users find the application simple and         topic change.
pleasing (usability score of 82%).                               We start by pre-processing the document using
   Using the crowd (and discriminating workers using         stop-word removal and tokenisation. This leaves us
their competence) to obtain quality translations is not      with a list of tokens (ti ) for the document d: d = {t1 ,
a novel idea as such (e.g., see the work done by Chris       t2 , ..., tn }. Then, we define a sequence of blocks of
Callison-Burch and his team [ZCB11]). However, us-           tokens, each of the same size (K):
ing a game and the notion of context (to qualify sen-
tences and players’ expertise) to increase the quality                       bi = {tj , j 2 [i, i + K]}           (1)
of the translations is new as far as we know.
   The rest of this paper is organised as follows: Sec-      The block bi is the one which begins with the token
tion 2 describes the first, important, element of our        ti . For empirical reasons, we have chosen K=20.
system: the segmentation into semantically homoge-
neous contexts and their descriptions using topics and          For each pair of successive blocks in the document,
keywords; Section 3 describes our prototype: the ar-         bi and bi+K+1 , our method computes the cohesion
chitecture of the game, the game design and the mo-          score of the associated gap gi between the two blocks
tivations to play the game; finally Section 4 concludes      using the vocabulary introduction metric:
our paper and discusses some of the future directions
we plan to follow.                                                                  N ew(i) + N ew(i + K + 1)
                                                                     Score(gi ) =                                 (2)
                                                                                               2K
2     Contextual Translation                                 where New(i) is the number of new terms introduced
One of the main ideas behind our work in this paper is       in the block bi that were not in the document before bi .
that translation of online meeting, i.e., speech and po-
tentially short sentences, needs to be correlated to the        Our solution uses these scores to detect where the
context of the discussion. This context, in the noisy,       similarity between two blocks is minimum, using the
limited and community-oriented environment we de-            following depth score metric:
scribed in the previous section, is what allows to get
                                                             DepthScore(gi ) = Score(gl ) + Score(gr )   2.Score(gi )
more accurate translations. Especially as our gamified
                                                                                                                  (3)
system records the context associated to a sentence
                                                             This metric compares the novelty (in term of common
and uses it to: (i) help the translators/players to find
                                                             tokens) between bi and two others blocks bl and br
the best translation by giving them the context of the
                                                             which are the two blocks with a smaller score than
discussions; and (ii) eventually translate more accu-
                                                             bi on the left and on the right of bi . This gives an
rately online meetings.
                                                             indication of the depth of the gap gi : the higher the
                                                             score the more dissimilar are the two blocks before and
2.1   Topic Detection
                                                             after the gap.
Topic detection consists in discovering the important           The issue now is that our metric gives us a large
keywords in a document or a part of a document. In           number of local maxima. Thus, we use a smoothing
                                      Figure 1: Workflow of our topic detection algorithm


process here to highlight the relevant maxima by av-                We have used a Java package for the LDA and DMM
eraging them with a fixed window:                                topic models called jLDADMM4 because it provides al-
                                                                 ternatives for topic modelling on short documents such
                                 i+a
                              1 X                                as tweets or in our case segments of a conversation.
   DepthScore(gi ) =                  DepthScore(gj )     (4)
                             2a j=i a
                                                                 2.1.3    Topic Labelling
a is a parameter that we set at 5 after an empirical             At the end of the LDA step, we end up with a list
evaluation and given the size of our documents.                  of keyword sets (one set per contextual section). The
                                                                 objective is now to obtain one or two labels for each
                                                                 topic. For example, for the following distribution of
  every time we find a depth score higher than:
                                                                 words: java - classes - objects - algorithm - instan-
                   s̄   max(DepthScore(gi ))                     tiate, we would like to have something like: Object
                         i
            s̄ +                                          (5)    programming - Computer science.
                                  2                                 The main idea of our labelling approach is to use
where s is the average of the scores, we split the docu-         the Wikipedia categories as labels. We believe it is rel-
ment after the block and expect the section before to            evant technique for topic labelling as Wikipedia cate-
be about a di↵erent topic than the section after. Fig-           gories carry a good semantic content [Sch09, NS08]. To
ure 2 gives a visual representation of our Text Tiling           map words and categories, we have used the articles of
algorithm applied to a concatenation of 3 Wikipedia              Wikipedia and more specifically their titles. For each
articles. This validation gives us an interesting out-           article, we have taken the title, removed the stop words
put as our algorithm splits the text into 4 segments             and returned a list of relevant words. Then, we have
(see Figure 2) – while the text is the concatenation             mapped each of these words to the categories present
of 3 documents. However, we noticed that one of the              in the article. At the end, we have obtained a matrix
document is actually quite heterogeneous semantically            which gives the probability of a word to be related to
and it seems to have 2 di↵erent topics.                          a category. To process this matrix, we have parsed ap-
                                                                 proximately 15% of the whole Wikipedia XML dump5 ,
2.1.2   Latent Dirichlet Allocation (LDA)                        removing the disambiguation pages, articles describing
                                                                 categories and redirection pages. We have also deleted
Once the di↵erent contextual sections are identified             categories which covered too many semantically un-
(we assume that they have only one topic), we apply              related articles like American films or 1900 deaths as
a Latent Dirichlet Allocation [MBYNIJ03] (LDA) to                well as unrepresentative words, i.e., words related to
each of them. This algorithm gives a discrete distri-            too many categories. At the end, we have indexed
bution of words with their probabilities for each topic.         around 42,000 categories and 64,000 words. Given our
The di↵erence between our scenario and the standard              matrix, we can retrieve the categories corresponding
use of LDA is that we potentially apply it to short              to each topic ranked by their combined score:
segments (not a lengthy corpus) and we configure it to
retrieve only one topic for each contextual section.                                    X
   Furthermore, we have chosen to set at 5 the num-                       Score(C) =          W (w 2 T ).P (w 2 C)    (6)
                                                                                        w2T
ber of words per topic given by the algorithm. This
number is enough to enable a correct topic labelling                4 http://jldadmm.sourceforge.net/

during the next step.                                               5 https://dumps.wikimedia.org/
Figure 2: Gap score results of the analysis of the concatenation of 3 articles from Wikipedia (Signal processing,
Dublin and Anarchism). The x-axis gives the gap index and the y-axis gives the depth score of each gap. Yellow
straight lines show the final segmentation. Notice that while we had 3 articles from Wikipedia, the system split
them in 4 segments. It actually makes sense as one of the articles had clearly two separate contents (article
Anarchism in Wikipedia)
                                                        .

Where:                                                         each document of the benchmark. The global average
                                                               over all the documents is 65,8%. As a quick discussion
  • C is a category                                            of these results, we should say that this evaluation is
  • T is a topic given by LDA (i.e., a list of words)          not perfect as: (i) it is performed on a homogeneous
  • w is a word of T                                           corpus (all documents are from the field of Computer
                                                               Science); (ii) the DISCO similarity API is not compre-
  • W(w 2 T) is the weight of the word w in the topic
                                                               hensive and lacks of technical vocabulary. This study
    T
                                                               probably needs to be replicated and extended to make
  • P(w 2 C) is the probability for w to be related to         sure our labelling system is accurate and relevant.
    C, found in the Wikipedia matrix

2.2   Evaluation                                               3     Description of our Prototype
To validate our approach, we have used a benchmark             Our system is based on a gamified collection of user
(Wiki206 [Med09]) which consists of 20 computer sci-           feedback: users submit sentences and translations, and
ence articles annotated by 15 teams of graduate stu-           they vote for the most accurate ones. In general gam-
dents. The objective is to measure the similarity be-          ification has proven to be e↵ective at collecting users
tween the keyphrases assigned by the humans and our            feedback [HKS14, Smi12] – but games need to be well
system.                                                        designed to give users incentives to participate [RD00].
   To compute similarity between two topic labels, we          Our game is composed of three micro-tasks and pro-
have used the DISCO API which is based on a pre-               vides game mechanisms that we believe would make
computed database of word similarities, called a word          the users keen to participate. More specifically, our
space. In this space, each word is associated to a dis-        use case is the following:
tribution of semantically similar words. Thus, to com-
                                                                   • Users of Machine Translation systems for online
pare two words, we finally compare their associated
                                                                     meetings submit some of the translations they are
words vector by using a statistical analysis done on
                                                                     o↵ered (they earn points for doing that)
very large text collections. With this tool, for each
label assigned by humans, we search the most simi-                 • Players improve the translations and vote or those
lar category retrieved by the algorithm and we com-                  they consider the best, and earn points
pute the average of the similarity scores of these se-
lected categories. That gives us a similarity value for        Our prototype is a web application that uses Microsoft
each document of the dataset. Figure 3 shows how               Skype API7 to interact with the online meeting appli-
close our own labels are from the ones picked by the           cation. At the moment our prototype ingests all the
human assessors, for group of teams of assessors and           data collected in a database that stores user profiles
  6 https://github.com/zelandiya/keyword-extraction-datasets       7 https://www.skype.com/en/developer/
Figure 3: Similarity between the labels selected by our system and the tags chosen by human assessors (grouped
in 15 teams). Notice that on average we have 65% of agreement with the assessors (not seen on the figure).
(points, comments, votes), sentences, translations and     3.2 Game Design
contexts.
                                                           The game is designed to follow the look and feel of
   The current section goes through the di↵erent ele-
                                                           the Skype online meeting application. The objective
ments of our prototype: the system architecture, the
                                                           is to have a simple and attractive interface motivating
game design, the gameplay and the motivation.
                                                           people to participate.
                                                               To test the usability of our system, i.e., its e↵ective-
3.1 Architecture of the Game                               ness, efficiency and satisfaction, we have used the Sys-
Our game is composed of two parts. The first part          tem Usability Scale [B+ 96] (SUS). This method gives
is the submission of translations during or just after     a score between 0 and 100 which indicates how much
online meetings and the second part is the evaluation      an application is pleasant to use. The SUS framework
of the submissions.                                        has set the average at 68 according to a research8 . It is
                                                           based on a questionnaire consisting of the 10 following
   Figure 4 shows the interface of the submission part
                                                           questions:
of our prototype: some people are having a discussion
on an online meeting application (Skype in our proto-        1. I think that I would like to use this system fre-
type, seen on the right of Figure 4) and their speech is         quently.
being recorded (using some speech-to-text component
– see on the left of Figure 4).                              2. I found the system unnecessarily complex.
   Every sentence is translated and the users can sub-       3. I thought the system was easy to use.
mit the sentences and/or correct them. The context of        4. I think that I would need the support of a techni-
the discussion is monitored and segmented if required            cal person to be able to use this system.
(see Section 2) The context is eventually used to label
                                                             5. I found the various functions in this system were
the sentences - and keep track of their context.
                                                                 well integrated.
   After being uploaded, the sentences are submitted
to the vote of the community. The initial sentence, the      6. I thought there was too much inconsistency in this
associated translations and a cloud of keywords which            system.
describes the context are given with every sentence.         7. I would imagine that most people would learn to
Users have to choose one of the proposed translations            use this system very quickly.
or if none of these translations suit them, they can add     8. I found the system very cumbersome to use.
their own. An example is given on the Figure 5. This
consensus decision making leaves the system with all         9. I felt very confident using the system.
the options (translations) which can be handy in case       10. I needed to learn a lot of things before I could get
of evolution or if a statistical model is applied.               going with this system.
   Our game has two modes: a classical mode and
a challenge mode. During the classical mode players            Users give a score from 1 (strongly disagree) to 5
evaluate 10 translations at a time whereas in the chal-    (strongly   agree) to every question. We have conducted
lenge mode they evaluate translations for as long as       the   survey   on 7 bilingual students on the evaluation
they find the ‘correct’ answer (i.e., in agreement with    part   of  the  game. We have obtained a usability of
what the community thinks) – the objective being to           8 http://satoriinteractive.com/system-usability-scale-a-

score the highest mark.                                    quick-usability-scoring-solution
                          Figure 4: Interface for submitting and updating translations


82%, which is a very good score. The worst score was          • level(p): player p’s level (given by the points p
given to the first question. We assume that the reason          earned)
for this is that some users did not feel immersed in the      • Nv (p): number of votes made by the player p
gamified universe as they were just asked to try once
the game with a fictive account.                              • Nv,a (p): number of votes made by p that are ap-
                                                                proved (see below)
3.3   Game Mechanisms                                         • Ns (p): number of submissions made by the player
                                                                p
To motivate the users, all the tasks in our prototype
                                                              • Ns,a (p): number of submissions made by p that
are linked to a Points-Badges-Leader board system.
                                                                are approved
Players earn points for each submission or evaluation,
and these points enable the players to go through dif-     The confidence score is used to weight the votes of
ferent progress status (e.g., beginner, expert). Our       each player p by their declared expertise and perceived
prototype is also composed of a leader board that sum-     expertise (from the crowd) through how many of their
marises users’ points and status. Other elements of        votes were ‘correct’.
importance in gamified applications that we have im-          The relevance of a translation is computed by:
plemented are missions (e.g., “evaluate 30 translations                      P                              P
                                                                                                     s                                  s
today”) and trophies that players win as they progress                  s       p2Vf or (tsi ) v(p, ti )        p2Vagainst (tsi ) v(p, ti )
                                                           Relevance(ti ) =                      P
in the leader board and by leading it.                                                                               s
                                                                                                    p2V (tsi ) v(p, ti )
    The points, trophies and status are displayed on the                                                                    (8)
user profile so that users can see their progress in the   With
game. You can see an example of a profile on Figure 6.                     v(p, tsi ) = CS(p).k(p, s)                       (9)
All these extrinsic motivators are used to increase the
participation and give the user a visual representation    Where:
of an accomplished task.                                      • tsi : translation i of the sentence s
    In addition to the points, each player has a confi-
dence score which tells how relevant their participation      • Vfor (tsi ) :set of the players who voted for tsi
is. It is computed as follows:                                • Vagainst (tsi ): set of the players who voted for the
              ✓                       ◆                         other translations
                   Nv,a (p) Ns,a (p)
    CS(p) = 1 +             +            ⇤ level(p) (7)       • V(s) : set of the player who have evaluated s
                    Nv (p)     Ns (p)
                                                              • k(p,s) : is a factor equal to 1 if the player is famil-
where:                                                          iar with the context of the sentence and 0.5 other-
                                                                wise – players have the possibility to add topics in
  • p: player                                                   which they have some knowledge to their profile.
                                  Figure 5: Interface for evaluating translations


      These topics are used to determine if the player      pany could work together to improve the translations
      is familiar with the context of a sentence or not.    within their company – it is a possible extension of
      This factor enables to weight the votes with the      our system, which could be deployed/used internally
      user’s expertise.                                     in an Enterprise or group. In that particular case, we
                                                            can probably also count organizational citizenship be-
   The confidence score is updated retroactively each       havior (i.e., the willingness to perform tasks that help
time a sentence is approved. This system of confi-          the organisation without explicit rewards) [SON83] to
dence score is useful to tell the di↵erence between rel-    increase intrinsic motivation.
evant participation and participation only motivated
by earning points. In order to motivate people to in-
crease their confidence scores, we have also added a        4   Conclusion and Future Work
leader board based on them.
                                                            In this paper we have presented our prototype of a
                                                            gamified translation improvement system for online
3.4    Intrinsic Motivation                                 meetings. Our system collects translations during on-
In addition to these rewarding systems (a.k.a., extrin-     line meetings and asks the crowd to improve them in
sic motivation systems), it is important for us to entice   context. Players earn points when they submit the
people to play by intrinsic motivation [RD00]. Intrin-      translations and when they vote for them. The vote
sic motivation in gamification optimises human moti-        of the players, weighted by their expertise in specific
vation to use the system and consequently brings bet-       contexts, helps our online meeting translation system
ter data quality. The objective is that players play        to be more accurate – again, in context.
for inherent satisfaction and not simply for additional        Using a known topic labelling benchmark, we have
rewards. In our case, we have chosen to focus on the        validated that our topic detection and labelling com-
feedback given to the player. For instance, we enable       ponent works well - we got 65% agreement with human
players to see the votes in favour of their own submis-     assessors. We have also conducted a system usability
sions. Thus, they have a feedback on their participa-       scale survey and the respondents acknowledged that
tion and can improve their skills.                          our system is easy to use and brings satisfaction to
   Moreover, we also display some statistics about the      players - score of 82%.
game, such as, the number of validated sentences or            As future work, we would like to (i) test our topic
the number of submitted sentences, and we provide a         detection and topic labelling algorithms on more ex-
visualisation of the approved corrections per language.     haustive benchmarks; (ii) improve the topic labelling
These pieces of information show the players that they      algorithms using more structural and semantic infor-
participate for a real purpose and they are members         mation (links between categories, text of the articles,
of a real community. This may particularly be impor-        hierarchy of categories); (iii) use our system to com-
tant in a corporate context: sta↵ members of a com-         pare di↵erent machine translation systems or di↵erent
          Figure 6: User profile: note the various elements: score, progress made, missions and trophies.


parameters/versions of machine translation systems         [Med09]       Olena Medelyan. Human-competitive
– this would particularly be interesting for machine                     automatic topic indexing. PhD thesis,
learning-based systems; (iv) evaluate our own system                     The University of Waikato, 2009.
borrowing the ideas we can find in the crowdsourcing
                                                           [NS08]        Vivi Nastase and Michael Strube. De-
domain (for instance what Chris Callison-Burch and
                                                                         coding wikipedia categories for knowl-
his team do in [ZCB11]).
                                                                         edge acquisition. In AAAI, 2008.
Acknowledgement                                            [RD00]        Richard M Ryan and Edward L Deci. In-
                                                                         trinsic and extrinsic motivations: Clas-
This work was supported by Science Foundation Ire-
                                                                         sic definitions and new directions. Con-
land grant 13/RC/2094 to Lero - the Irish Software
                                                                         temporary educational psychology, 25(1),
Research Centre.
                                                                         2000.
References                                                 [Sch09]       Peter Schönhofen. Identifying document
  +
[B 96]         John Brooke et al. Sus-a quick and dirty                  topics using the wikipedia category net-
               usability scale. Usability evaluation in                  work. WI/AS, 2009.
               industry, 189(194), 1996.                   [SCS04]       Nicola Stokes, Joe Carthy, and Alan F
                                                                         Smeaton. Select: a lexical cohesion
[BIR]          Satanjeev Banerjee and Alexander
                                                                         based news story segmentation system.
               I. Rudnicky. A text tiling based ap-
                                                                         AI Communications, 17(1), 2004.
               proach to topic boundary detection in
               meeting.                                    [Smi12]       Ross Smith. The future of work is
                                                                         play. In International Games Innovation
[HKS14]        Juho Hamari, Jonna Koivisto, and Harri
                                                                         Conference, 2012.
               Sarsa. Does gamification work?–a litera-
               ture review of empirical studies on gam-    [SON83]       CA Smith, Dennis W Organ, and
               ification. In HICSS, 2014.                                Janet P Near.     Organizational citi-
                                                                         zenship behavior: Its nature and an-
[Koe09]        Philipp Koehn.   Statistical machine                      tecedents. Journal of Applied Psychol-
               translation.   Cambridge University                       ogy, 68(4), 1983.
               Press, 2009.
                                                           [ZCB11]       Omar F. Zaidan and Chris Callison-
[MBYNIJ03] David M. Blei, Andrew Y. Ng, and                              Burch. Crowdsourcing translation: Pro-
           Michael I. Jordan. Latent dirichlet al-                       fessional quality from non-professionals.
           location. Journal of machine learning                         In ACL, pages 1220–1229, 2011.
           research, 2003.