=Paper=
{{Paper
|id=Vol-1642/paper4
|storemode=property
|title=Towards a Gamified System to Improve Translation for Online Meetings
|pdfUrl=https://ceur-ws.org/Vol-1642/paper4.pdf
|volume=Vol-1642
|authors=Laura Guillot,Quentin Bragard,Ross Smith,Dan Bean,Anthony Ventresque
|dblpUrl=https://dblp.org/rec/conf/sigir/GuillotBSBV16
}}
==Towards a Gamified System to Improve Translation for Online Meetings==
Towards a Gamified System to Improve Translation for
Online Meetings
Laura Guillot1,2 , Quentin Bragard1 , Ross Smith3 , Dan Bean3 and Anthony Ventresque1
1
Lero, School of Computer Science, University College Dublin, Ireland
2
École Centrale de Nantes, France
3
Microsoft Corporation, Skype Division, Seattle, USA
laura.guillot@eleves.ec-nantes.fr, quentin.bragard@ucdconnect.ie, rosss@microsoft.com,
danbean@microsoft.com, anthony.ventresque@ucd.ie
1 Introduction
Machine Translation has been a rich field of research
Abstract for decades and many directions and models have been
proposed [Koe09]. Commercial systems, such as, Mi-
Translation of online meetings (e.g., Skype crosoft Translator1 , Google Translate2 or Systran3 , are
conversations) is a useful feature that can commonplace nowadays and have proven to be useful
help users to understand each other. How- to users on the Internet and in real life. However,
ever translations can sometimes be inaccurate despite their obvious successes, academia and indus-
or they can miss the context of the discus- try are still facing hard challenges and many transla-
sion. This is for instance the case in corpo- tions proposed ‘in the wild’ lack quality. By ‘in the
rate environments where some words are used wild’ we mean translation of short and, at times, low
with special meanings that can be obscure to linguistic quality texts, as we see them for instance
other people. This paper presents the proto- on online social network applications. This is espe-
type of a gamified application that aims at cially a challenge for Machine Translation as speech
improving translations of and for online meet- (e.g., during online meetings) and short written posts
ings. In our system, users play to earn points and comments (e.g., on online social applications) are
and rewards – and they try to propose and usual ways of communicating nowadays. These ways
vote for the most accurate translations in con- of communicating, not only they are incomplete and
text. Our system uses various techniques to noisy, but they are also contextual in nature: they are
split conversations in various semantically co- often linked to a particular community (e.g., teenagers,
herent segments and label them with relevant employees of an Enterprise, members of a professional
keyphrases. This is how we extract a descrip- field) and evolve quickly. For instance some phrases
tion of the context of a sentence and we use are subversive and become quickly popular among a
this context to: (i) weight users’ expertise group of peers while the rest of the population does
and their translation (e.g., an AI specialist is not know their meaning or how to use them. The
more likely than a lay person to give a correct expressions “deadly” (as in, “it was deadly”, which
translation for a sentence about deep learn- means “it was great”) that you will hear in Dublin or
ing) (ii) map the various translations of words “c’est solide” (“it’s fair”) used by French teenagers are
and phrases and their context, so that we can hardly found on online resources and Machine Trans-
use them during online meetings. lation systems are unlikely to handle them correctly.
In this paper we propose a gamified application
that aims at collecting translations from and for online
Copyright c by the paper’s authors. Copying permitted for meetings. First, our system encourages users of online
private and academic purposes.
In: F. Hopfgartner, G. Kazai, U. Kruschwitz, and M. Meder 1 https://www.microsoft.com/en-us/translator
(eds.): Proceedings of the GamifIR 2016 Workshop, Pisa, Italy, 2 https://translate.google.com/
21-July-2016, published at http://ceur-ws.org 3 http://www.systransoft.com/
meeting systems (our system is based on Skype) to this paper, we combine di↵erent techniques to generate
submit elements of their discussions and the transla- a cloud of topic labels. First, we apply a text segmen-
tions that go with them. These elements are then seg- tation [SCS04] method on the meeting transcript to
mented into contextually homogeneous partitions and split it into one-subject sections. Then, we retrieve a
we apply a topic labelling mechanism to detect the rele- distribution of keywords for each section using a La-
vant keyphrases to describe them. Users can then play tent Dirichlet Allocation [MBYNIJ03] (LDA). Even-
with our system and try to find the best translations tually, we put a label [Sch09] on the list of keywords
given the context of the sentences. In the back-end, using the Wikipedia category network, finding the cat-
our system selects the best translations depending on egory the describes the best the keyphrases contained
both the crowds’ preference and the expertise of the in the topic.
players (using a mapping between context and exper-
tise). Our system can then be used during online meet- 2.1.1 Text Tiling
ings, where the context is monitored to find the most For the first part of our topic detection algorithm, we
accurate translation. use a Text Tiling [BIR] method which, given a docu-
We perform evaluations for the topic detection and ment, returns a list of segments where each segment
usability (i.e., how easy and satisfying is the interface) corresponds to a single subtopic. This method is based
elements of our system. We show that our system finds on the computation of lexical coherence between two
the right description of the context 65.8% of the time adjacent blocks of text to determine where there is a
– and that our users find the application simple and topic change.
pleasing (usability score of 82%). We start by pre-processing the document using
Using the crowd (and discriminating workers using stop-word removal and tokenisation. This leaves us
their competence) to obtain quality translations is not with a list of tokens (ti ) for the document d: d = {t1 ,
a novel idea as such (e.g., see the work done by Chris t2 , ..., tn }. Then, we define a sequence of blocks of
Callison-Burch and his team [ZCB11]). However, us- tokens, each of the same size (K):
ing a game and the notion of context (to qualify sen-
tences and players’ expertise) to increase the quality bi = {tj , j 2 [i, i + K]} (1)
of the translations is new as far as we know.
The rest of this paper is organised as follows: Sec- The block bi is the one which begins with the token
tion 2 describes the first, important, element of our ti . For empirical reasons, we have chosen K=20.
system: the segmentation into semantically homoge-
neous contexts and their descriptions using topics and For each pair of successive blocks in the document,
keywords; Section 3 describes our prototype: the ar- bi and bi+K+1 , our method computes the cohesion
chitecture of the game, the game design and the mo- score of the associated gap gi between the two blocks
tivations to play the game; finally Section 4 concludes using the vocabulary introduction metric:
our paper and discusses some of the future directions
we plan to follow. N ew(i) + N ew(i + K + 1)
Score(gi ) = (2)
2K
2 Contextual Translation where New(i) is the number of new terms introduced
One of the main ideas behind our work in this paper is in the block bi that were not in the document before bi .
that translation of online meeting, i.e., speech and po-
tentially short sentences, needs to be correlated to the Our solution uses these scores to detect where the
context of the discussion. This context, in the noisy, similarity between two blocks is minimum, using the
limited and community-oriented environment we de- following depth score metric:
scribed in the previous section, is what allows to get
DepthScore(gi ) = Score(gl ) + Score(gr ) 2.Score(gi )
more accurate translations. Especially as our gamified
(3)
system records the context associated to a sentence
This metric compares the novelty (in term of common
and uses it to: (i) help the translators/players to find
tokens) between bi and two others blocks bl and br
the best translation by giving them the context of the
which are the two blocks with a smaller score than
discussions; and (ii) eventually translate more accu-
bi on the left and on the right of bi . This gives an
rately online meetings.
indication of the depth of the gap gi : the higher the
score the more dissimilar are the two blocks before and
2.1 Topic Detection
after the gap.
Topic detection consists in discovering the important The issue now is that our metric gives us a large
keywords in a document or a part of a document. In number of local maxima. Thus, we use a smoothing
Figure 1: Workflow of our topic detection algorithm
process here to highlight the relevant maxima by av- We have used a Java package for the LDA and DMM
eraging them with a fixed window: topic models called jLDADMM4 because it provides al-
ternatives for topic modelling on short documents such
i+a
1 X as tweets or in our case segments of a conversation.
DepthScore(gi ) = DepthScore(gj ) (4)
2a j=i a
2.1.3 Topic Labelling
a is a parameter that we set at 5 after an empirical At the end of the LDA step, we end up with a list
evaluation and given the size of our documents. of keyword sets (one set per contextual section). The
objective is now to obtain one or two labels for each
topic. For example, for the following distribution of
every time we find a depth score higher than:
words: java - classes - objects - algorithm - instan-
s̄ max(DepthScore(gi )) tiate, we would like to have something like: Object
i
s̄ + (5) programming - Computer science.
2 The main idea of our labelling approach is to use
where s is the average of the scores, we split the docu- the Wikipedia categories as labels. We believe it is rel-
ment after the block and expect the section before to evant technique for topic labelling as Wikipedia cate-
be about a di↵erent topic than the section after. Fig- gories carry a good semantic content [Sch09, NS08]. To
ure 2 gives a visual representation of our Text Tiling map words and categories, we have used the articles of
algorithm applied to a concatenation of 3 Wikipedia Wikipedia and more specifically their titles. For each
articles. This validation gives us an interesting out- article, we have taken the title, removed the stop words
put as our algorithm splits the text into 4 segments and returned a list of relevant words. Then, we have
(see Figure 2) – while the text is the concatenation mapped each of these words to the categories present
of 3 documents. However, we noticed that one of the in the article. At the end, we have obtained a matrix
document is actually quite heterogeneous semantically which gives the probability of a word to be related to
and it seems to have 2 di↵erent topics. a category. To process this matrix, we have parsed ap-
proximately 15% of the whole Wikipedia XML dump5 ,
2.1.2 Latent Dirichlet Allocation (LDA) removing the disambiguation pages, articles describing
categories and redirection pages. We have also deleted
Once the di↵erent contextual sections are identified categories which covered too many semantically un-
(we assume that they have only one topic), we apply related articles like American films or 1900 deaths as
a Latent Dirichlet Allocation [MBYNIJ03] (LDA) to well as unrepresentative words, i.e., words related to
each of them. This algorithm gives a discrete distri- too many categories. At the end, we have indexed
bution of words with their probabilities for each topic. around 42,000 categories and 64,000 words. Given our
The di↵erence between our scenario and the standard matrix, we can retrieve the categories corresponding
use of LDA is that we potentially apply it to short to each topic ranked by their combined score:
segments (not a lengthy corpus) and we configure it to
retrieve only one topic for each contextual section. X
Furthermore, we have chosen to set at 5 the num- Score(C) = W (w 2 T ).P (w 2 C) (6)
w2T
ber of words per topic given by the algorithm. This
number is enough to enable a correct topic labelling 4 http://jldadmm.sourceforge.net/
during the next step. 5 https://dumps.wikimedia.org/
Figure 2: Gap score results of the analysis of the concatenation of 3 articles from Wikipedia (Signal processing,
Dublin and Anarchism). The x-axis gives the gap index and the y-axis gives the depth score of each gap. Yellow
straight lines show the final segmentation. Notice that while we had 3 articles from Wikipedia, the system split
them in 4 segments. It actually makes sense as one of the articles had clearly two separate contents (article
Anarchism in Wikipedia)
.
Where: each document of the benchmark. The global average
over all the documents is 65,8%. As a quick discussion
• C is a category of these results, we should say that this evaluation is
• T is a topic given by LDA (i.e., a list of words) not perfect as: (i) it is performed on a homogeneous
• w is a word of T corpus (all documents are from the field of Computer
Science); (ii) the DISCO similarity API is not compre-
• W(w 2 T) is the weight of the word w in the topic
hensive and lacks of technical vocabulary. This study
T
probably needs to be replicated and extended to make
• P(w 2 C) is the probability for w to be related to sure our labelling system is accurate and relevant.
C, found in the Wikipedia matrix
2.2 Evaluation 3 Description of our Prototype
To validate our approach, we have used a benchmark Our system is based on a gamified collection of user
(Wiki206 [Med09]) which consists of 20 computer sci- feedback: users submit sentences and translations, and
ence articles annotated by 15 teams of graduate stu- they vote for the most accurate ones. In general gam-
dents. The objective is to measure the similarity be- ification has proven to be e↵ective at collecting users
tween the keyphrases assigned by the humans and our feedback [HKS14, Smi12] – but games need to be well
system. designed to give users incentives to participate [RD00].
To compute similarity between two topic labels, we Our game is composed of three micro-tasks and pro-
have used the DISCO API which is based on a pre- vides game mechanisms that we believe would make
computed database of word similarities, called a word the users keen to participate. More specifically, our
space. In this space, each word is associated to a dis- use case is the following:
tribution of semantically similar words. Thus, to com-
• Users of Machine Translation systems for online
pare two words, we finally compare their associated
meetings submit some of the translations they are
words vector by using a statistical analysis done on
o↵ered (they earn points for doing that)
very large text collections. With this tool, for each
label assigned by humans, we search the most simi- • Players improve the translations and vote or those
lar category retrieved by the algorithm and we com- they consider the best, and earn points
pute the average of the similarity scores of these se-
lected categories. That gives us a similarity value for Our prototype is a web application that uses Microsoft
each document of the dataset. Figure 3 shows how Skype API7 to interact with the online meeting appli-
close our own labels are from the ones picked by the cation. At the moment our prototype ingests all the
human assessors, for group of teams of assessors and data collected in a database that stores user profiles
6 https://github.com/zelandiya/keyword-extraction-datasets 7 https://www.skype.com/en/developer/
Figure 3: Similarity between the labels selected by our system and the tags chosen by human assessors (grouped
in 15 teams). Notice that on average we have 65% of agreement with the assessors (not seen on the figure).
(points, comments, votes), sentences, translations and 3.2 Game Design
contexts.
The game is designed to follow the look and feel of
The current section goes through the di↵erent ele-
the Skype online meeting application. The objective
ments of our prototype: the system architecture, the
is to have a simple and attractive interface motivating
game design, the gameplay and the motivation.
people to participate.
To test the usability of our system, i.e., its e↵ective-
3.1 Architecture of the Game ness, efficiency and satisfaction, we have used the Sys-
Our game is composed of two parts. The first part tem Usability Scale [B+ 96] (SUS). This method gives
is the submission of translations during or just after a score between 0 and 100 which indicates how much
online meetings and the second part is the evaluation an application is pleasant to use. The SUS framework
of the submissions. has set the average at 68 according to a research8 . It is
based on a questionnaire consisting of the 10 following
Figure 4 shows the interface of the submission part
questions:
of our prototype: some people are having a discussion
on an online meeting application (Skype in our proto- 1. I think that I would like to use this system fre-
type, seen on the right of Figure 4) and their speech is quently.
being recorded (using some speech-to-text component
– see on the left of Figure 4). 2. I found the system unnecessarily complex.
Every sentence is translated and the users can sub- 3. I thought the system was easy to use.
mit the sentences and/or correct them. The context of 4. I think that I would need the support of a techni-
the discussion is monitored and segmented if required cal person to be able to use this system.
(see Section 2) The context is eventually used to label
5. I found the various functions in this system were
the sentences - and keep track of their context.
well integrated.
After being uploaded, the sentences are submitted
to the vote of the community. The initial sentence, the 6. I thought there was too much inconsistency in this
associated translations and a cloud of keywords which system.
describes the context are given with every sentence. 7. I would imagine that most people would learn to
Users have to choose one of the proposed translations use this system very quickly.
or if none of these translations suit them, they can add 8. I found the system very cumbersome to use.
their own. An example is given on the Figure 5. This
consensus decision making leaves the system with all 9. I felt very confident using the system.
the options (translations) which can be handy in case 10. I needed to learn a lot of things before I could get
of evolution or if a statistical model is applied. going with this system.
Our game has two modes: a classical mode and
a challenge mode. During the classical mode players Users give a score from 1 (strongly disagree) to 5
evaluate 10 translations at a time whereas in the chal- (strongly agree) to every question. We have conducted
lenge mode they evaluate translations for as long as the survey on 7 bilingual students on the evaluation
they find the ‘correct’ answer (i.e., in agreement with part of the game. We have obtained a usability of
what the community thinks) – the objective being to 8 http://satoriinteractive.com/system-usability-scale-a-
score the highest mark. quick-usability-scoring-solution
Figure 4: Interface for submitting and updating translations
82%, which is a very good score. The worst score was • level(p): player p’s level (given by the points p
given to the first question. We assume that the reason earned)
for this is that some users did not feel immersed in the • Nv (p): number of votes made by the player p
gamified universe as they were just asked to try once
the game with a fictive account. • Nv,a (p): number of votes made by p that are ap-
proved (see below)
3.3 Game Mechanisms • Ns (p): number of submissions made by the player
p
To motivate the users, all the tasks in our prototype
• Ns,a (p): number of submissions made by p that
are linked to a Points-Badges-Leader board system.
are approved
Players earn points for each submission or evaluation,
and these points enable the players to go through dif- The confidence score is used to weight the votes of
ferent progress status (e.g., beginner, expert). Our each player p by their declared expertise and perceived
prototype is also composed of a leader board that sum- expertise (from the crowd) through how many of their
marises users’ points and status. Other elements of votes were ‘correct’.
importance in gamified applications that we have im- The relevance of a translation is computed by:
plemented are missions (e.g., “evaluate 30 translations P P
s s
today”) and trophies that players win as they progress s p2Vf or (tsi ) v(p, ti ) p2Vagainst (tsi ) v(p, ti )
Relevance(ti ) = P
in the leader board and by leading it. s
p2V (tsi ) v(p, ti )
The points, trophies and status are displayed on the (8)
user profile so that users can see their progress in the With
game. You can see an example of a profile on Figure 6. v(p, tsi ) = CS(p).k(p, s) (9)
All these extrinsic motivators are used to increase the
participation and give the user a visual representation Where:
of an accomplished task. • tsi : translation i of the sentence s
In addition to the points, each player has a confi-
dence score which tells how relevant their participation • Vfor (tsi ) :set of the players who voted for tsi
is. It is computed as follows: • Vagainst (tsi ): set of the players who voted for the
✓ ◆ other translations
Nv,a (p) Ns,a (p)
CS(p) = 1 + + ⇤ level(p) (7) • V(s) : set of the player who have evaluated s
Nv (p) Ns (p)
• k(p,s) : is a factor equal to 1 if the player is famil-
where: iar with the context of the sentence and 0.5 other-
wise – players have the possibility to add topics in
• p: player which they have some knowledge to their profile.
Figure 5: Interface for evaluating translations
These topics are used to determine if the player pany could work together to improve the translations
is familiar with the context of a sentence or not. within their company – it is a possible extension of
This factor enables to weight the votes with the our system, which could be deployed/used internally
user’s expertise. in an Enterprise or group. In that particular case, we
can probably also count organizational citizenship be-
The confidence score is updated retroactively each havior (i.e., the willingness to perform tasks that help
time a sentence is approved. This system of confi- the organisation without explicit rewards) [SON83] to
dence score is useful to tell the di↵erence between rel- increase intrinsic motivation.
evant participation and participation only motivated
by earning points. In order to motivate people to in-
crease their confidence scores, we have also added a 4 Conclusion and Future Work
leader board based on them.
In this paper we have presented our prototype of a
gamified translation improvement system for online
3.4 Intrinsic Motivation meetings. Our system collects translations during on-
In addition to these rewarding systems (a.k.a., extrin- line meetings and asks the crowd to improve them in
sic motivation systems), it is important for us to entice context. Players earn points when they submit the
people to play by intrinsic motivation [RD00]. Intrin- translations and when they vote for them. The vote
sic motivation in gamification optimises human moti- of the players, weighted by their expertise in specific
vation to use the system and consequently brings bet- contexts, helps our online meeting translation system
ter data quality. The objective is that players play to be more accurate – again, in context.
for inherent satisfaction and not simply for additional Using a known topic labelling benchmark, we have
rewards. In our case, we have chosen to focus on the validated that our topic detection and labelling com-
feedback given to the player. For instance, we enable ponent works well - we got 65% agreement with human
players to see the votes in favour of their own submis- assessors. We have also conducted a system usability
sions. Thus, they have a feedback on their participa- scale survey and the respondents acknowledged that
tion and can improve their skills. our system is easy to use and brings satisfaction to
Moreover, we also display some statistics about the players - score of 82%.
game, such as, the number of validated sentences or As future work, we would like to (i) test our topic
the number of submitted sentences, and we provide a detection and topic labelling algorithms on more ex-
visualisation of the approved corrections per language. haustive benchmarks; (ii) improve the topic labelling
These pieces of information show the players that they algorithms using more structural and semantic infor-
participate for a real purpose and they are members mation (links between categories, text of the articles,
of a real community. This may particularly be impor- hierarchy of categories); (iii) use our system to com-
tant in a corporate context: sta↵ members of a com- pare di↵erent machine translation systems or di↵erent
Figure 6: User profile: note the various elements: score, progress made, missions and trophies.
parameters/versions of machine translation systems [Med09] Olena Medelyan. Human-competitive
– this would particularly be interesting for machine automatic topic indexing. PhD thesis,
learning-based systems; (iv) evaluate our own system The University of Waikato, 2009.
borrowing the ideas we can find in the crowdsourcing
[NS08] Vivi Nastase and Michael Strube. De-
domain (for instance what Chris Callison-Burch and
coding wikipedia categories for knowl-
his team do in [ZCB11]).
edge acquisition. In AAAI, 2008.
Acknowledgement [RD00] Richard M Ryan and Edward L Deci. In-
trinsic and extrinsic motivations: Clas-
This work was supported by Science Foundation Ire-
sic definitions and new directions. Con-
land grant 13/RC/2094 to Lero - the Irish Software
temporary educational psychology, 25(1),
Research Centre.
2000.
References [Sch09] Peter Schönhofen. Identifying document
+
[B 96] John Brooke et al. Sus-a quick and dirty topics using the wikipedia category net-
usability scale. Usability evaluation in work. WI/AS, 2009.
industry, 189(194), 1996. [SCS04] Nicola Stokes, Joe Carthy, and Alan F
Smeaton. Select: a lexical cohesion
[BIR] Satanjeev Banerjee and Alexander
based news story segmentation system.
I. Rudnicky. A text tiling based ap-
AI Communications, 17(1), 2004.
proach to topic boundary detection in
meeting. [Smi12] Ross Smith. The future of work is
play. In International Games Innovation
[HKS14] Juho Hamari, Jonna Koivisto, and Harri
Conference, 2012.
Sarsa. Does gamification work?–a litera-
ture review of empirical studies on gam- [SON83] CA Smith, Dennis W Organ, and
ification. In HICSS, 2014. Janet P Near. Organizational citi-
zenship behavior: Its nature and an-
[Koe09] Philipp Koehn. Statistical machine tecedents. Journal of Applied Psychol-
translation. Cambridge University ogy, 68(4), 1983.
Press, 2009.
[ZCB11] Omar F. Zaidan and Chris Callison-
[MBYNIJ03] David M. Blei, Andrew Y. Ng, and Burch. Crowdsourcing translation: Pro-
Michael I. Jordan. Latent dirichlet al- fessional quality from non-professionals.
location. Journal of machine learning In ACL, pages 1220–1229, 2011.
research, 2003.