 Crowdsourcing Language Resources for Dutch using PYBOSSA: Case Studies
              on Blends, Neologisms and Language Variation
                                         Peter Dekker, Tanneke Schoonheim
                                Instituut voor de Nederlandse Taal (Dutch Language Institute)

In this paper, we evaluate PYBOSSA, an open-source crowdsourcing framework, by performing case studies on blends, neologisms and
language variation. We describe the procedural aspects of crowdsourcing, such as working with a crowdsourcing platform and reaching
the desired audience. Furthermore, we analyze the results, and show that crowdsourcing can shed new light on how language is used by

Keywords: crowdsourcing, lexicography, neologisms, language variation

                    1.   Introduction                               Open-task crowdsourcing has been applied to lexicography
Crowdsourcing (or: citizen science) has shown to be a               for other languages, such as Slovene, where crowdsourcing
quick and cost-efficient way to perform tasks by a large            was integrated in the thesaurus and collocation dictionary
number of lay people, which normally have to be per-                applications (Holdt et al., 2018; Kosem et al., 2018). On
formed by a small number of experts (Holley, 2010; Causer           top of this goal of language documentation, we would like
et al., 2018). In this paper, we use the PYBOSSA (PB)               to use crowdsourcing to make language material available
framework1 for crowdsourcing language resources for the             for language learners.
Dutch language. We will describe our experiences with
this framework, to accomplish the goals of language doc-                                   2.    Method
umentation and generation of language learning material.
In addition to sharing our experiences, we will report on           As the basis for our experiments, we hosted an instance
linguistic findings based on the experiments we performed           of PYBOSSA at our institute. We named our crowdsourc-
on blends, neologisms and language variation.                       ing platform Taalradar (‘language radar’): this signifies
For the Dutch language, crowdsourcing has been valuable             both the ‘radar’ (overview) we would like to gain over
in the past. We distinguish two types of approaches. On             the entire language through crowdsourcing, and the per-
one hand, there are fixed tasks, where more or less one an-         sonal ‘language radar’ or linguistic intuition of contribu-
swer is correct. As fixed tasks, crowdsourcing has been             tors, which we would like to exploit. We ran two crowd-
applied to the transcription of letters from 17th and 18th-         sourcing rounds: in september 2018 and in november-
century Dutch sailors (Van der Wal et al., 2012) and his-           december 2018.
torical Dutch Bible translations (Beelen and Van der Sijs,
2014).                                                              2.1.   Tasks
On the other hand, there are open tasks, referred to in em-         We designed four tasks, which are well-suited to reach our
pirical sciences as elicitation tasks, where different answers      goals: documentation of the Dutch language and develop-
by different users are welcomed, in order to capture varia-         ing material for language learning. Since we would like to
tion. Examples of open tasks are Palabras (Burgos et al.,           get a picture of the speakers of the language, we ask for
2015; Sanders et al., 2016), where lay native Dutch speak-          user details (gender, age and city of residence) in all tasks.
ers were asked to transcribe vowels produced by L2 learn-           The tasks were created as Javascript/HTML files inside PB.
ers, and Emigrant Dutch2 , which tries to capture the lan-
                                                                    Tasks 1 and 2: Blends analysis and recognition Blends
guage use of emigrant Dutch speakers. Of course, mixture
                                                                    are compound words, formed “by fusing parts of at least
forms between open and fixed tasks are possible.
                                                                    two other source words of which either one is shortened in
The Dutch Language Institute strives to document the lan-
                                                                    the fusion and/or where there is some form of phonemic or
guage as it is used, by compiling language resources (eg.
                                                                    graphemic overlap of the source words” (Gries, 2004). An
dictionaries) based on corpora from different sources, such
                                                                    example of a blend in both English and Dutch is brunch,
as newspapers and websites. Fixed-task crowdsourcing,
                                                                    which consists of breakfast and lunch. For our experi-
such as transcription and correction, can help in this pro-
                                                                    ments, we used blends collected for the Algemeen Ned-
cess. However, we see even greater possibilities for open-
                                                                    erlands Woordenboek (Dictionary of Contemporary Dutch;
task crowdsourcing, asking speakers how they use and per-
                                                                    ANW) (Tiberius and Niestadt, 2010; Schoonheim and Tem-
ceive the language, which we will explore in this paper.
                                                                    pelaars, 2010; Tiberius and Schoonheim, 2016).
    Homepage: http://pybossa.com. DOI: https://
                                                                    We developed two tasks: analysis and recognition of
doi.org/10.5281/zenodo.1485460                                      blends. In the analysis task, contributors are presented with
    http://www.meertens.knaw.nl/                                    a blend, and asked of which source words this blend con-
vertrokken-nederlands/                                              sists. No context of the blend is provided. 10 blends are

presented in total. Figure 1 shows the task as it is presented    of the international society of Dutch linguistics and Drongo
to the contributor.                                               festival, an event for the language sector in The Nether-
                                                                  lands. Finally, we advertised our experiments via social
                                                                  media (Twitter, LinkedIn) and a Dutch linguistics blog.

                                                                                         3.    Results
                                                                  The results section consists of two parts. We will first de-
        Figure 1: Screenshot of the blends analysis task.         scribe our experiences with PYBOSSA as a crowdsourcing
                                                                  platform. Then, we will report on the linguistic findings on
In the recognition task, contributors are presented with a ci-    the language phenomena we performed experiments on.
tation from the ANW dictionary. 10 citations are presented.
Contributors should recognize the blend in the citation. Ev-      3.1.    Experiences of crowdsourcing with
ery citation contains one blend, but we ask for “one or mul-              PYBOSSA
tiple blends” and present users with tree input fields to enter   Table 1 shows the number of contributors for each of the
blends. We deliberately designed the task in this somewhat        tasks. It can be observed that only a small number of visi-
deceptive way, to see which other words are candidates for        tors did not finish the whole task. This could be due to the
being perceived as blends.                                        small number of questions we offered per task. The tasks in
Task 3: Neologisms In this task, contributors were asked          the second round (november-december 2018) received less
to judge neologisms (new words) in a citation, on two cri-        contributors than in the first round (september 2018), this
teria: endurance of the concept (“This word will be used          could be due to a less prominent place of the announcement
for long time.”) and diversity of users and situations (“This     in our newsletter in the second round. In all experiments,
word will be used by different people [eg. young, old] in         more women than men participated. Also, participants with
different situations [eg. conversation, newspaper].”). We         ages above 50 were well represented. More participants
selected these two criteria from the FUDGE test, a test           came from The Netherlands than from Flanders.
to rate the sustainability of a neologism, which normally
consists of 5 criteria (Metcalf, 2004). The neologisms and         Task                  # started   # completed   period
their citations were taken from newspaper material, which          Blends analysis       326         305           sept 2018
is used in the lexicographic workflow (see section 4.). From       Blends recognition    223         209           sept 2018
this corpus, sentences which contain a hitherto unknown            Neologisms            118         111           nov-dec 2018
word are extracted: these are possible neologisms, but can         Language variation    114         108           nov-dec 2018
also be words that have been formed ad hoc. Lexicogra-
phers accept or reject a word as neologism. We presented                   Table 1: Number of contributors per task.
15 words in a citation to users: 5 which have been attested
by lexicographers as neologisms, 5 which have been re-            We will now discuss our experiences with the PYBOSSA
jected as neologisms, and 5 unattested words.                     platform. A strength of PB is the freedom it offers when de-
Task 4: Language variation In this task, contributors are         signing a task: the whole interface can be written in HTML
asked how they call a certain concept or how they would ex-       and Javascript and can be customized. This also makes it
press a certain sentence. The goal is to chart dialectal vari-    easy to share tasks with other researchers4 . The account
ation, but also other kinds of language variation. We used        system and saving/loading of tasks is handled by PB, so
a list of questions from Taalverhalen.be, a website which         this does not have to be implemented by the task developer.
tries to chart language variation using questionnaires3 . The     Responses of the PB authors on the bug tracker are quick
list contains 16 questions: 9 questions on words for sweets,      and concise. It is clear that PB is mainly designed for fixed-
and 6 questions about the general vocabulary. An example          task crowdsourcing, not focusing on variation and the de-
of a question is: “How do you call VINEGAR?”. On top              tails of the contributor. For open-task, linguistic purposes,
of the user details we ask in other tasks (gender, age, city      some points require attention (at time of writing). Firstly,
of residence), we also ask for province, mother tongue and        there is no built-in support for asking contributor details.
educational level.                                                We handled this by asking contributor details via a normal
                                                                  question. However, since all given answers are visible pub-
2.2.     Audience                                                 licly in PB, this also applies to the details, which may not
Our experiments were advertised via our institutional             be ideal from a privacy perspective. Secondly, contributors
newsletter, which reaches 3891 subscribers with an inter-         cannot go back to a previous task and change their answers.
est in language. We assume this was the channel with the          Thirdly, multiple anonymous logins from the same com-
largest reach: in the first round, the newsletter article re-     puter are not allowed, making it harder to use PB on e.g.
ceived 519 clicks, and in the second round, 65 clicks. In         a trade fair. A workaround is possible, but not built in PB
both rounds, we observed an increase in contributions after       by default. Also, anonymous users are identified by IP ad-
the release of the newsletter. Additionally, we attended two      dress: this can cause problems when multiple anonymous
linguistics events, where we offered visitors the possibility     contributors connect via a shared internet connection, such
to engage in our crowdsourcing experiments: the meeting
                                                                     Our tasks can be downloaded from: https://github.
       http://taalverhalen.be, maintained by Miet Ooms.           com/INL/taalradar.

as in classroom use. Finally, there is no built-in possibility     Woord                      Endurant   Diverse     Status
for a contributor to stop answering after a subset of the total
                                                                   gendertransformatie        91.2%      70.2%       accepted
number of questions available, and show an end screen.             insectenafname             83.5%      55.7%       unattested
All in all, PYBOSSA, is a convenient crowdsourcing tool,           dreigingsmonitor           79.6%      54.0%       accepted
but has its limitations with regard to open-task crowdsourc-       belevenisstad              64.0%      55.3%       accepted
ing.                                                               vluchtelingenpraktijk      62.8%      46.9%       rejected
                                                                   multimediamerk             62.5%      52.7%       unattested
3.2.   Linguistic findings                                         zonnepriesteres            52.2%      17.4%       unattested
Blends For the blends analysis task, we compared the               seniorenmodebranche        47.4%      28.9%       unattested
contributor answers to the attested analyses from the ANW.         moeilijkheidsparadox       45.2%      21.7%       unattested
Contributors showed an average accuracy of 42%, with av-           afradertje                 43.5%      38.3%       rejected
erage accuracies per word ranging between 2-83%. Table 2           tijdstrends                38.3%      27.8%       rejected
                                                                   nachtnanny                 33.3%      18.0%       accepted
shows the given answers for the analysis of the blend pref-
                                                                   lighttaks                  26.5%      25.7%       accepted
erendum. This shows that there is not always one correct
                                                                   korttheater                20.4%      15.0%       rejected
analysis of a blend, when a related noun and verb can both         dieetopenbaring            8.7%       13.0%       rejected
be filled in as source word: while prefereren ‘to prefer’ +
referendum is the attested analysis, preferentie ‘preference’     Table 4: Endurance and diversity judgments for the 15
+ referendum may also be an option. It is even more in-           words in the neologisms task, ordered by % endurance. To-
teresting to see that a number of contributors analyze this       tal number of contributors per word varies between 111 and
blend entirely differently than the attested analysis: they       115. The rightmost column shows whether the word has
analyze the blend as pre ‘before’ + referendum.                   been manually attested, and if so, has been accepted or re-
                                                                  jected as neologism.
 Answer                         Frequency
 referendum, prefereren         154
 referendum, preferentie        60                                Language variation In the language variation task, we
 pre, referendum                16                                found that most people used the standard Dutch term to
 do not know                    11                                signify a word, only a minority of the given forms was a
 preferent, referendum          8                                 dialectal form. However, it is interesting to investigate the
                                                                  differences between Dutch and Flemish contributors. The
Table 2: 5 most frequent answers given for analysis task for      number of contributors from The Netherlands (around 100
blend preferendum. Correct answer in bold.                        per question) is larger than the number of Flemish contrib-
                                                                  utors (around 15 per question). Table 5 shows the relative
For the blends recognition task, the contributor answers          frequencies of given answers for the concept TAKE A SEAT,
were compared to the ANW entry in which the citation oc-          split per language area. ga lekker zitten is very popular in
curs. Contributors had an average recognition accuracy of         The Netherlands, while zet u is only used in Flanders.
87%, with average accuracies per word ranging between
54-97%. The accuracies are high: most blends are recog-            Utterance            Flanders   The Netherlands
nized correctly. Table 3 shows the given answers for the
                                                                   ga zitten            31%        38%
recognition of one specific blend: twittie. twittie ‘twitter
                                                                   ga lekker zitten     0%         18%
fight’ is a blend of twitter and fittie ‘fight’ (slang). Most      neem plaats          6%         7%
contributors correctly recognize this as blend. Many peo-          zet u                31%        0%
ple however also perceive fittie (which does also occur in         pak een stoel        6%         4%
this citation) and tweet as blends, possibly because these
                                                                   Total answers        16         115
words appear new or unknown.
                                                                  Table 5: Relative frequency of answers given for language
 Answer                   Frequency
                                                                  variation task for concept TAKE A SEAT, per area. Top 5
 twittie                  122                                     results, sorted by overall absolute frequency.
 twittie, fittie          56
 fittie                   16
 tweet, twittie, fittie   5
                                                                  These differences per area are observed for more questions.
 do not know              4                                       For example, a SWEET ON A STICK is referred to by many
                                                                  Flemish contributors as lekstok, whereas contributors from
Table 3: 5 most frequent answers given for recognition task       the Netherlands mainly use the form lolly. And WISHING
for blend twittie. Correct answer in bold.                        A GOOD NIGHT is done by saying slaap wel in Flanders,
                                                                  while welterusten is used more in The Netherlands.
Neologisms Table 4 shows the endurance and diversity
judgments for the 15 words in the neologisms task. These                           4.    Future applications
results show that in general, neologisms rejected by lexi-        Integrating crowdsourcing into a lexicographic work-
cographers also receive lower crowd endurance scores. For         flow Our case study on neologisms shows the potential of
diversity, this pattern is not as clear.                          crowdsourcing for lexicography. Crowdsourcing becomes

even more useful, if it becomes fully integrated into the lex-    Depuydt, K. and De Does, J. (2018). The Diachronic
icographic workflow. Currently, at the INT, newspaper ma-           Semantic Lexicon of Dutch as Linked Open Data. In
terial is fed in and sentences with unknown words are auto-         Proceedings of the Eleventh International Conference
matically extracted. Lexicographers then manually decide            on Language Resources and Evaluation (LREC 2018),
on inclusion in the dictionary. In an ideal workflow, the ex-       Paris, France, May. European Language Resources As-
tracted sentences are automatically imported into a crowd-          sociation (ELRA).
sourcing application and shown to the public. Contributor         Gries, S. T. (2004). Shouldnt it be breakfunch? A quanti-
judgments can help lexicographers in deciding on dictio-            tative analysis of blend structure in English. Linguistics,
nary inclusion. A challenge will be to motivate a crowd to          42(3), January.
contribute over a long period of time. To maintain worflow        Holdt, Š. A., Čibej, J., Dobrovoljc, K., Gantar, P., Gorjanc,
stability, also in case of a temporary drop in crowd partic-        V., Klemenc, B., Kosem, I., Krek, S., Laskowski, C.,
ipation, crowd consultation will be an optional step in the         and Robnik-Šikonja, M. (2018). Thesaurus of Modern
workflow.                                                           Slovene: By the Community for the Community. Pro-
Language learning We have not yet performed crowd-                  ceedings of the XVIII EURALEX International Congress:
sourcing experiments for language learning, but we are              Lexicography in Global Contexts, pages 989–997, July.
looking into future directions which seem promising.              Holley, R. (2010). Crowdsourcing: How and Why Should
Crowdsourcing can be used to cluster word senses, which             Libraries Do It? D-Lib Magazine, 16(3/4), March.
could help people with language or speech disabilities.           Kosem, I., Krek, S., Gantar, P., Holdt, Š. A., Čibej, J., and
Crowdsourcing has been used for word sense disambigua-              Laskowski, C. (2018). Collocations Dictionary of Mod-
tion before (Akkaya et al., 2010; Venhuizen et al., 2013),          ern Slovene. Proceedings of the XVIII EURALEX In-
also specifically targeted at creating language learning ma-        ternational Congress: Lexicography in Global Contexts,
terial (Parent and Eskenazi, 2010). It would be worthwhile          pages 989–997, July.
to apply this methodology to the ANW dictionary or the            Metcalf, A. A. (2004). Predicting new words: the secrets
semantic lexixon DiaMaNT (Depuydt and De Does, 2018).               of their success. Houghton Mifflin Harcourt.
Another idea could be to use crowdsourcing to select suit-        Parent, G. and Eskenazi, M. (2010). Clustering dictionary
able learning sentences for collocations or proverbs from a         definitions using Amazon Mechanical Turk. In Proceed-
corpus.                                                             ings of the NAACL HLT 2010 Workshop on Creating
                                                                    Speech and Language Data with Amazon’s Mechanical
                     5.   Conclusion                                Turk, pages 21–29. Association for Computational Lin-
Our experiments have shown that crowdsourcing proves                guistics.
useful for documenting the Dutch language, and can be             Sanders, E., Burgos, P., Cucchiarini, C., and van Hout,
valuable for developing Dutch language learning material            R. (2016). Palabras: Crowdsourcing Transcriptions of
in the future. We used the PYBOSSA framework for our                L2 Speech. International Conference on Language Re-
crowdsourcing experiments, which is very powerful, but              sources and Evaluation (LREC) 2016. Portorož, Slove-
also has its limitations when using it for linguistic purposes.     nia, page 7.
                                                                  Schoonheim, T. and Tempelaars, R. (2010). Dutch Lexi-
               6.    Acknowledgements                               cography in Progress, The Algemeen Nederlands Woor-
This work was supported by EU COST action CA160105                  denboek (ANW). In Proceedings of the XIV Euralex In-
enetCollect, which is gratefully acknowledged. We thank             ternational Congress, Ljouwert, Fryske Akademy/Afuk,
Miet Ooms for supplying the questions for the language              abstract, page 179.
variation task. We thank our colleagues at the INT for valu-      Tiberius, C. and Niestadt, J. (2010). The ANW: An online
able advices.                                                       Dutch dictionary. Proceedings of the XIV Euralex Inter-
                                                                    national Congress. Ljouwert, Fryske Akademy/Afuk.
          7.   Bibliographical References                         Tiberius, C. and Schoonheim, T. (2016). The Alge-
Akkaya, C., Conrad, A., Wiebe, J., and Mihalcea, R.                 meen Nederlands Woordenboek (ANW) and its Lexico-
  (2010). Amazon mechanical turk for subjectivity word              graphical Process. Der lexikografische Prozess bei In-
  sense disambiguation. In Proceedings of the NAACL                 ternetwörterbüchern. 4. Arbeitsbericht "Internetlexiko-
  HLT 2010 workshop on creating speech and language                 grafie". Mannheim: Institut für Deutsche Sprache.
  data with Amazon’s Mechanical Turk, pages 195–203.                (OPAL X/2014 ).
  Association for Computational Linguistics.                      Van der Wal, M. J., Rutten, G., and Simons, T. (2012).
Beelen, H. and Van der Sijs, N. (2014). Crowdsourcing de            Letters as loot: Confiscated Letters filling major gaps in
  Bijbel. Neerlandia / Nederlands van Nu, (2-2014).                 the History of Dutch. In Marina Dossena et al., editors,
Burgos, P., Sanders, E., Cucchiarini, C., van Hout, R., and         Pragmatics & Beyond New Series, volume 218, pages
  Strik, H. (2015). Auris populi: crowdsourced native               139–162. John Benjamins Publishing Company, Amster-
  transcriptions of Dutch vowels spoken by adult Spanish            dam.
  learners. InterSpeech 2015. Dresden, Germany, page 7.           Venhuizen, N., Evang, K., Basile, V., and Bos, J. (2013).
Causer, T., Grint, K., Sichani, A.-M., and Terras, M.               Gamification for word sense labeling. In Proceedings of
  (2018). ’Making such bargain’: Transcribe Bentham and             the 10th International Conference on Computational Se-
  the quality and cost-effectiveness of crowdsourced tran-          mantics (IWCS 2013).

