<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Annotation of subtitle paraphrases using a new web tool</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Digital Humanities University of Helsinki</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1862</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper analyzes the manual annotation e ort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authenti cation as well as open crowdsourced projects, in which anyone can participate and user identi cation takes place based on IP addresses.</p>
      </abstract>
      <kwd-group>
        <kwd>annotation paraphrase web tool inter-annotator agreement subtitle</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper introduces an online tool for annotating paraphrases and evaluates
annotations gathered with the tool. Paraphrases are pairs of phrases in the same
language that express approximately the same meaning, such as \Have a seat."
versus \Sit down.". The annotated paraphrases are part of Opusparcus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which
is a paraphrase corpus for six European languages: German (de), English (en),
Finnish ( ), French (fr), Russian (ru), and Swedish (sv).
      </p>
      <p>
        The paraphrases in Opusparcus consist of movie and TV subtitles from
OpenSubtitles2016 parallel corpora [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which are part of the larger OPUS corpus.1 We
are interested in movie and TV subtitles because of their conversational nature.
This makes subtitle data ideal for exploring dialogue phenomena and properties
of everyday, colloquial language [
        <xref ref-type="bibr" rid="ref10 ref11 ref17">11,17,10</xref>
        ]. In addition, the data could prove
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://opus.nlpl.eu/</title>
      <p>useful in modeling semantic similarity of short texts, with applications such as
extraction of related or paraphrastic content from social media. Our data could
also be valuable in computer assisted language learning to teach natural
everyday expressions as opposed to the formal language of some well-known data sets,
consisting of news texts, parliamentary speeches, or passages from the Bible.
Additionally, paraphrase data is useful for evaluating machine translation systems,
since it provides multiple correct translations for a single source sentence.</p>
      <p>Opusparcus consists of three types of data sets for each language: training,
development and test sets. These data sets can be used, for instance, in
machine learning. The training sets consist of millions of sentence pairs and their
paraphrases are paired automatically using a probabilistic ranking function. The
training sets are not discussed further in the current paper, which instead
focuses on the manually annotated development and test sets. The development
and test sets contain a few thousands of sentence pairs. Each of the pairs has
been checked by human annotators in order to ensure as high quality as possible.
The annotation e ort took place using the annotation tool, which is presented
in more detail below.</p>
      <p>The source code of the annotation tool is public.2 A public version of the
tool is online for anyone to test.3 The data gathered with the tool along with
the rest of Opusparcus is available for downloading.4</p>
      <p>The paper is divided into two main parts: First the setup of the annotation
task is described together with the design of the annotation tool. Then the
annotations produced in the project are evaluated.
2</p>
      <sec id="sec-2-1">
        <title>Setup</title>
        <p>In the beginning of the project, we faced many open questions. In the following,
we discuss the options we considered when setting up the annotation task. We
also describe why we created our own annotation tool and how the tool works.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Annotation scheme</title>
          <p>
            An essential question when determining the paraphrase status of sentence pairs,
is what rating scheme to use. The simplest scheme is to have two categories
only, as is the case with the Microsoft Research Paraphrase Corpus (MSRPC)
[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]: \Raters were told to use their best judgment in deciding whether 2 sentences,
at a high level, `mean the same thing'."
          </p>
          <p>
            Another well known resource, the Paraphrase Database (PPDB) [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] contains
automatically extracted paraphrases; however, the construction of PPDB also
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 https://github.com/miau1/simsents-anno</title>
    </sec>
    <sec id="sec-4">
      <title>3 https://vm1217.kaj.pouta.csc.</title>
    </sec>
    <sec id="sec-5">
      <title>4 Available through the Language Bank of Finland: http://urn. /urn:nbn: :</title>
      <p>lb-2018021221
involved manual annotation to some extent: \To gauge the quality of our
paraphrases, the authors judged 1900 randomly sampled predicate paraphrases on a
scale of 1 to 5, 5 being the best."</p>
      <p>
        In a later version, PPDB 2.0 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], there is further discussion: \Although we
typically think of paraphrases as equivalent or as bidirectionally entailing, a
substantial fraction of the phrase pairs in PPDB exhibit di erent entailment
relations. [...] These relations include forward entailment/hyponym, reverse
entailment/hypernym, non-entailing topical relatedness, unrelatedness, and even
exclusion/contradiction."
      </p>
      <p>
        In addition to assessing the degree of paraphrasticity, the annotation schemes
can include information about the types of paraphrase relations a phrase pair
contains. Vila et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] propose a complex scheme based on extensive linguistic
paraphrase typology. It consists of 24 di erent type tags and the annotations
also include the scopes for di erent paraphrase relations, such as lexical,
morphological or syntactic changes. Other complex schemes have also been
developed. Kovatchev et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] extend the typology and annotation scheme of Vila et
al., whereas Barron-Ceden~o et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] present a scheme based on an alternative
typology.
      </p>
      <p>When designing the Opusparcus corpus we wanted to annotate symmetric
relations and nd out whether two sentences essentially meant the same thing.
This excluded the di erent (asymmetric) entailment options from our emerging
annotation scheme. Furthermore, having only two classes (paraphrases versus
non-paraphrases) seemed too limited, because of some challenges we faced with
the data. In our system, the sentence pairs proposed as paraphrases are
produced by translating from one language to another language and then back; for
instance, English: \Have a seat." ! French: \Asseyez-vous." ! English: \Sit
down." Here \translation" actually means nding subtitles in di erent languages
that are shown at the same time in a movie or TV show. We have found that
translational paraphrases exhibit (at least) two types of near-paraphrase cases:
1. Scope mismatch: The two phrases mean almost the same thing, but one of
the phrases is more speci c than the other; for instance: \You?" $ \How
about you?", \Hi!' $ \Hi, Bob!", \What are you doing?" $ \What the
hell are you doing?"
2. Grammatical mismatch: The two phrases do not mean the same thing, but
the di erence is small and pertains to grammatical distinctions that are not
made in all languages. Such paraphrase candidates are typically by-products
of translation between languages; for instance: \I am a doctor." $ \I am
the doctor.", or French \Il est la." $ \Elle est la.". The French example
could mean either \He is here." $ \She is here." when referring to animate
objects, or just \It is here." when talking about inanimate things. It does
not appear crucial to distinguish between grammatical gender in the latter
case.</p>
      <p>Another aspect that caught our attention initially was whether it would
be necessary to distinguish between interesting and uninteresting paraphrases.</p>
      <p>There are fairly trivial transformations that can be applied to produce
paraphrases, such as: \I am sorry." $ \I'm sorry.", \Alright." $ \All right.", or
change of word order, which is common in some languages; an English example
could be: \I don't know him." $ \Him I don't know." If a computer were to
determine whether such phrase pairs were paraphrases, a very simple algorithm
would su ce, and the data would not be too interesting from a machine learning
point of view.</p>
      <p>Taking these considerations into account, an initial six-level scale was planned
for assessing to what extent sentences meant the same thing: 5 { Excellent, 4 {
Too similar, and as such uninteresting, 3 { Scope mismatch, 2 { Grammatical
mismatch, 1 { Farfetched, 0 { Wrong. However, this scheme immediately turned
out to be impractical. The scale does not produce a simple range from good
to bad. For instance, in case of 5 (excellent) or 4 (too similar), the annotator
rst has to decide whether the sentences are paraphrases or not, and in case of
paraphrases, whether they are interesting or not.</p>
      <p>
        A four-grade scale was adopted instead: 4 { Good example of paraphrases,
3 { Mostly good example of paraphrases, 2 { Mostly bad example of paraphrases,
and 1 { Bad example of paraphrases. Note that the scale has an even number
of entries, so that the annotator needs to take sides, and indicate a preference
towards either good or bad. There is no option for \cannot tell" in the middle,
in contrast to the ve-grade scale of PPDB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Nonetheless, a fth so-called
\trash" category was created, to make it possible for the annotators to discard
invalid data.
      </p>
      <p>The number of too similar sentence pairs have been reduced in a pre ltering
step, where edit distance is used to measure sentence similarity. In this way,
we avoid wasting annotation e ort on trivial cases. When it comes to scope
mismatch and grammatical mismatch, the annotators must make decisions to
their best judgment and the characteristics of the language they are annotating;
these cases need to be annotated as either \mostly good" (3) or \mostly bad" (2)
examples of paraphrases. The instructions shown to the annotators are displayed
in Table 1.
2.2</p>
      <sec id="sec-5-1">
        <title>Why did we build our own tool?</title>
        <p>Before tackling the annotation task, we evaluated whether to use an existing
annotation tool or build one ourselves. Using an existing tool is potentially less
expensive, and existing services usually o er ways of storing and backing up data
and securely handling user authentications.</p>
        <p>
          We tried using WebAnno [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], which is a web-based annotation tool designed
for linguistic annotation tasks. With WebAnno, one can design one's own
annotation projects, assign users and monitor the projects. WebAnno turned out
to be too slow to use for our purposes: the user has to highlight the part they
want to annotate and then type in the annotation category. Working with
WebAnno is useful for annotating linguistic relations but unnecessarily complicated
for simply choosing one of our ve annotation categories.
        </p>
        <p>Category Description Examples
Good, The two sentences can be used in It was a last minute thing. $ This
\Dark green", the same situation and essentially wasn't planned.
4 \mean the same thing". Honey, look. $ Um, honey, listen.
I have goose esh. $ The hair's
standing up on my arms.</p>
        <p>Mostly good, It is acceptable to think that the Hang that up. $ Hang up the
\Light green", two sentences refer to the same phone.
3 thing, although one sentence might Go to your bedroom. $ Just go to
be more speci c than the other one, sleep.
or there are di erences in style, Next man, move it. $ Next, please.
such as polite form versus famil- Calvin, now what? $ What are we
iar form. There may also be di er- doing?
ences in gender, number or tense, Good job. $ Right, good game, good
etc if these di erences are of mi- game.
nor importance for the phrases as Tu es fatigue? $ Vous ^etes
faa whole, such as masculine or fem- tiguee?
inine agreement of French adjec- Den ar fanig. $ Det ar dumt.
tives. Olet myohassa. $ Te tulitte liian
myohaan.</p>
        <p>Mostly bad, There is some connection between Another one? $ Partner again?
\Yellow", the sentences that explains why Did you ask him? $ Have you
2 they occur together, but one would asked her?
not really consider them to mean Hello, operator? $ Yes, operator,
the same thing. There may also I'm trying to get to the police.
be di erences in gender, number, Isn't that right? $ Well, hasn't it?
tense etc that are important for the Get them up there. $ Put your
meaning of the phrases as a whole. hands in the air.</p>
        <p>I thought you might. $ Yeah, didn't
think so.</p>
        <p>I am on my way. $ We are
coming.</p>
        <p>Bad, There is no obvious connection. She's over there. $ Take me to
\Red", The sentences mean di erent him.
1 things. All the cons. $ Nice and comfy.
Trash At least one of the sentences is Estoy buscando a mi hermana. $
invalid in some of the following I'm looking for my sister.
ways: { The language of the sen- Now, watch what you're saying. $
tence is wrong, such as an English Watch your mouth.
phrase in the French annotation Adolfo Where can I nd? $ Where
data. { There are spelling mistakes I can nd Adolfo?
or the sentence is syntactically
misformed. However, sloppy
punctuation or capitalization can be
ignored and the sentence can be
accepted.</p>
        <p>
          Amazon Mechanical Turk5 (AMT) is similar to WebAnno in the sense that
users can design their own annotation task, but the main selling point of AMT
is that the annotations are made using crowdsourcing. AMT utilizes a global
marketplace of workers who are paid for their work e ort. According to Snow et.
al [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], linguistic annotation tasks can be carried out quickly and inexpensively
by non-expert users. However, it is important that the annotators are pro cient
in the language they are annotating in order to obtain reliable annotations.
        </p>
        <p>In the end, we decided to implement our own tool, because it needs to perform
a speci c task in a controllable setting.
2.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Design choices</title>
        <p>Before implementing the annotation platform, the design has to be thought out
thoroughly to serve the annotation task. It is important that the annotation
process is simple and convenient. This makes the task pleasant for the
annotators, while simultaneously bene ting the ones conducting the project by allowing
annotations to be gathered faster.</p>
        <p>Web-based tool. In order to allow the annotators an easy access to the tool,
we decided to make it accessible with a web browser. In this way the annotators
can evaluate sentence pairs anywhere and anytime they like. This also allows for
easy recruitment of new annotators by creating new user accounts and sharing
the link to the interface.</p>
        <p>The main annotation view is meant to be simple and informative (Figure 1).
The person annotating sees two sentences in a given language and evaluates the
similarity on a scale from 1 to 4 by pressing the corresponding number key or
by clicking the button. In addition to the four similarity category buttons, there
is a button to discard the sentence pair. The discard button has no shortcut key
on the keyboard in order to avoid the category being chosen accidentally. The
criteria for each category are visible below the sentence pair. The annotator can
also see their progress for each language at the top of the page. By clicking their
username at the top of the page, the user can enter their user page. Here the
user can switch between the languages they were assigned to annotate, change
their password and see their 100 most recent annotations and edit them.</p>
        <p>In addition to being able to make annotations, admin users have access to
special features. They can add new users, view annotation statistics per language
or per user and search for and read speci c annotations.</p>
        <p>Sharing the task. Each sentence pair has to be annotated by two di erent
annotators. We do not hand out complete batches of sentence pairs for
annotation, in order to avoid dealing with un nished batches. Instead, our tool nds
the next sentence pair dynamically. Within a given language, all annotators
annotate sentence pairs from the same sentence pair pool. The algorithm looks for</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 http://mturk.com</title>
      <p>the rst pair that has been annotated by another annotator, but lacks a second
annotation. If such a pair is not found, the algorithm nds the rst pair that has
no annotations. The users can stop annotating anytime they like without feeling
the pressure of having un nished work and continue again when it is convenient
for them.
2.4</p>
      <sec id="sec-6-1">
        <title>Structure of the tool</title>
        <p>The annotation tool is written in Python and it uses the Django web
framework6. The database used is PostgreSQL7. The application runs in a cPouta
virtual machine by CSC8, a Finnish information and communication
technology provider, but it can be run on any server, for example on Heroku9, a cloud
computing service.</p>
        <p>We have chosen to use Django, one of the most popular web frameworks for
Python. Django has a prebuilt admin page, which allows multiple admins to
easily manage users without each of them having access to the backend of the
tool. Django also has a database API, which allows the developer to use Django's
methods instead of raw SQL commands. This makes database interactions more
intuitive and concise. Additionally, Django has built-in methods for handling
security risks, which is important to us, since we are dealing with users with
passwords.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6 https://www.djangoproject.com/</title>
    </sec>
    <sec id="sec-8">
      <title>7 https://www.postgresql.org/</title>
    </sec>
    <sec id="sec-9">
      <title>8 https://www.csc. /</title>
    </sec>
    <sec id="sec-10">
      <title>9 https://www.heroku.com/</title>
      <p>There are two versions of our tool: one that requires registration and logging
in, and one that is open for anyone to use. Each annotator for the private tool
was approved by admins. This makes it time consuming to have a large group
of annotators. The public tool is open for anyone, but there still has to be
two annotations from two di erent annotators for each sentence pair. The users
are tracked by their IP addresses, which is not by any means a perfect way
of identifying individual users. An open tool is a good way of gathering large
amounts of annotated data, but the tool has to have mechanisms for detecting
and ltering out random and noisy annotations. In the end, we decided to use
annotations only from the private tool.
3</p>
      <sec id="sec-10-1">
        <title>Evaluation</title>
        <p>Eighteen persons participated in the annotation e ort. The annotators were
recruited among researchers and students at the university, as well as family
members and friends. The German data was annotated by native German speakers
and a skilled speaker of German as a second language. The English data was
annotated by non-native but highly skilled English speakers. The Finnish data
was annotated by native Finnish speakers. The French data was annotated by a
native French speaker and skilled non-native French speakers. The Russian data
was annotated by native Russian speakers, and the Swedish data was annotated
by native Swedish speakers. Table 2 shows the total number of paraphrases
annotated as well as the number of annotators who contributed the most for each
language.</p>
        <p>In the following, we evaluate the annotations in terms of inter-annotator
agreement as well as annotation times and session lengths. We want to make
sure that the annotations are good quality and that fatigue or carelessness was
not a detrimental factor in the process.
3.1</p>
        <sec id="sec-10-1-1">
          <title>Inter-annotator agreement</title>
          <p>
            The results of the annotation of the Opusparcus development and test sets have
been published earlier in connection with the release of the corpus [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]; a detailed
breakdown is presented, showing the number of sentence pairs that end up in
di erent categories.
          </p>
          <p>The current paper extends the analysis by taking a closer look at
interannotator agreement. It would also be interesting to study intra-annotator
agreement (intra-rater reliability) to nd out how consistently our annotators
performed on data that they had already annotated before. However, we never
displayed the same sentence pairs twice to the same annotator, so we cannot
assess the reliability of individual annotators, only to what extent they agreed
or disagreed with other annotators.</p>
          <p>Distributions over annotation categories. The annotators were shown
sentence pairs and needed to decide between ve options. For every sentence pair,
two annotations were obtained, because two annotators made two independent
choices. Figure 2 shows the distributions of all annotation choices made,
separately for each language. It is obvious that not all annotation categories occur
as frequently, and there are di erences across languages. The language-speci c
di erences are explained, at least partly, by the amount of available data from
which to produce sentence pairs for annotation. In a preprocessing step, the
sentence pairs were ranked automatically, most \promising" sentences rst. The
English data set was the largest one, and 70 % of the annotated pairs turned out
to be \good" or \mostly good" paraphrases. By contrast, the Swedish material
was the smallest one and only about half of the pairs were tagged as paraphrases.</p>
        </sec>
        <sec id="sec-10-1-2">
          <title>Discounting for chance agreement. To assess the level of agreement be</title>
          <p>
            tween annotators, Cohen's kappa score [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] is frequently used in the literature. In
Cohen's own words, kappa (or ) is \[a] coe cient of interjudge agreement for
nominal scales. [...] It is directly interpretable as the proportion of joint
judgments in which there is agreement, after chance agreement is excluded."
          </p>
          <p>
            There are two main ways of computing the probability that agreement occurs
by pure chance: either the distribution of proportions over the categories is taken
to be equal for all annotators or the annotators have their own individual
distributions, as originally suggested by Cohen [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. To use individual distributions is
complicated in our case, since we assign each sentence pair dynamically to two
annotators in our annotator pool. Hence, we have a large number of batches,
each annotated by di erent pairs of annotators. However, in practice the two
approaches tend to produce very similar outcomes [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], and consequently we base
100 %
90 %
our kappa calculations on one common distribution per language (shown in
Figure 2). In fact, we did verify the hypothesis that both calculations produce very
similar results, by examining the languages where one pair of annotators had
co-annotated more than half of the sentence pairs. When we used
annotatorspeci c distributions in the calculations, the resulting chance agreement
probabilities di ered by at most one percentage point from the probabilities based on
one common distribution.
          </p>
          <p>
            We evaluate inter-annotator agreement in three di erent ways. In the rst
evaluation, we retain all distinctions between the ve annotation categories. This
means, for instance, that we consider the annotators to disagree if one annotator
opts for \Good" and the other one \Mostly good" in a particular case. The
results are shown in Table 3. To verbally assess what the kappa values actually
tell us about inter-annotator agreement, we have adopted a guideline proposed
by Landis and Koch [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], which is commonly used for benchmarking in spite of
being fairly arbitrary, as already stated in the original paper.
          </p>
          <p>Table 3 demonstrates that the level of agreement between the ve categories
\Good", \Mostly good", \Mostly bad", \Bad", and \Trash" ranges between fair
and moderate. The average level of agreement is 59.9 % with a kappa value of
0.46. Thus, in general there are di ering views among the annotators on how to
judge paraphrase status on this four-level scale (plus trash).</p>
          <p>
            Next, we relax the conditions of agreement and merge the two categories
\Good" and \Mostly good" paraphrases into one single class \Is paraphrase",
and similarly merge the categories \Bad" and \Mostly bad" into one class \Is not
paraphrase". The trash category is maintained as a third class. The results for
this division are shown in Table 4. The average level of agreement is now 83.1 %
with a kappa value of 0.66, which can be characterized as substantial agreement.
Interestingly, very similar values are reported for the Microsoft Research
Paraphrase Corpus (MSRPC) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], where annotators were supposed to decide whether
sentences from the news domain were paraphrases or not. The inter-annotator
agreement for MSRPC was 84 % and kappa was 0.62. Thus, these two tasks are
very similar and so is the observed level of agreement.
          </p>
          <p>
            Since our paraphrase annotation is based on a four-grade scale ranging from
\good" to \bad", we decided to evaluate agreement in a third way, where
adjacent choices are considered to be in agreement. In this scheme \good" and
\mostly good" match, and so do \mostly good" and \mostly bad" as well as
\mostly bad" and \bad". Table 5 presents the results of this calculation. Not
surprisingly, inter-annotator agreement increases (to 92.5 % on average), but so
does the expected level of agreement by chance (60.7 %). The kappa score is
0.81. It is interesting to note that although the likelihood of agreement by pure
chance increases, inter-annotator agreement increases to such an extent that the
overall kappa score suggests \almost perfect" agreement.
Discussion. The authors behind the MSRPC corpus consider their annotation
task to be \ill-de ned", but they were surprised at how high inter-rater
agreement was (84 %) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Our setup was similar in the sense that our annotators did
not typically receive any further instructions than the descriptions and examples
shown in the annotation tool (see Table 1). Highest agreement is observed for
English, Finnish and Swedish, languages where the people most involved in the
paraphrase project performed a substantial part of the annotation e ort. This
indicates that deeper involvement in the project contributes to more convergent
views on how to categorize the paraphrase data. Why Russian and French have
the lowest degrees of agreement is unclear. These languages seemed to have the
noisiest data, French because of complicated orthography, and Russian possibly
because of OCR errors, which produce Latin letters into Cyrillic text.
3.2
          </p>
        </sec>
        <sec id="sec-10-1-3">
          <title>Annotation times</title>
          <p>Measuring annotation times reveals information on annotator behavior.
Especially interesting behavior is such that would a ect the reliability of the
annotation e ort, e.g. signs of fatigue or maliciousness. With annotation times, we
mean the time elapsed between two annotation events for a user.</p>
          <p>Many annotators started the annotation task with slow annotations. In
Figure 3 we see this e ect for user2 and user4. The slow start is more clearly visible
for user4. The fastest times before the 200 annotation mark are slower than
after that. Additionally, the times are slightly faster after about 1000
annotations. This indicates that the user rst took his time annotating to get familiar
with the task. Once the user gured out the nature of the work, he increased
his annotating speed and maintained it or slightly increased it for the rest of
the task. The same e ect is observable for user2 at the beginning of both of the
annotated languages but to a lesser extent. Additionally, the annotation speed
for native Russian speaker user2 decreases when he switches from annotating
Russian to French. We did not observe signs of slowing down because of fatigue
for any annotator. Neither did we experience any maliciousness from the users'
side, e.g. very fast consecutive annotations.</p>
          <p>Annotation behavior and strategies are also re ected in the amount of time
people spend annotating in a single session. We de ne an annotation session to
consist of annotation events where the time between two consecutive events is
less than ve minutes. Figure 4 shows the number of sessions of di erent lengths,
as well as the cumulative proportion of annotation events for all users.</p>
          <p>Most of the annotation sessions are relatively short, and consequently a large
proportion of the annotations come from short sessions. As we mentioned above,
we cannot assess the reliability of individual annotators using intra-annotator
agreement measures, but a look at the session lengths and annotation results
suggests no di erence in quality of the annotators who worked in short sessions
in comparison to those who preferred longer sessions. Based on this we assume
that annotator fatigue does not a ect the quality of the resulting data set to a
large degree.</p>
        </sec>
      </sec>
      <sec id="sec-10-2">
        <title>Discussion and conclusion</title>
        <p>
          Could the inter-annotator agreement be higher? The creators of MRSPC [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
believe that in their task agreements could be improved through practice and
discussion among the annotators. However, they also observed that attempts to
make the task more concrete resulted in degraded intra-annotator agreement.
        </p>
        <p>
          Others have called for more linguistically informed data sets with more
negrained annotation categories. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] There is a trade-o , however, between
annotation speed and complexity of the annotation task. We have favored a fairly
simple intuitive annotation scheme.
        </p>
        <p>
          The Opusparcus data sets have been used successfully in machine learning
for training and evaluating automatic paraphrase detection. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
        </p>
        <p>In future work, if we wish to recruit a larger pool of annotators through
crowdsourcing, attention needs to be paid to better tracking of the reliability
and consistency of individual annotator performance. Additionally, although the
colloquial style of the data makes it interesting to work with, the task could be
made even more enjoyable, for instance through gami cation.</p>
      </sec>
      <sec id="sec-10-3">
        <title>Acknowledgments</title>
        <p>We are grateful to the following people for helping us in the annotation e ort:
Thomas de Bluts, Aleksandr Semenov, Olivia Engstrom, Janine Siewert, Carola
Carpentier, Svante Creutz, Yves Scherrer, Anders Ahlback, Sami Itkonen,
Riikka Raatikainen, Kaisla Kajava, Tiina Koho, Oksana Lehtonen, Sharid Loaiciga
Sanchez, and Tatiana Batanina.</p>
        <p>We would also like to thank Hanna Westerlund, Martin Matthiesen, and
Mietta Lennes for making Opusparcus available at the Language Bank of Finland
(http://www.kielipankki. ).</p>
        <p>The project was supported in part by the Academy of Finland through
Project 314062 in the ICT 2023 call on Computation, Machine Learning and
Arti cial Intelligence.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Barron-Ceden~o,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Vila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>39</volume>
          ,
          <issue>917</issue>
          {
          <fpage>947</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A coe cient of agreement for nominal scales</article-title>
          .
          <source>Educational and Psychological Measurement</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <volume>37</volume>
          {
          <fpage>46</fpage>
          (
          <year>1960</year>
          ), https://doi.org/10.1177/ 001316446002000104
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Creutz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Open Subtitles Paraphrase Corpus for Six Languages</article-title>
          .
          <source>In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brockett</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatically constructing a corpus of sentential paraphrases</article-title>
          .
          <source>In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) at the Seond International Joint Conference of Natural Language Processing (IJCNLP-05)</source>
          .
          <source>Asia Federation of Natural Language Processing (January</source>
          <year>2005</year>
          ), https://www.microsoft.com/en-us/research/publication/ automatically-constructing
          <article-title>-a-corpus-of-sentential-paraphrases/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Eugenio</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glass</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Squibs and discussions: The kappa statistic: A second look</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ) (
          <year>2004</year>
          ), http://www.aclweb.org/anthology/ J04-1005
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ganitkevitch</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Van Durme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>PPDB: The paraphrase database</article-title>
          .
          <source>In: Proceedings of NAACL-HLT</source>
          . pp.
          <volume>758</volume>
          {
          <fpage>764</fpage>
          . Association for Computational Linguistics, Atlanta, Georgia (
          <year>June 2013</year>
          ), http://cs.jhu.edu/ ccb/ publications/ppdb.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kovatchev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salamo</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ETPC { a paraphrase identi cation corpus annotated with extended paraphrase typology and negation</article-title>
          . In: LREC (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Landis</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>G.G.</given-names>
          </string-name>
          :
          <article-title>The measurement of observer agreement for categorical data</article-title>
          .
          <source>Biometrics</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <volume>159</volume>
          {
          <fpage>174</fpage>
          (
          <year>1977</year>
          ), http://www.jstor.org/stable/2529310
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lison</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ). Portoroz, Slovenia (May
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lison</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kouylekov</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora</article-title>
          .
          <source>In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan (May
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Paetzold</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Collecting and exploring everyday language for predicting psycholinguistic properties of words</article-title>
          .
          <source>In: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          . pp.
          <volume>669</volume>
          {
          <fpage>1679</fpage>
          .
          <string-name>
            <surname>Osaka</surname>
          </string-name>
          ,
          <string-name>
            <surname>Japan</surname>
          </string-name>
          (
          <year>December 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pavlick</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rastogi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganitkevitch</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Van Durme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>PPDB 2.0: Better paraphrase ranking, ne-grained entailment relations, word embeddings, and style classi cation</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers)</source>
          . pp.
          <volume>425</volume>
          {
          <fpage>430</fpage>
          . Association for Computational Linguistics, Beijing, China (
          <year>July 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rus</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banjade</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lintean</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>On paraphrase identi cation corpora</article-title>
          .
          <source>In: LREC</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Sjoblom, E.,
          <string-name>
            <surname>Creutz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aulamo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Paraphrase detection on noisy subtitles in six languages</article-title>
          .
          <source>In: Proceedings of W-NUT at EMNLP</source>
          . Brussels, Belgium (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Snow</surname>
            , R.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Connor</surname>
            ,
            <given-names>B.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks</article-title>
          .
          <source>In: Proceedings of EMNLP</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vila</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            <given-names>guez</given-names>
          </string-name>
          , H.:
          <article-title>Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>49</volume>
          ,
          <issue>77</issue>
          {
          <fpage>105</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>van der Wees</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bisazza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Measuring the e ect of conversational aspects on machine translation quality</article-title>
          .
          <source>In: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          . pp.
          <volume>2571</volume>
          {
          <fpage>2581</fpage>
          .
          <string-name>
            <surname>Osaka</surname>
          </string-name>
          ,
          <string-name>
            <surname>Japan</surname>
          </string-name>
          (
          <year>December 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Yimam</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Eckart de Castilho, R.,
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Webanno: A exible, web-based and visually supported system for distributed annotations</article-title>
          .
          <source>In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          .
          <article-title>Association for Computational Linguistics, So a</article-title>
          ,
          <string-name>
            <surname>Bulgaria</surname>
          </string-name>
          (
          <year>August 2013</year>
          ), http://www.aclweb.org/anthology/P13-4001
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>