<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crowdsourcing versus the laboratory: Towards crowd-based linguistic text quality assessment of query-based extractive summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Neslihan Iskender</string-name>
          <email>neslihan.iskender@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Polzehl</string-name>
          <email>tim.polzehl1@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Moller</string-name>
          <email>sebastian.moeller@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Quality and Usability Lab, TU Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Curating text manually in order to improve the quality of automatic natural language processing tools can become very time consuming and expensive. Especially, in the case of query-based extractive online forum summarization, curating complex information spread along multiple posts from multiple forum members to create a short meta-summary that answers a given query is a very challenging task. To overcome this challenge, we explore the applicability of microtask crowdsourcing as a fast and cheap alternative for query-based extractive text summarization of online forum discussions. We measure the linguistic quality of crowd-based forum summarizations, which is usually conducted in a traditional laboratory environment with the help of experts, via comparative crowdsourcing and laboratory experiments. To our knowledge, no other study considered query-based extractive text summarization and summary quality evaluation as an application area of the microtask crowdsourcing. By conducting experiments both in crowdsourcing and laboratory environments, and comparing the results of linguistic quality judgments, we found out that microtask crowdsourcing shows high applicability for determining the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, and structure &amp; coherence. Further, our comparison of these ndings with a preliminary and initial set of expert annotations suggest that the crowd assessments can reach comparable results to experts speci cally when determining factors such as overall quality and structure &amp; coherence mean values. Eventually, preliminary analyses reveal a high correlation between the crowd and expert ratings when assessing low-quality summaries.</p>
      </abstract>
      <kwd-group>
        <kwd>digitally curated text</kwd>
        <kwd>microtask crowdsourcing</kwd>
        <kwd>linguistic summary quality evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the widespread usage of the world wide web, crowdsourcing has become
one of the main resources to work at so-called \micro-tasks" that require human
intelligence to annotate text or solve tasks that computers cannot yet solve and
connect to external knowledge and expertise. In this way, a fast and relatively
inexpensive mechanism is provided so that the cost and time barriers of
qualitative and quantitative laboratory studies and controlled experiments can be
mostly overcome [
        <xref ref-type="bibr" rid="ref12 ref15">15, 12</xref>
        ].
      </p>
      <p>
        Although microtask crowdsourcing has been primarily used for simple,
independent tasks such as image labeling or digitizing the print documents [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], few
researchers have begun to investigate the crowdsourcing for complex and expert
tasks such as writing, product design, or Natural Language Processing (NLP)
tasks [
        <xref ref-type="bibr" rid="ref19 ref34">19, 34</xref>
        ]. Especially, empirical investigation of di erent NLP tasks such as
sentiment analysis and assessment of translation quality in crowdsourcing has
shown that aggregated responses of crowd workers can produce gold-standard
data sets with quality approaching those produced by experts [
        <xref ref-type="bibr" rid="ref1 ref28 ref31">31, 1, 28</xref>
        ]. Inspired
by these results, we propose using microtask crowdsourcing as a fast and cheap
way of curating complex information spread along multiple posts from multiple
forum members to create a short meta-summary that answers a given query
along with the quality evaluation of these summaries. To our knowledge, only
prior work by the authors themselves has considered the query-based extractive
forum summarization and linguistic quality evaluation of query-based extractive
summarization as an application area of micro-task crowdsourcing, without
nalizing the decision on crowdsourcing's appropriateness for this task [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. To
ll this research gap, we focus on the subjective linguistic quality evaluation of
a curation task containing the compilation of online discussion forum
summarization by conducting comparative crowdsourcing and laboratory experiments.
Given such a meta-summary encompassing all forum posts aggregated and
summarized towards a certain query can reach high quality results, both human and
automated search as well as summary embedding in di erent contexts can
become a much more valuable mean to raise e ciency and accuracy in information
retrieval applications.
      </p>
      <p>In the remainder of this paper, we answer the research question \Can crowd
successfully create query-based extractive summaries and asses the overall and
linguistic quality of these summaries?" by conducting both laboratory and
crowdsourcing experiments and comparing the results. Following, as preliminary
results, we compare laboratory and crowd assessments with an initial data set
of expert annotations to determine the reliability of non-expert annotations for
summary quality evaluation.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Evaluation of Summary Quality</title>
        <p>
          The evaluation of summary quality is crucial for determining the success of any
summarization method, improving the quality of both human and automatic
summarization tools and as well for their commercialization. Due to the
subjectivity and ambiguity of summary quality evaluation, as well as the high variety
of summarization approaches, the possible measures for the summary quality
evaluation can be broadly classi ed into two categories: extrinsic and intrinsic
evaluation [
          <xref ref-type="bibr" rid="ref17 ref32">17, 32</xref>
          ].
        </p>
        <p>
          In extrinsic evaluation, the evaluation of summary quality is accomplished
on two bases: content responsiveness which examines the summary's usefulness
with respect to external information need or goal basis, and relevance assessment
which determines if the source document contains relevant information about the
user`s need or query [
          <xref ref-type="bibr" rid="ref26 ref5">26, 5</xref>
          ]. The extrinsic measures are usually assessed
manually with the help of experts or non-expert crowdworkers. In intrinsic evaluation,
the evaluation of the summary is directly based on itself and is often done by
comparison with a reference summary [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Two main approaches to measure
the intrinsic quality are the linguistic quality evaluation (or readability
evaluation) and content evaluation [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. The content evaluation is often performed
automatically and determines how many word sequences of reference summary
are included in the peer summary [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. In contrast, linguistic quality evaluation
contains the assessment of grammaticality, non-redundancy, referential clarity,
focus, structure and coherence [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Hence, there are widely used automatic quality metrics developed to measure
the quality of a summary such as ROUGE which o ers a set of statistics (e.g.
ROUGE-2 which uses 2-grams) by executing a series of recall measures based on
n-gram co-occurrence between a peer summary and a list of reference summaries
[
          <xref ref-type="bibr" rid="ref21 ref33">21, 33</xref>
          ]. These scores can only provide content-based similarity based on a gold
standard summary created by an expert. However, linguistic summary quality
features such as grammaticality, non-redundancy, referential clarity, focus, and
structure &amp; and coherence, can not be assessed automatically in most cases [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ].
The existing automatic evaluation methods for linguistic quality evaluation are
rare [
          <xref ref-type="bibr" rid="ref22 ref29 ref9">22, 29, 9</xref>
          ], often do not consider the complexity of the quality dimensions,
and can require language-dependent adaptation for the recognition of very
complex linguistic features. Therefore, linguistic quality features should be assessed
manually by experts, or not evaluated at all due to the time and cost e orts. So,
there are still new manual quality assessment methods needed in the research of
summary quality evaluation.
        </p>
        <p>
          In this paper, we focus on intrinsic evaluation measures, especially on the
linguistic quality evaluation as de ned in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], performed under both crowd-working
and laboratory conditions.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Crowdsourcing for Summary Creation and Evaluation</title>
        <p>
          In recent years, researchers have found that even some complex and expert tasks
such as writing, product design, or NLP tasks may be successfully completed by
non-expert crowd workers with appropriate process design and technological
support [
          <xref ref-type="bibr" rid="ref18 ref19 ref2 ref3">19, 3, 2, 18</xref>
          ]. Especially, using non-expert crowd workers for NLP tasks
which are usually conducted by experts has become one of the research interests
due to the organizational and nancial bene ts of microtask crowdsourcing [
          <xref ref-type="bibr" rid="ref18 ref4 ref7">18,
7, 4</xref>
          ].
        </p>
        <p>
          Although crowdsourcing services provide quality control mechanisms, the
quality of crowdsourced corpus generation has been repeatedly questioned
because of the crowd worker`s inaccuracy and the complexity of text
summarization. Gillick and Lui [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] have shown that non-expert crowdworkers can not
produce summaries with the same linguistic quality results as the experts. Besides,
Lloret et. al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] have conducted a crowdsourcing study for corpus generation
of abstractive image summarization and their results suggest that non-expert
crowdworkers perform poorly due to the complexity of summarization task and
non-motivation of crowdworkers. However, El-Haj et.al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] have shown that
Amazon's Mechanical Turk (AMT)1 is appropriate for carrying out summary
creation task of human-generated single-document summaries from Wikipedia
and newspaper article in Arabic. One reason for di erent results regarding the
appropriateness of crowdsourcing for summary evaluation may be that these
studies have concentrated on di erent kinds of summarization tasks such as
abstractive image summarization or extractive generic summarization which leads
to varying levels of task complexity.
        </p>
        <p>
          Focusing on summarization quality evaluation, the application of
crowdsourcing has not been explored as thoroughly as for other NLP tasks, such as
translation [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. However, the subjective quality assessment is needed to determine the
quality of automatic or human-generated summaries [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. Using crowdsourcing
for subjective summary quality evaluation provides a fast and cheap alternative
to the traditional subjective testing with experts but the crowdsourced
annotations must be checked for quality since they are produced by workers with
unknown or varied skills and motivations [
          <xref ref-type="bibr" rid="ref25 ref27">27, 25</xref>
          ].
        </p>
        <p>
          Only, Gillick and Lui [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] have conducted various crowdsourcing
experiments to investigate the quality of crowdsourced summary quality evaluation
and showed that non-expert crowdworkers can not evaluate the summaries as
good as experts. Also, Iskender et. al.[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] have carried out a crowdsourcing study
to evaluate the summary quality, showing that experts and crowd workers
correlate only assessing the low quality summaries. Gao et.al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], Falke et.al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
and Fan et. al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] have used crowdsourcing as a source of human evaluation
to evaluate their automatic summarization systems, but not questioned the
robustness of crowdsourcing for this task.
        </p>
        <p>Therefore, more empirical studies should be conducted to nd out which
kind of summarization evaluation tasks are appropriate for crowdsourcing, how
to design crowdsourcing tasks appropriate to crowd workers, as well as how to
assure the quality of crowdsourcing for summarization evaluation.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <sec id="sec-3-1">
        <title>Data Set</title>
        <p>We used a German data set of crowdsourced summaries created as the
querybased extractive summarization of forum queries and posts. These forum queries
and posts originate from the forum Deutsche Telekom hilft where Telekom
customers ask questions about the company's products and services and the
questions are answered by other customers or the support agents of the company.</p>
        <sec id="sec-3-1-1">
          <title>1 http://www.mturk.com.</title>
          <p>This summary data set was already annotated using a 5-point MOS in a
previous crowdsourcing experiment. After aggregating the three di erent
judgments per summary with a majority voting in this experiment, the quality of
these summaries was ranging from 1.667 to 5. Based on these annotations, we
allocated 50 summaries within 10 distinct quality groups ranging from lowest
to highest scores (lowest group [1.667, 2]; highest group (4.667, 5]) each
represented by ve summaries to generate strati ed data of widely varying qualities.
The average word count of these summaries was 63.32, the shortest one with
24 words, and the longest one with 147 words. The corresponding posts had an
average word count of 555, the shortest posts with 155 words, and the longest
with 1005 words. Accordingly, the average length of customer queries was 7.78,
the shortest one with 4 words, and the longest with 17 words.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Crowdsourcing Study</title>
        <p>
          All the crowdsourcing tasks were completed using Crowdee platform2. For the
crowd worker selection, we used two di erent tasks: German language pro ciency
screener provided by the Crowdee platform and a task-speci c quali cation job
developed by the author team. We admitted only crowd workers who passed the
German language test with a score of 0.9 and above (scale [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]) to participate
in the quali cation job.
        </p>
        <p>In this quali cation task, we gave explanations about the process of extractive
summary crafting and asked the crowd workers to rate the overall linguistic and
content quality of four reference summaries (two very good, two very bad) whose
quality were already annotated by experts on a 5-point MOS scale using the
labels very good, good, moderate, bad, very bad. Expert scores were not shown
to the participants. For each rating matching the experts' rating crowd workers
earned 4 points. For each MOS-scale point deviation from the expert rating,
crowd workers earned a point less, so increasing deviations from the experts'
ratings were linearly punished.</p>
        <p>We paid 1.2 Euros for the quali cation task and its average compilation
duration was 417 seconds, ca. 7 minutes. The quali cation task was online for one
week. Out of 1569 screened crowd workers of Crowdee platform holding a German
language score &gt;= 0.9, 82 crowd workers participated in the quali cation task,
67 out of them passed the test with a point ratio &gt;= 0.625, and 46 quali ed
crowd workers returned in order to perform the summary quality assessment
task when they were published after 2 weeks time.</p>
        <p>In the summary quality assessment task, crowd workers were presented with
a brief explanation of how the summaries are created. It was highlighted that the
summaries were constructed by the simple act of copying sentences from forum
posts, potentially incurring a certain degree of unnatural or isolated composure.
After that, an example of a query, forum posts, and a summary were shown. Next,
crowd workers answered 9 questions regarding the quality of a single summary
in the following order: 1) overall quality, 2) grammaticality, 3) non-redundancy,</p>
        <sec id="sec-3-2-1">
          <title>2 https://www.crowdee.com/</title>
          <p>4) referential clarity, 5) focus, 6) structure &amp; coherence, 7) summary usefulness,
8) post usefulness and 9) summary informativeness.</p>
          <p>The overall quality was asked rst to avoid the in uence of more detailed
aspects on the overall quality judgment. The scoring of each aspect of a
single summary was done on a separated page, which contained a short, informal
de nition of the respective aspect (sometimes illustrated with an example), the
summary and the 5-point MOS scale (very good, good, moderate, bad, very bad ).
Additionally, in question 7 we showed the original query; in questions 8 and 9,
the original query and the corresponding forum posts were displayed.</p>
          <p>Each of the 50 summaries was rated by 24 di erent crowd workers, resulting
in 10,800 labels (50 summaries x 9 questions x 24 repetitions). The estimated
work duration for completing this task was 7 minutes. Accordingly, the payment
was calculated based on the minimum hourly wage oor (9,19 Euros in
Germany), so the crowd workers were paid 1.2 Euros. Overall, 46 crowd workers
(19f, 27m, Mage = 43) completed the individual sets of tasks within 20 days
where they spent 249,884 seconds, ca. 69.4 hours at total. With an average of
55.543 answers accepted at Crowdee platform in total, the crowd workers were
relatively experienced users of the platform.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Laboratory study</title>
        <p>The summary quality evaluation task design itself was identical to the
crowdsourcing study. Participants also used the Crowdee platform to answer the
questions, however, this time in a controlled laboratory environment. The experiment
duration was set to one hour and the participants were instructed to evaluate as
many summaries as they can. Following common practice for laboratory tests, all
the participants were instructed in a written form before the experiment with
the general task description and all their questions regarding the experiment
rules or general questions were answered immediately. Each of the 50 summaries
was again rated by 24 di erent participants, resulting in further 10,800 labels
(50 summaries x 9 questions x 24 repetitions).</p>
        <p>Participants were recruited using a local participant pool admitting German
natives only. Before conducting the laboratory experiment, we collected
participant information about age, gender, education and knowledge about the services
and products of telecommunication service Telekom from which the queries and
posts originate. Overall, 71 participants (33f, 38m, Mage = 29) completed the
laboratory study in 51 days, spending 295,033 seconds, ca. 82 hours at total. The
average number of evaluated summaries in an hour was 12 and they were paid
15 Euros per hour. Attained education was distributed over the complete range
with 46% having completed high school, 7% college, 24% a Bachelor's degree
and 23% Master's degree or higher. The question about knowledge on
telecommunication service Telekom resulted in self-assessments of a 10% very bad, 24%
bad, 39% average, 25% good and 1% very good answer distribution.
Results are presented for the scores overall quality (OQ) and the ve linguistic
quality scores (including grammaticality (GR), non-redundancy (NR),
referential clarity (RC), focus (FO), structure &amp; coherence (SC)), and we will refer
to these labels by their abbreviations in this section. The evaluation of the
extrinsic factors, summary usefulness (SU), post usefulness (PU) and summary
informativeness (SI), is future work.</p>
        <p>Overall, we analyzed 7,200 labels (50 summaries x 6 questions x 24
repetitions) from the crowdsourcing study collected with 46 crowd workers and 7,200
labels from laboratory study collected with 71 participants. We used majority
voting as the aggregation method which leads to 300 labels (50 summaries x 6
questions x average of 24 repetitions) from crowdsourcing and 300 labels from
laboratory study. After data cleaning and basic outlier removal, we analyzed
288 labels, N = 48 per question, (48 summaries x 6 questions x average of 24
repetitions) from crowdsourcing and 288, N = 48 per question, from laboratory
study.
Anderson Darling tests for normality check were conducted to test the
distribution of crowd ratings for OQ, GR, NR, RC, FO, and SC, indicating that all
items are normally distributed with p &gt; 0.05. Figure 1 shows the boxplots of
each crowd rated item.</p>
        <p>To determine the relationship between OQ and GR, NR, RC, FO, SC,
Pearson correlations were computed (cf. table 1). With each of these linguistic quality
items, OQ obtained a signi cant high correlation coe cient rp &gt; .84 and p &lt;
.001 which indicates a very strong linear relationship between the individual
linguistic quality items and OQ, with the correlation between OQ and RC
being the strongest (rp = .97). In addition, linguistic quality items inter-correlate
with each other signi cantly with p &lt; .001 and rp &gt; .71 while the correlation
coe cient between GR and NR being the weakest (rp = .711), and correlation
between RC and FO results the strongest (rp = .971).</p>
        <p>Before conducting a one-way ANOVA test to compare the means of OQ and
the 5 linguistic quality scores for signi cant di erences, Levene's test to check
the homogeneity of variances was carried out with respective assumptions met.
There were statistically signi cant di erences between group means revealed by
the one-way ANOVA (p &lt; .05). Post hoc test applying Tukey criterion revealed
that the mean of FO (M = 3.937) was signi cantly higher than the mean of OQ
(M = 3.588, p &lt; 0.05) and the mean of GR (M = 3.565, p &lt; 0.05). No other
signi cant di erences were found.
4.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation of Laboratory Ratings</title>
        <p>Anderson Darling tests for normality check were conducted to test the
distribution of laboratory ratings for OQ, GR, NR, RC, FO, and SC, indicating that all
items are normally distributed (p &gt; 0.05). Figure 1 shows the boxplots of each
in laboratory rated item.</p>
        <p>To determine the relationship between OQ and the ve linguistic quality
scores, Pearson correlations were computed (cf. table 2). With each of these
linguistic quality items, OQ obtained a signi cant correlation coe cient rp &gt; .79
and p &lt; .001 indicating a strong linear relationship between linguistic quality
items and OQ, showing the correlation between OQ and SC as strongest (rp =
.946). In addition, linguistic quality items inter-correlate with each other
significantly with p &lt; .001 and rp &gt;= .58, where the correlation coe cient between
GR and NR results the weakest (rp = .58), and the correlation between RC and
FO results the strongest (rp = .929).</p>
        <p>Before conducting a one-way ANOVA test to compare the OQ and the ve
linguistic quality scores with each other, Levene's test to check the homogeneity
of variances was carried out resulting respective assumptions to be met. There
were statistically signi cant di erences between group means determined by
oneway ANOVA (p &lt; .001). Post hoc test (Tukey criterion) revealed that the mean
of FO (M = 3.827) was statistically higher than the means of the OQ (M =
3.359, p &lt; 0.01), GR (M = 3.354, p &lt; 0.01) as well as SC (M = 3.406, p &lt;
0.001). Again, no other signi cant di erences were found.
4.3</p>
      </sec>
      <sec id="sec-3-5">
        <title>Comparing Crowdsourcing and Laboratory</title>
        <p>When calculating Pearson correlation coe cients between MOS from crowd
assessments and MOS from laboratory assessments, results of overall quality (rp
= .935), GR (rp = .90), NR (rp = .833), RC (rp = .881), FO (rp = .869) and SC
(rp = .911) with p &lt; .001 for all correlations reveal overall very strong signi cant
linear relationship between crowd and laboratory assessments. Figure 3 shows
the dependency of crowd and laboratory ratings in scatter plots.</p>
        <p>To compare OQ and the ve linguistic quality scores from crowdsourcing with
their respective items from laboratory ratings, T-tests assuming independent
sample distributions were conducted. Before that, Anderson Darling tests for
normality check and Levene's test to check the homogeneity of variances were
carried out, with respective assumptions met. T-Test results revealed that there
was no signi cant di erence between OQ, GR, RC, FO and SC ratings with
respect to the corresponding crowd and laboratory ratings. Only between-1,
there was a signi cant di erence, revealing that the mean of NR in laboratory
(M = 3.60) was rated signi cantly lower than the mean of NR in crowdsourcing
(M = 3.831, p &lt; .05).
4.4</p>
      </sec>
      <sec id="sec-3-6">
        <title>Preliminary Results: Towards Comparing Experts with</title>
      </sec>
      <sec id="sec-3-7">
        <title>Crowdsourcing and Laboratory</title>
        <p>To explore the relationship between expert and non-expert ratings, we created
a reference group with randomly selected 24 summaries (N = 24) from our data
set to test the congruence of non-expert judgments by comparing them to an
initial data set of expert judgments, with two experts rated OQ, GR, NR, RC,
FO and SC of 24 summaries. We again used majority voting as our aggregation
method when analysing the preliminary results of comparison expert judgments
with non-expert judgments.
Overall Quality</p>
        <p>Grammaticality</p>
        <p>Non-Redundancy</p>
        <p>Focus</p>
        <p>Structure &amp; Coherence</p>
        <p>At rst, we divided our data into three groups, all of which contain crowd,
laboratory and expert judgments. The rst group \All" includes all the
summaries acting as the reference group (N = 24). Next, we split the data into
subsets of high- and low-rated summaries by the median of crowd OQ ratings
(Mdn = 3.646) and created the groups \Low" (N = 12) and \High" (N = 12)
for crowdsourcing, laboratory and expert ratings respectively. Because of the
resulting non-normal distribution in the groups, we calculated the Spearman
rank-order correlation coe cients for all three groups between expert ratings
and crowd ratings on the one side, and expert ratings and laboratory ratings on
the other side. Table 3 shows the correlation coe cients for all groups.</p>
        <p>Comparing correlation in between expert and crowd ratings as shown in
the left plane of table 3, in the second column of the group \Low", a strong
correlation can be found between OQ ratings, as well as between FO ratings.
Also, the overall magnitude between experts and crowd increase in the group
\Low" in comparison to the group \All". Comparing correlation in between
expert and laboratory ratings as shown in the right plane of table 3, in the
fth column of the group \Low", we observe that the correlation coe cients
for OQ, GR, and SC generally increase compared to group \All". However, the
correlation coe cients of NR and FO between expert and laboratory ratings
increase comparing high-quality summaries to group \All".</p>
        <p>In order to nd out if labels assessed by crowd worker, laboratory
participants or experts show signi cant di erences, we compare the individual means of
crowdsourcing, laboratory and experts assessments with respect to OQ and the
ve linguistic quality scores applying one-way ANOVA or in case of non-normal
distribution Kruskal-Wallis tests. Results show signi cant di erences in between
crowdsourcing, laboratory and experts assessments with respect to means of OQ
ratings (p &lt; .05), GR ratings (p &lt; .001), NR ratings (p &lt; .001), RC ratings
(p &lt; .001) as well as FO ratings (p &lt; .001). Eventually, ratings of SC ratings
did not show signi cant di erence. In a nal analysis, we compare the absolute
magnitudes and o set of ratings applying post hoc tests (Tukey criterion in case
of normal distribution, and Dunn's criterion in case of non-normal distribution).
Results revealed that the experts rated OQ (M = 3.771) signi cantly higher
than the laboratory participants (M = 3.266, p &lt; .05). Moreover, experts rated
GR (M = 4.25), NR (M = 4.354), RC (M = 4.396) and FO (M = 4.333)
signi cantly higher than the crowd workers (MGR = 3.642, MNR = 3.818, MRC =
3.709, MFO = 3.90, and p &lt; .05 for all) and the laboratory participants (MGR
= 3.399, MNR = 3.521, MRC = 3.507, MFO = 3.741, and p &lt; .05 for all).</p>
        <p>However, due to the small sample size of these three groups, these results
need to be interpreted with caution and treated as preliminary results.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>In this paper, we have analyzed the appropriateness of micro-task crowdsourcing
for the complex task of query-based extractive text summary creation by means
of linguistic quality assessment.</p>
      <p>The analysis of crowd ratings (cf. 4.1) and the analysis of laboratory
ratings (cf. 4.2) have shown that there is a signi cant strong or very strong
intercorrelation between overall quality ratings and the ve linguistic scores in both
environments, suggesting that non-experts associate the overall quality strongly
with the linguistic quality. Additionally, an analysis of one-way ANOVA for both
environments has revealed that the means of the individual scores do not di er
from each other signi cantly, except the mean of focus. Interestingly, signi
cantly higher overall ratings with respect to focus scores in both experiments
indicate that the query-based summaries can well be assessed as highly focused
on a speci c topic on the one side, while at the same time they can be showing
lower grammaticality or structure &amp; coherence assessments. Potentially, this can
be connected to the nature of query-based summaries from individual posts, the
latter being likely to be focused on a given query.</p>
      <p>With the comparison of crowd and laboratory ratings (cf. 4.3), we have shown
that there is a statistically signi cant very strong correlation between overall
quality ratings and the ve linguistic quality scores, although crowd workers are
not equally well-instructed compared the laboratory participants, e.g. receiving a
personal introduction, a pre-written instructions sheet and being able to verbally
clarify irritations. As the main nding of this paper, we showed that the degree
of control on noise, mental distraction, and continuous work does not lead to any
di erence in the overall quality. Also, the presented results from the
independentsamples T-tests support these ndings, except for non-redundancy, which needs
to be analyzed in more detailed work in the future. These ndings highlight
that crowdsourcing can be used instead of laboratory studies to determine the
subjective overall quality and the linguistic quality of text summaries.</p>
      <p>Additionally, as the preliminary results in section 4.4 reveal, crowd workers
may even be preferred compared to experts in certain cases such as
identifying overall quality and focus of low-quality summaries or determining the mean
overall quality and structure &amp; coherence of a summary data set. Following,
laboratory participants may be preferred to experts in such cases like assessing
the grammaticality and referential clarity of low quality summaries. Again,
laboratory studies might be used to determine the mean structure &amp; coherence of
a summarization data set. Especially, using crowdsourcing to eliminate the bad
quality summaries from the data set might be quite bene cial when training an
automatic summarization tool with not annotated noisy data or when to decide
on the application of experts or crowds in order to procure cost-e cient high
quality text summaries at scale.</p>
      <p>
        Further, since the automatic evaluation of text summaries always requires
gold standard data to calculate metrics such as ROUGE [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], NLP research
might pro t from using crowdsourcing to determine the mean overall quality of
an automatic summarization tool. In this way, the performance of di erent
automatic summarization tools can be compared with each other without having
a gold standard data which is costly to create. Especially, when assessing
summaries to prepare training data for end-user directed summarization application,
a naive assessment by non-expert crowd workers may even re ect a more realistic
assessment with respect to non-expert level understanding and comprehension
in the end user group in comparison to expert evaluation.
      </p>
      <p>For all other items e.g. grammaticality, non-redundancy, referential clarity
and focus, experts rate signi cantly higher than the crowd workers and
laboratory participants. This observation might be explainable by the fact that the
nature of extractive summarization and inherent text quality losses - compared to
naturally composed text ow - are more familiar to experts than to non-experts,
hence their quality degradation may be more distinguishable and accessible to
experts.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we execute a rst step to answer the question \Can crowd
successfully be applied to create query-based extractive summaries?". Although
crowd workers are not equally well-instructed compared to laboratory
participants, crowdsourcing can be used instead of traditional laboratory assessments
to evaluate the overall and linguistic quality of text summaries as shown above.
This nding highlights that crowdsourcing facilitates a prospective of large-scale
subjective overall and linguistic quality annotation of text summaries in a fast
and cheap way, especially when naive end-users viewpoint is needed to evaluate
an automatic summarization application or any kind of summarization method.</p>
      <p>However, preliminary results also suggest that if expert annotations are
needed, crowdsourcing and laboratory assessments can be used instead of
experts only in certain cases such as identifying summaries with the low overall
quality, grammaticality, focus and structure &amp; coherence and also determining
the mean overall quality and structure &amp; coherence of a summary data set. So,
if there is a not annotated summary data set which needs expert annotation,
then the crowd workers or laboratory participants can not replace the experts
since the correlation coe cients for mixed quality summaries are generally
moderate. Additionally, the correlation coe cients of experts with non-experts vary
in a range from weak to very strong in between groups. Currently, we cannot
precisely explain these di erent correlation magnitudes of the experts and
nonexperts. The reasons for the dissent can be of multiple nature, for example,
a di erent understanding of the guidelines, varying weighting of the objected
summary parts or the lack of expertise. Right now, based on the comparative
results of laboratory and crowd ratings, we can only exclude the online working
characteristics of crowdsourcing such as unmotivated and unconcentrated crowd
workers.</p>
      <p>In future work, the reasons for these di erent correlation magnitudes will be
investigated by collecting more expert data, as the expert ratings are collected
for half of our data set. Also, qualitative interviews will be conducted with crowd
workers to nd out how well the guidelines are understood by them. Furthermore,
this work does not include any special data cleaning or annotation aggregation
method of 24 di erent judgments for a single item. Therefore, further analysis
needs to be performed to answer the question of how many repetitions are enough
for crowdsourcing and laboratory assessments so comparable results to experts
can be obtained. Lastly, already collected extrinsic quality data will be analyzed
to explore the relationship between overall, intrinsic and extrinsic quality factors.
A deeper analysis of which evaluation measures are more sensitive to varying
annotation quality will be also part of future work, in order to analyze more
elaborately dependencies, requirements and applicability of a general application
of crowd-based summary creation to help both humans and automated tools
curating large online texts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fast, cheap, and creative: evaluating translation quality using amazon's mechanical turk</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-</source>
          Volume 1. pp.
          <volume>286</volume>
          {
          <fpage>295</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukhopadhyay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Quality enhancement by weighted rank aggregation of crowd opinion</article-title>
          .
          <source>arXiv preprint arXiv:1708.09662</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukhopadhyay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A review of judgment analysis algorithms for crowdsourced opinions</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cocos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masino</surname>
            ,
            <given-names>A.J.:</given-names>
          </string-name>
          <article-title>Crowd control: E ectively utilizing unscreened crowd workers for biomedical data annotation</article-title>
          .
          <source>Journal of biomedical informatics 69</source>
          ,
          <volume>86</volume>
          {
          <fpage>92</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Conroy</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang</surname>
          </string-name>
          , H.T.:
          <article-title>Mind the gap: Dangers of divorcing evaluations of summary content from linguistic quality</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Computational Linguistics-Volume</source>
          <volume>1</volume>
          . pp.
          <volume>145</volume>
          {
          <fpage>152</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dang</surname>
          </string-name>
          , H.T.:
          <article-title>Overview of duc 2005</article-title>
          .
          <article-title>In: Proceedings of the document understanding conference</article-title>
          . vol.
          <year>2005</year>
          , pp.
          <volume>1</volume>
          {
          <issue>12</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>De Kuthy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ziai</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meurers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Focus annotation of task-based data: Establishing the quality of crowd annotation</article-title>
          .
          <source>In: Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL</source>
          <year>2016</year>
          (
          <article-title>LAW-X 2016)</article-title>
          . pp.
          <volume>110</volume>
          {
          <issue>119</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>El-Haj</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Using mechanical turk to create a corpus of arabic summaries (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ellouze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaoua</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hadrich</given-names>
            <surname>Belguith</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Mix multiple features to evaluate the content and the linguistic quality of text summaries</article-title>
          .
          <source>Journal of computing and information technology 25(2)</source>
          ,
          <volume>149</volume>
          {
          <fpage>166</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Falke</surname>
            , T., Meyer,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Concept-map-based multi-document summarization using concept coreference resolution and global importance optimization</article-title>
          .
          <source>In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          . pp.
          <volume>801</volume>
          {
          <issue>811</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grangier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Controllable abstractive summarization</article-title>
          .
          <source>arXiv preprint arXiv:1711.05217</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Gadiraju</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Its Getting Crowded! Improving the E ectiveness of Microtask Crowdsourcing. Gesellschaft fur Informatik eV (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gao</surname>
            , Y., Meyer,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>April: Interactively learning to summarise by combining active preference learning and reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1808</source>
          .
          <volume>09658</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Gillick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Non-expert evaluation of summarization systems is risky</article-title>
          .
          <source>In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech</source>
          and
          <article-title>Language Data with Amazon's Mechanical Turk</article-title>
          . pp.
          <volume>148</volume>
          {
          <fpage>151</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rand</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeckhauser</surname>
            ,
            <given-names>R.J.:</given-names>
          </string-name>
          <article-title>The online laboratory: Conducting experiments in a real labor market</article-title>
          .
          <source>Experimental economics 14(3)</source>
          ,
          <volume>399</volume>
          {
          <fpage>425</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Iskender</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabryszak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polzehl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hennig</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <article-title>Moller, S.: A crowdsourcing approach to evaluate the quality of query-based extractive text summaries</article-title>
          .
          <source>In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX)</source>
          . pp.
          <volume>1</volume>
          {
          <issue>3</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galliers</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>Evaluating natural language processing systems: An analysis and review</source>
          , vol.
          <volume>1083</volume>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kairam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heer</surname>
          </string-name>
          , J.:
          <article-title>Parting crowds: Characterizing divergent interpretations in crowdsourced annotation tasks</article-title>
          .
          <source>In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work &amp; Social Computing</source>
          . pp.
          <volume>1637</volume>
          {
          <fpage>1648</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kittur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nickerson</surname>
            ,
            <given-names>J.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmerman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lease</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The future of crowd work</article-title>
          .
          <source>In: Proceedings of the 2013 conference on Computer supported cooperative work</source>
          . pp.
          <volume>1301</volume>
          {
          <fpage>1318</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kittur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smus</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khamkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraut</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          : Crowdforge:
          <article-title>Crowdsourcing complex work</article-title>
          .
          <source>In: Proceedings of the 24th annual ACM symposium on User interface software and technology</source>
          . pp.
          <volume>43</volume>
          {
          <fpage>52</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          :
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          .
          <source>Text Summarization Branches Out</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>H.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
          </string-name>
          , M.Y.:
          <article-title>Automatically evaluating text coherence using discourse relations</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume</source>
          <volume>1</volume>
          . pp.
          <volume>997</volume>
          {
          <fpage>1006</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Lloret</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Analyzing the capabilities of crowdsourcing services for text summarization</article-title>
          .
          <source>Language resources and evaluation 47(2)</source>
          ,
          <volume>337</volume>
          {
          <fpage>369</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Lloret</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The challenging task of summary evaluation: an overview</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>52</volume>
          (
          <issue>1</issue>
          ),
          <volume>101</volume>
          {
          <fpage>148</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Malone</surname>
            ,
            <given-names>T.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laubacher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dellarocas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The collective intelligence genome</article-title>
          .
          <source>MIT Sloan Management Review</source>
          <volume>51</volume>
          (
          <issue>3</issue>
          ),
          <volume>21</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Mani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Summarization evaluation: An overview (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Minder</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Crowdlang: A programming language for the systematic exploration of human computation systems</article-title>
          .
          <source>In: International Conference on Social Informatics</source>
          . pp.
          <volume>124</volume>
          {
          <fpage>137</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ruger, S.:
          <article-title>How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation</article-title>
          .
          <source>In: Proceedings of the international conference on Multimedia information retrieval</source>
          . pp.
          <volume>557</volume>
          {
          <fpage>566</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Pitler</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic evaluation of linguistic quality in multi-document summarization</article-title>
          .
          <source>In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>544</volume>
          {
          <fpage>554</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Shapira</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabay</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasunuru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amsterdamer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dagan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing lightweight pyramids for manual summary evaluation</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>05929</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Snow</surname>
            , R.,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Connor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Cheap and fast|but is it good?: evaluating non-expert annotations for natural language tasks</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          . pp.
          <volume>254</volume>
          {
          <fpage>263</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Steinberger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jezek</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Evaluation measures for text summarization</article-title>
          .
          <source>Computing and Informatics</source>
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <volume>251</volume>
          {
          <fpage>275</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Torres-Moreno</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunha</surname>
          </string-name>
          , I.d.,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Velazquez-Morales</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Summary evaluation with and without references</article-title>
          .
          <source>Polibits (42)</source>
          ,
          <volume>13</volume>
          {
          <fpage>20</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Valentine</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Retelny</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , To,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rahmati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Doshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.S.:</surname>
          </string-name>
          <article-title>Flash organizations: Crowdsourcing complex work by structuring crowds as organizations</article-title>
          .
          <source>In: Proceedings of the 2017 CHI conference on human factors in computing systems</source>
          . pp.
          <volume>3523</volume>
          {
          <fpage>3537</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>