<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CrowdTruth Measures for Language Ambiguity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anca Dumitrache</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lora Aroyo</string-name>
          <email>lora.aroyog@vu.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Welty</string-name>
          <email>cawelty@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google Research</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM CAS</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>VU University Amsterdam</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. We have been investigating the use of crowdsourcing as an a ordable alternative to using experts to clean noisy data, and have found that with the proper analysis, crowds can rival and even out-perform the precision and recall of experts, at a much lower cost. We have further found that the crowd, by virtue of its diversity, can help us nd evidence of ambiguous sentences that are di cult to classify, and we have hypothesized that such sentences are likely just as di cult for machines to classify. In this paper we outline CrowdTruth, a previously presented method for scoring ambiguous sentences that suggests that existing modes of truth are inadequate, and we present for the rst time a set of weighted metrics for evaluating the performance of experts, the crowd, and a trained classi er in light of ambiguity. We show that our theory of truth and our metrics are a more powerful way to evaluate NLP performance over traditional unweighted metrics like precision and recall, because they allow us to account for the rather obvious fact that some sentences express the target relations more clearly than others.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        NLP often relies on the development of a set of gold standard annotations, or
ground truth, for the purpose of training, testing and evaluation. Distant
supervision [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is a helpful solution that has given linked data sets a lot of attention
in NLP, however the data can be noisy. Human annotators can help to clean
up this noise, however for Clinical NLP domain knowledge is usually believed
to be required from annotators, making the process for acquiring ground truth
more di cult. The lack of annotated datasets for training and benchmarking is
considered one of the big challenges of Clinical NLP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Furthermore, the assumption that the gold standard represents a universal
and reliable model for language is awed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Disagreement between annotators
is usually eliminated through overly prescriptive guidelines, resulting in data
that is neither general nor re ects language's inherent ambiguity. The process
of acquiring ground truth by working exclusively with domain experts is costly
and non-scalable.
      </p>
      <p>Crowdsourcing can be a much faster and cheaper procedure than expert
annotation, and it allows for collecting enough annotations per task in order to
represent the diversity inherent in language. Crowd workers, however, generally
lack medical expertise, which might impact the quality and reliability of their
work in more knowledge-intensive tasks.</p>
      <p>Our approach can overcome the limitations of gathering expert ground truth,
by using disagreement analysis on crowd annotations to model the ambiguity
inherent in medical text. We have previously shown our approach can improve
relation extraction classi er performance over annotated data provided by
experts, can e ectively identify low-quality workers, and identify issues with the
annotation tasks themselves. In this paper we explore the hypothesis that our
sentence-level metrics are providing useful information about sentence clarity,
and present initial results on the value of di erent approaches to scoring that
the traditional precision, recall, and accuracy.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Crowdsourcing ground truth has shown promising results in a variety of domains.
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] compared the crowd versus experts for the task of part-of-speech tagging.
The authors also show that models trained based on crowdsourced annotation
can perform just as well as expert-trained models. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] studied crowdsourcing
for relation extraction in the general domain, comparing its e ciency to that
of fully automated information extraction approaches. Their results showed the
crowd was especially suited to identifying subtle formulations of relations that
do not appear frequently enough to be picked up by statistical methods.
      </p>
      <p>
        Other research for crowdsourcing ground truth includes: entity clustering
and disambiguation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Twitter entity extraction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], multilingual entity
extraction and paraphrasing [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and taxonomy creation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, all of these
approaches rely on the assumption that one black-and-white gold standard must
exist for every task. Disagreement between annotators is discarded by picking one
answer that re ects some consensus, usually through using majority vote. The
number of annotators per task is also kept low, between two and ve workers,
also in the interest of eliminating disagreement. The novelty in our approach is
to consider language ambiguity, and consequently inter-annotator disagreement,
as an inherent feature of the language. The metrics we employ for determining
the quality of crowd answers are speci cally tailored to quantify disagreement
between annotators, rather than eliminate it.
      </p>
      <p>
        The role of inter-annotator disagreement when building a gold standard
has previously been discussed by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. After empirically studying part-of-speech
datasets, the authors found that inter-annotator disagreement is consistent across
domains, even across languages. Furthermore, most disagreement is indicative
of debatable cases in linguistic theory, rather than faulty annotation. We
believe these ndings manifest even more strongly for NLP tasks involving
semantic ambiguity, such as relation extraction. In assessing the Ontology Alignment
Evaluation Initiative (OAEI) benchmark, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] found that disagreement between
annotators (both crowd and expert) is an indicator for inherent uncertainty in
the domain knowledge, and that current benchmarks in ontology alignment and
evaluation are not designed to model this uncertainty.
      </p>
      <p>
        Human annotation is a process of semantic interpretation. It can be described
using the triangle of reference [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], that links together three aspects: sign (input
text), interpreter (worker), referent (annotation). Ambiguity for one aspect of
the triangle will propagate and a ect the others { e.g. an unclear sentence will
cause more disagreement between workers. Therefore, in our work, we use metrics
to harness disagreement for each of the three aspects of the triangle, measuring
the quality of the worker, as well as the ambiguity of the text and the task.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>We set up an experiment to train and evaluate a relation extraction model
for a sentence-level relation classi er. The classi er takes, as input, sentences
and two terms from the sentence, and returns a score re ecting the likelihood
that a speci c relation, in our case the cause relation between disorders and
symptoms, is expressed in the sentence between the terms. Starting from a set
of 902 sentences that are likely to contain medical relations, we constructed
a work ow for collecting annotations through crowdsourcing. This output was
analyzed with our metrics for capturing disagreement, and then used to train
a model for relation extraction. In parallel, we also constructed a model based
on data from a traditional gold standard using domain experts, that we then
compare to the crowd model.
3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>
          The dataset used in our experiments contains 902 medical sentences extracted
from PubMed article abstracts. The MetaMap parser [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] ran over the corpus to
identify medical terms from the UMLS vocabulary [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Distant supervision [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
was used to select sentences with pairs of terms that are linked in UMLS by
one of our chosen seed medical relations. The intuition of distant supervision is
that since we know the terms are related, and they are in the same sentence,
it is more likely that the sentence expresses a relation between them. The seed
relations were restricted to a set of eleven UMLS relations important for clinical
decision making [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] (see Tab.1). All of the data that we have used is available
online at: http://data.crowdtruth.org/medical-relex.
        </p>
        <p>For collecting annotations from medical experts, we employed medical
students, in their third year at American universities, that had just taken United
States Medical Licensing Examination (USMLE) and were waiting for their
results. Each sentence was annotated by exactly one person. The annotation task
consisted of deciding whether or not the UMLS seed relation discovered by
distant supervision is present in the sentence for the two selected terms.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Crowdsourcing setup</title>
        <p>The crowdsourced annotation is performed in a work ow of three tasks (Fig.1).
The sentences were pre-processed to determine whether the terms found with
distant supervision are complete or not; identifying complete medical terms is
di cult, and the automated method left a number of terms still incomplete,
which was a signi cant source of error for the crowd in subsequent stages, so the
incomplete terms were sent through a crowdsourcing task (FactSpan) in order
to get the full word span of the medical terms. Next, the sentences with the
corrected term spans were sent to a relation extraction task (RelEx), where the
crowd was asked to decide which relation holds between the two extracted terms.
We also added four new relations (e.g. associated with), to account for weaker,
more general links between the terms (see Tab.1). The workers were able to
read the de nition of each relation, and could choose any number of relations
per sentence. There were options for the cases when the terms were related,
but not by those we provided (other), and for no relation between the terms
(none). Finally, the results from RelEx were passed to another crowdsourcing
task (RelDir) to determine the direction of the relation with regards to the two
extracted terms. (FactSpan and RelDir) were added to the basic RelEx task to
correct the most common sources of errors from the crowd.</p>
        <p>
          All three crowdsourcing tasks were run on the CrowdFlower platform 4 with
10-15 workers per sentence, to allow for a distribution of perspectives. Even
with three tasks and 10-15 workers per sentence, compared to a single expert
judgment per sentence, the total cost of the crowd amounted to 2=3 of the sum
paid for the experts. In our case, cost was not the limiting factor for the experts,
but their time and availability.
For each crowdsourcing task in the work ow, the crowd output was processed
with our metrics { a set of general-purpose crowdsourcing metrics [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. These
metrics attempt to model the crowdsourcing process based on the triangle of
reference [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], with the vertices being the input sentence, the worker, and the
target relations. Our theory is that ambiguity and disagreement at any of the
vertices (e.g. a sentence with unclear meaning, a poor quality worker, or an
unclear relation) will propagate in the system, in uencing the other components.
For example, a worker who annotates an unclear sentence is more likely to
disagree with the other workers, and this can impact that worker's quality. A low
quality worker is more likely to disagree with the other workers, and this can
impact the apparent quality of the sentence. If one of the target relations is
itself ambiguous, it will be di cult to identify and will generate disagreement that
may have nothing to do with the quality of sentences or workers. Our metrics
account for this by isolating the signals from the workers, sentences, and the
target relations, and more accurately evaluating each. In previous work we have
validated this premise in several empirical studies [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>In this paper we focus speci cally on sentence quality, to evaluate our claim
that low quality sentences are di cult to annotate, and likewise di cult for
ma4 http://crowdflower.com
Sent.1: Renal osteodystrophy is a general complication of chronic renal failure and
end stage renal disease.</p>
        <p>Sent.2: If TB is a concern, a PPD is performed.
chines to process. To measure this e ect, we begin with a simple representation
of the crowd output from the RelEx task:
{ annotation vector: the annotations of one worker for one sentence. For each
worker i their solution to a task on a sentence s is the vector Ws;i. If the
worker selects a relation, its corresponding component would be marked with
`1', and `0' otherwise. For instance, in the case of RelEx, the vector will have
fourteen components, one for each relation, none and other.
{ sentence vector: For every sentence s, we sum the annotation vectors for all
workers on the given task: Vs = Pi Ws;i .</p>
        <p>The sentence vector is a simple representation of the annotations on a
sentence, and leads to the sentence-relation score, which measures, for each relation,
the degree to which a sentence vector diverges from perfect agreement on that
relation. It is simply the cosine similarity between the sentence vector and the
unit vector for the relation: srs(s; r) = cos(Vs; r^). The higher the value of this
metric, the more clearly the relation is expressed in the sentence. The purpose
of the experiments is to provide evidence that the srs is measuring the clarity,
or inversely the ambiguity, of a sentence with respect to a particular relation,
and that sentences with low scores present di culty for the crowd, experts, and
machines alike.</p>
        <p>
          We use a two-step process to eliminate low-quality worker annotations. We
run the sentence metrics and lter out sentences whose quality score is one
standard deviation below the mean, then we run our worker metrics [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on the
remaining sentences and lter out all workers below a trained threshold. The
purpose of the rst step is to ensure the worker quality scores are not adversely
impacted by confusing sentences. We remove all low quality worker annotations
and re-evaluate the sentence metrics on all sentences.
3.4
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Training the model</title>
        <p>
          At the highest level our research goal is to investigate crowdsourcing as a way to
gather human annotated data for training and evaluating cognitive systems. In
these experiments we were speci cally gathering annotated data for a
sentencelevel relation extraction classi er [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. This classi er is trained per individual
relation, by feeding it both positive and negative examples. It o ers support for
both discrete labels, and real values for weighting the con dence of the training
data entries, with positive values in (0; 1], and negative values in [ 1; 0).
        </p>
        <p>To test our approach, we gathered four annotated data sets and trained
classi er models for the cause relation using ve-fold cross-validation over the
902 sentences:
1. baseline: Discrete (positive or negative) labels are given for each sentence by
the distant supervision method { for any relation, a positive example is a
sentence containing two terms related by cause in UMLS. Distant supervision
does not extract negative examples, so in order to generate a negative set
for one relation, we use positive examples for the other (non-overlapping)
relations shown in Tab. 1.
2. expert: Discrete labels based on an expert's judgment as to whether the
baseline label is correct. The experts do not generate judgments for all
combinations of sentences and relations { for each sentence, the annotator decides on
the seed relation extracted with distant supervision. We reuse positive
examples from the other relations to extend the number of negative examples.
3. single: Discrete labels for every sentence are taken from one randomly
selected crowd worker who annotated the sentence. This data simulates the
traditional single annotator setting.
4. crowd: Weighted labels for every sentence are based on the CrowdTruth
sentence-relation score. The classi er expects positive scores for positive
examples, and negative scores for negative, so the sentence-relation scores must
be re-scaled. An important variable in the re-scaling is a threshold to select
positive and negative examples. The Results section compares the
performance of the crowd at di erent threshold values. Given a threshold, the
sentence-relation score is then linearly re-scaled into the [0:85; 1] interval
for the positive label weight, and the [ 1; 0:85] interval for negative (see
below). An example of how the scores were processed is given in Tab.2.</p>
        <p>
          In order to directly compare the expert to the crowd annotations, it was
necessary to annotate precisely the same sentences using each method, and train
the classi er on each set. The limitation on batch size came from the availability
of our experts, we were only able to use them for 902 sentences. In a batch this
small, we found that the sentence-relation score, which ranged between [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] and
rarely assigned a weight of 1, diluted the positive signal too much in comparison
to the expert scores which were simply 0 or 1. We experimented, on a di erent
data set, with rescaling the scores and selected the range that yielded the highest
quality score, speci ed above.
In order for a meaningful comparison between the crowd and expert models,
we veri ed the sentences to provide a ground truth { a discrete positive or
negative label on each sentence used in evaluation (for training, only the scores
from the respective data set were used). While the main purpose of this work
is to move beyond discrete labels for truth, we needed a reference standard to
establish that our approach is at least as good as the accepted practice. To
produce this reference standard, we rst selected the positive/negative threshold for
sentence-relation score in the crowd dataset that yielded the highest agreement
between the crowd and the experts, and then accepted all 755 sentences where
the experts and crowd agreed as true positives. The remaining sentences were
manually evaluated and assigned either a positive, negative, or ambiguous value.
The ambiguous cases were subsequently removed resulting in 902 sentences. In
this way we created reliable, unbiased test scores, to be used in the evaluation
of the models. In some sense, removing the ambiguous cases penalizes our
approach, which is designed speci cally to help deal with them, but again we want
to rst establish our approach is at least as good as accepted practice.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Preliminary experiments</title>
        <p>
          As reported in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and summarized here, we compared each of the four datasets
to our vetted reference standard, to determine the quality of the cause relation
annotations, as shown in Fig.2. As expected, the baseline data was the
lowest quality, followed closely by the single crowd worker. The expert annotations
achieved an F1 score of 0.844. Since the baseline, expert, and single sets are
binary decisions, they appear as horizontal lines. For the crowd annotations,
we plotted the F1 against di erent sentence-relation score thresholds for
determining positive and negative sentences. Between the thresholds of 0.6 and 0.8,
the crowd out-performs the expert, reaching a maximum of 0.907 F1 score at
a threshold of 0.7. This di erence is signi cant with p = 0:007, measured with
McNemar's test [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          We next wanted to verify that this improvement in annotation quality has a
positive impact on the model that is trained with this data. In a cross-validation
experiment, we trained the model with each of the four datasets for
identifying the cause relation (discussed in more detail in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]). The results of the
evaluation (Fig.3) show the best performance for the crowd model when the
sentence-relation threshold for deciding between negative/positive equals 0.5.
Trained with this data, the classi er model achieves an F1 score of 0.642,
compared to the expert-trained model which reaches 0.638. McNemar's test shows
statistical signi cance with p = 0:016. This result demonstrates that the crowd
provides training data that is at least as good, if not better than experts.
We believe the discrete notion of truth is obsolete and should be replaced by
something more exible. For the purposes of semantic interpretation tasks for
which crowdsourcing is appropriate, we propose our annotation-level metrics as
a suitable replacement. In this case, the sentence relationscore gives a
realvalued score that measures the degree to which a particular sentence expresses a
particular relation between two terms. We believe the preliminary experiments
demonstrate the approach is sound. Our primary results evaluate the
sentencerelation score as a measure of the clarity with which a sentence expresses the
relation. To this end, we de ne the following metrics:
p = tp=(tp + f p), weighted precision p0 = Ps ws(tp(s) + f p(s)) .
{ weighted recall: Where normally r = tp=(tp + f n), weighted recall r0 =
        </p>
        <p>Ps wstp(s)</p>
        <p>Ps ws(tp(s) + f n(s))
{ weighted f-measure: Is the harmonic mean of weighted precision and recall:
f 10 = 2p0r1=(p0 + r0)</p>
        <p>If the srs metric is a true measure of clarity, then we would expect it to be
more likely for low clarity sentences to be wrong, and less likely for high clarity
sentences, and this should be revealed in an overall increase of the weighted scores
over the unweighted. In Tab. 3, we show a comparison of ve data sets. In the rst
two columns, the annotation quality of each data set is shown, comparing the F1
to the weighted F1'. The F1' scores are higher in all cases, revealing that human
annotators are indeed having trouble correctly annotating these sentences. The
baseline scores are the least a ected by the weighting, which also ts with our
intuition since the baseline does not use human judgment at all.</p>
        <p>The next six columns in each row show classi er performance when trained
by that dataset. The rst pair of columns compare the F1 to F1', and for interest
the nal four columns show the precision and recall. In all cases the classi er
F1' is greater than F1, indicating that, as with humans, machines have trouble
correctly interpreting sentences with a low srs. The only weighted metric that
does not increase is the baseline recall, again this is justi ed as the baseline does
not actually require any interpretation.</p>
        <p>In Fig. 4 we show how the classi er performs throughout the possible
thresholds, the weighted scores are consistently higher.</p>
        <p>
          We also analyzed the data to understand the overlap between the crowd
scores and the experts. In Fig.5 we compared the frequency of sentences with
cause annotations at di erent sentence-relation scores (measured with kernel
density estimation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]) to the expert annotations of the same sentences. The
result shows high agreement between the crowd and the expert { a low
sentencerelation score is highly correlated with a negative expert decision, and a high
score is highly correlated with a positive expert decision. In Fig.6 we show the
number of sentences in which the crowd agrees with the expert (on both positive
and negative decisions), plotted against di erent positive/negative thresholds for
the sentence-relation score of cause. The maximum agreement with the expert
set is at the 0.7 threshold, the same as for the annotation quality F1 score (Fig.2),
with 755 sentences in common between crowd and expert. The remaining 147
sentences were manually evaluated to build the test partition.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>A widespread use of linked data for information extraction is distant supervision,
in which relation tuples from a data source are found in sentences in a text
corpus, and those sentences are treated as training data for relation extraction
systems. Distant supervision is a cheap way to acquire training data, but that
data can be quite noisy, which limits the performance of a system trained with
it. Human annotators can be used to clean the data, but in some domains,
such as medical NLP, it is widely believed that only medical experts can do
this reliably. Current methods for collecting this human annotation attempt to
minimize disagreement between annotators, but end up failing to capture the
ambiguity inherent in language. We believe this is a vestige of an antiquated
notion of truth being a discrete property, and have developed a powerful new
method for representing truth.</p>
      <p>In this paper we have presented results that show that using a larger number
of workers per example { up to 15 { can form a more accurate model of truth
at the sentence level, and signi cantly improve the quality of the annotations.
It also bene ts systems that use this annotated data, such as machine learning
systems, signi cantly improving their performance with higher quality data. Our
primary result is to show that our scoring metric for sentence quality in relation
extraction supports our hypothesis that higher quality sentences are easier to
classify { for crowd workers, experts, and machines, and our model of truth
allows us to more faithfully capture the ambiguity that is inherent in language
and human interpretation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors would like to thank Chang Wang for support with using the medical
relation extraction classi er, and Anthony Levas for help with collecting the
expert annotations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aronson</surname>
            ,
            <given-names>A.R.:</given-names>
          </string-name>
          <article-title>E ective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>In: Proceedings of the AMIA Symposium</source>
          . p.
          <fpage>17</fpage>
          . American Medical Informatics Association (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Crowd Truth: harnessing disagreement in crowdsourcing a relation extraction gold standard</article-title>
          .
          <source>Web Science</source>
          <year>2013</year>
          . ACM (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The Three Sides of CrowdTruth</article-title>
          .
          <source>Journal of Human Computation</source>
          <volume>1</volume>
          ,
          <issue>31</issue>
          {
          <fpage>34</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Truth is a lie: Crowd truth and the seven myths of human annotation</article-title>
          .
          <source>AI</source>
          Magazine
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <volume>15</volume>
          {
          <fpage>24</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The uni ed medical language system (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research 32(suppl 1)</source>
          ,
          <source>D267{D270</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nadkarni</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Avolio</surname>
            ,
            <given-names>L.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>18</volume>
          (
          <issue>5</issue>
          ),
          <volume>540</volume>
          {
          <fpage>543</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cheatham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Conference v2.
          <article-title>0: An uncertain version of the OAEI Conference benchmark</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2014</year>
          , pp.
          <volume>33</volume>
          {
          <fpage>48</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          , W.B.:
          <article-title>Building a persistent workforce on mechanical turk for multilingual data collection</article-title>
          .
          <source>In: Proceedings of The 3rd Human Computation Workshop (HCOMP</source>
          <year>2011</year>
          )
          <article-title>(</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chilton</surname>
            ,
            <given-names>L.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edge</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landay</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Cascade: crowdsourcing taxonomy creation</article-title>
          .
          <source>In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</source>
          . pp.
          <year>1999</year>
          {
          <year>2008</year>
          . CHI '13,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dumitrache</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Achieving expert-level annotation quality with CrowdTruth: the case of medical relation extraction</article-title>
          .
          <source>In: Proceedings of the 2015 International Workshop on Biomedical Data Mining</source>
          , Modeling, and Semantic Integration (
          <fpage>BDM2I</fpage>
          -2015), 14th International Semantic Web Conference (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murnane</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karandikar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Keller, N.,
          <string-name>
            <surname>Martineau</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Annotating named entities in Twitter data with crowdsourcing</article-title>
          .
          <source>In: In Proc. NAACL HLT</source>
          . pp.
          <volume>80</volume>
          {
          <fpage>88</fpage>
          . CSLDAMT '
          <volume>10</volume>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plank</surname>
          </string-name>
          , B., S gaard, A.:
          <article-title>Experiments with crowdsourced re-annotation of a POS tagging data set</article-title>
          .
          <source>In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          . pp.
          <volume>377</volume>
          {
          <fpage>382</fpage>
          . Association for Computational Linguistics, Baltimore, Maryland (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Knowlton</surname>
            ,
            <given-names>J.Q.</given-names>
          </string-name>
          :
          <article-title>On the de nition of \picture"</article-title>
          .
          <source>AV Communication Review</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <volume>157</volume>
          {
          <fpage>183</fpage>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kondreddi</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trianta</surname>
            <given-names>llou</given-names>
          </string-name>
          , P.,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Combining information extraction and human computing for crowdsourced knowledge acquisition</article-title>
          .
          <source>In: 30th International Conference on Data Engineering</source>
          . pp.
          <volume>988</volume>
          {
          <fpage>999</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>Y.r.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hwang</surname>
            ,
            <given-names>S.w.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Hybrid entity clustering using crowds and data</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>22</volume>
          (
          <issue>5</issue>
          ),
          <volume>711</volume>
          {
          <fpage>726</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>McNemar</surname>
            ,
            <given-names>Q.:</given-names>
          </string-name>
          <article-title>Note on the sampling error of the di erence between correlated proportions or percentages</article-title>
          .
          <source>Psychometrika</source>
          <volume>12</volume>
          (
          <issue>2</issue>
          ),
          <volume>153</volume>
          {
          <fpage>157</fpage>
          (
          <year>1947</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mintz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bills</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snow</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distant supervision for relation extraction without labeled data</article-title>
          .
          <source>In: Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume</source>
          <volume>2</volume>
          . pp.
          <volume>1003</volume>
          {
          <fpage>1011</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ogden</surname>
            ,
            <given-names>C.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richards</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The meaning of meaning</article-title>
          .
          <source>Trubner &amp; Co</source>
          , London (
          <year>1923</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Plank</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , D., S gaard, A.:
          <article-title>Linguistically debatable or just plain wrong? In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</article-title>
          . pp.
          <volume>507</volume>
          {
          <fpage>511</fpage>
          . Association for Computational Linguistics, Baltimore, Maryland (
          <year>June 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Silverman</surname>
            ,
            <given-names>B.W.</given-names>
          </string-name>
          :
          <article-title>Density estimation for statistics and data analysis</article-title>
          ,
          <source>vol. 26</source>
          . CRC press (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , J.:
          <article-title>Medical relation extraction with manifold models</article-title>
          .
          <source>In: 52nd Annual Meeting of the ACL</source>
          , vol.
          <volume>1</volume>
          . pp.
          <volume>828</volume>
          {
          <fpage>838</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>