Achieving Expert-Level Annotation Quality with CrowdTruth The Case of Medical Relation Extraction Anca Dumitrache1,2 , Lora Aroyo1 , and Chris Welty3 1 VU University Amsterdam, Netherlands {anca.dumitrache,lora.aroyo}@vu.nl 2 IBM CAS, Amsterdam, Netherlands 3 Google Research, New York, USA cawelty@gmail.com Abstract. The lack of annotated datasets for training and benchmark- ing is one of the main challenges of Clinical Natural Language Processing. In addition, current methods for collecting annotation attempt to min- imize disagreement between annotators, and therefore fail to model the ambiguity inherent in language. We propose the CrowdTruth method for collecting medical ground truth through crowdsourcing, based on the ob- servation that disagreement between annotators can be used to capture ambiguity in text. In this work, we report on using this method to build a ground truth for medical relation extraction, and how it performed in training a classification model. Our results show that, with appropriate processing, the crowd performs just as well as medical experts in terms of the quality and efficacy of annotations. Furthermore, we show that the general practice of employing a small number of annotators for collecting ground truth is faulty, and that more annotators per sentence are needed to get the highest quality annotations. 1 Introduction Clinical Natural Language Processing (NLP) has become an invaluable tool for navigating and processing medical data [19]. Clinical NLP relies on the devel- opment of a set of gold standard annotations, or ground truth, for the purpose of training, testing and evaluation. Ground truth is usually collected by humans reading text and following a set of guidelines to ensure a uniform understanding of the annotation task. In the medical domain, domain knowledge is usually be- lieved to be required from annotators, making the process for acquiring ground truth more difficult. The lack of annotated datasets for training and benchmark- ing is considered one of the big challenges of Clinical NLP [8]. Furthermore, the process behind acquiring ground truth often presents flaws [5]. It is assumed that the gold standard represents a universal and reliable model for language. However, previous experiments we performed in medical relation extraction [2] identified two issues with this assumption: (1) disagreement be- tween annotators is usually eliminated through overly prescriptive annotation guidelines, thus creating artificial data that is neither general nor reflects the ambiguity inherent in natural language, and (2) the process of acquiring ground truth by working exclusively with domain experts is costly and non-scalable, both in terms of time and money. A possible solution to these issues is using crowdsourcing for collecting the ground truth. Not only is this a much faster and cheaper procedure than expert annotation, it also allows for collecting enough annotations per task in order to represent the diversity inherent in language. Crowd workers, however, generally lack medical expertise, which might impact the quality and reliability of their work in more knowledge-intensive tasks. Previously, we studied medical relation extraction in a relatively small set of 90 sentences [3], comparing the results from the crowd with that of two expert medical annotators. We found that disagree- ment within the crowd is consistent with expert inter-annotator disagreement. Furthermore, sentences that registered high disagreement tended to be vague or ambiguous when manually evaluated. Our approach, called CrowdTruth, can overcome the limitations of gathering expert ground truth, by using disagreement analysis on crowd annotations to model the ambiguity inherent in medical text. Furthermore, we claim that, even for complex annotation tasks such as relation extraction, lack of medical exper- tise of the crowd is compensated by collecting a large enough set of annotations. We prove this in two ways, by manually judging the quality of the annotations provided by experts and the crowd, and more importantly by training a model for medical relation extraction with both CrowdTruth data and ground truth from medical experts, and comparing them in a cross-validation experiment. In this paper, we make the following contributions: (1) a comparison of the quality and efficacy of annotations for medical relation extraction provided by both crowd and medical experts, showing that crowd annotations are equivalent to those of experts, with appropriate processing; (2) an openly available dataset of 900 English sentences for medical relation extraction, centering primarily on the cause relation, that have been processed with disagreement analysis and by experts; (3) an analysis of the optimal crowd settings for medical relation extraction, showing that 10 workers per sentence yields the highest quality an- notations. 2 Related Work There exists some research using crowdsourcing to collect semantic data for the medical domain. [18] use crowdsourcing to verify relation hierarchies in biomed- ical ontologies. On 14 relations from the SNOMED CT CORE Problem List Subset, the authors report the crowd’s accuracy at 85% for identifying whether the relations were correct or not. In the field of Biomedical NLP, [7] used crowd- sourcing to extract the gene-mutation relations in Medical Literature Analysis and Retrieval System Online (MEDLINE) abstracts. Focusing on a very specific gene-mutation domain, the authors report a weighted accuracy of 82% over a corpus of 250 MEDLINE abstracts. Both of these approaches present preliminary results from experiments performed with small datasets. To our knowledge, the most extensive study of medical crowdsourcing was performed by [23], who describe a method for crowdsourcing a ground truth for medical named entity recognition and entity linking. In a dataset of over 1,000 clinical trials, the authors show no statistically significant difference between the crowd and expert-generated gold standard for the task of extracting medications and their attributes. We extend these results by applying crowdsourcing to the more complex task of medical relation extraction, that prima facie seems to require more domain expertise than named entity recognition. Furthermore, we test the viability of crowdsourced ground truth for relation extraction. Crowdsourcing ground truth has shown promising results in a variety of other domains. [13] compared the crowd versus experts for the task of part-of-speech tagging. The authors also show that models trained based on crowdsourced an- notation can perform just as well as expert-trained models. [15] studied crowd- sourcing for relation extraction in the general domain, comparing its efficiency to that of fully automated information extraction approaches. Their results showed the crowd was especially suited to identifying subtle formulations of relations that do not appear frequently enough to be picked up by statistical methods. Other research for crowdsourcing ground truth includes: entity clustering and disambiguation [16], Twitter entity extraction [12], multilingual entity ex- traction and paraphrasing [9], and taxonomy creation [10]. However, all of these approaches rely on the assumption that one black-and-white gold standard must exist for every task. Disagreement between annotators is discarded by picking one answer that reflects some consensus, usually through using majority vote. The number of annotators per task is also kept low, between two and five workers, also in the interest of eliminating disagreement. The novelty in our approach is to consider language ambiguity, and consequently inter-annotator disagreement, as an inherent feature of the language. The metrics we employ for determining the quality of crowd answers are specifically tailored to quantify disagreement between annotators, rather than eliminate it. 3 Experimental Setup In order to perform the comparison between expert and crowdsourced gold stan- dards, we set up an experiment to train and evaluate a relation extraction model for a sentence-level relation classifier. The classifier takes, as input, sentences and two terms from the sentence, and returns a score reflecting the likelihood that a specific relation, in our case the cause relation between symptoms and disorders, is expressed in the sentence between the terms. Starting from a set of 902 sen- tences that are likely to contain medical relations, we constructed a workflow for collecting annotations through crowdsourcing. This output was analyzed with CrowdTruth metrics for capturing disagreement, and then used to train a model for relation extraction. In parallel, we also constructed a model based using a traditional gold standard acquired from domain experts, that we then compare to the crowd model. 3.1 Data The dataset used in our experiments contains 902 medical sentences extracted from PubMed article abstracts. The MetaMap parser [1] ran over the corpus to identify medical terms from the UMLS vocabulary [6]. Distant supervision [17] was used to select sentences with pairs of terms that are linked in UMLS by one of our chosen seed medical relations. The intuition of distant supervision is that since we know the terms are related, and they are in the same sentence, it is more likely that the sentence expresses a relation between them (than just any random sentence). The seed relations were restricted to a set of eleven UMLS relations important for clinical decision making [22] (listed in Tab.1). Given a relation, each sentence in the dataset can be either positive (i.e. the relation is expressed between the two terms in the sentence), or negative (i.e. the rela- tion is not expressed). All of the data that we have used is available online at: http://data.crowdtruth.org/medical-relex. For collecting annotations from medical experts, we employed medical stu- dents, in their third year at American universities, that had just taken United States Medical Licensing Examination (USMLE) and were waiting for their re- sults. Each sentence was annotated by exactly one person. The annotation task consisted of deciding whether or not the UMLS seed relation discovered by dis- tant supervision is present in the sentence for the two selected terms. 3.2 Crowdsourcing setup The crowdsourced annotation setup is based on our previous medical relation extraction work [4], adapted into a workflow of three tasks (Fig.1). First, the sentences were pre-processed using a named-entity recognition tool combining the UMLS vocabulary with lexical parsing, to determine whether the terms found with distant supervision are complete or not. The incomplete terms were parsed through a crowdsourcing task (FactSpan) in order to get the full word span of the medical terms. Next, the sentences with the corrected term spans were sent to a relation extraction task (RelEx), where the crowd was asked to decide which relation holds between the two extracted terms. To simplify the task for the crowd, we combined the UMLS relations from distant supervision, merging relations with similar meanings (e.g. disease has primary anatomic site and has finding site). We also added four new relations (e.g. associated with), to account for weaker, more general links between the terms. The full set of the relations presented to the crowd is available in Tab.1. The workers were also able to read the definition of each relation. The task was multiple choice, workers being able to choose more than one relation at the same time. There were also options available for cases when the medical relation was other than the ones we provided (other), and for when there was no relation between the terms (none). Finally, the results from RelEx were passed to another crowdsourcing Table 1: Set of medical relations. Relation Corresponding Definition Example UMLS relation(s) treat may treat therapeutic use of a drug penicillin treats infection prevent may prevent preventative use of a drug vitamin C prevents influenza diagnosis may diagnose diagnostic use of an ingredient, test or a drug RINNE test is used to diagnose hear- ing loss cause cause of; the underlying reason for a symptom or a disease fever induces dizziness has causative agent location disease has primary body part in which disease or disorder is observed leukemia is found in the circulatory anatomic site; system has finding site symptom disease has finding; deviation from normal function indicating the pain is a symptom of a broken arm disease may have presence of disease or abnormality finding manif estation has manifestation links disorders to the observations that are closely abdominal distention is a manifesta- associated with them tion of liver failure contraindicate contraindicated drug a condition for which a drug or treatment should patients with obesity should avoid not be used using danazol associated with signs, symptoms or findings that often appear to- patients who smoke often have yellow gether teeth side effect a secondary condition or symptom that results use of antidepressants causes dryness from a drug in the eyes is a a relation that indicates that one of the terms is migraine is a kind of headache more specific variation of the other part of an anatomical or structural sub-component the left ventricle is part of the heart task RelDir to determine the direction of the relation with regards to the two extracted terms. All three crowdsourcing tasks were run on the CrowdFlower platform 4 with 10-15 workers per sentence, to allow for a distribution of perspectives; the precise settings for each task are available in Tab.2. Even with three tasks and 10-15 workers per sentence, compared to a single expert judgment per sentence, the total cost of the crowd amounted to 2/3 of the sum paid for the experts. In our case, cost was not the limiting factor for the experts, but their time and availability. FactSpan RelEx RelDir judgments (i.e. workers per sentence) 10 15 10 pay per sentence annotation (in $) 0.04 0.05 0.01 Table 2: CrowdFlower Settings for the Tasks in CrowdTruth Workflow. 3.3 CrowdTruth metrics For each crowdsourcing task in the crowd annotation workflow, the crowd output was processed with the use of CrowdTruth metrics – a set of general-purpose 4 http://CrowdFlower.com Fig. 1: CrowdTruth Workflow for Medical Relation Extraction on CrowdFlower [11]. crowdsourcing metrics [14], that have been successfully used to model relation extraction [4]. These metrics attempt to model the crowdsourcing process based on the triangle of reference [20], with the vertices being the input sentence, the worker, and the seed relation. Ambiguity and disagreement at any of the vertices (e.g. a sentence with unclear meaning, a poor quality worker, or an unclear relation) will propagate in the system, influencing the other components. For example, a worker who annotates an unclear sentence is more likely to disagree with the other workers, and this can impact that worker’s quality. Therefore, the CrowdTruth metrics model quality at each vertex in relation to all the others, so that a high quality worker who annotates many low clarity sentences will be recognized as high quality. In our workflow, these metrics are used both to eliminate spammers [21], and to determine the clarity of the sentences and relations. The main concepts are: – annotation vector: This construct is used to model the annotation of one worker for one sentence. For each worker i submitting their solution to a task on a sentence s, the vector Ws,i records their answers. If the worker selects an answer, its corresponding component would be marked with ‘1’, and ‘0’ otherwise. For instance, in the case of RelEx, the vector will have fourteen components, one for each relation, as well as none and other. – sentence vector: This is the main component for modeling disagreement in the crowdsourcing system. There is one such vector for every input sentence. For every sentence s, it is computed P by adding the annotation vectors for all workers on the given task: Vs = i Ws,i . – sentence-annotation score: A core CrowdTruth concept, this metric computes annotation ambiguity in a sentence with the use of cosine similarity. In the case of RelEx, it becomes the sentence-relation score, and is computed as the cosine similarity between the sentence vector and the unit vector for the relation: srs(s, r) = cos(Vs , r̂). The higher the value of this metric, the more clearly the relation is expressed in the sentence. 3.4 Training the model The sentences together with the relation annotations were then used to train a manifold model for relation extraction [22]. This model was developed for the medical domain, and tested for the relation set that we employ. It is trained per individual relation, by feeding it both positive and negative data. It offers support for both discrete labels, and real values for weighting the confidence of the training data entries, with positive values in (0, 1], and negative values in [−1, 0). Using this system, we train several models using five-fold cross validation, in order to compare performances of the crowd and expert dataset. In total, we use four datasets: 1. baseline: Discrete (positive or negative) labels are given for each sentence by the distant supervision method – for any relation, a positive example is a sentence containing two terms related by cause in UMLS. This dataset constitutes the baseline against which all other datasets are tested. Distant supervision does not extract negative examples, so in order to generate a negative set for one relation, we use positive examples for the other (non- overlapping) relations shown in Tab. 1. 2. expert: Discrete labels based on an expert’s judgment as to whether the base- line label is correct. The experts do not generate judgments for all combina- tions of sentences and relations – for each sentence, the annotator decides on the seed relation extracted with distant supervision. Similarly to the base- line data, we reuse positive examples from the other relations to extend the number of negative examples. 3. single: Discrete labels for every sentence are taken from one randomly selected crowd worker who annotated the sentence. This data simulates the traditional single annotator setting. 4. crowd: Weighted labels for every sentence are based on the CrowdTruth sentence-relation score. The classifier expects positive scores for positive ex- amples, and negative scores for negative, so the sentence-relation scores must be re-scaled. An important variable in the re-scaling is a threshold to se- lect positive and negative examples. The Results section compares the per- formance of the crowd at different threshold values. Given a threshold, the sentence-relation score is then linearly re-scaled into the [0.85, 1] interval for the positive label weight, and the [−1, −0.85] interval for negative. An exam- ple of how the scores were processed is given in Tab.3. 3.5 Evaluation setup In order for a meaningful comparison between the crowd and expert models, the evaluation set needs to be carefully selected. The sentences in the test folds were picked through the cross validation mechanism, but the scores were selected from our test partition, which we verified to ensure correctness. To build the test partition , we first selected the positive/negative threshold for sentence-relation score such that the crowd agrees the most with the experts. We assume that, if both the expert and the crowd agree that a sentence is either a positive or Sent.1: Renal osteodystrophy is a general complication of chronic renal failure and end stage renal disease. Sent.2: If TB is a concern, a PPD is performed. n te th manif estatio contraindica associated wi side effect diag nosis sy mptom location prev ent part of cause other treat none is a Sent. sentence 0 0 1 10 1 2 0 0 1 0 0 0 0 0 Sent.1 vector 3 1 7 0 0 0 0 0 3 0 0 0 1 0 Sent.2 sentence – 0 0 0.09 0.96 0.09 0.19 0 0 0.09 0 0 0 0 0 Sent.1 relation score 0.36 0.12 0.84 0 0 0 0 0 0.36 0 0 0 0.12 0 Sent.2 crowd model -1 -1 -0.97 0.99 -0.97 -0.94 -1 -1 -0.97 -1 -1 -1 -1 -1 Sent.1 training score -0.89 -0.96 0.95 -1 -1 -1 -1 -1 -0.89 -1 -1 -1 -0.96 -1 Sent.2 Table 3: Example sentence with scores from the crowd dataset; training score calculated for negative/positive sentence-relation threshold equal to 0.5, and linear rescaling in the [−1, −0.85] interval for negative, [0.85, 1] for positive. negative example, it can automatically be used as part of the test set. Such a sentence was labeled with the crowd score. In the cases where the crowd and experts disagree, we manually verified and assigned either a positive, negative, or ambiguous value. The ambiguous cases were subsequently removed from the test folds. In this way we created reliable, unbiased test scores, to be used in the evaluation of the models. 4 Results We compared each of the four datasets using the test partition as a gold standard, to determine the quality of the cause relation annotations, as shown in Fig.2. As expected, the baseline data performed the lowest, followed closely by the single crowd worker. The expert annotations achieved an F1 score of 0.844. Since the baseline, expert, and single sets are binary decisions, they appear as horizontal lines. For the crowd annotations, we plotted the F1 against different sentence- Fig. 2: Annotation quality F1 per Fig. 3: Crowd & expert agreement per neg./pos. threshold for cause. neg./pos. threshold for cause. Fig. 5: Learning curves (crowd with Fig. 4: F1 scores. pos./neg. threshold at 0.5). relation score thresholds for determining positive and negative sentences. Be- tween the thresholds of 0.6 and 0.8, the crowd out-performs the expert, reaching the maximum of 0.907 F1 score at a threshold of 0.7. This difference is significant with p = 0.007, measured with McNemar’s test. In Fig.3 we show the number of sentences in which the crowd agrees with the expert (on both positive and negative decisions), plotted against different positive/negative thresholds for the sentence-relation score of cause. The maximum agreement with the expert set is at the 0.7 threshold, the same as for the annotation quality F1 score (Fig.2), with 755 sentences where crowd and expert agree. We next wanted to verify that this improvement in annotation quality has a positive impact on the model that is trained with this data. In a cross-validation experiment, we trained the model with each of the four datasets for identi- fying the cause relation. The results of the evaluation (Fig.4) show the best performance for the crowd model when the sentence-relation threshold for de- ciding between negative/positive equals 0.5. Trained with this data, the classifier model achieves an F1 score of 0.642, compared to the expert-trained model which reaches 0.638. McNemar’s test shows statistical significance with p = 0.016. This result demonstrates that the crowd provides training data that is at least as good, if not better than experts. In addition, the baseline scores an F1 of 0.575, and the single annotator shows the worst performance, scoring at 0.483. The learning curves (Fig.5) show that, above 400 sentences, the crowd consistently scores over baseline and single in F1 score. After 600 sentences, the crowd also out-performs the experts. The trend of the curve is still upward, indicating that more data is necessary to get the best performance. Finally, we checked whether the number of workers per task was sufficient to produce a stable sentence-relation score. For the RelEx task, we ensured that each sentence was checked by at least 10 workers, after spam removal. The plot of the mean cosine distance between sentence vectors before and after adding the latest worker shows that the sentence vector becomes stable after 10 workers (Fig. 6). Furthermore, the annotation quality F1 score per total number of workers (Fig. 7) is also stable after 10 workers (the drop towards the end is due to sparse data – only 54 sentences had 15 or more total workers). Fig. 6: Mean cosine distance for sen- Fig. 7: Annotation quality F1 for crowd tence vectors before and after adding pos./neg. threshold at 0.7, shown per the latest worker, shown per number of number of workers. workers. 5 Discussion Our goal was to demonstrate that, like the crowdsourced medical entity recog- nition work by Zhai et al. [23], the CrowdTruth approach of having multiple annotators with precise quality scores can be harnessed to create gold standard data with a quality that rivals annotated data created by medical experts. Our results show this clearly, in fact with slight improvements, with a sizable dataset (902 sentences) on a problem (relation extraction) that prima facie seems to require more domain expertise. Tab.4 shows the results in more detail. The most curious aspect of the results is that the positive/negative sentence- relation score threshold that gives the best quality annotations (Fig.2) is different from the best threshold for training the model (Fig.4). It is the lower threshold (equal to 0.5) that gives a better classification. This is most likely due to the higher recall of the lower threshold, which exposes the classifier to more positive examples. F-score is the harmonic mean between precision and recall, and does not necessarily represent the best trade-off between them, as this experiment shows. Indeed F-score may not be the best trade-off between precision and recall for the classifier. In [11], we experimented with a weighted F-score, using the CrowdTruth metrics to account for ambiguity in the sentences. Using this new metric, we found an improved performance for both crowd and expert. Table 4: Model evaluation results for each dataset. precision recall F1 score accuracy max. F1 score crowd 0.565 0.743 0.642 0.784 0.659 (0.5 sent.-rel. threshold) crowd 0.619 0.61 0.613 0.8 0.654 (0.7 sent.-rel. threshold) expert 0.672 0.604 0.638 0.818 0.679 baseline 0.436 0.844 0.575 0.674 0.622 single 0.495 0.473 0.483 0.737 0.54 It is also notable that the baseline out-performs the single annotator. This could be an indicator that the crowd can only achieve quality when accounting for the choices of multiple annotators. In addition, the recall score for baseline is notably high. This could be a consequence of how the model performs its training – one of the features it learns is the UMLS type of the terms. For cause, term types are often enough to accurately qualify the relation. The learning curves (Fig.5) show we still have not reached the ideal amount of training data, especially for the CrowdTruth approach, in which the weights of sentences have less of a cumulative effect, as opposed to datasets with binary labels. In other words, when accounting for ambiguity in training, more data points are needed to reach maximum performance. A bottleneck in this analysis is the availability of expert annotations – we did not have the resources to collect a larger expert dataset, and this indeed is the main reason to consider crowd- sourcing. It is also worth noting that, while the crowd annotations consistently out-perform the distant-supervision baseline, we do not yet have a fair compari- son between a distant supervision approach and the CrowdTruth approach. The real value of distant supervision is that large amounts of data can be gathered rather easily and cheaply, since humans are not involved. We are working on experiments to explore the trade-off between scale, quality, and cost, based on the assumption that systems trained with either kind of data will eventually reach a performance maximum. Finally, in Figs. 6 & 7 we observe that we need at least 10 workers to get a stable crowd score. This result goes against the general practice for building a ground truth, where per task there usually are 2 to 5 annotators. Based on our results, we believe that the general practice is wrong, and that outside of a few clear cases, the input of more annotators is necessary to capture ambiguity. Even with this added requirement, we found that crowdsourcing is still cheaper than medical experts – the cost of the experts was 50% higher. 6 Conclusion The lack of ground truth for training and benchmarking is one of the main chal- lenges of Clinical NLP. In addition, current methods for collecting annotation attempt to minimize disagreement between annotators, but end up failing to model the ambiguity inherent in language. We propose the CrowdTruth method for crowdsourcing ground truth while also capturing and interpreting disagree- ment. We used CrowdTruth to build a gold standard of 902 sentences for medical relation extraction, which was employed in training a classification model. We have shown that, with appropriate processing, the crowd performs just as well as medical experts in terms of the quality and efficacy of annotations, while being cheaper and more readily available. Our results indicate that at least 10 workers per sentence are needed to get the highest quality annotations, in contrast to the general practice of employing a small number of annotators for collecting ground truth. We plan to continue our experiments by scaling out the crowdsourcing approach, which has the possibility of performing better. Acknowledgments The authors would like to thank Chang Wang for support with using the medical relation extraction classifier, and Anthony Levas for help with collecting the expert annotations. References 1. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium. p. 17. AMIA (2001) 2. Aroyo, L., Welty, C.: Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. Web Science 2013. ACM (2013) 3. Aroyo, L., Welty, C.: Measuring crowd truth for medical relation extraction. In: AAAI 2013 Fall Symposium on Semantics for Big Data (2013) 4. Aroyo, L., Welty, C.: The Three Sides of CrowdTruth. Journal of Human Compu- tation 1, 31–34 (2014) 5. Aroyo, L., Welty, C.: Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine 36(1), 15–24 (2015) 6. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004) 7. Burger, J.D., Doughty, E., Bayer, S., Tresner-Kirsch, D., Wellner, B., Aberdeen, J., Lee, K., Kann, M.G., Hirschman, L.: Validating candidate gene-mutation relations in medline abstracts via crowdsourcing. In: Data Integration in the Life Sciences. pp. 83–91. Springer (2012) 8. Chapman, W.W., Nadkarni, P.M., Hirschman, L., D’Avolio, L.W., Savova, G.K., Uzuner, O.: Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the AMIA 18(5), 540–543 (2011) 9. Chen, D.L., Dolan, W.B.: Building a persistent workforce on mechanical turk for multilingual data collection. In: Proceedings of The 3rd HCOMP (2011) 10. Chilton, L.B., Little, G., Edge, D., Weld, D.S., Landay, J.A.: Cascade: crowd- sourcing taxonomy creation. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 1999–2008. CHI ’13, ACM, New York, NY, USA (2013) 11. Dumitrache, A., Aroyo, L., Welty, C.: CrowdTruth Measures for Language Ambi- guity: The Case of Medical Relation Extraction. In: Proceedings of the 2015 In- ternational Workshop on Linked Data for Information Extraction (LD4IE-2015), 14th International Semantic Web Conference (2015) 12. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: An- notating named entities in Twitter data with crowdsourcing. In: In Proc. NAACL HLT. pp. 80–88. CSLDAMT ’10, ACL (2010) 13. Hovy, D., Plank, B., Søgaard, A.: Experiments with crowdsourced re-annotation of a POS tagging data set. In: Proceedings of the 52nd Annual Meeting of the ACL (Volume 2: Short Papers). pp. 377–382. ACL, Baltimore, Maryland (June 2014) 14. Inel, O., Khamkham, K., Cristea, T., Dumitrache, A., Rutjes, A., van der Ploeg, J., Romaszko, L., Aroyo, L., Sips, R.J.: CrowdTruth: Machine-Human Computation Framework for Harnessing Disagreement in Gathering Annotated Data. In: The Semantic Web–ISWC 2014, pp. 486–504. Springer (2014) 15. Kondreddi, S.K., Triantafillou, P., Weikum, G.: Combining information extraction and human computing for crowdsourced knowledge acquisition. In: 30th Interna- tional Conference on Data Engineering. pp. 988–999. IEEE (2014) 16. Lee, J., Cho, H., Park, J.W., Cha, Y.r., Hwang, S.w., Nie, Z., Wen, J.R.: Hybrid entity clustering using crowds and data. The VLDB Journal 22(5), 711–726 (2013) 17. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extrac- tion without labeled data. In: Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2. pp. 1003–1011. ACL (2009) 18. Mortensen, J.M., Musen, M.A., Noy, N.F.: Crowdsourcing the verification of re- lationships in biomedical ontologies. In: AMIA Annual Symposium Proceedings. vol. 2013, p. 1020. AMIA (2013) 19. Nadkarni, P.M., Ohno-Machado, L., Chapman, W.W.: Natural language process- ing: an introduction. Journal of the AMIA 18(5), 544–551 (2011) 20. Ogden, C.K., Richards, I.: The meaning of meaning. Trubner & Co, London (1923) 21. Soberón, G., Aroyo, L., Welty, C., Inel, O., Lin, H., Overmeen, M.: Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters. In: 1st International Workshop on Crowdsourcing the Semantic Web, 12th Interna- tional Semantic Web Conference (2013) 22. Wang, C., Fan, J.: Medical relation extraction with manifold models. In: 52nd Annual Meeting of the ACL, vol. 1. pp. 828–838. ACL (2014) 23. Zhai, H., Lingren, T., Deleger, L., Li, Q., Kaiser, M., Stoutenborough, L., Solti, I.: Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. JMIR 15(4) (2013)