<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">CrowdTruth Measures for Language Ambiguity The Case of Medical Relation Extraction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anca</forename><surname>Dumitrache</surname></persName>
							<email>anca.dumitrache@vu.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">VU University Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">IBM CAS</orgName>
								<address>
									<settlement>Amsterdam</settlement>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lora</forename><surname>Aroyo</surname></persName>
							<email>lora.aroyo@vu.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">VU University Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chris</forename><surname>Welty</surname></persName>
							<email>cawelty@gmail.com</email>
							<affiliation key="aff2">
								<orgName type="institution">Google Research</orgName>
								<address>
									<settlement>New York</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">CrowdTruth Measures for Language Ambiguity The Case of Medical Relation Extraction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">43F4A6DD69CDE4DF95C873824625AAD7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T09:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. We have been investigating the use of crowdsourcing as an affordable alternative to using experts to clean noisy data, and have found that with the proper analysis, crowds can rival and even out-perform the precision and recall of experts, at a much lower cost. We have further found that the crowd, by virtue of its diversity, can help us find evidence of ambiguous sentences that are difficult to classify, and we have hypothesized that such sentences are likely just as difficult for machines to classify. In this paper we outline CrowdTruth, a previously presented method for scoring ambiguous sentences that suggests that existing modes of truth are inadequate, and we present for the first time a set of weighted metrics for evaluating the performance of experts, the crowd, and a trained classifier in light of ambiguity. We show that our theory of truth and our metrics are a more powerful way to evaluate NLP performance over traditional unweighted metrics like precision and recall, because they allow us to account for the rather obvious fact that some sentences express the target relations more clearly than others.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>NLP often relies on the development of a set of gold standard annotations, or ground truth, for the purpose of training, testing and evaluation. Distant supervision <ref type="bibr" target="#b16">[17]</ref> is a helpful solution that has given linked data sets a lot of attention in NLP, however the data can be noisy. Human annotators can help to clean up this noise, however for Clinical NLP domain knowledge is usually believed to be required from annotators, making the process for acquiring ground truth more difficult. The lack of annotated datasets for training and benchmarking is considered one of the big challenges of Clinical NLP <ref type="bibr" target="#b5">[6]</ref>.</p><p>Furthermore, the assumption that the gold standard represents a universal and reliable model for language is flawed <ref type="bibr" target="#b3">[4]</ref>. Disagreement between annotators is usually eliminated through overly prescriptive guidelines, resulting in data that is neither general nor reflects language's inherent ambiguity. The process of acquiring ground truth by working exclusively with domain experts is costly and non-scalable.</p><p>Crowdsourcing can be a much faster and cheaper procedure than expert annotation, and it allows for collecting enough annotations per task in order to represent the diversity inherent in language. Crowd workers, however, generally lack medical expertise, which might impact the quality and reliability of their work in more knowledge-intensive tasks.</p><p>Our approach can overcome the limitations of gathering expert ground truth, by using disagreement analysis on crowd annotations to model the ambiguity inherent in medical text. We have previously shown our approach can improve relation extraction classifier performance over annotated data provided by experts, can effectively identify low-quality workers, and identify issues with the annotation tasks themselves. In this paper we explore the hypothesis that our sentence-level metrics are providing useful information about sentence clarity, and present initial results on the value of different approaches to scoring that the traditional precision, recall, and accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Crowdsourcing ground truth has shown promising results in a variety of domains. <ref type="bibr" target="#b11">[12]</ref> compared the crowd versus experts for the task of part-of-speech tagging. The authors also show that models trained based on crowdsourced annotation can perform just as well as expert-trained models. <ref type="bibr" target="#b13">[14]</ref> studied crowdsourcing for relation extraction in the general domain, comparing its efficiency to that of fully automated information extraction approaches. Their results showed the crowd was especially suited to identifying subtle formulations of relations that do not appear frequently enough to be picked up by statistical methods.</p><p>Other research for crowdsourcing ground truth includes: entity clustering and disambiguation <ref type="bibr" target="#b14">[15]</ref>, Twitter entity extraction <ref type="bibr" target="#b10">[11]</ref>, multilingual entity extraction and paraphrasing <ref type="bibr" target="#b7">[8]</ref>, and taxonomy creation <ref type="bibr" target="#b8">[9]</ref>. However, all of these approaches rely on the assumption that one black-and-white gold standard must exist for every task. Disagreement between annotators is discarded by picking one answer that reflects some consensus, usually through using majority vote. The number of annotators per task is also kept low, between two and five workers, also in the interest of eliminating disagreement. The novelty in our approach is to consider language ambiguity, and consequently inter-annotator disagreement, as an inherent feature of the language. The metrics we employ for determining the quality of crowd answers are specifically tailored to quantify disagreement between annotators, rather than eliminate it.</p><p>The role of inter-annotator disagreement when building a gold standard has previously been discussed by <ref type="bibr" target="#b18">[19]</ref>. After empirically studying part-of-speech datasets, the authors found that inter-annotator disagreement is consistent across domains, even across languages. Furthermore, most disagreement is indicative of debatable cases in linguistic theory, rather than faulty annotation. We believe these findings manifest even more strongly for NLP tasks involving semantic ambiguity, such as relation extraction. In assessing the Ontology Alignment Evaluation Initiative (OAEI) benchmark, <ref type="bibr" target="#b6">[7]</ref> found that disagreement between annotators (both crowd and expert) is an indicator for inherent uncertainty in the domain knowledge, and that current benchmarks in ontology alignment and evaluation are not designed to model this uncertainty.</p><p>Human annotation is a process of semantic interpretation. It can be described using the triangle of reference <ref type="bibr" target="#b12">[13]</ref>, that links together three aspects: sign (input text), interpreter (worker), referent (annotation). Ambiguity for one aspect of the triangle will propagate and affect the others -e.g. an unclear sentence will cause more disagreement between workers. Therefore, in our work, we use metrics to harness disagreement for each of the three aspects of the triangle, measuring the quality of the worker, as well as the ambiguity of the text and the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods</head><p>We set up an experiment to train and evaluate a relation extraction model for a sentence-level relation classifier. The classifier takes, as input, sentences and two terms from the sentence, and returns a score reflecting the likelihood that a specific relation, in our case the cause relation between disorders and symptoms, is expressed in the sentence between the terms. Starting from a set of 902 sentences that are likely to contain medical relations, we constructed a workflow for collecting annotations through crowdsourcing. This output was analyzed with our metrics for capturing disagreement, and then used to train a model for relation extraction. In parallel, we also constructed a model based on data from a traditional gold standard using domain experts, that we then compare to the crowd model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data</head><p>The dataset used in our experiments contains 902 medical sentences extracted from PubMed article abstracts. The MetaMap parser <ref type="bibr" target="#b0">[1]</ref> ran over the corpus to identify medical terms from the UMLS vocabulary <ref type="bibr" target="#b4">[5]</ref>. Distant supervision <ref type="bibr" target="#b16">[17]</ref> was used to select sentences with pairs of terms that are linked in UMLS by one of our chosen seed medical relations. The intuition of distant supervision is that since we know the terms are related, and they are in the same sentence, it is more likely that the sentence expresses a relation between them. The seed relations were restricted to a set of eleven UMLS relations important for clinical decision making <ref type="bibr" target="#b20">[21]</ref> (see Tab.1). All of the data that we have used is available online at: http://data.crowdtruth.org/medical-relex. For collecting annotations from medical experts, we employed medical students, in their third year at American universities, that had just taken United States Medical Licensing Examination (USMLE) and were waiting for their results. Each sentence was annotated by exactly one person. The annotation task consisted of deciding whether or not the UMLS seed relation discovered by distant supervision is present in the sentence for the two selected terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Crowdsourcing setup</head><p>The crowdsourced annotation is performed in a workflow of three tasks (Fig. <ref type="figure" target="#fig_0">1</ref>). The sentences were pre-processed to determine whether the terms found with distant supervision are complete or not; identifying complete medical terms is difficult, and the automated method left a number of terms still incomplete, which was a significant source of error for the crowd in subsequent stages, so the incomplete terms were sent through a crowdsourcing task (FactSpan) in order to get the full word span of the medical terms. Next, the sentences with the corrected term spans were sent to a relation extraction task (RelEx), where the crowd was asked to decide which relation holds between the two extracted terms. We also added four new relations (e.g. associated with), to account for weaker, more general links between the terms (see Tab.1). The workers were able to read the definition of each relation, and could choose any number of relations per sentence. There were options for the cases when the terms were related, but not by those we provided (other), and for no relation between the terms (none). Finally, the results from RelEx were passed to another crowdsourcing task (RelDir) to determine the direction of the relation with regards to the two extracted terms. (FactSpan and RelDir) were added to the basic RelEx task to correct the most common sources of errors from the crowd.</p><p>All three crowdsourcing tasks were run on the CrowdFlower platform<ref type="foot" target="#foot_0">4</ref> with 10-15 workers per sentence, to allow for a distribution of perspectives. Even with three tasks and 10-15 workers per sentence, compared to a single expert judgment per sentence, the total cost of the crowd amounted to 2/3 of the sum paid for the experts. In our case, cost was not the limiting factor for the experts, but their time and availability. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Metrics</head><p>For each crowdsourcing task in the workflow, the crowd output was processed with our metrics -a set of general-purpose crowdsourcing metrics <ref type="bibr" target="#b2">[3]</ref>. These metrics attempt to model the crowdsourcing process based on the triangle of reference <ref type="bibr" target="#b17">[18]</ref>, with the vertices being the input sentence, the worker, and the target relations. Our theory is that ambiguity and disagreement at any of the vertices (e.g. a sentence with unclear meaning, a poor quality worker, or an unclear relation) will propagate in the system, influencing the other components. For example, a worker who annotates an unclear sentence is more likely to disagree with the other workers, and this can impact that worker's quality. A low quality worker is more likely to disagree with the other workers, and this can impact the apparent quality of the sentence. If one of the target relations is itself ambiguous, it will be difficult to identify and will generate disagreement that may have nothing to do with the quality of sentences or workers. Our metrics account for this by isolating the signals from the workers, sentences, and the target relations, and more accurately evaluating each. In previous work we have validated this premise in several empirical studies <ref type="bibr" target="#b2">[3]</ref>.</p><p>In this paper we focus specifically on sentence quality, to evaluate our claim that low quality sentences are difficult to annotate, and likewise difficult for ma- Sent. sentence 0 0 1 10 1 2 0 0 1 0 0 0 0 0 Sent.1 vector 3 1 7 0 0 0 0 0 3 0 0 0 1 0 Sent.2 sentence -0 0 0.09 0.96 0.09 0.19 0 0 0.09 0 0 0 0 0 Sent.1 relation score 0.36 0.12 0.84 0 0 0 0 0 0.36 0 0 0 0.12 0 Sent.2 crowd model -1 -1 -0.97 0.99 -0.97 -0.94 -1 -1 -0.97 - chines to process. To measure this effect, we begin with a simple representation of the crowd output from the RelEx task:</p><formula xml:id="formula_0">1 -1 -1 -1 -1 Sent.1 training score -0.89 -0.96 0.95 -1 -1 -1 -1 -1 -0.89 -1 -1 -1 -0.96 -1 Sent.2</formula><p>annotation vector: the annotations of one worker for one sentence. For each worker i their solution to a task on a sentence s is the vector W s,i . If the worker selects a relation, its corresponding component would be marked with '1', and '0' otherwise. For instance, in the case of RelEx, the vector will have fourteen components, one for each relation, none and other. sentence vector: For every sentence s, we sum the annotation vectors for all workers on the given task:</p><formula xml:id="formula_1">V s = i W s,i .</formula><p>The sentence vector is a simple representation of the annotations on a sentence, and leads to the sentence-relation score, which measures, for each relation, the degree to which a sentence vector diverges from perfect agreement on that relation. It is simply the cosine similarity between the sentence vector and the unit vector for the relation: srs(s, r) = cos(V s , r). The higher the value of this metric, the more clearly the relation is expressed in the sentence. The purpose of the experiments is to provide evidence that the srs is measuring the clarity, or inversely the ambiguity, of a sentence with respect to a particular relation, and that sentences with low scores present difficulty for the crowd, experts, and machines alike.</p><p>We use a two-step process to eliminate low-quality worker annotations. We run the sentence metrics and filter out sentences whose quality score is one standard deviation below the mean, then we run our worker metrics <ref type="bibr" target="#b1">[2]</ref> on the remaining sentences and filter out all workers below a trained threshold. The purpose of the first step is to ensure the worker quality scores are not adversely impacted by confusing sentences. We remove all low quality worker annotations and re-evaluate the sentence metrics on all sentences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Training the model</head><p>At the highest level our research goal is to investigate crowdsourcing as a way to gather human annotated data for training and evaluating cognitive systems. In these experiments we were specifically gathering annotated data for a sentencelevel relation extraction classifier <ref type="bibr" target="#b20">[21]</ref>. This classifier is trained per individual relation, by feeding it both positive and negative examples. It offers support for both discrete labels, and real values for weighting the confidence of the training data entries, with positive values in (0, 1], and negative values in [−1, 0).</p><p>To test our approach, we gathered four annotated data sets and trained classifier models for the cause relation using five-fold cross-validation over the 902 sentences: In order to directly compare the expert to the crowd annotations, it was necessary to annotate precisely the same sentences using each method, and train the classifier on each set. The limitation on batch size came from the availability of our experts, we were only able to use them for 902 sentences. In a batch this small, we found that the sentence-relation score, which ranged between [0,1] and rarely assigned a weight of 1, diluted the positive signal too much in comparison to the expert scores which were simply 0 or 1. We experimented, on a different data set, with rescaling the scores and selected the range that yielded the highest quality score, specified above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Evaluation setup</head><p>In order for a meaningful comparison between the crowd and expert models, we verified the sentences to provide a ground truth -a discrete positive or negative label on each sentence used in evaluation (for training, only the scores from the respective data set were used). While the main purpose of this work is to move beyond discrete labels for truth, we needed a reference standard to establish that our approach is at least as good as the accepted practice. To produce this reference standard, we first selected the positive/negative threshold for sentence-relation score in the crowd dataset that yielded the highest agreement between the crowd and the experts, and then accepted all 755 sentences where the experts and crowd agreed as true positives. The remaining sentences were manually evaluated and assigned either a positive, negative, or ambiguous value. The ambiguous cases were subsequently removed resulting in 902 sentences. In this way we created reliable, unbiased test scores, to be used in the evaluation of the models. In some sense, removing the ambiguous cases penalizes our approach, which is designed specifically to help deal with them, but again we want to first establish our approach is at least as good as accepted practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Preliminary experiments</head><p>As reported in <ref type="bibr" target="#b9">[10]</ref> and summarized here, we compared each of the four datasets to our vetted reference standard, to determine the quality of the cause relation annotations, as shown in Fig. <ref type="figure" target="#fig_2">2</ref>. As expected, the baseline data was the lowest quality, followed closely by the single crowd worker. The expert annotations achieved an F1 score of 0.844. Since the baseline, expert, and single sets are binary decisions, they appear as horizontal lines. For the crowd annotations, we plotted the F1 against different sentence-relation score thresholds for determining positive and negative sentences. Between the thresholds of 0.6 and 0.8, the crowd out-performs the expert, reaching a maximum of 0.907 F1 score at  a threshold of 0.7. This difference is significant with p = 0.007, measured with McNemar's test <ref type="bibr" target="#b15">[16]</ref>.</p><p>We next wanted to verify that this improvement in annotation quality has a positive impact on the model that is trained with this data. In a cross-validation experiment, we trained the model with each of the four datasets for identifying the cause relation (discussed in more detail in <ref type="bibr" target="#b9">[10]</ref>). The results of the evaluation (Fig. <ref type="figure" target="#fig_3">3</ref>) show the best performance for the crowd model when the sentence-relation threshold for deciding between negative/positive equals 0.5. Trained with this data, the classifier model achieves an F1 score of 0.642, compared to the expert-trained model which reaches 0.638. McNemar's test shows statistical significance with p = 0.016. This result demonstrates that the crowd provides training data that is at least as good, if not better than experts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Results and Discussion</head><p>We believe the discrete notion of truth is obsolete and should be replaced by something more flexible. For the purposes of semantic interpretation tasks for which crowdsourcing is appropriate, we propose our annotation-level metrics as a suitable replacement. In this case, the sentence − relationscore gives a realvalued score that measures the degree to which a particular sentence expresses a particular relation between two terms. We believe the preliminary experiments demonstrate the approach is sound. Our primary results evaluate the sentencerelation score as a measure of the clarity with which a sentence expresses the relation. To this end, we define the following metrics:</p><p>sentence weight: For a given positive/negative threshold τ , if srs(s) ≥ τ for sentence s then the sentence weight w s = srs(s), otherwise w s = 1 − srs(s). weighted precision: We collect true and false positives and negatives in the standard way based on the vetted reference standard, such that tp(s) = 1 iff s is a true positive, and 0 otherwise, similarly for f p, tn, f n. Where normally p = tp/(tp + f p), weighted precision p = s w s tp(s) s w s (tp(s) + f p(s))</p><p>.</p><p>weighted recall: Where normally r = tp/(tp + f n), weighted recall r = s w s tp(s) s w s (tp(s) + f n(s))</p><p>.</p><p>weighted f-measure: Is the harmonic mean of weighted precision and recall:</p><formula xml:id="formula_2">f 1 = 2p r1/(p + r )</formula><p>If the srs metric is a true measure of clarity, then we would expect it to be more likely for low clarity sentences to be wrong, and less likely for high clarity sentences, and this should be revealed in an overall increase of the weighted scores over the unweighted. In Tab. 3, we show a comparison of five data sets. In the first two columns, the annotation quality of each data set is shown, comparing the F1 to the weighted F1'. The F1' scores are higher in all cases, revealing that human annotators are indeed having trouble correctly annotating these sentences. The baseline scores are the least affected by the weighting, which also fits with our intuition since the baseline does not use human judgment at all.   The next six columns in each row show classifier performance when trained by that dataset. The first pair of columns compare the F1 to F1', and for interest the final four columns show the precision and recall. In all cases the classifier F1' is greater than F1, indicating that, as with humans, machines have trouble correctly interpreting sentences with a low srs. The only weighted metric that does not increase is the baseline recall, again this is justified as the baseline does not actually require any interpretation.</p><p>In Fig. <ref type="figure">4</ref> we show how the classifier performs throughout the possible thresholds, the weighted scores are consistently higher.</p><p>We also analyzed the data to understand the overlap between the crowd scores and the experts. In Fig. <ref type="figure">5</ref> we compared the frequency of sentences with cause annotations at different sentence-relation scores (measured with kernel density estimation <ref type="bibr" target="#b19">[20]</ref>) to the expert annotations of the same sentences. The result shows high agreement between the crowd and the expert -a low sentencerelation score is highly correlated with a negative expert decision, and a high score is highly correlated with a positive expert decision. In Fig. <ref type="figure" target="#fig_5">6</ref> we show the number of sentences in which the crowd agrees with the expert (on both positive and negative decisions), plotted against different positive/negative thresholds for the sentence-relation score of cause. The maximum agreement with the expert set is at the 0.7 threshold, the same as for the annotation quality F1 score (Fig. <ref type="figure" target="#fig_2">2</ref>), with 755 sentences in common between crowd and expert. The remaining 147 sentences were manually evaluated to build the test partition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>A widespread use of linked data for information extraction is distant supervision, in which relation tuples from a data source are found in sentences in a text corpus, and those sentences are treated as training data for relation extraction systems. Distant supervision is a cheap way to acquire training data, but that data can be quite noisy, which limits the performance of a system trained with it. Human annotators can be used to clean the data, but in some domains, such as medical NLP, it is widely believed that only medical experts can do this reliably. Current methods for collecting this human annotation attempt to minimize disagreement between annotators, but end up failing to capture the ambiguity inherent in language. We believe this is a vestige of an antiquated notion of truth being a discrete property, and have developed a powerful new method for representing truth.</p><p>In this paper we have presented results that show that using a larger number of workers per example -up to 15 -can form a more accurate model of truth at the sentence level, and significantly improve the quality of the annotations. It also benefits systems that use this annotated data, such as machine learning systems, significantly improving their performance with higher quality data. Our primary result is to show that our scoring metric for sentence quality in relation extraction supports our hypothesis that higher quality sentences are easier to classify -for crowd workers, experts, and machines, and our model of truth allows us to more faithfully capture the ambiguity that is inherent in language and human interpretation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: CrowdTruth Workflow for Medical Relation Extraction on CrowdFlower [10].</figDesc><graphic coords="5,134.77,245.48,345.82,107.59" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>1 .</head><label>1</label><figDesc>baseline: Discrete (positive or negative) labels are given for each sentence by the distant supervision method -for any relation, a positive example is a sentence containing two terms related by cause in UMLS. Distant supervision does not extract negative examples, so in order to generate a negative set for one relation, we use positive examples for the other (non-overlapping) relations shown in Tab. 1. 2. expert: Discrete labels based on an expert's judgment as to whether the baseline label is correct. The experts do not generate judgments for all combinations of sentences and relations -for each sentence, the annotator decides on the seed relation extracted with distant supervision. We reuse positive examples from the other relations to extend the number of negative examples. 3. single: Discrete labels for every sentence are taken from one randomly selected crowd worker who annotated the sentence. This data simulates the traditional single annotator setting. 4. crowd: Weighted labels for every sentence are based on the CrowdTruth sentence-relation score. The classifier expects positive scores for positive examples, and negative scores for negative, so the sentence-relation scores must be re-scaled. An important variable in the re-scaling is a threshold to select positive and negative examples. The Results section compares the performance of the crowd at different threshold values. Given a threshold, the sentence-relation score is then linearly re-scaled into the [0.85, 1] interval for the positive label weight, and the [−1, −0.85] interval for negative (see below). An example of how the scores were processed is given in Tab.2.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Annotation quality F1 score per negative/positive threshold for cause.</figDesc><graphic coords="8,141.31,529.93,155.62,101.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Classifier F1 scores when trained with each dataset.</figDesc><graphic coords="8,318.43,529.93,155.62,101.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 :Fig. 5 :</head><label>45</label><figDesc>Fig. 4: Comparison of weighted to non-weighted F1 scores for the crowd-trained classifier at different thresholds.</figDesc><graphic coords="10,134.77,218.62,345.83,119.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 6 :</head><label>6</label><figDesc>Fig. 6: Crowd &amp; expert agreement per neg./pos. threshold for cause.</figDesc><graphic coords="10,318.43,394.28,155.62,95.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Set of medical relations.</figDesc><table><row><cell>Relation</cell><cell>Corresponding</cell><cell>Definition</cell><cell>Example</cell></row><row><cell></cell><cell>UMLS relation(s)</cell><cell></cell><cell></cell></row><row><cell>treat</cell><cell>may treat</cell><cell>therapeutic use of a drug</cell><cell>penicillin treats infection</cell></row><row><cell>prevent</cell><cell>may prevent</cell><cell>preventative use of a drug</cell><cell>vitamin C prevents influenza</cell></row><row><cell>diagnosis</cell><cell>may diagnose</cell><cell>diagnostic use of an ingredient, test or a drug</cell><cell>RINNE test is used to diagnose hear-</cell></row><row><cell></cell><cell></cell><cell></cell><cell>ing loss</cell></row><row><cell>cause</cell><cell>cause of;</cell><cell cols="2">the underlying reason for a symptom or a disease fever induces dizziness</cell></row><row><cell></cell><cell>has causative agent</cell><cell></cell><cell></cell></row><row><cell>location</cell><cell>disease has primary</cell><cell cols="2">body part in which disease or disorder is observed leukemia is found in the circulatory</cell></row><row><cell></cell><cell>anatomic site;</cell><cell></cell><cell>system</cell></row><row><cell></cell><cell>has finding site</cell><cell></cell><cell></cell></row><row><cell>symptom</cell><cell>disease has finding;</cell><cell>deviation from normal function indicating the</cell><cell>pain is a symptom of a broken arm</cell></row><row><cell></cell><cell>disease may have</cell><cell>presence of disease or abnormality</cell><cell></cell></row><row><cell></cell><cell>finding</cell><cell></cell><cell></cell></row><row><cell cols="2">manif estation has manifestation</cell><cell>links disorders to the observations that are closely</cell><cell>abdominal distention is a manifesta-</cell></row><row><cell></cell><cell></cell><cell>associated with them</cell><cell>tion of liver failure</cell></row><row><cell cols="3">contraindicate contraindicated drug a condition for which a drug or treatment should</cell><cell>patients with obesity should avoid</cell></row><row><cell></cell><cell></cell><cell>not be used</cell><cell>using danazol</cell></row><row><cell>associated with</cell><cell></cell><cell>signs, symptoms or findings that often appear to-</cell><cell>patients who smoke often have yellow</cell></row><row><cell></cell><cell></cell><cell>gether</cell><cell>teeth</cell></row><row><cell>side effect</cell><cell></cell><cell>a secondary condition or symptom that results</cell><cell>use of antidepressants causes dryness</cell></row><row><cell></cell><cell></cell><cell>from a drug</cell><cell>in the eyes</cell></row><row><cell>is a</cell><cell></cell><cell></cell><cell></cell></row></table><note>a relation that indicates that one of the terms is more specific variation of the other migraine is a kind of headache part of an anatomical or structural sub-component the left ventricle is part of the heart</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Sent.1: Renal osteodystrophy is a general complication of chronic renal failure and end stage renal disease. Sent.2: If TB is a concern, a PPD is performed.</figDesc><table><row><cell>trea t</cell><cell>prev ent</cell><cell>diag nos is</cell><cell>cau se</cell><cell>loca tion</cell><cell>sym ptom</cell><cell>man if esta tion con trai ndic ate asso ciat ed with</cell><cell>side effe ct is a</cell><cell>part of othe r</cell><cell>non e</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Example sentence with scores from the crowd dataset; training score calculated for negative/positive sentence-relation threshold equal to 0.5, and linear rescaling in the [−1, −0.85] interval for negative, [0.85, 1] for positive.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc>Model evaluation results for each dataset.</figDesc><table><row><cell cols="2">Annotation Quality</cell><cell cols="3">Classifier Performance</cell><cell></cell></row><row><cell>F1</cell><cell>F1'</cell><cell>F1 F1' P</cell><cell>P'</cell><cell>R</cell><cell>R'</cell></row><row><cell>crowd@.5 0.838</cell><cell>0.933</cell><cell cols="4">0.642 0.687 0.565 0.632 0.743 0.754</cell></row><row><cell>crowd@.7 0.907</cell><cell>0.963</cell><cell cols="4">0.613 0.646 0.620 0.678 0.611 0.622</cell></row><row><cell>baseline 0.656</cell><cell>0.689</cell><cell cols="4">0.575 0.606 0.436 0.474 0.845 0.842</cell></row><row><cell>single 0.664</cell><cell>0.734</cell><cell cols="4">0.483 0.507 0.496 0.545 0.473 0.478</cell></row><row><cell>expert 0.844</cell><cell>0.861</cell><cell cols="4">0.638 0.658 0.672 0.711 0.605 0.616</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://crowdflower.com</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors would like to thank Chang Wang for support with using the medical relation extraction classifier, and Anthony Levas for help with collecting the expert annotations.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Aronson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AMIA Symposium</title>
				<meeting>the AMIA Symposium</meeting>
		<imprint>
			<publisher>American Medical Informatics Association</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page">17</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Crowd Truth: harnessing disagreement in crowdsourcing a relation extraction gold standard</title>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Welty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Science</title>
		<imprint>
			<date type="published" when="2013">2013. 2013</date>
			<publisher>ACM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The Three Sides of CrowdTruth</title>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Welty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Human Computation</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="31" to="34" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Truth is a lie: Crowd truth and the seven myths of human annotation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Welty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AI Magazine</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="15" to="24" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The unified medical language system (UMLS): integrating biomedical terminology</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bodenreider</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucleic acids research</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="D267" to="D270" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
	<note>suppl</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Nadkarni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hirschman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">W</forename><surname>D'avolio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K</forename><surname>Savova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Uzuner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="540" to="543" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Cheatham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hitzler</surname></persName>
		</author>
		<title level="m">Conference v2. 0: An uncertain version of the OAEI Conference benchmark</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="33" to="48" />
		</imprint>
	</monogr>
	<note>The Semantic Web-ISWC 2014</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Building a persistent workforce on mechanical turk for multilingual data collection</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Dolan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of The 3rd Human Computation Workshop</title>
				<meeting>The 3rd Human Computation Workshop<address><addrLine>HCOMP</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Cascade: crowdsourcing taxonomy creation</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">B</forename><surname>Chilton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Little</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Edge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Weld</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Landay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</title>
				<meeting>the SIGCHI Conference on Human Factors in Computing Systems<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1999" to="2008" />
		</imprint>
	</monogr>
	<note>CHI &apos;13</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Achieving expert-level annotation quality with CrowdTruth: the case of medical relation extraction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dumitrache</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Welty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 International Workshop on Biomedical Data Mining, Modeling, and Semantic Integration (BDM2I-2015), 14th International Semantic Web Conference</title>
				<meeting>the 2015 International Workshop on Biomedical Data Mining, Modeling, and Semantic Integration (BDM2I-2015), 14th International Semantic Web Conference</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Annotating named entities in Twitter data with crowdsourcing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Finin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Murnane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Karandikar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Keller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Martineau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dredze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CSLDAMT &apos;10, Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="80" to="88" />
		</imprint>
	</monogr>
	<note>Proc. NAACL HLT.</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Experiments with crowdsourced re-annotation of a POS tagging data set</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Plank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Søgaard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Short Papers</title>
		<meeting>the 52nd Annual Meeting of the Association for Computational Linguistics<address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2014-06">June 2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="377" to="382" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">On the definition of &quot;picture</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Q</forename><surname>Knowlton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AV Communication Review</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="157" to="183" />
			<date type="published" when="1966">1966</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Combining information extraction and human computing for crowdsourced knowledge acquisition</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Kondreddi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Triantafillou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">30th International Conference on Data Engineering</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="988" to="999" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Hybrid entity clustering using crowds and data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">R</forename><surname>Cha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Hwang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Wen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="711" to="726" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Note on the sampling error of the difference between correlated proportions or percentages</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Mcnemar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychometrika</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="153" to="157" />
			<date type="published" when="1947">1947</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Distant supervision for relation extraction without labeled data</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mintz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bills</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Snow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP</title>
				<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1003" to="1011" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">The meaning of meaning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Ogden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Richards</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1923">1923</date>
			<publisher>Trubner &amp; Co</publisher>
			<pubPlace>London</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Linguistically debatable or just plain wrong?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Plank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Søgaard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Short Papers</title>
		<meeting>the 52nd Annual Meeting of the Association for Computational Linguistics<address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2014-06">June 2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="507" to="511" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Density estimation for statistics and data analysis</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">W</forename><surname>Silverman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1986">1986</date>
			<publisher>CRC press</publisher>
			<biblScope unit="volume">26</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Medical relation extraction with manifold models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Annual Meeting of the ACL</title>
				<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="828" to="838" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
