<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Basic CRF approach to DIANN 2018 shared task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Pol</forename><forename type="middle">Alvarez</forename><surname>Vecino</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Lluís</forename><surname>Padró</surname></persName>
							<email>padro@cs.upc.edu</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">TALP Research Center</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Universitat Politècnica de Catalunya</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Basic CRF approach to DIANN 2018 shared task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5C4A6DC0F7B09123089B8BE721FE07C3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Medical Named Entity Recognition</term>
					<term>CRF</term>
					<term>Disabilities</term>
					<term>Negation detection</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the UPC 2 system participation in DI-ANN (Disability annotation on documents from the biomedical domain) shared task, framed in the IBEREVAL 2018 evaluation workshop 1 . The system tackles the detection of disabilities using a CRF to perform IOB Named Entity Recognition (NER). Regarding the detection of negated disabilities, the out-of-the-box NegEx rule-based system is used.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This paper presents a simple approach to the Disability detection shared task DIANN proposed in the framework of IBEREVAL 2018 <ref type="bibr" target="#b1">[2]</ref>.</p><p>The task consists of identifying disabilities in biomedical research articles. The documents are abstracts or short descriptions, typically a few hundred words long, and use standard grammar and orthographic conventions. The goal is to detect where a disability is described or attributed to a patient. Thus, disability mentions that are negated or discarded in the text should be marked as "negated". The task requires the participation on Spanish, and English is optional.</p><p>We approach the task in two sequential stages: Disability recognition and negation detection. The former is addressed with a classical NER approach: A CRF <ref type="bibr" target="#b2">[3]</ref> performing IOB annotation. The later is solved using an out-of-the-box rule-based system: NegEx <ref type="bibr" target="#b0">[1]</ref>, which has been adapted for Spanish.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Disability recognition approach</head><p>The training data format is short files with XML tags indicating disabilities and negation expressions. Our approach was to transform the input into a list of words and add to them the part of speech (PoS) and IOB information. The result elements were tuples of (word, P OS, IOB − tag). Only the disabilities were considered when building the IOB information. The negation expressions were not used because they were predicted by another module which does not require IOB annotations.</p><p>The PoS tagging was done using NLTK <ref type="bibr" target="#b3">[4]</ref>. The IOB-tagging was performed manually using the entities inside the &lt;dis&gt; ... &lt;/dis&gt; XML tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Basic Model</head><p>The basic model is a Conditional Random Field (CRF) applied to the IOBtagged dataset. The implementation was built using NLTK's basic CRFTagger which is a module used for POS tagging that uses CRFSuite<ref type="foot" target="#foot_1">2</ref> .</p><p>The model uses a predefined entities list extracted from the training set which contains an entity per line. It also uses an acronyms list which is built filtering all the single-word entities with all letters in uppercase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Features</head><p>The features used are grouped by similarity in order to ease the evaluation of their utility. The following list describes them:</p><p>1. word, pos, lemma, all-caps, strange-cap, contains-dash, contains-dot current word, its lemma, POS, whether all the word's letter are uppercase, whether the word contains uppercase letters while the first is lowercase, and if it contains a dash or dot.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">inside-entities, is-acronym</head><p>boolean indicating whether the word is found in the predefined entities list, and in the acronyms list.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">position-X, total-position-X</head><p>position-X is true if the word is found at the position X inside an entity of the predefined entities list; total-position-X is the number of entities in the list in which the word appears. 4. prev-X-word, prev-X-pos, prev-X-lemma the word appearing X positions to the left of the current word, its POS, and its lemma. 5. next-X-word, next-X-pos, next-X-inside-ente the word appearing X positions to the right of the current word, its POS, and if it is inside the entities list. 6. next1-word, next1-pos concatenation of the current and next words, and their part of speech. 7. prev1-word, prev1-pos, prev1-lemma concatenation of the previous and current words, their part of speech, and their lemma. 8. next2-word, next2-pos concatenation of the two next words, and their part of speech.</p><p>9. prev2-word, prev2-pos concatenation of the two previous words, and the concatenation of their part of speech.</p><p>The number of words used in the features next-X-and prev-X-are tunable parameters. In the final execution, the number of preceding words considered was three and, for the next words, the number was two.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Training</head><p>The final features were chosen using 10-fold cross-validation. For each fold, the model builds the entities and acronyms list (using only the nine training chunks) and trains the model to predict the remaining validation chunk. After averaging all the folds F1-score results, the features corresponding to the best average (for the two languages) were used to train a model using all the training dataset.</p><p>For each group of features described in the previous section, the whole group was deactivated to check if they affected the precision.</p><p>Initially, the groups 4-9 contained all the features of groups 1-3 applied to their elements (i.e. the features applied to the prev-X-word, or concatenating them in the case of next1-feature. Experiments were performed deactivating one group at a time and checking the impact on performance. This allowed us to remove non-useful feature groups, leaving only those groups with actual contribution to the task. Once the useful feature groups were chosen, a more fine-grained inspection was carried to remove useless features inside each group, resulting in the final feature groups reported above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Negation detection approach</head><p>The negations were predicted using an out-of-the-box NegEx implementation<ref type="foot" target="#foot_2">3</ref> . After tagging the entities, each sentence and the entity it contains are passed to NegEx which marks if the entity is negated and which is the set of words negating it (if no entity is present the sentence is not fed to NegEx). We detected that almost all the correct negations were close to the entity so the negation expressions that were more than three words away of the entity were discarded.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments and Results</head><p>The experiments performed were to predict the whole training dataset using ten-fold cross-validation. The best model was then used to annotate the test set. Table <ref type="table" target="#tab_0">1</ref> shows the results of some experiments varying the used feature set. Using all the features gives the best results. All results are computed using the evaluation tool provided by DIANN organizers 4 . Table <ref type="table" target="#tab_1">2</ref> reports the final scores obtained on the official test set. The fields evaluated here are: disability, refers to all disabilities annotation both included or not in a negation; negated disability, considers all the negation-related annotations (disability, negation trigger, and scope); and non-negated disability + negated disability which evaluates jointly the annotation of disabilities and negation (negated disability are considered correct if both negation and disability are correct). For all categories, both partial and exact results are provided. for most metrics. We consider that the presented approach has improvement margin, since the used features are a basic set, could be extended with more advanced semantic information such as word embeddings.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results of cross-validation experiments deactivating one feature group at a time</figDesc><table><row><cell cols="2">Spanish</cell><cell></cell><cell></cell><cell>English</cell><cell></cell><cell></cell></row><row><cell cols="7">Precision Recall F1 score Precision Recall F1 score</cell></row><row><cell cols="2">Group disabled: 2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.50</cell><cell>0.55</cell><cell>0.52</cell><cell>0.46</cell><cell>0.35</cell><cell>0.40</cell></row><row><cell>Disability</cell><cell>0.72</cell><cell>0.63</cell><cell>0.68</cell><cell>0.72</cell><cell>0.58</cell><cell>0.64</cell></row><row><cell cols="2">Group disabled: 3</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.50</cell><cell>0.50</cell><cell>0.50</cell><cell>0.48</cell><cell>0.35</cell><cell>0.41</cell></row><row><cell>Disability</cell><cell>0.73</cell><cell>0.51</cell><cell>0.60</cell><cell>0.75</cell><cell>0.56</cell><cell>0.64</cell></row><row><cell cols="2">Group disabled: 4</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.51</cell><cell cols="2">0.58 0.54</cell><cell>0.48</cell><cell>0.40</cell><cell>0.44</cell></row><row><cell cols="2">Disability 0.74</cell><cell>0.59</cell><cell>0.65</cell><cell>0.74</cell><cell>0.65</cell><cell>0.70</cell></row><row><cell cols="2">Group disabled: 5</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.51</cell><cell>0.55</cell><cell>0.53</cell><cell>0.48</cell><cell>0.38</cell><cell>0.42</cell></row><row><cell>Disability</cell><cell>0.71</cell><cell>0.59</cell><cell>0.64</cell><cell>0.75</cell><cell>0.65</cell><cell>0.69</cell></row><row><cell cols="2">Group disabled: 6</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.49</cell><cell>0.53</cell><cell>0.51</cell><cell>0.48</cell><cell>0.38</cell><cell>0.42</cell></row><row><cell>Disability</cell><cell>0.72</cell><cell>0.59</cell><cell>0.65</cell><cell>0.73</cell><cell>0.63</cell><cell>0.68</cell></row><row><cell cols="2">Group disabled: 7</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.50</cell><cell>0.55</cell><cell>0.52</cell><cell>0.47</cell><cell>0.35</cell><cell>0.40</cell></row><row><cell>Disability</cell><cell>0.72</cell><cell>0.60</cell><cell>0.65</cell><cell>0.73</cell><cell>0.64</cell><cell>0.68</cell></row><row><cell cols="2">Group disabled: 8</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.49</cell><cell>0.53</cell><cell>0.51</cell><cell>0.47</cell><cell>0.35</cell><cell>0.40</cell></row><row><cell>Disability</cell><cell>0.72</cell><cell>0.69</cell><cell>0.65</cell><cell>0.75</cell><cell>0.65</cell><cell>0.70</cell></row><row><cell cols="2">Group disabled: 9</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.47</cell><cell>0.50</cell><cell>0.48</cell><cell>0.48</cell><cell>0.38</cell><cell>0.42</cell></row><row><cell>Disability</cell><cell>0.71</cell><cell>0.59</cell><cell>0.64</cell><cell>0.74</cell><cell>0.65</cell><cell>0.69</cell></row><row><cell cols="3">Group disabled: None</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Negation</cell><cell>0.52</cell><cell>0.55</cell><cell>0.53</cell><cell>0.47</cell><cell>0.41</cell><cell>0.43</cell></row><row><cell cols="2">Disability 0.74</cell><cell>0.62</cell><cell>0.68</cell><cell>0.75</cell><cell cols="2">0.67 0.71</cell></row><row><cell>5 Conclusions</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="7">We have presented a simple CRF approach to disability detection in medical</cell></row><row><cell cols="7">texts. The systems produces average results, ranking in the middle of the table</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Final testing results with the full-featured model.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://nlp.uned.es/diann</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://pypi.python.org/pypi/python-crfsuite Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://github.com/mongoose54/negex</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://github.com/diannibereval2018/evaluation Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4">Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages(IberEval 2018)   </note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This research has been partially funded by Spanish Government through Graph-Med project TIN2016-77820-C3-3-R.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">A simple algorithm for identifying negated findings and diseases in discharge summaries</title>
		<author>
			<persName><forename type="first">W</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Bridewell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hanbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cooper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Buchanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2001">11. 2001</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="301" to="310" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of the diann task: Disability annotation at ibereval</title>
		<author>
			<persName><forename type="first">H</forename><surname>Fabregat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Martinez-Romo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Araujo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages</title>
				<meeting>the Workshop on Evaluation of Human Language Technologies for Iberian Languages<address><addrLine>IberEval</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018. 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C N</forename><surname>Pereira</surname></persName>
		</author>
		<idno type="DOI">10.1038/nprot.2006.61</idno>
		<ptr target="https://doi.org/10.1038/nprot.2006.61" />
	</analytic>
	<monogr>
		<title level="m">ICML &apos;01 Proceedings of the Eighteenth International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2001-06">June. jun 2001</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="282" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Nltk: The natural language toolkit</title>
		<author>
			<persName><forename type="first">E</forename><surname>Loper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bird</surname></persName>
		</author>
		<idno type="DOI">10.3115/1118108.1118117</idno>
		<ptr target="https://doi.org/10.3115/1118108.1118117" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics</title>
				<meeting>the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="63" to="70" />
		</imprint>
	</monogr>
	<note>ETMTNLP &apos;02</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
