Detecting Uncertainty in Biomedical Literature: A Simple Disambiguation Approach Using Sparse Random Indexing Erik Velldal Department of Informatics, University of Oslo, Norway erikve@ifi.uio.no Abstract defined for two levels of analysis: While Task 1 is described as learning to detect sentences contain- This paper presents a novel approach to ing uncertainty, the object of Task 2 is learning to the problem of hedge detection, which resolve the in-sentence scope of hedge cues. The involves the identification of so-called focus of the present paper is only on Task 1. hedge cues for labeling sentences as cer- A hedge cue is here taken to mean the words tain or uncertain. This is the classification or phrases that signal the attitude of uncertainty problem for Task 1 of the CoNLL-2010 or speculation.2 Examples 1-4 in Figure 1, taken Shared Task, which focuses on hedging in from the BioScope corpus (Vincze et al., 2008), il- biomedical literature. We here propose to lustrate how cue words are annotated in the Shared view hedge detection as a simple disam- Task training data. Moreover, the training data biguation problem, restricted to words that also annotates an entire sentence as uncertain if have previously been observed as hedge it contains a hedge cue, and it is the prediction of cues. Applying an SVM classifier, the this sentence labeling that is required for Task 1. approach achieves the best published re- The approach presented in this paper extends on sults so far for sentence-level uncertainty that of Velldal et al. (2010), where a maximum en- prediction on the Shared Task test data. tropy (MaxEnt) classifier is applied to automati- We also show that the technique of ran- cally detect cue words, subsequently labeling sen- dom indexing can be successfully applied tences as uncertain if they are found to contain a for compressing the dimensionality of the cue. Furthermore, in the system of Velldal et al. original feature space by several orders of (2010), the resolution of the in-sentence scopes of magnitude, while at the same time yield- identified cues, as required for Task 2, is deter- ing better classifier performance. mined by a set of manually crafted rules operating 1 Introduction on dependency representations. For readers inter- ested in more details on this set of rules used for The problem of hedge detection refers to the task solving Task 2, the reader is referred to Øvrelid et of identifying uncertainty or speculation in text. al. (2010b). The focus of the present paper, how- Being the topic of several recent shared tasks and ever, is to present a new and simplified approach dedicated workshops,1 this is a problem that is re- to the classification problem relevant for solving ceiving increased interest within the fields of NLP Task 1, and also partially Task 2, viz. the identifi- and biomedical text mining. In terms of practical cation of hedge cues. motivation, hedge detection is particularly useful in relation to information extraction tasks, where 2 Overview the ability to distinguish between factual and un- certain information can be of vital importance. In Section 5 we first develop a Support Vector Ma- The topic of the Shared Task at the 2010 Confer- chine (SVM) token classifier for identifying cue ence for Natural Language Learning (CoNLL), is 2 As noted by Farkas et al. (2010), most hedge cues typi- hedge detection for the domain of biomedical re- cally fall in the following categories; auxiliaries (may, might, search literature (Farkas et al., 2010). The task is could, etc.), verbs of hedging or verbs with speculative con- tent (suggest, suspect, indicate, suppose, seem, appear, etc.), 1 adjectives or adverbs (probable, likely, possible, unsure, etc.), Hedge detection played a central part of the shared tasks of both BioNLP 2009 and CoNLL 2010, as well as the NeSp- or conjunctions (either. . . or, etc.). NLP 2010 workshop (Negation and Speculation in NLP). 72 (1) {ROI happeari to serve as messengers mediating {directly hori indirectly} the release of the inhibitory subunit I kappa B from NF-kappa B}. (2) {The specific role of the chromodomain is hunknowni} but chromodomain swapping experiments in Drosophila {hsuggesti that they {hmighti be protein interaction modules}} [18]. (3) These data {hindicate thati IL-10 and IL-4 inhibit cytokine production by different mechanisms}. (4) Whereas a background set of promoter regions is easy to identify, it is {hnot cleari how to define a reasonable genomic sample of enhancers}. Figure 1: Examples of hedged sentences in the BioScope corpus. Hedge cues are here shown using angle brackets, with braces corresponding to their annotated scopes. words. For a given sentence, the classifier con- tion as a sequence labeling problem. Similarly siders each word in turn, labeling it as a cue or a to Morante and Daelemans (2009), Tang et al. non-cue. We will refer to this mode of cue classifi- (2010) set out to label tokens according to a BIO- cation as performing word-by-word classification scheme, i.e. indicating whether they are at the Be- (WbW). Later, in Section 6, we go on to show how ginning, Inside, or Outside of a hedge cue. Tang et better results can be obtained by instead approach- al. (2010) train both a Conditional Random Field ing the task as a disambiguation problem, restrict- (CRF) sequence classifier and an SVM-based Hid- ing our attention to only those tokens whose base den Markov Model (HMM), finally combining the forms have previously been observed as hedge predictions of both models in a second CRF. cues. Reformulating the problem in this way sim- In terms of the overall approach, i.e. viewing plifies the classification task tremendously, reduc- the problem as a sequence labeling task, Tang et ing the number of examples that need to be consid- al. (2010) are actually representative of the major- ered, and thereby also trimming down the relevant ity of the ST participants for Task 1 (Farkas et al., feature space to a much more manageable size. At 2010), including the top three performers on the the same time, the resulting classifier achieves the official held-out data. As noted by Farkas et al. best published results so far on the Shared Task (2010), the remaining systems approached the task data (to the best of our knowledge). either as a WbW token classification problem, or Additionally, in Section 7 we show how the directly as a sentence classification problem. Ex- very large input feature space can be further com- amples of the former are the systems of Velldal et pressed using random indexing. This is essentially al. (2010) and Vlachos and Craven (2010), sharing a dimension reduction technique based on sparse the 4th rank position (out of 24 submitted systems) random projections, which we here apply for fea- for Task 1. ture extraction. We show that training the classifier In both the sequence labeling and token classi- on the reduced feature space yields better perfor- fication approaches, a sentence is labeled as un- mance than when using the original input space. certain if it contains a word labeled as a cue. In The evaluation measures and feature templates are contrast, the sentence classification approaches in- detailed in Sections 5.2 and 5.3, respectively. Note stead tries to label sentences directly, typically us- that, while preliminary results for all models are ing Bag-of-Words (BoW) features. In terms of the presented for the development data throughout the official Task 1 evaluation, the sentence classifiers paper, the performance of all models is ultimately tended to achieve a somewhat lower relative rank. compared on the official Shared Task held-out data in Section 8. We start, however, by providing a 4 Data Sets and Preprocessing brief overview of related work in Section 3, and The training data for the CoNLL 2010 Shared Task then describe the relevant data sets and preprocess- is taken from the BioScope corpus (Vincze et al., ing steps in Section 4. 2008) and consists of 14,541 sentences (or other root-level utterances) from biomedical abstracts 3 Related Work and articles. Some basic descriptive statistics for The top-ranked system for Task 1 in the official the data sets are provided in Table 1. We see that CoNLL 2010 Shared Task evaluation, described roughly 18% of the sentences are annotated as un- in (Tang et al., 2010), approaches cue identifica- certain. The BioScope corpus also provides anno- 73 Data Set Sentences Hedged Cues Multi-Word Tokens Cue Tokens Sentences Cues Training Abstracts 11,871 2,101 2,659 364 309,634 3,056 Articles 2,670 519 668 84 68,579 782 Total 14,541 2,620 3,327 448 378,213 3,838 Held-Out 5,003 790 1,033 87 138,276 1,148 Table 1: The Shared Task data sets. The top three rows lists the properties of the training data, separately detailing its two components—biomedical abstracts and full articles. The bottom row summarizes the official held-out test data (articles only). Token counts are based on the tokenizer described in Section 4.1. tation for hedge cues as well as their scope. Out of GENIA in the biomedical domain, while using our a total of 378,213 tokens, 3,838 are annotated as improved tokenization. being part of a hedge cue. As can be seen, the to- For the vast majority of tokens, we use GENIA tal number of cues is somewhat lower (3,327), due PoS tags and base forms (i.e. lemmas). However, to the fact that some tokens are part of the same GENIA does not make a PoS distinction between cue, so-called multi-word cues (448 in total), such proper and common nouns, as in the Penn Tree- as indicate that in Example 3. bank, and hence we give precedence to TnT out- For evaluation purposes, the task organizers puts for tokens tagged as nominal by both taggers. provided newly annotated biomedical articles, comprising 5,003 additional utterances, of which 5 Hedge Cue Classification 790 are annotated as hedged (see overview in Ta- This section develops a binary cue classifier sim- ble 1). The data contains a total of 1,033 cues, of ilar to that of Velldal et al. (2010), but using which 87 are multi-word cues spanning multiple the framework of large-margin SVM classification tokens, comprising 1,148 cue tokens altogether. (Vapnik, 1995) instead of MaxEnt. For a given 4.1 Tokenization sentence, the word-by-word classifier (referred to as CW bW ) considers each token in turn, labeling it The GENIA tagger (Tsuruoka et al., 2005) takes as a cue or non-cue. Any sentence found to con- an important role in our preprocessing set-up, as tain a cue is subsequently labeled as uncertain. it is specifically tuned for biomedical text. Never- theless, its rules for tokenization appear to not al- 5.1 Defining the Training Instances ways be optimally adapted for the BioScope cor- As annotated in the training data, it is possible for pus. (For examples, GENIA unconditionally in- a hedge cue to span multiple tokens, e.g. as in troduces token boundaries for some punctuation whether or not. The majority of the multi-word marks that can also occur token-internally.) Our cues in the training data are very infrequent, how- preprocessing pipeline therefore deploys a home- ever, most occurring only once, and the classifier grown, cascaded finite-state tokenizer (adapted itself is not sensitive to the notion of multi-word from the open-source English Resource Grammar; cues. A given word token is considered a cue as Flickinger (2000)), which aims to implement the long as it falls within the span of a cue annotation. tokenization decisions made in the Penn Tree- As presented to the learner, a given token wi is bank (Marcus et al., 1993)—much like GENIA, represented as a feature vector f (wi ) = f~i ∈ ℜd . in principle—but properly treating certain corner Each dimension fij represents a feature function cases found in the BioScope data. which can encode arbitrary properties of wi . Sec- 4.2 PoS Tagging and Lemmatization tion 5.3 describes the particular features we are us- ing. Each training example can be thought of as a For part-of-speech (PoS) tagging and lemmatiza- pair of a feature vector and a label, hf~i , yi i. If wi tion, we combine GENIA and TnT (Brants, 2000), is a cue we have yi =+1, while for non-cues the which operates on pre-tokenized inputs but in its label is −1. For estimating the actual SVM clas- default model is trained on financial news from the sifier for predicting the labels on unseen examples Penn Treebank. Our general goal here is to take we use the SVMlight toolkit (Joachims, 1999). advantage of the higher PoS accuracy provided by 74 5.2 Evaluation Measures over surface forms. The behavior of this classi- We will be reporting precision, recall and F1 for fier is similar to what we would expect from sim- two different levels of evaluation; the sentence- ply compiling a list of cue words from the training level and the token-level. While the token- data, based on the majority usage of each word as level scores indicate how well the classifiers suc- cue or non-cue. ceed in identifying individual cue words, the As shown in Table 2, after averaging results sentence-level scores are what actually correspond from 10-fold cross-validation on the training data, the baseline cue classifier (shown as CW U ni ) to Task 1, i.e. correctly identifying sentences as bW being certain or uncertain. achieves a sentence-level F1 of 88.69 and a token- level F1 of 79.59. In comparison, the classifier 5.3 Feature Templates using all the available n-gram features (CW bW ) In the Shared Task system description paper of achieves F-scores of 91.19 and 87.80 on the Velldal et al. (2010), results are reported for Max- sentence-level and token-level, respectively. We Ent cue classifiers using a wide variety of feature see that the improvement in performance com- types of both surface-oriented and syntactic na- pared to the baseline is most pronounced on the ture. For the latter, Velldal et al. (2010) defines token-level, but the differences in scores for both a range of syntactic and dependency-based fea- levels are found to be statistically significant at p tures extracted from parses produced by the Malt- < 0.005 using a two-tailed sign-test. Parser (Nivre et al., 2006; Øvrelid et al., 2010a) 6 Reformulating the Classification and the XLE (Crouch et al., 2008), recording in- Problem formation about dependency relations, subcatego- rization frames, etc. However, it turned out that the An error analysis of our initial WbW classifier simpler lexical and surface-oriented features were revealed that it is not able to generalize to new sufficient for the identification of hedge cues. hedge cues beyond those that have already been Drawing on the observation above, the classi- observed during training. Even after adding the fiers trained in this paper are only based on sim- non-lexicalized variants of all feature types (i.e. ple sequence-oriented n-gram features collected making features more general by not including the for PoS-tags, lemmas and surface forms. For all focus word itself), the classifier still fails to iden- these types of features we record neighbors for up tify any unseen hedge cues whose base form did to 3 positions left/right of the focus word. For in- not occur as a cue in the training material. On the creased generality, all these n-gram features also other hand, only very few of the test cues are actu- include non-lexicalized variants, i.e. excluding the ally unseen (≈1.5%), meaning that the set of cue focus word itself. words might reasonably be treated as a near-closed class (at least for the biomedical data considered 5.4 Preliminary Results in this study). As a consequence of these observa- Instantiating all feature templates described above tions, we here reformulate the problem as follows. for the BioScope training data, using the maximal Instead of approaching the task as a classification span for all n-grams (n=4, i.e. including up to 3 problem defined for all words, we only consider neighbors), we end up with a total of more than words that have a base form observed as a hedge 6,500,000 unique feature types. However, after cue in the training material. In effect, any word testing different feature configurations, it turns out whose base form has never been observed as a cue that the best performing model only uses a small in the training data is automatically considered to subset of this feature pool. The configuration we be a non-cue when testing. Part of the rationale will be using throughout this paper includes; n- here is that, while it seems reasonable to assume grams over base forms ±3 positions of the focus that any word occurring as a cue can also occur as word; n-grams over surface forms up to +2 posi- a non-cue, the converse is less likely. tions only; and PoS of the focus word. This re- While the training data contains a total of ap- sults in a set of roughly 2,630,000 feature types. proximately 17,600 unique base forms (given the In addition to reporting classifier performance for preprocessing outlined in Section 4), only 143 of this feature configuration, we also provide results these ever occur as hedge cues. By restricting the for a baseline model using only unigram features classifier to only this subset of words, we man- 75 Sentence Level Token Level be significant (at p < 0.005). Model Prec Rec F1 Prec Rec F1 U ni CW bW 91.01 86.53 88.69 90.60 71.03 79.59 7 Sparse Random Indexing CW bW 94.31 88.30 91.19 94.67 81.89 87.80 CDisamb 93.64 89.68 91.60 94.01 83.55 88.45 As mentioned in Section 5.1, each training ex- RI CDisamb 93.78 88.45 91.03 94.05 81.97 87.58 ample is represented by a d-dimensional feature vector f~i ∈ ℜd . Given n examples and d fea- Table 2: 10-fold cross-validation on the biomedi- tures, the feature vectors can be thought of as rows cal abstracts and articles in the training data. in a matrix F ∈ ℜn×d . One potential problem with using a vector-based numerical encoding of local context features such as those described in age to simplify the classification problem tremen- Section 5.3, is that the dimensionality of the fea- dously, but without any loss in performance. ture space grows very rapidly with the number of Note that, although we view the task as a dis- training examples. Using local features, e.g. con- ambiguation problem, it is not feasible to train text windows recording properties such as direc- separate classifiers for each individual base form. tion and distance, the number of unique features The frequency distribution of the cue words in the grows much faster than when using, say, BoW fea- training material is very skewed with most cues tures. In order to make the vector encoding scal- being very rare—many occurring as a cue only able, we would like to somehow be able to put a once (≈ 40%). (Note, that most of these words bound on the number of dimensions. also have many additional occurrences in the train- As mentioned above, even after simplifying the ing data as non-cues, however.) For the majority classification problem, our input feature space is of the cue words then, it seems we can not hope still rather huge, totaling roughly 670,000 feature to gather enough reliable information to train in- types. Given that the number of training examples dividual classifiers. Instead, we want to be able to is only around n ≈ 95,000 we have that d ≫ n, and draw on information from the more frequently oc- whenever we want to add more feature templates curring cues also when classifying or disambiguat- or add more training data, this imbalance will only ing the less frequent ones. Consequently, we still become more pronounced. It is also likely that train a single global classifier as for the original many of the n-gram features in our model will WbW set-up. However, as the disambiguation not be relevant for the classification of new data classifier still only needs to consider a small sub- points. The combination of many irrelevant fea- set of the number of words considered by the full tures, and few training examples compared to the WbW classifier, the number of instantiated feature number of features, makes the learner prone to types is, of course, greatly reduced. overfitting. For the full WbW classification, the number of In previous attempts to reduce the feature space, training examples is 378,213. Using the feature we have applied several feature selection schemes, configuration described in Section 5.4, this gener- such as filtering on the correlation coefficient be- ates a total of roughly 2,630,000 feature types. For tween a feature and a class label, or using simple the disambiguation model, using the same feature frequency cutoffs. Although such methods are ef- configuration, the number of instantiated feature fective in reducing the number of features, they types is reduced to just below 670,000, as gener- typically do so at the expense of classifier perfor- ated for 94,155 training examples. mance. Due to both data sparseness and the likeli- Running the new disambiguation classifier by hood of many features being only locally relevant, 10-fold cross validation on the training data, we it is difficult to reliably asses the relevance of the find that it has substantially better recall than the input features, and we risk filtering out many rele- original WbW classifier. The results are shown in vant features as well. Using simple filtering meth- the row CDisamb in Table 2. Across all levels of ods, we did not manage to considerably reduce the evaluation the CDisamb model achieves a boost in number of features without also significantly re- F1 compared to CW bW . However, when applying ducing the performance of the classifier. Although a two-tailed sign-test, considering differences in better results can be expected by using so-called classifier decisions on both the sentence-level and wrapper methods (Guyon and Elisseeff, 2003) in- token-level, only the latter differences are found to stead, this is not computationally feasible for large 76 feature sets. (with k being on the order of thousands). As As an alternative to such feature selection meth- noted by Sahlgren (2005), high-dimensional vec- ods, we here report on experiments with a tech- tors having random directions are very likely to be nique known as random indexing (RI). This allows close to orthogonal, and the approximation to F us to drastically compress the feature space with- will generally be better the higher we set k. out explicitly throwing out any features. Finally, it is worth noting that RI has tradition- The technique of random indexing was initially ally been applied on the type level, with the pur- introduced by Kanerva et al. (2000) for modeling pose of accumulating context vectors that repre- the semantic similarity of words by their distribu- sent the distributional profiles of words in a se- tion in text.3 Actually RI forms part of a larger mantic space model (Sahlgren, 2005). Here, on family of dimension reduction techniques based the other hand, we apply it on the instance level on random projections. Such methods typically and as a general means of compressing the feature work by multiplying the feature matrix F ∈ ℜn×d space of a learning problem. by a random matrix R ∈ ℜd×k , where k ≪ d, thereby reducing the number of dimensions from 7.1 Tuning the Random Indexing d to k: Regarding the ratio of non-zero elements, the lit- erature on random projections contains a wide G = F R ∈ ℜn×k , with k ≪ d (5) range of suggestions as to how the entries of the Given that k is sufficiently high, the Johnson- random matrix R should be initialized. In the Lindenstrauss lemma (Johnson and Lindenstrauss, context of random indexing, Sahlgren and Karl- 1984) tells us that the pairwise distances (and gren (2005) set approximately 1% of the entries thereby separability) in F can be preserved with in each index to +1 or −1. It is worth bearing in high probability within the lower-dimensional mind, however, that the computational complexity space G (Li et al., 2006). While the only condi- of dot-product operations (as used extensively by tion on the entries of R is that they are i.i.d. with the SVM learner) depend not only on the number zero mean, they are typically also specified to have of dimensions itself, but on the number of non- unit variance (Li et al., 2006). zero elements. We therefore want to take care One advantage of the particular random index- to avoid ending up with a reduced space that is ing approach is that the full n × d feature ma- much more dense.Nevertheless, the appeal of us- trix F does not need to be explicitly computed. ing a random projection technique is in our case The method constructs the representation of the more related to its potential as a feature extraction data in G by incrementally accumulating so-called step, and less to its potential for speeding up com- index vectors assigned to each of the d features putations and reducing memory load, as the orig- (Sahlgren and Karlgren, 2005). The process can inal feature vectors are already very sparse. After be described by the following two simple steps: experimenting with different parametrizations, it seems that the classifier performance on our data - When a new feature is instantiated, it is as- sets are fairly stable with respect to varying the ra- signed a randomly generated vector of a fixed tio of non-zeros. Moreover, we find that the non- dimensionality k, consisting of a small num- zero entries can be very sparsely distributed, e.g. ber of −1s and +1s (the remaining elements ≈ 0.05–0.2%, without much loss in classifier per- being 0). This is then the so-called index vec- formance. Figure 2a shows the effect of varying tor of the feature. (The index of the ith fea- the ratio of non-zero elements while keeping the ture corresponds to the ith column of R.) dimensionality fixed (at k=5,000), always assign- - The vector representing a given training ex- ing an equal number of +1s and −1s (giving zero ample (the jth row of G represents the jth mean and unit variance). For each parametrization example) is then constructed by simply sum- we perform a batch of 5 experiments using dif- ming the random index vectors of its features. ferent random initializations of the index vectors. The scores shown in Figure 2a are the average and Note that, although we want to have k ≪ d, we maximum within each batch. As can be seen, with still operate in relatively high-dimensional space index vectors of 5,000 elements, using 8 non-zero 3 Readers are referred to Sahlgren (2005) for a good intro- entries (corresponding to a ratio of 0.16%) here duction to random indexing. 77 94 Avg F1 94 Avg F1 Max F1 Max F1 92 92 90 90 88 88 F1 F1 86 86 84 84 82 82 80 80 2 4 8 16 32 64 1250 2500 5000 10000 20000 670K # Non-Zeros Dimensionality (k) (a) (b) Figure 2: While varying various parameters of the random indexing, the plot shows averaged and maxi- mum sentence-level F1 from 5 different runs for each setting (using different random initializations of the index vectors), testing on 1/10th of the training data. In (a) we vary the number of non-zero elements in the index vectors, while keeping the dimensionality fixed at k=5,000. In (b) we apply the disambiguation classifier using random indexing while varying the dimensionality k of the index vectors. The number of non-zeros varies from 2 (for k=1,250) to 32 (for k=20,000). For reference, the last column shows the result for using the original non-projected feature space. seems to strike a reasonable balance between in- appears to preserve a lot more information than dex density and performance. feature selection based on filtering methods. As expected, we do, however, see a clear dete- rioration of classifier accuracy if the dimensional- 7.2 Preliminary Results The bottom row of Table 2 (CDisambRI ), shows the ity of the index vectors is set very low. Figure 2b shows the effect of varying the dimensionality k results of applying an SVM-classifier by full 10- of the index vectors, while fixing the ratio of non- fold cross-validation over the training set using the zero entries per vector to 0.16%. Again we per- same random index assignments that yielded the form batches of 5 experiments for each value of k, maximum F1 in Figure 2b for k=5,000 (with eight reporting the average and maximum within each randomly set non-zeros in each index). We see that RI the performance of CDisamb is actually slightly batch. For our cue classification data, the posi- tive effect of increasing k seems to flatten out at lower than for CDisamb . The differences are not around k=5,000. When considering the standard detected as being significant though (applying the deviation of scores within each batch, however, the sign-test in the same manner as described above). variability of the results seems to steadily decrease Moreover, it should also be pointed out that we as k increases. For example, while we find σ=1.34 have not yet tried tuning the random indexing by for the set of runs using k=1,250, we find σ=0.29 multiple runs of full 10-fold cross-validation on for k=20,000. the training data, which would be expected to im- When looking at the maximum scores shown in prove these results. Given the fact that the effec- Figure 2b, one of the runs using k=5,000 turns tive feature space for the classifier is reduced from out to have the peak performance, achieving a 670,000 to just 5,000 dimensions, we find it no- RI table that the CDisamb model achieves comparable (sentence-level) F1 of 90.38. Not only does it score higher than any of the other RI-runs with results, with only preliminary tuning. k>5,000, it also outperforms the original CDisamb Another important observation is that the com- model, which achieves an F1 of 89.36 for the same plexity of the resulting SVM in terms of the num- single “fold” (the models in Figure 2b are tested ber of support vectors (SVs), is considerably re- using 1/10th of the training material). duced for the RI-model: While the number of SVs In our experience, although the random projec- for CDisamb averages just below 8% of the train- tion provided by the RI vectors only represents an ing examples, this is reduced to just above 4% for approximation to the original input space, it still RI CDisamb (using the SVMlight default settings). In 78 Sentence Level Token Level ing the sign-test as described above to the classi- Model Prec Rec F1 Prec Rec F1 RI fier decisions of CDisamb , we find statistically sig- CWU ni bW 77.54 81.27 79.36 75.89 66.90 71.11 nificant differences with respect to CW bW but not CW bW 89.02 84.18 86.53 87.58 74.30 80.40 with respect to CDisamb . Nonetheless, the encour- CDisamb 87.37 85.82 86.59 85.92 76.57 80.98 RI RI aging results of the CDisamb model on the held-out CDisamb 88.83 84.56 86.64 86.65 74.65 80.21 Tang 85.03 87.72 86.36 – – – data means that further tuning of the RI configura- tion on the training data will be a priority for future Table 3: Results on the Shared Task test data. experiments. It is also worth noting that many of the sys- tems participating in the ST challenge used fairly addition to halving the number of SVs, as well as complex and resource-heavy feature types, being reducing the feature space by two orders of mag- sensitive to document structure, grammatical rela- nitude, the upper bound on the VC-dimension (as tions, etc. (Farkas et al., 2010). The fact that com- estimated by SVMlight ) is also reduced by 12%. It parable or better results can be obtained using a is also worth noting that the run-time differences relatively simple approach as demonstrated in this for estimating the SVM on the original input space paper—with low cost in terms of both computa- and the reduced (but slightly denser) feature space, tion and external resources—might lower the bar are negligible (≈ 5 CPU-seconds more for the RI- for employing a hedge detection component in an model when re-training on the full training set). actual IE system. Finally, we also observe that our simple uni- 8 Held-Out Testing gram baseline classifier proves to be surprisingly Table 3 presents the final results for the various competitive. In fact, comparing its Task 1 F1 to classifiers developed in this paper, testing them on those of the official ST evaluation, it actually out- the biomedical articles of the CoNLL 2010 Shared ranks 7 of the 24 submitted systems. Task held-out test set (see Table 1). In addition to the evaluation results for our own classifiers, Ta- 9 Conclusions ble 3 also include the official test results for the This paper has presented the incremental develop- system described by Tang et al. (2010). The se- ment of uncertainty classifiers for detecting hedg- quence classifier developed by Tang et al. (2010), ing in biomedical text—the topic of the CoNLL combining a CRF classifier and a large-margin 2010 Shared Task. Using simple n-gram features HMM model, obtained the best results for the of- over words, lemmas and PoS-tags, we first develop ficial ST evaluation for Task 1 (i.e. sentence-level a (linear) SVM cue classifier that outperforms the uncertainty detection). top ranked system for Task 1 in the official Shared As seen from Table 3, all of our SVM clas- Task evaluation (i.e. sentence-level uncertainty de- RI sifiers CW bW , CDisamb , and CDisamb , achieve a tection). We then show how the original classi- higher sentence-level F1 than the system of Tang fication task can be greatly simplified by view- et al. (2010) (though it is unknown whether the ing it as a disambiguation task restricted to only differences are statistically significant). We also those words that have previously been observed as note that our reformulation of the cue classifica- hedge cues. Operating in a smaller (though still tion task as a disambiguation problem leads to bet- fairly large) feature space, this second classifier ter performance also on the held-out data, with achieves even better results. Finally, we apply the CDisamb performing slightly better than CW bW method of random indexing, further reducing the across both evaluation levels. Interestingly, the dimensionality of the feature space by two orders best performer of them all proves to be the ran- of magnitude. This final classifier—combining RI dom indexing model (CDisamb ), even though this an SVM-based disambiguation model with ran- model was not the top-performer on the training dom indexing—is our best performer, achieving data. One possible explanation for the strong held- a sentence-level F1 of 86.64 on the CoNLL 2010 RI out performance of CDisamb is that the reduced Shared Task held-out data. complexity of this classifier (see Section 7.2) has made it less prone to overfitting, leading to better generalization performance on new data. Apply- 79 Acknowledgments Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. MaltParser: A data-driven parser-generator for de- The experiments reported in this paper represent an extension pendency parsing. In Proceedings of the Fifth In- of previous joint work with Stephan Oepen and Lilja Øvre- ternational Conference on Language Resources and lid (Velldal et al., 2010; Øvrelid et al., 2010b). The author Evaluation, pages 2216–2219. also wishes to thank the anonymous reviewers, as well as col- Magnus Sahlgren and Jussi Karlgren. 2005. Auto- leagues at the Uni. of Oslo, for their valuable comments. matic bilingual lexicon acquisition using random in- dexing of parallel corpora. Journal of Natural Lan- guage Engineering, Special Issue on Parallel Texts, References 11(3), September. Thorsten Brants. 2000. TnT. A statistical Part-of- Magnus Sahlgren. 2005. An introduction to random Speech tagger. In Proceedings of the Sixth Con- indexing. In Proceedings of the Methods and Appli- ference on Applied Natural Language Processing, cations of Semantic Indexing Workshop at the 7th In- pages 224 – 231, Seattle, WA. ternational Conference on Terminology and Knowl- edge Engineering (TKE), Copenhagen, Denmark. Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy King, John Maxwell, and Paula Newman. 2008. Buzhou Tang, Xiaolong Wang, Xuan Wang, Bo Yuan, XLE documentation. Palo Alto Research Center. and Shixi Fan. 2010. A cascade method for detect- ing hedges and their scope in natural language text. Richard Farkas, Veronika Vincze, Gyorgy Mora, Janos In Proceedings of the 14th Conference on Natural Csirik, and Gyorgy Szarvas. 2010. The CoNLL Language Learning, Uppsala, Sweden. 2010 Shared Task: Learning to detect hedges and their scope in natural language text. In Proceedings Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, of the 14th Conference on Natural Language Learn- Tomoko Ohta, John McNaught, Sophia Ananiadou, ing, Uppsala, Sweden. and Jun’ichi Tsujii. 2005. Developing a robust Part- of-Speech tagger for biomedical text. In Advances Dan Flickinger. 2000. On building a more efficient in Informatics, pages 382 – 392. Springer, Berlin. grammar by exploiting types. Natural Language Engineering, 6 (1):15 – 28. Vladimir Vapnik. 1995. The Nature of Statistical Isabelle Guyon and André Elisseeff. 2003. An intro- Learning Theory. Springer. duction to variable and feature selection. Journal of Erik Velldal, Lilja Øvrelid, and Stephan Oepen. 2010. Machine Learning Research; Special issue on vari- Resolving speculation: MaxEnt cue classification able and feature selection, 3:1157 – 1182, March. and dependency-based scope rules. In Proceedings Thorsten Joachims. 1999. Making large-scale SVM of the 14th Conference on Natural Language Learn- learning practical. In B. Schölkopf, C. Burges, and ing, Uppsala, Sweden. A. Smola, editors, Advances in Kernel Methods - Veronika Vincze, György Szarvas, Richárd Farkas, Support Vector Learning. MIT-Press. György Móra, and János Csirik. 2008. The Bio- William Johnson and Joram Lindenstrauss. 1984. Ex- Scope corpus: Annotation for negation, uncertainty tensions of Lipschitz mappings into a Hilbert space. and their scope in biomedical texts. In Proceedings Contemporary Mathematics, 26:189 – 206. of the BioNLP 2008 Workshop, Columbus, USA. P. Kanerva, J. Kristoferson, and A Holst. 2000. Ran- Andreas Vlachos and Mark Craven. 2010. Detecting dom indexing of text samples for latent semantic speculative language using syntactic dependencies analysis. In Proceedings of the 22nd Annual Con- and logistic regression. In Proceedings of the 14th ference of the Cognitive Science Society, page 1036, Conference on Natural Language Learning, Upp- Pennsylvania. Mahwah, New Jersey: Erlbaum. sala, Sweden. Ping Li, Trevor Hastie, and Kenneth Church. 2006. Lilja Øvrelid, Jonas Kuhn, and Kathrin Spreyer. 2010a. Very sparse random projections. In Proceedings of Cross-framework parser stacking for data-driven de- the Twelfth ACM SIGKDD International Conference pendency parsing. TAL 2010 special issue on Ma- on Knowledge Discovery and Data Mining (KDD- chine Learning for NLP, 50(3). 06), Philadelphia. Lilja Øvrelid, Erik Velldal, and Stephan Oepen. 2010b. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Syntactic scope resolution in uncertainty analysis. Marcinkiewicz. 1993. Building a large annotated In Proceedings of the 23rd International Conference corpus of English. The Penn Treebank. Computa- on Computational Linguistics, Beijing, China. tional Linguistics, 19:313 – 330. Roser Morante and Walter Daelemans. 2009. Learn- ing the scope of hedge cues in biomedical texts. In Proceedings of the BioNLP 2009 Workshop, pages 28 – 36, Boulder, Colorado. 80