Detecting Uncertainty in Biomedical Literature:
    A Simple Disambiguation Approach Using Sparse Random Indexing

                                               Erik Velldal
                           Department of Informatics, University of Oslo, Norway
                                         erikve@ifi.uio.no


                         Abstract                                      defined for two levels of analysis: While Task 1 is
                                                                       described as learning to detect sentences contain-
    This paper presents a novel approach to
                                                                       ing uncertainty, the object of Task 2 is learning to
    the problem of hedge detection, which
                                                                       resolve the in-sentence scope of hedge cues. The
    involves the identification of so-called
                                                                       focus of the present paper is only on Task 1.
    hedge cues for labeling sentences as cer-
                                                                          A hedge cue is here taken to mean the words
    tain or uncertain. This is the classification
                                                                       or phrases that signal the attitude of uncertainty
    problem for Task 1 of the CoNLL-2010
                                                                       or speculation.2 Examples 1-4 in Figure 1, taken
    Shared Task, which focuses on hedging in
                                                                       from the BioScope corpus (Vincze et al., 2008), il-
    biomedical literature. We here propose to
                                                                       lustrate how cue words are annotated in the Shared
    view hedge detection as a simple disam-
                                                                       Task training data. Moreover, the training data
    biguation problem, restricted to words that
                                                                       also annotates an entire sentence as uncertain if
    have previously been observed as hedge
                                                                       it contains a hedge cue, and it is the prediction of
    cues. Applying an SVM classifier, the
                                                                       this sentence labeling that is required for Task 1.
    approach achieves the best published re-
                                                                          The approach presented in this paper extends on
    sults so far for sentence-level uncertainty
                                                                       that of Velldal et al. (2010), where a maximum en-
    prediction on the Shared Task test data.
                                                                       tropy (MaxEnt) classifier is applied to automati-
    We also show that the technique of ran-
                                                                       cally detect cue words, subsequently labeling sen-
    dom indexing can be successfully applied
                                                                       tences as uncertain if they are found to contain a
    for compressing the dimensionality of the
                                                                       cue. Furthermore, in the system of Velldal et al.
    original feature space by several orders of
                                                                       (2010), the resolution of the in-sentence scopes of
    magnitude, while at the same time yield-
                                                                       identified cues, as required for Task 2, is deter-
    ing better classifier performance.
                                                                       mined by a set of manually crafted rules operating
1 Introduction                                                         on dependency representations. For readers inter-
                                                                       ested in more details on this set of rules used for
The problem of hedge detection refers to the task
                                                                       solving Task 2, the reader is referred to Øvrelid et
of identifying uncertainty or speculation in text.
                                                                       al. (2010b). The focus of the present paper, how-
Being the topic of several recent shared tasks and
                                                                       ever, is to present a new and simplified approach
dedicated workshops,1 this is a problem that is re-
                                                                       to the classification problem relevant for solving
ceiving increased interest within the fields of NLP
                                                                       Task 1, and also partially Task 2, viz. the identifi-
and biomedical text mining. In terms of practical
                                                                       cation of hedge cues.
motivation, hedge detection is particularly useful
in relation to information extraction tasks, where                     2 Overview
the ability to distinguish between factual and un-
certain information can be of vital importance.                        In Section 5 we first develop a Support Vector Ma-
   The topic of the Shared Task at the 2010 Confer-                    chine (SVM) token classifier for identifying cue
ence for Natural Language Learning (CoNLL), is                             2
                                                                             As noted by Farkas et al. (2010), most hedge cues typi-
hedge detection for the domain of biomedical re-                       cally fall in the following categories; auxiliaries (may, might,
search literature (Farkas et al., 2010). The task is                   could, etc.), verbs of hedging or verbs with speculative con-
                                                                       tent (suggest, suspect, indicate, suppose, seem, appear, etc.),
    1                                                                  adjectives or adverbs (probable, likely, possible, unsure, etc.),
      Hedge detection played a central part of the shared tasks
of both BioNLP 2009 and CoNLL 2010, as well as the NeSp-               or conjunctions (either. . . or, etc.).
NLP 2010 workshop (Negation and Speculation in NLP).


                                                                  72
(1)   {ROI happeari to serve as messengers mediating {directly hori indirectly} the release of the inhibitory subunit I kappa
      B from NF-kappa B}.

(2)   {The specific role of the chromodomain is hunknowni} but chromodomain swapping experiments in Drosophila
      {hsuggesti that they {hmighti be protein interaction modules}} [18].

(3)   These data {hindicate thati IL-10 and IL-4 inhibit cytokine production by different mechanisms}.

(4)   Whereas a background set of promoter regions is easy to identify, it is {hnot cleari how to define a reasonable genomic
      sample of enhancers}.

Figure 1: Examples of hedged sentences in the BioScope corpus. Hedge cues are here shown using angle
brackets, with braces corresponding to their annotated scopes.


words. For a given sentence, the classifier con-                  tion as a sequence labeling problem. Similarly
siders each word in turn, labeling it as a cue or a               to Morante and Daelemans (2009), Tang et al.
non-cue. We will refer to this mode of cue classifi-              (2010) set out to label tokens according to a BIO-
cation as performing word-by-word classification                  scheme, i.e. indicating whether they are at the Be-
(WbW). Later, in Section 6, we go on to show how                  ginning, Inside, or Outside of a hedge cue. Tang et
better results can be obtained by instead approach-               al. (2010) train both a Conditional Random Field
ing the task as a disambiguation problem, restrict-               (CRF) sequence classifier and an SVM-based Hid-
ing our attention to only those tokens whose base                 den Markov Model (HMM), finally combining the
forms have previously been observed as hedge                      predictions of both models in a second CRF.
cues. Reformulating the problem in this way sim-                     In terms of the overall approach, i.e. viewing
plifies the classification task tremendously, reduc-              the problem as a sequence labeling task, Tang et
ing the number of examples that need to be consid-                al. (2010) are actually representative of the major-
ered, and thereby also trimming down the relevant                 ity of the ST participants for Task 1 (Farkas et al.,
feature space to a much more manageable size. At                  2010), including the top three performers on the
the same time, the resulting classifier achieves the              official held-out data. As noted by Farkas et al.
best published results so far on the Shared Task                  (2010), the remaining systems approached the task
data (to the best of our knowledge).                              either as a WbW token classification problem, or
   Additionally, in Section 7 we show how the                     directly as a sentence classification problem. Ex-
very large input feature space can be further com-                amples of the former are the systems of Velldal et
pressed using random indexing. This is essentially                al. (2010) and Vlachos and Craven (2010), sharing
a dimension reduction technique based on sparse                   the 4th rank position (out of 24 submitted systems)
random projections, which we here apply for fea-                  for Task 1.
ture extraction. We show that training the classifier                In both the sequence labeling and token classi-
on the reduced feature space yields better perfor-                fication approaches, a sentence is labeled as un-
mance than when using the original input space.                   certain if it contains a word labeled as a cue. In
The evaluation measures and feature templates are                 contrast, the sentence classification approaches in-
detailed in Sections 5.2 and 5.3, respectively. Note              stead tries to label sentences directly, typically us-
that, while preliminary results for all models are                ing Bag-of-Words (BoW) features. In terms of the
presented for the development data throughout the                 official Task 1 evaluation, the sentence classifiers
paper, the performance of all models is ultimately                tended to achieve a somewhat lower relative rank.
compared on the official Shared Task held-out data
in Section 8. We start, however, by providing a                   4 Data Sets and Preprocessing
brief overview of related work in Section 3, and                  The training data for the CoNLL 2010 Shared Task
then describe the relevant data sets and preprocess-              is taken from the BioScope corpus (Vincze et al.,
ing steps in Section 4.                                           2008) and consists of 14,541 sentences (or other
                                                                  root-level utterances) from biomedical abstracts
3 Related Work
                                                                  and articles. Some basic descriptive statistics for
The top-ranked system for Task 1 in the official                  the data sets are provided in Table 1. We see that
CoNLL 2010 Shared Task evaluation, described                      roughly 18% of the sentences are annotated as un-
in (Tang et al., 2010), approaches cue identifica-                certain. The BioScope corpus also provides anno-


                                                             73
                          Data Set    Sentences    Hedged     Cues     Multi-Word    Tokens    Cue Tokens
                                                  Sentences              Cues


               Training
                          Abstracts    11,871       2,101     2,659        364       309,634      3,056
                          Articles      2,670         519       668         84        68,579        782
                          Total        14,541       2,620     3,327        448       378,213      3,838
                          Held-Out      5,003        790      1,033         87       138,276      1,148

Table 1: The Shared Task data sets. The top three rows lists the properties of the training data, separately
detailing its two components—biomedical abstracts and full articles. The bottom row summarizes the
official held-out test data (articles only). Token counts are based on the tokenizer described in Section 4.1.


tation for hedge cues as well as their scope. Out of               GENIA in the biomedical domain, while using our
a total of 378,213 tokens, 3,838 are annotated as                  improved tokenization.
being part of a hedge cue. As can be seen, the to-                   For the vast majority of tokens, we use GENIA
tal number of cues is somewhat lower (3,327), due                  PoS tags and base forms (i.e. lemmas). However,
to the fact that some tokens are part of the same                  GENIA does not make a PoS distinction between
cue, so-called multi-word cues (448 in total), such                proper and common nouns, as in the Penn Tree-
as indicate that in Example 3.                                     bank, and hence we give precedence to TnT out-
   For evaluation purposes, the task organizers                    puts for tokens tagged as nominal by both taggers.
provided newly annotated biomedical articles,
comprising 5,003 additional utterances, of which                   5 Hedge Cue Classification
790 are annotated as hedged (see overview in Ta-                   This section develops a binary cue classifier sim-
ble 1). The data contains a total of 1,033 cues, of                ilar to that of Velldal et al. (2010), but using
which 87 are multi-word cues spanning multiple                     the framework of large-margin SVM classification
tokens, comprising 1,148 cue tokens altogether.                    (Vapnik, 1995) instead of MaxEnt. For a given
4.1   Tokenization                                                 sentence, the word-by-word classifier (referred to
                                                                   as CW bW ) considers each token in turn, labeling it
The GENIA tagger (Tsuruoka et al., 2005) takes                     as a cue or non-cue. Any sentence found to con-
an important role in our preprocessing set-up, as                  tain a cue is subsequently labeled as uncertain.
it is specifically tuned for biomedical text. Never-
theless, its rules for tokenization appear to not al-              5.1 Defining the Training Instances
ways be optimally adapted for the BioScope cor-                    As annotated in the training data, it is possible for
pus. (For examples, GENIA unconditionally in-                      a hedge cue to span multiple tokens, e.g. as in
troduces token boundaries for some punctuation                     whether or not. The majority of the multi-word
marks that can also occur token-internally.) Our                   cues in the training data are very infrequent, how-
preprocessing pipeline therefore deploys a home-                   ever, most occurring only once, and the classifier
grown, cascaded finite-state tokenizer (adapted                    itself is not sensitive to the notion of multi-word
from the open-source English Resource Grammar;                     cues. A given word token is considered a cue as
Flickinger (2000)), which aims to implement the                    long as it falls within the span of a cue annotation.
tokenization decisions made in the Penn Tree-                         As presented to the learner, a given token wi is
bank (Marcus et al., 1993)—much like GENIA,                        represented as a feature vector f (wi ) = f~i ∈ ℜd .
in principle—but properly treating certain corner                  Each dimension fij represents a feature function
cases found in the BioScope data.                                  which can encode arbitrary properties of wi . Sec-
4.2   PoS Tagging and Lemmatization                                tion 5.3 describes the particular features we are us-
                                                                   ing. Each training example can be thought of as a
For part-of-speech (PoS) tagging and lemmatiza-                    pair of a feature vector and a label, hf~i , yi i. If wi
tion, we combine GENIA and TnT (Brants, 2000),                     is a cue we have yi =+1, while for non-cues the
which operates on pre-tokenized inputs but in its                  label is −1. For estimating the actual SVM clas-
default model is trained on financial news from the                sifier for predicting the labels on unseen examples
Penn Treebank. Our general goal here is to take                    we use the SVMlight toolkit (Joachims, 1999).
advantage of the higher PoS accuracy provided by


                                                              74
5.2   Evaluation Measures                                     over surface forms. The behavior of this classi-
We will be reporting precision, recall and F1 for             fier is similar to what we would expect from sim-
two different levels of evaluation; the sentence-             ply compiling a list of cue words from the training
level and the token-level. While the token-                   data, based on the majority usage of each word as
level scores indicate how well the classifiers suc-           cue or non-cue.
ceed in identifying individual cue words, the                    As shown in Table 2, after averaging results
sentence-level scores are what actually correspond            from 10-fold cross-validation on the training data,
                                                              the baseline cue classifier (shown as CW       U ni )
to Task 1, i.e. correctly identifying sentences as                                                             bW
being certain or uncertain.                                   achieves a sentence-level F1 of 88.69 and a token-
                                                              level F1 of 79.59. In comparison, the classifier
5.3   Feature Templates                                       using all the available n-gram features (CW bW )
In the Shared Task system description paper of                achieves F-scores of 91.19 and 87.80 on the
Velldal et al. (2010), results are reported for Max-          sentence-level and token-level, respectively. We
Ent cue classifiers using a wide variety of feature           see that the improvement in performance com-
types of both surface-oriented and syntactic na-              pared to the baseline is most pronounced on the
ture. For the latter, Velldal et al. (2010) defines           token-level, but the differences in scores for both
a range of syntactic and dependency-based fea-                levels are found to be statistically significant at p
tures extracted from parses produced by the Malt-             < 0.005 using a two-tailed sign-test.
Parser (Nivre et al., 2006; Øvrelid et al., 2010a)
                                                              6 Reformulating the Classification
and the XLE (Crouch et al., 2008), recording in-
                                                                Problem
formation about dependency relations, subcatego-
rization frames, etc. However, it turned out that the         An error analysis of our initial WbW classifier
simpler lexical and surface-oriented features were            revealed that it is not able to generalize to new
sufficient for the identification of hedge cues.              hedge cues beyond those that have already been
   Drawing on the observation above, the classi-              observed during training. Even after adding the
fiers trained in this paper are only based on sim-            non-lexicalized variants of all feature types (i.e.
ple sequence-oriented n-gram features collected               making features more general by not including the
for PoS-tags, lemmas and surface forms. For all               focus word itself), the classifier still fails to iden-
these types of features we record neighbors for up            tify any unseen hedge cues whose base form did
to 3 positions left/right of the focus word. For in-          not occur as a cue in the training material. On the
creased generality, all these n-gram features also            other hand, only very few of the test cues are actu-
include non-lexicalized variants, i.e. excluding the          ally unseen (≈1.5%), meaning that the set of cue
focus word itself.                                            words might reasonably be treated as a near-closed
                                                              class (at least for the biomedical data considered
5.4   Preliminary Results
                                                              in this study). As a consequence of these observa-
Instantiating all feature templates described above           tions, we here reformulate the problem as follows.
for the BioScope training data, using the maximal             Instead of approaching the task as a classification
span for all n-grams (n=4, i.e. including up to 3             problem defined for all words, we only consider
neighbors), we end up with a total of more than               words that have a base form observed as a hedge
6,500,000 unique feature types. However, after                cue in the training material. In effect, any word
testing different feature configurations, it turns out        whose base form has never been observed as a cue
that the best performing model only uses a small              in the training data is automatically considered to
subset of this feature pool. The configuration we             be a non-cue when testing. Part of the rationale
will be using throughout this paper includes; n-              here is that, while it seems reasonable to assume
grams over base forms ±3 positions of the focus               that any word occurring as a cue can also occur as
word; n-grams over surface forms up to +2 posi-               a non-cue, the converse is less likely.
tions only; and PoS of the focus word. This re-                  While the training data contains a total of ap-
sults in a set of roughly 2,630,000 feature types.            proximately 17,600 unique base forms (given the
In addition to reporting classifier performance for           preprocessing outlined in Section 4), only 143 of
this feature configuration, we also provide results           these ever occur as hedge cues. By restricting the
for a baseline model using only unigram features              classifier to only this subset of words, we man-


                                                         75
              Sentence Level          Token Level               be significant (at p < 0.005).
Model       Prec Rec      F1       Prec Rec       F1
 U ni
CW bW      91.01   86.53   88.69   90.60   71.03   79.59        7 Sparse Random Indexing
CW bW      94.31   88.30   91.19   94.67   81.89   87.80
CDisamb    93.64   89.68   91.60   94.01   83.55   88.45
                                                                As mentioned in Section 5.1, each training ex-
 RI
CDisamb    93.78   88.45   91.03   94.05   81.97   87.58        ample is represented by a d-dimensional feature
                                                                vector f~i ∈ ℜd . Given n examples and d fea-
Table 2: 10-fold cross-validation on the biomedi-               tures, the feature vectors can be thought of as rows
cal abstracts and articles in the training data.                in a matrix F ∈ ℜn×d . One potential problem
                                                                with using a vector-based numerical encoding of
                                                                local context features such as those described in
age to simplify the classification problem tremen-              Section 5.3, is that the dimensionality of the fea-
dously, but without any loss in performance.                    ture space grows very rapidly with the number of
   Note that, although we view the task as a dis-               training examples. Using local features, e.g. con-
ambiguation problem, it is not feasible to train                text windows recording properties such as direc-
separate classifiers for each individual base form.             tion and distance, the number of unique features
The frequency distribution of the cue words in the              grows much faster than when using, say, BoW fea-
training material is very skewed with most cues                 tures. In order to make the vector encoding scal-
being very rare—many occurring as a cue only                    able, we would like to somehow be able to put a
once (≈ 40%). (Note, that most of these words                   bound on the number of dimensions.
also have many additional occurrences in the train-                 As mentioned above, even after simplifying the
ing data as non-cues, however.) For the majority                classification problem, our input feature space is
of the cue words then, it seems we can not hope                 still rather huge, totaling roughly 670,000 feature
to gather enough reliable information to train in-              types. Given that the number of training examples
dividual classifiers. Instead, we want to be able to            is only around n ≈ 95,000 we have that d ≫ n, and
draw on information from the more frequently oc-                whenever we want to add more feature templates
curring cues also when classifying or disambiguat-              or add more training data, this imbalance will only
ing the less frequent ones. Consequently, we still              become more pronounced. It is also likely that
train a single global classifier as for the original            many of the n-gram features in our model will
WbW set-up. However, as the disambiguation                      not be relevant for the classification of new data
classifier still only needs to consider a small sub-            points. The combination of many irrelevant fea-
set of the number of words considered by the full               tures, and few training examples compared to the
WbW classifier, the number of instantiated feature              number of features, makes the learner prone to
types is, of course, greatly reduced.                           overfitting.
   For the full WbW classification, the number of                   In previous attempts to reduce the feature space,
training examples is 378,213. Using the feature                 we have applied several feature selection schemes,
configuration described in Section 5.4, this gener-             such as filtering on the correlation coefficient be-
ates a total of roughly 2,630,000 feature types. For            tween a feature and a class label, or using simple
the disambiguation model, using the same feature                frequency cutoffs. Although such methods are ef-
configuration, the number of instantiated feature               fective in reducing the number of features, they
types is reduced to just below 670,000, as gener-               typically do so at the expense of classifier perfor-
ated for 94,155 training examples.                              mance. Due to both data sparseness and the likeli-
   Running the new disambiguation classifier by                 hood of many features being only locally relevant,
10-fold cross validation on the training data, we               it is difficult to reliably asses the relevance of the
find that it has substantially better recall than the           input features, and we risk filtering out many rele-
original WbW classifier. The results are shown in               vant features as well. Using simple filtering meth-
the row CDisamb in Table 2. Across all levels of                ods, we did not manage to considerably reduce the
evaluation the CDisamb model achieves a boost in                number of features without also significantly re-
F1 compared to CW bW . However, when applying                   ducing the performance of the classifier. Although
a two-tailed sign-test, considering differences in              better results can be expected by using so-called
classifier decisions on both the sentence-level and             wrapper methods (Guyon and Elisseeff, 2003) in-
token-level, only the latter differences are found to           stead, this is not computationally feasible for large


                                                           76
feature sets.                                                         (with k being on the order of thousands). As
   As an alternative to such feature selection meth-                  noted by Sahlgren (2005), high-dimensional vec-
ods, we here report on experiments with a tech-                       tors having random directions are very likely to be
nique known as random indexing (RI). This allows                      close to orthogonal, and the approximation to F
us to drastically compress the feature space with-                    will generally be better the higher we set k.
out explicitly throwing out any features.                                Finally, it is worth noting that RI has tradition-
   The technique of random indexing was initially                     ally been applied on the type level, with the pur-
introduced by Kanerva et al. (2000) for modeling                      pose of accumulating context vectors that repre-
the semantic similarity of words by their distribu-                   sent the distributional profiles of words in a se-
tion in text.3 Actually RI forms part of a larger                     mantic space model (Sahlgren, 2005). Here, on
family of dimension reduction techniques based                        the other hand, we apply it on the instance level
on random projections. Such methods typically                         and as a general means of compressing the feature
work by multiplying the feature matrix F ∈ ℜn×d                       space of a learning problem.
by a random matrix R ∈ ℜd×k , where k ≪ d,
thereby reducing the number of dimensions from                        7.1 Tuning the Random Indexing
d to k:                                                               Regarding the ratio of non-zero elements, the lit-
                                                                      erature on random projections contains a wide
        G = F R ∈ ℜn×k ,              with k ≪ d          (5)         range of suggestions as to how the entries of the
Given that k is sufficiently high, the Johnson-                       random matrix R should be initialized. In the
Lindenstrauss lemma (Johnson and Lindenstrauss,                       context of random indexing, Sahlgren and Karl-
1984) tells us that the pairwise distances (and                       gren (2005) set approximately 1% of the entries
thereby separability) in F can be preserved with                      in each index to +1 or −1. It is worth bearing in
high probability within the lower-dimensional                         mind, however, that the computational complexity
space G (Li et al., 2006). While the only condi-                      of dot-product operations (as used extensively by
tion on the entries of R is that they are i.i.d. with                 the SVM learner) depend not only on the number
zero mean, they are typically also specified to have                  of dimensions itself, but on the number of non-
unit variance (Li et al., 2006).                                      zero elements. We therefore want to take care
   One advantage of the particular random index-                      to avoid ending up with a reduced space that is
ing approach is that the full n × d feature ma-                       much more dense.Nevertheless, the appeal of us-
trix F does not need to be explicitly computed.                       ing a random projection technique is in our case
The method constructs the representation of the                       more related to its potential as a feature extraction
data in G by incrementally accumulating so-called                     step, and less to its potential for speeding up com-
index vectors assigned to each of the d features                      putations and reducing memory load, as the orig-
(Sahlgren and Karlgren, 2005). The process can                        inal feature vectors are already very sparse. After
be described by the following two simple steps:                       experimenting with different parametrizations, it
                                                                      seems that the classifier performance on our data
   - When a new feature is instantiated, it is as-                    sets are fairly stable with respect to varying the ra-
     signed a randomly generated vector of a fixed                    tio of non-zeros. Moreover, we find that the non-
     dimensionality k, consisting of a small num-                     zero entries can be very sparsely distributed, e.g.
     ber of −1s and +1s (the remaining elements                       ≈ 0.05–0.2%, without much loss in classifier per-
     being 0). This is then the so-called index vec-                  formance. Figure 2a shows the effect of varying
     tor of the feature. (The index of the ith fea-                   the ratio of non-zero elements while keeping the
     ture corresponds to the ith column of R.)                        dimensionality fixed (at k=5,000), always assign-
   - The vector representing a given training ex-                     ing an equal number of +1s and −1s (giving zero
     ample (the jth row of G represents the jth                       mean and unit variance). For each parametrization
     example) is then constructed by simply sum-                      we perform a batch of 5 experiments using dif-
     ming the random index vectors of its features.                   ferent random initializations of the index vectors.
                                                                      The scores shown in Figure 2a are the average and
Note that, although we want to have k ≪ d, we                         maximum within each batch. As can be seen, with
still operate in relatively high-dimensional space                    index vectors of 5,000 elements, using 8 non-zero
   3
     Readers are referred to Sahlgren (2005) for a good intro-        entries (corresponding to a ratio of 0.16%) here
duction to random indexing.


                                                                 77
     94                                 Avg F1                    94                                          Avg F1
                                        Max F1                                                                Max F1
     92                                                           92

     90                                                           90

     88                                                           88


                                                             F1
F1


     86                                                           86

     84                                                           84

     82                                                           82

     80                                                           80
             2     4      8       16   32        64                      1250   2500     5000      10000    20000   670K
                         # Non-Zeros                                                   Dimensionality (k)
                         (a)                                                             (b)

Figure 2: While varying various parameters of the random indexing, the plot shows averaged and maxi-
mum sentence-level F1 from 5 different runs for each setting (using different random initializations of the
index vectors), testing on 1/10th of the training data. In (a) we vary the number of non-zero elements in
the index vectors, while keeping the dimensionality fixed at k=5,000. In (b) we apply the disambiguation
classifier using random indexing while varying the dimensionality k of the index vectors. The number
of non-zeros varies from 2 (for k=1,250) to 32 (for k=20,000). For reference, the last column shows the
result for using the original non-projected feature space.


seems to strike a reasonable balance between in-             appears to preserve a lot more information than
dex density and performance.                                 feature selection based on filtering methods.
   As expected, we do, however, see a clear dete-
rioration of classifier accuracy if the dimensional-         7.2 Preliminary Results
                                                             The bottom row of Table 2 (CDisambRI      ), shows the
ity of the index vectors is set very low. Figure 2b
shows the effect of varying the dimensionality k             results of applying an SVM-classifier by full 10-
of the index vectors, while fixing the ratio of non-         fold cross-validation over the training set using the
zero entries per vector to 0.16%. Again we per-              same random index assignments that yielded the
form batches of 5 experiments for each value of k,           maximum F1 in Figure 2b for k=5,000 (with eight
reporting the average and maximum within each                randomly set non-zeros in each index). We see that
                                                                                      RI
                                                             the performance of CDisamb       is actually slightly
batch. For our cue classification data, the posi-
tive effect of increasing k seems to flatten out at          lower than for CDisamb . The differences are not
around k=5,000. When considering the standard                detected as being significant though (applying the
deviation of scores within each batch, however, the          sign-test in the same manner as described above).
variability of the results seems to steadily decrease        Moreover, it should also be pointed out that we
as k increases. For example, while we find σ=1.34            have not yet tried tuning the random indexing by
for the set of runs using k=1,250, we find σ=0.29            multiple runs of full 10-fold cross-validation on
for k=20,000.                                                the training data, which would be expected to im-
   When looking at the maximum scores shown in               prove these results. Given the fact that the effec-
Figure 2b, one of the runs using k=5,000 turns               tive feature space for the classifier is reduced from
out to have the peak performance, achieving a                670,000 to just 5,000 dimensions, we find it no-
                                                                              RI
                                                             table that the CDisamb  model achieves comparable
(sentence-level) F1 of 90.38. Not only does it
score higher than any of the other RI-runs with              results, with only preliminary tuning.
k>5,000, it also outperforms the original CDisamb               Another important observation is that the com-
model, which achieves an F1 of 89.36 for the same            plexity of the resulting SVM in terms of the num-
single “fold” (the models in Figure 2b are tested            ber of support vectors (SVs), is considerably re-
using 1/10th of the training material).                      duced for the RI-model: While the number of SVs
   In our experience, although the random projec-            for CDisamb averages just below 8% of the train-
tion provided by the RI vectors only represents an           ing examples, this is reduced to just above 4% for
approximation to the original input space, it still             RI
                                                             CDisamb    (using the SVMlight default settings). In


                                                        78
              Sentence Level          Token Level               ing the sign-test as described above to the classi-
Model       Prec Rec      F1       Prec Rec       F1                                RI
                                                                fier decisions of CDisamb    , we find statistically sig-
CWU ni
    bW     77.54   81.27   79.36   75.89   66.90   71.11        nificant differences with respect to CW bW but not
CW bW      89.02   84.18   86.53   87.58   74.30   80.40        with respect to CDisamb . Nonetheless, the encour-
CDisamb    87.37   85.82   86.59   85.92   76.57   80.98                               RI
  RI
                                                                aging results of the CDisamb    model on the held-out
CDisamb    88.83   84.56   86.64   86.65   74.65   80.21
Tang       85.03   87.72   86.36     –       –       –          data means that further tuning of the RI configura-
                                                                tion on the training data will be a priority for future
  Table 3: Results on the Shared Task test data.                experiments.
                                                                   It is also worth noting that many of the sys-
                                                                tems participating in the ST challenge used fairly
addition to halving the number of SVs, as well as               complex and resource-heavy feature types, being
reducing the feature space by two orders of mag-                sensitive to document structure, grammatical rela-
nitude, the upper bound on the VC-dimension (as                 tions, etc. (Farkas et al., 2010). The fact that com-
estimated by SVMlight ) is also reduced by 12%. It              parable or better results can be obtained using a
is also worth noting that the run-time differences              relatively simple approach as demonstrated in this
for estimating the SVM on the original input space              paper—with low cost in terms of both computa-
and the reduced (but slightly denser) feature space,            tion and external resources—might lower the bar
are negligible (≈ 5 CPU-seconds more for the RI-                for employing a hedge detection component in an
model when re-training on the full training set).               actual IE system.
                                                                   Finally, we also observe that our simple uni-
8 Held-Out Testing                                              gram baseline classifier proves to be surprisingly
Table 3 presents the final results for the various              competitive. In fact, comparing its Task 1 F1 to
classifiers developed in this paper, testing them on            those of the official ST evaluation, it actually out-
the biomedical articles of the CoNLL 2010 Shared                ranks 7 of the 24 submitted systems.
Task held-out test set (see Table 1). In addition to
the evaluation results for our own classifiers, Ta-             9 Conclusions
ble 3 also include the official test results for the            This paper has presented the incremental develop-
system described by Tang et al. (2010). The se-                 ment of uncertainty classifiers for detecting hedg-
quence classifier developed by Tang et al. (2010),              ing in biomedical text—the topic of the CoNLL
combining a CRF classifier and a large-margin                   2010 Shared Task. Using simple n-gram features
HMM model, obtained the best results for the of-                over words, lemmas and PoS-tags, we first develop
ficial ST evaluation for Task 1 (i.e. sentence-level            a (linear) SVM cue classifier that outperforms the
uncertainty detection).                                         top ranked system for Task 1 in the official Shared
   As seen from Table 3, all of our SVM clas-                   Task evaluation (i.e. sentence-level uncertainty de-
                                   RI
sifiers CW bW , CDisamb , and CDisamb     , achieve a           tection). We then show how the original classi-
higher sentence-level F1 than the system of Tang                fication task can be greatly simplified by view-
et al. (2010) (though it is unknown whether the                 ing it as a disambiguation task restricted to only
differences are statistically significant). We also             those words that have previously been observed as
note that our reformulation of the cue classifica-              hedge cues. Operating in a smaller (though still
tion task as a disambiguation problem leads to bet-             fairly large) feature space, this second classifier
ter performance also on the held-out data, with                 achieves even better results. Finally, we apply the
CDisamb performing slightly better than CW bW                   method of random indexing, further reducing the
across both evaluation levels. Interestingly, the               dimensionality of the feature space by two orders
best performer of them all proves to be the ran-                of magnitude. This final classifier—combining
                          RI
dom indexing model (CDisamb       ), even though this           an SVM-based disambiguation model with ran-
model was not the top-performer on the training                 dom indexing—is our best performer, achieving
data. One possible explanation for the strong held-             a sentence-level F1 of 86.64 on the CoNLL 2010
                        RI
out performance of CDisamb       is that the reduced            Shared Task held-out data.
complexity of this classifier (see Section 7.2) has
made it less prone to overfitting, leading to better
generalization performance on new data. Apply-


                                                           79
Acknowledgments                                                      Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
                                                                       MaltParser: A data-driven parser-generator for de-
The experiments reported in this paper represent an extension          pendency parsing. In Proceedings of the Fifth In-
of previous joint work with Stephan Oepen and Lilja Øvre-              ternational Conference on Language Resources and
lid (Velldal et al., 2010; Øvrelid et al., 2010b). The author          Evaluation, pages 2216–2219.
also wishes to thank the anonymous reviewers, as well as col-        Magnus Sahlgren and Jussi Karlgren. 2005. Auto-
leagues at the Uni. of Oslo, for their valuable comments.             matic bilingual lexicon acquisition using random in-
                                                                      dexing of parallel corpora. Journal of Natural Lan-
                                                                      guage Engineering, Special Issue on Parallel Texts,
References                                                            11(3), September.
Thorsten Brants. 2000. TnT. A statistical Part-of-                   Magnus Sahlgren. 2005. An introduction to random
  Speech tagger. In Proceedings of the Sixth Con-                     indexing. In Proceedings of the Methods and Appli-
  ference on Applied Natural Language Processing,                     cations of Semantic Indexing Workshop at the 7th In-
  pages 224 – 231, Seattle, WA.                                       ternational Conference on Terminology and Knowl-
                                                                      edge Engineering (TKE), Copenhagen, Denmark.
Dick Crouch, Mary Dalrymple, Ron Kaplan, Tracy
  King, John Maxwell, and Paula Newman. 2008.                        Buzhou Tang, Xiaolong Wang, Xuan Wang, Bo Yuan,
  XLE documentation. Palo Alto Research Center.                        and Shixi Fan. 2010. A cascade method for detect-
                                                                       ing hedges and their scope in natural language text.
Richard Farkas, Veronika Vincze, Gyorgy Mora, Janos                    In Proceedings of the 14th Conference on Natural
  Csirik, and Gyorgy Szarvas. 2010. The CoNLL                          Language Learning, Uppsala, Sweden.
  2010 Shared Task: Learning to detect hedges and
  their scope in natural language text. In Proceedings               Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim,
  of the 14th Conference on Natural Language Learn-                    Tomoko Ohta, John McNaught, Sophia Ananiadou,
  ing, Uppsala, Sweden.                                                and Jun’ichi Tsujii. 2005. Developing a robust Part-
                                                                       of-Speech tagger for biomedical text. In Advances
Dan Flickinger. 2000. On building a more efficient
                                                                       in Informatics, pages 382 – 392. Springer, Berlin.
  grammar by exploiting types. Natural Language
  Engineering, 6 (1):15 – 28.                                        Vladimir Vapnik. 1995. The Nature of Statistical
Isabelle Guyon and André Elisseeff. 2003. An intro-                    Learning Theory. Springer.
   duction to variable and feature selection. Journal of             Erik Velldal, Lilja Øvrelid, and Stephan Oepen. 2010.
   Machine Learning Research; Special issue on vari-                    Resolving speculation: MaxEnt cue classification
   able and feature selection, 3:1157 – 1182, March.                    and dependency-based scope rules. In Proceedings
Thorsten Joachims. 1999. Making large-scale SVM                         of the 14th Conference on Natural Language Learn-
  learning practical. In B. Schölkopf, C. Burges, and                   ing, Uppsala, Sweden.
  A. Smola, editors, Advances in Kernel Methods -
                                                                     Veronika Vincze, György Szarvas, Richárd Farkas,
  Support Vector Learning. MIT-Press.
                                                                       György Móra, and János Csirik. 2008. The Bio-
William Johnson and Joram Lindenstrauss. 1984. Ex-                     Scope corpus: Annotation for negation, uncertainty
  tensions of Lipschitz mappings into a Hilbert space.                 and their scope in biomedical texts. In Proceedings
  Contemporary Mathematics, 26:189 – 206.                              of the BioNLP 2008 Workshop, Columbus, USA.

P. Kanerva, J. Kristoferson, and A Holst. 2000. Ran-                 Andreas Vlachos and Mark Craven. 2010. Detecting
   dom indexing of text samples for latent semantic                    speculative language using syntactic dependencies
   analysis. In Proceedings of the 22nd Annual Con-                    and logistic regression. In Proceedings of the 14th
   ference of the Cognitive Science Society, page 1036,                Conference on Natural Language Learning, Upp-
   Pennsylvania. Mahwah, New Jersey: Erlbaum.                          sala, Sweden.

Ping Li, Trevor Hastie, and Kenneth Church. 2006.                    Lilja Øvrelid, Jonas Kuhn, and Kathrin Spreyer. 2010a.
   Very sparse random projections. In Proceedings of                    Cross-framework parser stacking for data-driven de-
   the Twelfth ACM SIGKDD International Conference                      pendency parsing. TAL 2010 special issue on Ma-
   on Knowledge Discovery and Data Mining (KDD-                         chine Learning for NLP, 50(3).
   06), Philadelphia.
                                                                     Lilja Øvrelid, Erik Velldal, and Stephan Oepen. 2010b.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann                    Syntactic scope resolution in uncertainty analysis.
  Marcinkiewicz. 1993. Building a large annotated                       In Proceedings of the 23rd International Conference
  corpus of English. The Penn Treebank. Computa-                        on Computational Linguistics, Beijing, China.
  tional Linguistics, 19:313 – 330.
Roser Morante and Walter Daelemans. 2009. Learn-
  ing the scope of hedge cues in biomedical texts. In
  Proceedings of the BioNLP 2009 Workshop, pages
  28 – 36, Boulder, Colorado.


                                                                80