<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Uncertainty in Biomedical Literature: A Simple Disambiguation Approach Using Sparse Random Indexing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erik Velldal</string-name>
          <email>erikve@ifi.uio.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <fpage>72</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>This paper presents a novel approach to the problem of hedge detection, which involves the identification of so-called hedge cues for labeling sentences as certain or uncertain. This is the classification problem for Task 1 of the CoNLL-2010 Shared Task, which focuses on hedging in biomedical literature. We here propose to view hedge detection as a simple disambiguation problem, restricted to words that have previously been observed as hedge cues. Applying an SVM classifier, the approach achieves the best published results so far for sentence-level uncertainty prediction on the Shared Task test data. We also show that the technique of random indexing can be successfully applied for compressing the dimensionality of the original feature space by several orders of magnitude, while at the same time yielding better classifier performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The problem of hedge detection refers to the task
of identifying uncertainty or speculation in text.
Being the topic of several recent shared tasks and
dedicated workshops,1 this is a problem that is
receiving increased interest within the fields of NLP
and biomedical text mining. In terms of practical
motivation, hedge detection is particularly useful
in relation to information extraction tasks, where
the ability to distinguish between factual and
uncertain information can be of vital importance.</p>
      <p>
        The topic of the Shared Task at the 2010
Conference for Natural Language Learning (CoNLL), is
hedge detection for the domain of biomedical
research literature
        <xref ref-type="bibr" rid="ref3">(Farkas et al., 2010)</xref>
        . The task is
1Hedge detection played a central part of the shared tasks
of both BioNLP 2009 and CoNLL 2010, as well as the
NeSpNLP 2010 workshop (Negation and Speculation in NLP).
defined for two levels of analysis: While Task 1 is
described as learning to detect sentences
containing uncertainty, the object of Task 2 is learning to
resolve the in-sentence scope of hedge cues. The
focus of the present paper is only on Task 1.
      </p>
      <p>
        A hedge cue is here taken to mean the words
or phrases that signal the attitude of uncertainty
or speculation.2 Examples 1-4 in Figure 1, taken
from the BioScope corpus
        <xref ref-type="bibr" rid="ref19">(Vincze et al., 2008)</xref>
        ,
illustrate how cue words are annotated in the Shared
Task training data. Moreover, the training data
also annotates an entire sentence as uncertain if
it contains a hedge cue, and it is the prediction of
this sentence labeling that is required for Task 1.
      </p>
      <p>The approach presented in this paper extends on
that of Velldal et al. (2010), where a maximum
entropy (MaxEnt) classifier is applied to
automatically detect cue words, subsequently labeling
sentences as uncertain if they are found to contain a
cue. Furthermore, in the system of Velldal et al.
(2010), the resolution of the in-sentence scopes of
identified cues, as required for Task 2, is
determined by a set of manually crafted rules operating
on dependency representations. For readers
interested in more details on this set of rules used for
solving Task 2, the reader is referred to Øvrelid et
al. (2010b). The focus of the present paper,
however, is to present a new and simplified approach
to the classification problem relevant for solving
Task 1, and also partially Task 2, viz. the
identification of hedge cues.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overview</title>
      <p>In Section 5 we first develop a Support Vector
Machine (SVM) token classifier for identifying cue</p>
      <p>2As noted by Farkas et al. (2010), most hedge cues
typically fall in the following categories; auxiliaries (may, might,
could, etc.), verbs of hedging or verbs with speculative
content (suggest, suspect, indicate, suppose, seem, appear, etc.),
adjectives or adverbs (probable, likely, possible, unsure, etc.),
or conjunctions (either. . . or, etc.).
(1) {ROI happeari to serve as messengers mediating {directly hori indirectly} the release of the inhibitory subunit I kappa</p>
      <p>B from NF-kappa B}.
(2) {The specific role of the chromodomain is hunknowni} but chromodomain swapping experiments in Drosophila
{hsuggesti that they {hmighti be protein interaction modules}} [18].
(3)
(4)</p>
      <p>These data {hindicate thati IL-10 and IL-4 inhibit cytokine production by different mechanisms}.</p>
      <p>Whereas a background set of promoter regions is easy to identify, it is {hnot cleari how to define a reasonable genomic
sample of enhancers}.
words. For a given sentence, the classifier
considers each word in turn, labeling it as a cue or a
non-cue. We will refer to this mode of cue
classification as performing word-by-word classification
(WbW). Later, in Section 6, we go on to show how
better results can be obtained by instead
approaching the task as a disambiguation problem,
restricting our attention to only those tokens whose base
forms have previously been observed as hedge
cues. Reformulating the problem in this way
simplifies the classification task tremendously,
reducing the number of examples that need to be
considered, and thereby also trimming down the relevant
feature space to a much more manageable size. At
the same time, the resulting classifier achieves the
best published results so far on the Shared Task
data (to the best of our knowledge).</p>
      <p>Additionally, in Section 7 we show how the
very large input feature space can be further
compressed using random indexing. This is essentially
a dimension reduction technique based on sparse
random projections, which we here apply for
feature extraction. We show that training the classifier
on the reduced feature space yields better
performance than when using the original input space.</p>
      <p>The evaluation measures and feature templates are
detailed in Sections 5.2 and 5.3, respectively. Note
that, while preliminary results for all models are
presented for the development data throughout the
paper, the performance of all models is ultimately
compared on the official Shared Task held-out data
in Section 8. We start, however, by providing a
brief overview of related work in Section 3, and
then describe the relevant data sets and
preprocessing steps in Section 4.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        The top-ranked system for Task 1 in the official
CoNLL 2010 Shared Task evaluation, described
in
        <xref ref-type="bibr" rid="ref15">(Tang et al., 2010)</xref>
        , approaches cue
identification as a sequence labeling problem. Similarly
to Morante and Daelemans (2009), Tang et al.
(2010) set out to label tokens according to a
BIOscheme, i.e. indicating whether they are at the
Beginning, Inside, or Outside of a hedge cue. Tang et
al. (2010) train both a Conditional Random Field
(CRF) sequence classifier and an SVM-based
Hidden Markov Model (HMM), finally combining the
predictions of both models in a second CRF.
      </p>
      <p>
        In terms of the overall approach, i.e. viewing
the problem as a sequence labeling task, Tang et
al. (2010) are actually representative of the
majority of the ST participants for Task 1
        <xref ref-type="bibr" rid="ref3">(Farkas et al.,
2010)</xref>
        , including the top three performers on the
official held-out data. As noted by Farkas et al.
(2010), the remaining systems approached the task
either as a WbW token classification problem, or
directly as a sentence classification problem.
Examples of the former are the systems of Velldal et
al. (2010) and Vlachos and Craven (2010), sharing
the 4th rank position (out of 24 submitted systems)
for Task 1.
      </p>
      <p>In both the sequence labeling and token
classification approaches, a sentence is labeled as
uncertain if it contains a word labeled as a cue. In
contrast, the sentence classification approaches
instead tries to label sentences directly, typically
using Bag-of-Words (BoW) features. In terms of the
official Task 1 evaluation, the sentence classifiers
tended to achieve a somewhat lower relative rank.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data Sets and Preprocessing</title>
      <p>
        The training data for the CoNLL 2010 Shared Task
is taken from the BioScope corpus
        <xref ref-type="bibr" rid="ref19">(Vincze et al.,
2008)</xref>
        and consists of 14,541 sentences (or other
root-level utterances) from biomedical abstracts
and articles. Some basic descriptive statistics for
the data sets are provided in Table 1. We see that
roughly 18% of the sentences are annotated as
uncertain. The BioScope corpus also provides
annoiignn AArbtsitcrlaescts
r Total
a
T
      </p>
      <p>Held-Out
tation for hedge cues as well as their scope. Out of
a total of 378,213 tokens, 3,838 are annotated as
being part of a hedge cue. As can be seen, the
total number of cues is somewhat lower (3,327), due
to the fact that some tokens are part of the same
cue, so-called multi-word cues (448 in total), such
as indicate that in Example 3.</p>
      <p>For evaluation purposes, the task organizers
provided newly annotated biomedical articles,
comprising 5,003 additional utterances, of which
790 are annotated as hedged (see overview in
Table 1). The data contains a total of 1,033 cues, of
which 87 are multi-word cues spanning multiple
tokens, comprising 1,148 cue tokens altogether.
4.1</p>
      <sec id="sec-4-1">
        <title>Tokenization</title>
        <p>
          The GENIA tagger
          <xref ref-type="bibr" rid="ref16">(Tsuruoka et al., 2005)</xref>
          takes
an important role in our preprocessing set-up, as
it is specifically tuned for biomedical text.
Nevertheless, its rules for tokenization appear to not
always be optimally adapted for the BioScope
corpus. (For examples, GENIA unconditionally
introduces token boundaries for some punctuation
marks that can also occur token-internally.) Our
preprocessing pipeline therefore deploys a
homegrown, cascaded finite-state tokenizer (adapted
from the open-source English Resource Grammar;
Flickinger (2000)), which aims to implement the
tokenization decisions made in the Penn
Treebank
          <xref ref-type="bibr" rid="ref10">(Marcus et al., 1993)</xref>
          —much like GENIA,
in principle—but properly treating certain corner
cases found in the BioScope data.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>PoS Tagging and Lemmatization</title>
        <p>
          For part-of-speech (PoS) tagging and
lemmatization, we combine GENIA and TnT
          <xref ref-type="bibr" rid="ref1">(Brants, 2000)</xref>
          ,
which operates on pre-tokenized inputs but in its
default model is trained on financial news from the
Penn Treebank. Our general goal here is to take
advantage of the higher PoS accuracy provided by
        </p>
        <p>GENIA in the biomedical domain, while using our
improved tokenization.</p>
        <p>For the vast majority of tokens, we use GENIA
PoS tags and base forms (i.e. lemmas). However,
GENIA does not make a PoS distinction between
proper and common nouns, as in the Penn
Treebank, and hence we give precedence to TnT
outputs for tokens tagged as nominal by both taggers.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Hedge Cue Classification</title>
      <p>
        This section develops a binary cue classifier
similar to that of Velldal et al. (2010), but using
the framework of large-margin SVM classification
        <xref ref-type="bibr" rid="ref17">(Vapnik, 1995)</xref>
        instead of MaxEnt. For a given
sentence, the word-by-word classifier (referred to
as CW bW ) considers each token in turn, labeling it
as a cue or non-cue. Any sentence found to
contain a cue is subsequently labeled as uncertain.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Defining the Training Instances</title>
        <p>As annotated in the training data, it is possible for
a hedge cue to span multiple tokens, e.g. as in
whether or not. The majority of the multi-word
cues in the training data are very infrequent,
however, most occurring only once, and the classifier
itself is not sensitive to the notion of multi-word
cues. A given word token is considered a cue as
long as it falls within the span of a cue annotation.</p>
        <p>
          As presented to the learner, a given token wi is
represented as a feature vector f (wi) = f~i ∈ ℜd.
Each dimension fij represents a feature function
which can encode arbitrary properties of wi.
Section 5.3 describes the particular features we are
using. Each training example can be thought of as a
pair of a feature vector and a label, hf~i, yii. If wi
is a cue we have yi=+1, while for non-cues the
label is −1. For estimating the actual SVM
classifier for predicting the labels on unseen examples
we use the SVMlight toolkit
          <xref ref-type="bibr" rid="ref6">(Joachims, 1999)</xref>
          .
We will be reporting precision, recall and F1 for
two different levels of evaluation; the
sentencelevel and the token-level. While the
tokenlevel scores indicate how well the classifiers
succeed in identifying individual cue words, the
sentence-level scores are what actually correspond
to Task 1, i.e. correctly identifying sentences as
being certain or uncertain.
5.3
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Feature Templates</title>
        <p>
          In the Shared Task system description paper of
Velldal et al. (2010), results are reported for
MaxEnt cue classifiers using a wide variety of feature
types of both surface-oriented and syntactic
nature. For the latter, Velldal et al. (2010) defines
a range of syntactic and dependency-based
features extracted from parses produced by the
MaltParser
          <xref ref-type="bibr" rid="ref12">(Nivre et al., 2006; Øvrelid et al., 2010a)</xref>
          and the XLE
          <xref ref-type="bibr" rid="ref2">(Crouch et al., 2008)</xref>
          , recording
information about dependency relations,
subcategorization frames, etc. However, it turned out that the
simpler lexical and surface-oriented features were
sufficient for the identification of hedge cues.
        </p>
        <p>Drawing on the observation above, the
classifiers trained in this paper are only based on
simple sequence-oriented n-gram features collected
for PoS-tags, lemmas and surface forms. For all
these types of features we record neighbors for up
to 3 positions left/right of the focus word. For
increased generality, all these n-gram features also
include non-lexicalized variants, i.e. excluding the
focus word itself.
5.4</p>
      </sec>
      <sec id="sec-5-3">
        <title>Preliminary Results</title>
        <p>Instantiating all feature templates described above
for the BioScope training data, using the maximal
span for all n-grams (n=4, i.e. including up to 3
neighbors), we end up with a total of more than
6,500,000 unique feature types. However, after
testing different feature configurations, it turns out
that the best performing model only uses a small
subset of this feature pool. The configuration we
will be using throughout this paper includes;
ngrams over base forms ±3 positions of the focus
word; n-grams over surface forms up to +2
positions only; and PoS of the focus word. This
results in a set of roughly 2,630,000 feature types.
In addition to reporting classifier performance for
this feature configuration, we also provide results
for a baseline model using only unigram features
over surface forms. The behavior of this
classifier is similar to what we would expect from
simply compiling a list of cue words from the training
data, based on the majority usage of each word as
cue or non-cue.</p>
        <p>As shown in Table 2, after averaging results
from 10-fold cross-validation on the training data,
the baseline cue classifier (shown as CWUnbiW )
achieves a sentence-level F1 of 88.69 and a
tokenlevel F1 of 79.59. In comparison, the classifier
using all the available n-gram features (CW bW )
achieves F-scores of 91.19 and 87.80 on the
sentence-level and token-level, respectively. We
see that the improvement in performance
compared to the baseline is most pronounced on the
token-level, but the differences in scores for both
levels are found to be statistically significant at p
&lt; 0.005 using a two-tailed sign-test.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Reformulating the Classification</title>
    </sec>
    <sec id="sec-7">
      <title>Problem</title>
      <p>An error analysis of our initial WbW classifier
revealed that it is not able to generalize to new
hedge cues beyond those that have already been
observed during training. Even after adding the
non-lexicalized variants of all feature types (i.e.
making features more general by not including the
focus word itself), the classifier still fails to
identify any unseen hedge cues whose base form did
not occur as a cue in the training material. On the
other hand, only very few of the test cues are
actually unseen (≈1.5%), meaning that the set of cue
words might reasonably be treated as a near-closed
class (at least for the biomedical data considered
in this study). As a consequence of these
observations, we here reformulate the problem as follows.
Instead of approaching the task as a classification
problem defined for all words, we only consider
words that have a base form observed as a hedge
cue in the training material. In effect, any word
whose base form has never been observed as a cue
in the training data is automatically considered to
be a non-cue when testing. Part of the rationale
here is that, while it seems reasonable to assume
that any word occurring as a cue can also occur as
a non-cue, the converse is less likely.</p>
      <p>While the training data contains a total of
approximately 17,600 unique base forms (given the
preprocessing outlined in Section 4), only 143 of
these ever occur as hedge cues. By restricting the
classifier to only this subset of words, we
man</p>
      <p>F1
age to simplify the classification problem
tremendously, but without any loss in performance.</p>
      <p>Note that, although we view the task as a
disambiguation problem, it is not feasible to train
separate classifiers for each individual base form.
The frequency distribution of the cue words in the
training material is very skewed with most cues
being very rare—many occurring as a cue only
once (≈ 40%). (Note, that most of these words
also have many additional occurrences in the
training data as non-cues, however.) For the majority
of the cue words then, it seems we can not hope
to gather enough reliable information to train
individual classifiers. Instead, we want to be able to
draw on information from the more frequently
occurring cues also when classifying or
disambiguating the less frequent ones. Consequently, we still
train a single global classifier as for the original
WbW set-up. However, as the disambiguation
classifier still only needs to consider a small
subset of the number of words considered by the full
WbW classifier, the number of instantiated feature
types is, of course, greatly reduced.</p>
      <p>For the full WbW classification, the number of
training examples is 378,213. Using the feature
configuration described in Section 5.4, this
generates a total of roughly 2,630,000 feature types. For
the disambiguation model, using the same feature
configuration, the number of instantiated feature
types is reduced to just below 670,000, as
generated for 94,155 training examples.</p>
      <p>Running the new disambiguation classifier by
10-fold cross validation on the training data, we
find that it has substantially better recall than the
original WbW classifier. The results are shown in
the row CDisamb in Table 2. Across all levels of
evaluation the CDisamb model achieves a boost in
F1 compared to CW bW . However, when applying
a two-tailed sign-test, considering differences in
classifier decisions on both the sentence-level and
token-level, only the latter differences are found to</p>
    </sec>
    <sec id="sec-8">
      <title>Sparse Random Indexing</title>
      <p>As mentioned in Section 5.1, each training
example is represented by a d-dimensional feature
vector f~i ∈ ℜd. Given n examples and d
features, the feature vectors can be thought of as rows
in a matrix F ∈ ℜn×d. One potential problem
with using a vector-based numerical encoding of
local context features such as those described in
Section 5.3, is that the dimensionality of the
feature space grows very rapidly with the number of
training examples. Using local features, e.g.
context windows recording properties such as
direction and distance, the number of unique features
grows much faster than when using, say, BoW
features. In order to make the vector encoding
scalable, we would like to somehow be able to put a
bound on the number of dimensions.</p>
      <p>As mentioned above, even after simplifying the
classification problem, our input feature space is
still rather huge, totaling roughly 670,000 feature
types. Given that the number of training examples
is only around n ≈ 95,000 we have that d ≫ n, and
whenever we want to add more feature templates
or add more training data, this imbalance will only
become more pronounced. It is also likely that
many of the n-gram features in our model will
not be relevant for the classification of new data
points. The combination of many irrelevant
features, and few training examples compared to the
number of features, makes the learner prone to
overfitting.</p>
      <p>
        In previous attempts to reduce the feature space,
we have applied several feature selection schemes,
such as filtering on the correlation coefficient
between a feature and a class label, or using simple
frequency cutoffs. Although such methods are
effective in reducing the number of features, they
typically do so at the expense of classifier
performance. Due to both data sparseness and the
likelihood of many features being only locally relevant,
it is difficult to reliably asses the relevance of the
input features, and we risk filtering out many
relevant features as well. Using simple filtering
methods, we did not manage to considerably reduce the
number of features without also significantly
reducing the performance of the classifier. Although
better results can be expected by using so-called
wrapper methods
        <xref ref-type="bibr" rid="ref5">(Guyon and Elisseeff, 2003)</xref>
        instead, this is not computationally feasible for large
feature sets.
      </p>
      <p>As an alternative to such feature selection
methods, we here report on experiments with a
technique known as random indexing (RI). This allows
us to drastically compress the feature space
without explicitly throwing out any features.</p>
      <p>The technique of random indexing was initially
introduced by Kanerva et al. (2000) for modeling
the semantic similarity of words by their
distribution in text.3 Actually RI forms part of a larger
family of dimension reduction techniques based
on random projections. Such methods typically
work by multiplying the feature matrix F ∈ ℜn×d
by a random matrix R ∈ ℜd×k, where k ≪ d,
thereby reducing the number of dimensions from
d to k:</p>
      <p>
        G = F R ∈ ℜn×k,
with k ≪ d
(5)
Given that k is sufficiently high, the
JohnsonLindenstrauss lemma
        <xref ref-type="bibr" rid="ref7">(Johnson and Lindenstrauss,
1984)</xref>
        tells us that the pairwise distances (and
thereby separability) in F can be preserved with
high probability within the lower-dimensional
space G
        <xref ref-type="bibr" rid="ref9">(Li et al., 2006)</xref>
        . While the only
condition on the entries of R is that they are i.i.d. with
zero mean, they are typically also specified to have
unit variance
        <xref ref-type="bibr" rid="ref9">(Li et al., 2006)</xref>
        .
      </p>
      <p>
        One advantage of the particular random
indexing approach is that the full n × d feature
matrix F does not need to be explicitly computed.
The method constructs the representation of the
data in G by incrementally accumulating so-called
index vectors assigned to each of the d features
        <xref ref-type="bibr" rid="ref13 ref14">(Sahlgren and Karlgren, 2005)</xref>
        . The process can
be described by the following two simple steps:
- When a new feature is instantiated, it is
assigned a randomly generated vector of a fixed
dimensionality k, consisting of a small
number of −1s and +1s (the remaining elements
being 0). This is then the so-called index
vector of the feature. (The index of the ith
feature corresponds to the ith column of R.)
- The vector representing a given training
example (the jth row of G represents the jth
example) is then constructed by simply
summing the random index vectors of its features.
Note that, although we want to have k ≪ d, we
still operate in relatively high-dimensional space
3Readers are referred to Sahlgren (2005) for a good
introduction to random indexing.
(with k being on the order of thousands). As
noted by Sahlgren (2005), high-dimensional
vectors having random directions are very likely to be
close to orthogonal, and the approximation to F
will generally be better the higher we set k.
      </p>
      <p>
        Finally, it is worth noting that RI has
traditionally been applied on the type level, with the
purpose of accumulating context vectors that
represent the distributional profiles of words in a
semantic space model
        <xref ref-type="bibr" rid="ref13 ref14">(Sahlgren, 2005)</xref>
        . Here, on
the other hand, we apply it on the instance level
and as a general means of compressing the feature
space of a learning problem.
7.1
      </p>
      <sec id="sec-8-1">
        <title>Tuning the Random Indexing</title>
        <p>Regarding the ratio of non-zero elements, the
literature on random projections contains a wide
range of suggestions as to how the entries of the
random matrix R should be initialized. In the
context of random indexing, Sahlgren and
Karlgren (2005) set approximately 1% of the entries
in each index to +1 or −1. It is worth bearing in
mind, however, that the computational complexity
of dot-product operations (as used extensively by
the SVM learner) depend not only on the number
of dimensions itself, but on the number of
nonzero elements. We therefore want to take care
to avoid ending up with a reduced space that is
much more dense.Nevertheless, the appeal of
using a random projection technique is in our case
more related to its potential as a feature extraction
step, and less to its potential for speeding up
computations and reducing memory load, as the
original feature vectors are already very sparse. After
experimenting with different parametrizations, it
seems that the classifier performance on our data
sets are fairly stable with respect to varying the
ratio of non-zeros. Moreover, we find that the
nonzero entries can be very sparsely distributed, e.g.
≈ 0.05–0.2%, without much loss in classifier
performance. Figure 2a shows the effect of varying
the ratio of non-zero elements while keeping the
dimensionality fixed (at k=5,000), always
assigning an equal number of +1s and −1s (giving zero
mean and unit variance). For each parametrization
we perform a batch of 5 experiments using
different random initializations of the index vectors.
The scores shown in Figure 2a are the average and
maximum within each batch. As can be seen, with
index vectors of 5,000 elements, using 8 non-zero
entries (corresponding to a ratio of 0.16%) here
1 88
F
1 88
F</p>
        <p>5000 10000 20000 670K
Dimensionality (k)
(b)
seems to strike a reasonable balance between
index density and performance.</p>
        <p>As expected, we do, however, see a clear
deterioration of classifier accuracy if the
dimensionality of the index vectors is set very low. Figure 2b
shows the effect of varying the dimensionality k
of the index vectors, while fixing the ratio of
nonzero entries per vector to 0.16%. Again we
perform batches of 5 experiments for each value of k,
reporting the average and maximum within each
batch. For our cue classification data, the
positive effect of increasing k seems to flatten out at
around k=5,000. When considering the standard
deviation of scores within each batch, however, the
variability of the results seems to steadily decrease
as k increases. For example, while we find σ=1.34
for the set of runs using k=1,250, we find σ=0.29
for k=20,000.</p>
        <p>When looking at the maximum scores shown in
Figure 2b, one of the runs using k=5,000 turns
out to have the peak performance, achieving a
(sentence-level) F1 of 90.38. Not only does it
score higher than any of the other RI-runs with
k&gt;5,000, it also outperforms the original CDisamb
model, which achieves an F1 of 89.36 for the same
single “fold” (the models in Figure 2b are tested
using 1/10th of the training material).</p>
        <p>In our experience, although the random
projection provided by the RI vectors only represents an
approximation to the original input space, it still
appears to preserve a lot more information than
feature selection based on filtering methods.
7.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Preliminary Results</title>
        <p>The bottom row of Table 2 (CDRiIsamb), shows the
results of applying an SVM-classifier by full
10fold cross-validation over the training set using the
same random index assignments that yielded the
maximum F1 in Figure 2b for k=5,000 (with eight
randomly set non-zeros in each index). We see that
the performance of CDRiIsamb is actually slightly
lower than for CDisamb. The differences are not
detected as being significant though (applying the
sign-test in the same manner as described above).
Moreover, it should also be pointed out that we
have not yet tried tuning the random indexing by
multiple runs of full 10-fold cross-validation on
the training data, which would be expected to
improve these results. Given the fact that the
effective feature space for the classifier is reduced from
670,000 to just 5,000 dimensions, we find it
notable that the CDRIisamb model achieves comparable
results, with only preliminary tuning.</p>
        <p>Another important observation is that the
complexity of the resulting SVM in terms of the
number of support vectors (SVs), is considerably
reduced for the RI-model: While the number of SVs
for CDisamb averages just below 8% of the
training examples, this is reduced to just above 4% for
CDRiIsamb (using the SVMlight default settings). In</p>
        <p>F1
addition to halving the number of SVs, as well as
reducing the feature space by two orders of
magnitude, the upper bound on the VC-dimension (as
estimated by SVMlight) is also reduced by 12%. It
is also worth noting that the run-time differences
for estimating the SVM on the original input space
and the reduced (but slightly denser) feature space,
are negligible (≈ 5 CPU-seconds more for the
RImodel when re-training on the full training set).
8</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Held-Out Testing</title>
      <p>Table 3 presents the final results for the various
classifiers developed in this paper, testing them on
the biomedical articles of the CoNLL 2010 Shared
Task held-out test set (see Table 1). In addition to
the evaluation results for our own classifiers,
Table 3 also include the official test results for the
system described by Tang et al. (2010). The
sequence classifier developed by Tang et al. (2010),
combining a CRF classifier and a large-margin
HMM model, obtained the best results for the
official ST evaluation for Task 1 (i.e. sentence-level
uncertainty detection).</p>
      <p>As seen from Table 3, all of our SVM
classifiers CW bW , CDisamb, and CDRiIsamb, achieve a
higher sentence-level F1 than the system of Tang
et al. (2010) (though it is unknown whether the
differences are statistically significant). We also
note that our reformulation of the cue
classification task as a disambiguation problem leads to
better performance also on the held-out data, with
CDisamb performing slightly better than CW bW
across both evaluation levels. Interestingly, the
best performer of them all proves to be the
random indexing model (CDRiIsamb), even though this
model was not the top-performer on the training
data. One possible explanation for the strong
heldout performance of CDRiIsamb is that the reduced
complexity of this classifier (see Section 7.2) has
made it less prone to overfitting, leading to better
generalization performance on new data.
Applying the sign-test as described above to the
classifier decisions of CDRiIsamb, we find statistically
significant differences with respect to CW bW but not
with respect to CDisamb. Nonetheless, the
encouraging results of the CDRIisamb model on the held-out
data means that further tuning of the RI
configuration on the training data will be a priority for future
experiments.</p>
      <p>
        It is also worth noting that many of the
systems participating in the ST challenge used fairly
complex and resource-heavy feature types, being
sensitive to document structure, grammatical
relations, etc.
        <xref ref-type="bibr" rid="ref3">(Farkas et al., 2010)</xref>
        . The fact that
comparable or better results can be obtained using a
relatively simple approach as demonstrated in this
paper—with low cost in terms of both
computation and external resources—might lower the bar
for employing a hedge detection component in an
actual IE system.
      </p>
      <p>Finally, we also observe that our simple
unigram baseline classifier proves to be surprisingly
competitive. In fact, comparing its Task 1 F1 to
those of the official ST evaluation, it actually
outranks 7 of the 24 submitted systems.
9</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusions</title>
      <p>This paper has presented the incremental
development of uncertainty classifiers for detecting
hedging in biomedical text—the topic of the CoNLL
2010 Shared Task. Using simple n-gram features
over words, lemmas and PoS-tags, we first develop
a (linear) SVM cue classifier that outperforms the
top ranked system for Task 1 in the official Shared
Task evaluation (i.e. sentence-level uncertainty
detection). We then show how the original
classification task can be greatly simplified by
viewing it as a disambiguation task restricted to only
those words that have previously been observed as
hedge cues. Operating in a smaller (though still
fairly large) feature space, this second classifier
achieves even better results. Finally, we apply the
method of random indexing, further reducing the
dimensionality of the feature space by two orders
of magnitude. This final classifier—combining
an SVM-based disambiguation model with
random indexing—is our best performer, achieving
a sentence-level F1 of 86.64 on the CoNLL 2010
Shared Task held-out data.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments</title>
      <p>The experiments reported in this paper represent an extension
of previous joint work with Stephan Oepen and Lilja
Øvrealso wishes to thank the anonymous reviewers, as well as
colleagues at the Uni. of Oslo, for their valuable comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Brants</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>TnT. A statistical Part-ofSpeech tagger</article-title>
          .
          <source>In Proceedings of the Sixth Conference on Applied Natural Language Processing</source>
          , pages
          <fpage>224</fpage>
          -
          <lpage>231</lpage>
          , Seattle, WA.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dick</given-names>
            <surname>Crouch</surname>
          </string-name>
          , Mary Dalrymple, Ron Kaplan,
          <string-name>
            <given-names>Tracy</given-names>
            <surname>King</surname>
          </string-name>
          , John Maxwell, and
          <string-name>
            <given-names>Paula</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>XLE documentation</article-title>
          . Palo Alto Research Center.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Farkas</surname>
          </string-name>
          , Veronika Vincze, Gyorgy Mora, Janos Csirik, and
          <string-name>
            <given-names>Gyorgy</given-names>
            <surname>Szarvas</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The CoNLL 2010 Shared Task: Learning to detect hedges and their scope in natural language text</article-title>
          .
          <source>In Proceedings of the 14th Conference on Natural Language Learning</source>
          , Uppsala, Sweden.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Dan</given-names>
            <surname>Flickinger</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>On building a more efficient grammar by exploiting types</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>15</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Guyon</surname>
          </string-name>
          and
          <string-name>
            <given-names>André</given-names>
            <surname>Elisseeff</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>An introduction to variable and feature selection</article-title>
          .
          <source>Journal of Machine Learning Research; Special issue on variable and feature selection</source>
          ,
          <volume>3</volume>
          :
          <fpage>1157</fpage>
          -
          <lpage>1182</lpage>
          ,
          <year>March</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Making large-scale SVM learning practical</article-title>
          . In B.
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Burges</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Smola, editors,
          <source>Advances in Kernel Methods - Support Vector Learning</source>
          . MIT-Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>William</given-names>
            <surname>Johnson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joram</given-names>
            <surname>Lindenstrauss</surname>
          </string-name>
          .
          <year>1984</year>
          .
          <article-title>Extensions of Lipschitz mappings into a Hilbert space</article-title>
          .
          <source>Contemporary Mathematics</source>
          ,
          <volume>26</volume>
          :
          <fpage>189</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kristoferson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A</given-names>
            <surname>Holst</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Random indexing of text samples for latent semantic analysis</article-title>
          .
          <source>In Proceedings of the 22nd Annual Conference of the Cognitive Science Society</source>
          , page
          <volume>1036</volume>
          ,
          <string-name>
            <surname>Pennsylvania</surname>
          </string-name>
          . Mahwah, New Jersey: Erlbaum.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Ping</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Hastie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kenneth</given-names>
            <surname>Church</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Very sparse random projections</article-title>
          .
          <source>In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD06)</source>
          , Philadelphia.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Mitchell P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          , Beatrice Santorini, and Mary Ann Marcinkiewicz.
          <year>1993</year>
          .
          <article-title>Building a large annotated corpus of English. The Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          :
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Roser</given-names>
            <surname>Morante</surname>
          </string-name>
          and
          <string-name>
            <given-names>Walter</given-names>
            <surname>Daelemans</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning the scope of hedge cues in biomedical texts</article-title>
          .
          <source>In Proceedings of the BioNLP 2009 Workshop</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          , Boulder, Colorado.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Joakim</given-names>
            <surname>Nivre</surname>
          </string-name>
          , Johan Hall, and
          <string-name>
            <given-names>Jens</given-names>
            <surname>Nilsson</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>MaltParser: A data-driven parser-generator for dependency parsing</article-title>
          .
          <source>In Proceedings of the Fifth International Conference on Language Resources and Evaluation</source>
          , pages
          <fpage>2216</fpage>
          -
          <lpage>2219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jussi</given-names>
            <surname>Karlgren</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Automatic bilingual lexicon acquisition using random indexing of parallel corpora</article-title>
          .
          <source>Journal of Natural Language Engineering</source>
          , Special Issue on Parallel Texts,
          <volume>11</volume>
          (
          <issue>3</issue>
          ), September.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Magnus</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>An introduction to random indexing</article-title>
          .
          <source>In Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE)</source>
          , Copenhagen, Denmark.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Buzhou</given-names>
            <surname>Tang</surname>
          </string-name>
          , Xiaolong Wang,
          <string-name>
            <surname>Xuan</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bo Yuan</surname>
            , and
            <given-names>Shixi</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A cascade method for detecting hedges and their scope in natural language text</article-title>
          .
          <source>In Proceedings of the 14th Conference on Natural Language Learning</source>
          , Uppsala, Sweden.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Yoshimasa</given-names>
            <surname>Tsuruoka</surname>
          </string-name>
          , Yuka Tateishi,
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta,
          <string-name>
            <surname>John McNaught</surname>
            ,
            <given-names>Sophia</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Developing a robust Partof-Speech tagger for biomedical text</article-title>
          .
          <source>In Advances in Informatics</source>
          , pages
          <fpage>382</fpage>
          -
          <lpage>392</lpage>
          . Springer, Berlin.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>The Nature of Statistical Learning Theory</article-title>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Erik</given-names>
            <surname>Velldal</surname>
          </string-name>
          , Lilja Øvrelid, and
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Oepen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Resolving speculation: MaxEnt cue classification and dependency-based scope rules</article-title>
          .
          <source>In Proceedings of the 14th Conference on Natural Language Learning</source>
          , Uppsala, Sweden.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Veronika</given-names>
            <surname>Vincze</surname>
          </string-name>
          , György Szarvas, Richárd Farkas, György Móra, and
          <string-name>
            <given-names>János</given-names>
            <surname>Csirik</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The BioScope corpus: Annotation for negation, uncertainty and their scope in biomedical texts</article-title>
          .
          <source>In Proceedings of the BioNLP 2008 Workshop</source>
          , Columbus, USA.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Vlachos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Craven</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Detecting speculative language using syntactic dependencies and logistic regression</article-title>
          .
          <source>In Proceedings of the 14th Conference on Natural Language Learning</source>
          , Uppsala, Sweden.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Lilja</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , Jonas Kuhn, and
          <string-name>
            <given-names>Kathrin</given-names>
            <surname>Spreyer</surname>
          </string-name>
          . 2010a.
          <article-title>Cross-framework parser stacking for data-driven dependency parsing</article-title>
          .
          <article-title>TAL 2010 special issue on Machine Learning for</article-title>
          NLP,
          <volume>50</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Lilja</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , Erik Velldal, and
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Oepen</surname>
          </string-name>
          . 2010b.
          <article-title>Syntactic scope resolution in uncertainty analysis</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Computational Linguistics</source>
          , Beijing, China.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>