Neural Architectures for Biological Inter-Sentence Relation
Extraction
Enrique Noriega-Atala, Peter M. Lovett, Clayton T. Morrison and Mihai Surdeanu
The University of Arizona, Tucson, Arizona, USA
Abstract
We introduce a family of deep-learning architectures for inter-sentence relation extraction, i.e., relations where the participants
are not necessarily in the same sentence. We apply these architectures to an important use case in the biomedical domain:
assigning biological context to biochemical events. In this work, biological context is defined as the type of biological system
within which the biochemical event is observed. The neural architectures encode and aggregate multiple occurrences of
the same candidate context mentions to determine whether it is the correct context for a particular event mention. We
propose two broad types of architectures: the first type aggregates multiple instances that correspond to the same candidate
context with respect to event mention before emitting a classification; the second type independently classifies each instance
and uses the results to vote for the final class, akin to an ensemble approach. Our experiments show that the proposed
neural classifiers are competitive and some achieve better performance than previous state of the art traditional machine
learning methods without the need for feature engineering. Our analysis shows that the neural methods particularly improve
precision compared to traditional machine learning classifiers and also demonstrates how the difficulty of inter-sentence
relation extraction increases as the distance between the event and context mentions increase.
Keywords
Inter-sentence relation extraction, biological context, natural language processing, neural networks
1. Introduction Quantity Count
# of inter-sent. relations 1936
Extracting biochemical interactions that describe mecha- Mean sent. distance 22
nistic information from scientific literature is a task that Median sent. distance 5
has been well studied by the NLP community [1, 2, 3]. Max sent. distance 225
Automated event detection systems such as [4, 5, 6, 7,
8, 9, 10, 11] are able to detect and extract biochemical Table 1
events with high throughput and good recall. The infor- Statistics about the inter-sentence distances of biological con-
mation extracted with such tools enables scientists and text annotations.
researchers to analyze, study and discover mechanistic
pathways and their characteristics by aggregating the
interactions and biological processes described in the mechanistic pathways described by the literature. For
scientific literature. example, some tumors associated with oncogenic Ras
However, when dealing with such mechanistic pro- in humans are different from those in mice, suggesting
cesses it is important to identify the biological context that the Ras pathway differs in both species [12]. Ignor-
in which they hold. Here, biological context means the ing the biological context information, specifically the
type of biological system, described at different levels species in the prior example, can mislead the reader to
of granularity, such as species, organ, tissue, cellular draw incorrect conclusions.
component, and/or cell-line within which the extracted Biological context is often not explicitly stated in the
biochemical interactions are observed. Knowing the bi- same clause that contains the biochemical event men-
ological context is important to correctly interpret the tion. Instead, the context is often established explicitly
somewhere else in the text, such as the previous sentence
The AAAI-22 Workshop on Scientific Document Understanding, March or paragraph. In other words, there is a long distance
01, 2022, Vancouver, BC, Canada relation between the event mention and its context. In
Envelope-Open enoriega@arizona.edu (E. Noriega-Atala); these cases, the context is implicitly propagated through
plovett@email.arizona.edu (P. M. Lovett); claytonm@arizona.edu
(C. T. Morrison); msurdeanu@arizona.edu (M. Surdeanu)
the discourse that leads up to that particular biochem-
GLOBE https://enoriega.info/ (E. Noriega-Atala); ical event mention, as illustrated in figure 1. Table 1
https://pelovett.github.io/ (P. M. Lovett); and figure 2 contain summary statistics about the sen-
https://ml4ai.github.io/people/clayton/ (C. T. Morrison); tence distances for the relations in the corpus used in this
http://surdeanu.cs.arizona.edu/mihai/ (M. Surdeanu) work. These statistics indicate that, while most of the
Orcid 0000-0001-7150-2989 (E. Noriega-Atala)
© 2022 Copyright for this paper by its authors. Use permitted under Creative inter-sentence relations are close to the event mention
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) they are associated with, there is a long tail of biological
Transfection of the R-Ras siRNA effectively reduced the expression of endogenous R-Ras protein in P C 1 2 c e l l s .
These results demonstrate that activation of endogenous R-Ras protein is essential for the ECM mediated cell migration and that regulation of R-Ras activity plays a key
role in ECM mediated cell migration. S e m a 4 D a n d P l e x i n - B 1 - R n d 1 i n h i b i t s P I 3 - K a c t i v i t y through its R-Ras GAP activity.
Figure 1: Example of an inter-sentence relation annotated by a domain expert. The biological context, highlighted in blue, is
established two sentences prior to the event mention, highlighted in pink.
problem, the entities are potentially located in differ-
600
ent sentences, making the context association task an
500
instance of an inter-sentence relation extraction problem.
# annotations
400 Previous work in inter-sentence relation extraction
300 includes [18], which combined within-sentence syntactic
200 features with an introduced dependency link between
100 the root nodes of parse trees from different sentences
0 that contain a given pair of entities. [19] proposes an
0 10 20 30 40 50
# of sentences apart inter-sentence relation extraction model that builds a la-
Figure 2: Distribution of inter-sentence distances of biologi- beled edge graph convolutional neural network model
cal context annotations. on a document-level graph. There have also been efforts
to create language resources to foster the development
of inter-sentence relation extraction methods. [20] pro-
context mentions that are further than five sentences pose an open domain data set generated from Wikipedia
away from the corresponding event mentions. to Wikidata. [21] propose an inter-sentence relation ex-
We frame the problem of associating event mentions traction data set constructed using distance supervision.
with their biological context as an inter-sentence relation Modeling inter-sentence relation extraction using trans-
extraction task and propose a family of deep-learning former architectures require processing potentially long
architectures to identify context. The approach inspects sequences. Long input sequences are problematic be-
an event mention, a candidate context mention, and the cause computing the self-attention matrix has quadratic
text between them to determine whether the candidate runtime and space complexity relative to the its length.
context mention is context of the event mention. Our This observation has motivated research efforts to gener-
work makes the following contributions: ate efficient approximations of self-attention. [22] pro-
poses a sparse, drop-in replacement for the self-attention
• Proposes a family of neural architectures that mechanism with linear complexity that relies on slid-
leverages large pre-trained language models for ing windows and selects domain-dependent global atten-
multi-sentence relation extraction. tion tokens from the input sequence. [23] proposes a
• Extends a corpus of cancer-related open access lower-rank approximation of the self-attention matrix to
papers with biochemical event extractions anno- linearize the complexity. [24] ommits the pair-wise de-
tated with biological context. Unlike the original pendencies between the input tokens and then factorizes
corpus, this extended data set includes the full the attention matrix to reduce its rank. Other approaches
text of each article, tokenized and aligned to its [25] rely on kernel functions to compute approximations
annotations. with linear time and space complexity. [26] takes this
• Analyzes multiple methods to aggregate different approach further by using relative position encodings,
pieces of evidence that correspond to the same instead of absolute ones.
input event and context, and assesses the overall Prior work has specifically studied the contextualiza-
performance and reliability of the networks under tion of information extraction in the biomedical domain.
these different aggregation schemes. [27] associates anatomical contextual containers with
event mentions that appear in the same sentence via a
set of rules that considers lexical patterns in the case
2. Related Work of ambiguity and falls back to token distance if no pat-
tern is matched. [28] elaborates on the same idea by
The problem of relation extraction (RE) has received exten- incorporating dependency trees into the rules instead of
sive attention [13, 14], including within the biomedical lexical patterns, as well as introducing a method to detect
domain [15, 16], with recent promising results incorporat- negations and speculative statements.
ing distant supervision [17]. However, most of the work [29] previously studied the task of context association
focuses on identifying relations among entities within for the biomedical domain and framed it as a problem of
the same sentence. In the biological context association inter-sentence relation extraction. This work presents
label=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , and proliferation in < C O N > MCF-7 < / C O N >
cells .
lbbel=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , …have linked the upregulation of [ E V E N T ]
with rapid proliferation in certain [ C O N T E X T ] … < C O N > MCF-7 < / C O N > cells .
lcbel=() … < C O N > macrophages < / C O N > , and [ C O N T E X T ] , where it is a trimeric complex consisting of one alpha-chain … [ S E P ] …FcRbeta also acts as a chaperone
that increases < E V T > FcepsilonRI expression < / E V T >
Figure 3: Example input text spans. (a) Single-sentence segment with markers; (b) multi-sentence segment with markers and
masked secondary event and context mentions; and (c) truncated long multi-sentence segment.
set of linguistic and lexical features that describe the ontology depends on the type of entity: UniProt1 for
neighborhood of the participant entities and proposes an proteins, PubChem2 for chemical entities, etc.
aggregation mechanism that results in improved context Importantly, a context biological container type is
association. likely mentioned multiple times in the document. Ap-
Previous work relied upon feature engineering to en- proximately half of the context container types in the
code the participants and their potential interactions. context-event relation corpus are detected two or more
State-of-the-art NLP research leverages large language times, as illustrated in figure 5. Every candidate con-
models to exploit transfer learning. Models such as [30], text mention that refers to the same container type is
and similar transformer based architectures [31] better paired with the relevant event mention to generate a
capture the semantics of text based on its surrounding text segment for each pair. Each segment is represented
context with unsupervised pre-training over extremely as the concatenation of the sentences that include the
large corpora. Specialized models, such as [32, 33, 34] event mention, one mention of the candidate context
refine language models by continuing pre-training with container type, and all the sentences in between. These
in-domain corpora. text segments are used as input to the network to make
To the best of our knowledge, the work presented predictions. If an article contains 𝑛𝑖 context mentions of
here is the first to propose and analyze deep-learning container type 𝑖, then for each event mention the network
aggregation and ensemble architectures for many-to-one, will take up to 𝑛𝑖 input text segments to determine if type
long-distance relation extraction. 𝑖 is a context of the event. The task of the network is to
learn whether context type 𝑖 is a context of the specific
event mention by looking at a subset of the 𝑛𝑖 inputs. An
3. Neural Architectures for article with 𝑗 context types and 𝑚 event mentions will
Context Association see a total of 𝑗 × 𝑚 classification problems and a total of
𝑗
∑𝑖 𝑛𝑖 × 𝑚 input text segments. Figure 4 shows a block
We propose a family of neural architectures designed to diagram of the family of architectures.
determine whether a candidate context class is relevant to Each input segment is preprocessed as follows. The
a given biochemical event mention. A biochemical event boundaries of the relevant event and candidate con-
mention (event mention for short) describes the interac- text mentions are marked with the special tokens:
tion between proteins, genes, and other gene products < E V T > . . . < / E V T > for the event mention and < C T X > . . .
through biochemical reactions such as regulation, inhi- < / C T X > for the context mention. Other event or context
bition, phosphorylation, etc. In particular, we focus on mentions present in the segment are masked with special
the 12 interactions detected by REACH [35]. A biological [ E V E N T ] or [ C O N T E X T ] tokens, respectively, to avoid con-
container context mention (context mention for short) fusing the classifier with other event mentions that aren’t
represents an instance from any of the following biolog- the focus of the current prediction. Figure 3 shows ex-
ical container types: species (e.g., human, mice), organ ample text spans where the event and context mentions
(e.g., liver, lung), tissue type (e.g., endothelium, muscle are surrounded by their boundary tokens. Next, each
tissue), cell type (e.g., macrophages, neurons), or cell line preprocessed text segment is tokenized using the tok-
(e.g., HeLa, MCF-7). enizer specific to the pre-trained transformer used as the
In this work, we use an existing information extraction encoder. If a tokenized sequence exceeds the maximum
system [36] to detect and extract event mentions and can- length allowed by the transformer, it is truncated before
didate context mentions. Candidate context mentions are the encoding step by selecting the prefix of the sequence
grounded to ontology concepts with unique identifiers up to half the length, the suffix up to half the length minus
to accommodate different spellings and synonyms that
refer to the same biological container type. The specific 1
https://www.uniprot.org/
2
https://pubchem.ncbi.nlm.nih.gov/
Aggregation
︙ ︙ ︙ ︙
upregulates
Aggregation Correct
MCF-7 Function Context
ErbB1/2
expression
Aggregated
Embedding
BioMed
︙ ︙ RoBERTA … …
… ︙ ︙
in upregulates
︙ Voting Correct
Function Context
MCF-7 ErbB1/2
expression
Individual
cells Predictions
Input Segments Encoded Segments Classification Voting
Embeddings
Figure 4: Context association neural architecture. The left-most box represents the input text segments after pre-processing.
The blocks inside the encoded segments box represent BioMed RoBERTa’s hidden states for the input segments. The
classification embeddings box contains averages of the hidden states corresponding to the < E V T > and < C O N > tokens of each input
segment. Depending on the choice of architecture, classification embeddings either flow through (a) the aggregation block,
which combines them to then generate the final classification; or (b) the voting block, where each embedding is classified,
then the final result is generated through a voting function.
400
nate from the previously discussed process. To generate
350
a single prediction, the network must combine the infor-
300
mation carried forward by the classification embeddings.
250
We propose two general approaches to combine the clas-
frequency
200 sification embeddings and generate the final prediction
150 by combining the information before classification and
100 after classification, respectively:
50
0
• Aggregation: Classification embeddings are
1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 1920+
num instances combined together using an aggregation func-
tion. The aggregated embedding is then passed
Figure 5: Distribution of the number of context class detec-
through a multi-layer perceptron (MLP) to emit
tions per article (𝑛𝑖 ).
a binary classification.
• Voting: Each classification embedding is passed
individually through the MLP, which emits a lo-
one token, and inserting a special < S E P > token between cal decision based only on the individual input
them. Any truncated input segment is guaranteed to re- text segment. The individual decisions are com-
tain both mentions and their local lexical context. Figure bined using a voting function to emit the final
3 shows an example of a segment truncated using this classification.
procedure. After tokenization, the segments are encoded
using BioMed RoBERTa-base [37] 3 , based on [32]. Intuitively, aggregation functions consider multiple
The output hidden states of the < E V T > and < C O N > tokens information points to make an informed decision based
are averaged to create a classification embedding. on the “bigger picture” presented by the article. Voting
Each classification task emits a single binary predic- functions, on the other hand, make isolated decisions
tion, but has up to 𝑛𝑖 classification embeddings to account solely based on information local to each input text seg-
for the multiple (potential) context mentions that origi- ment, then use those individual predictions to vote for
the final classification, akin to an ensemble approach.
3
We used the available public checkpoint for both the BPE There are multiple ways to implement aggregation and
and BioMed RoBERTa models from https://huggingface.co/allenai/ voting functions. We propose four implementations of
biomed_roberta_base each kind, each following a intuitive principle.
[b]0.49
0.6
0.5
0.4
Score value
0.3
0.2
F1
0.1 Precision
Recall
3 4 5 6 7 8 9 10
# of mentions
Figure 6: Majority vote
[b]0.49
0.65
F1
0.60 Precision
0.55 Recall
0.50
Score value
0.45
0.40
0.35
0.30
0.25
3 4 5 6 7 8 9 10
# of mentions
Figure 7: Average aggregation
Figure 8: Precision/recall/F1 scores of the relation classifier as the number of context mention considered for each individual
relation classification is varied.
Documents Event mentions Context mentions Annotations
Validation 6 685 713 1,192
Cross validation 20 1,169 1,926 1,543
Total 26 1,854 2,639 2,735
Cross-validation split
Training 17 975.83 (58.32) 1,654.83 (52.83) 1,288.33 (95.89)
Testing 3 193.16 (58.32) 271.16 (52.83) 254.66 (95.89)
Table 2
Statistics of the context association dataset. The upper part shows statistics from the overall dataset, both in total and split by
the two partitions: (a) validation set, and (b) partition used for the formal cross-validation experiments. The lower part shows
the average and standard deviations used for train/test for the different folds in cross-validation.
Aggregation Functions gregation approach concatenates 𝑘 nearest classification
embeddings and uses a MLP to reduce the concatenated
Nearest Context Mention: Following the intuition that
embeddings to a new vector with the same number of
textual proximity should be a strong indicator of associ-
components as an individual classification embedding.
ation, this approach selects the context mention of the
The MLP works as map that combines the original 𝑘
relevant context type that is closest to the event mention.
classification embeddings whose parameters are learned
The closest context mention can appear either before
during training. If the number of input text segments
or after event mention. In this setting, all other context
is < 𝑘, the concatenated classification embeddings are
mentions are ignored. The approach results in only one,
padded with zeros before being mapped to the new vector
unaltered classification embedding. It is equivalent to
space.
the case where only one mention of the relevant context
type appears in a document (𝑛𝑖 = 1).
Average Context Embedding: Conversely, all mentions Voting Functions
of the candidate context type can bear a degree of re- One hit: This voting approach requires the minimum
sponsibility to determine whether it is context of the amount of evidence to trigger a positive classification.
event mention. Without making a statement about the The context type is classified as i s c o n t e x t o f the event
importance of each context mention, we consider the mention if at least one classification embedding is classi-
text segments of the 𝑘 nearest context mentions of the fied as positive. Intuitively, this voting function favors
relevant context type, to either side. The upper bound is recall.
enforced for efficiency and is left as a hyper parameter. Majority vote: Conversely, it can be argued that there
If there are less than 𝑘 context mentions, all the text seg- should be consensus in the vote. The majority vote func-
ments are considered. The segments are encoded, then tion triggers a positive classification if at least half of the
the resulting classification embeddings are averaged. classification embeddings are classified as positive. In
Inverse Distance aggregation: It can be argued that the contrast to one hit, this voting function favors precision.
influence of each context mention in the final decision Post-inverse distance vote: Analogous to the inverse
decreases when it is farther apart from the event mention. distance aggregation approach, this approach takes the
We propose this aggregation approach, where instead of vote of each classification embedding as weighted by the
averaging the 𝑘 nearest classification embeddings, they normalized inverse sentence distance: 𝑤 = 𝑑 −1 / ∑𝑘 𝑑 −1 .
𝑖 𝑖 𝑗 𝑗
are combined as a weighted sum, where each classifica- The final classification is emitted in favor of the class with
𝑘
tion embedding’s weight is defined as 𝑤𝑖 = 𝑑𝑖−1 / ∑𝑗 𝑑𝑗−1 , the highest weight. As opposed to the inverse distance
the normalized inverse sentence distance between the aggregation approach, the combination happens after
event mention and the context mention. The resulting passing the embeddings through the MLP.
aggregated embedding still carries information from the Confidence vote: We can weight each vote proportion-
nearest 𝑘 context mentions, but their contributions di- ally to the confidence of the classifier. In this approach,
minish inversely proportionally to their distance from the vote of each individual classification is weighted by
the event mention. the classifier’s confidence. The weights are given by
Parameterized aggregation: Instead of relying upon a the normalized logits of the vote of each classification
heuristic approach to calculate the weights that deter- embedding: 𝑤 = 𝑙 / ∑𝑘 𝑙 .
𝑖 𝑖 𝑗 𝑗
mine the contributions of each classification embedding,
we let the network learn the interactions between them
using an attention mechanism. The parameterized ag-
4. Full-Text Context-Event Method Precision Recall F1
Relation Corpus Majority (3 votes) 0.580* 0.498 0.536*
Parameterized agg. 0.537* 0.494 0.514*
We used a corpus of biochemical events annotated with One-hit 0.409 0.668* 0.507
biological context to test the neural architectures for Post inv. distance 0.571* 0.446 0.501
context assignment. Our version of the corpus is an Nearest mention 0.541* 0.464 0.499
Average (5 segs) 0.527 0.469 0.497
extension of the corpus published by [29].
Inverse distance 0.544* 0.454 0.495
The corpus consists of automated extractions of 26 Confidence vote 0.394 0.443 0.417
open-access articles from the PubMed Central repository,
all related to the domain of cancer biology. The first type Baselines
of extractions are events mentions. An event mention Random forest 0.439 0.541 0.485
is a relation between one or more entities participating Logistic regression 0.361 0.699 0.476
in a biochemical reaction or its regulation. These men- Heuristic 0.421 0.548 0.476
Decision tree 0.311 0.389 0.345
tions can be phosphorylation, ubiquitination, expression,
etc. The second type of extractions are candidate context Table 3
mentions. These consist of named entity extractions of Cross-validation results for the is context of class. * denotes
different biological container types: species, tissue types statistically significant improvement w.r.t. the random forest
and cell lines. classifier.
Each event extracted was annotated by up to three
biologists who assigned the event’s relevant biological
context from a pool of candidate context extractions avail- 5.1. Automatic Negative Examples
able in the paper. Context annotations are not exclu-
sive, meaning that every event mention can be annotated The context-event relation corpus only contains positive
with one or more context classes. The result is a set of context annotations of event mentions. We automatically
annotated events, where each event can have zero or generate negative examples for event mentions in each
more biological context associations, and there is at least document by enumerating the cartesian product of all
one explicit mention for each biological context in the event and context mentions followed by subtracting the
same article. The specifics of the automated event extrac- annotated pairs. One consequence of generating negative
tion procedure, annotation tool, annotations protocols examples using this exhaustive strategy is that it results
and inter-annotator agreements are thoroughly detailed in most of the event/context pairs being negative exam-
in [29]. Table 2 contains summary statistics of the data ples, with 60,367 (95.68%) negative pairs and 2,703 (4.32%)
set’s documents. positive pairs. This results in a severe class imbalance,
The original corpus release lacked the full text of the which makes the classification task harder.
articles. Our proposed methodology requires the raw
text to be used as input to the neural architectures. Our 5.2. Results and Discussion
contribution here is an extension this corpus, where we
We use a cross validation evaluation framework similar
identified, processed and tokenized the full text of the
to the evaluation methodology used by [29]. Each fold
articles using the same information extraction tool [35]
contains all of the event-context pairs that belong to
used by the authors of the original corpus in such way
three different articles. However, we held out six papers
that the tokens align correctly with the annotations and
as a development set. During cross validation, one fold
extractions published previously. The full-text context-
is used for testing and training is performed using the
event relation corpus, along with the code for the experi-
remaining 𝑘 − 1 folds plus the data from the development
ments presented in this document, is publicly available
set. This way, we take advantage of more training data
for reproducibility and further research.4
and avoid leaking the information from development into
testing.
5. Experiments and Results To better understand the impact of considering multi-
ple context mentions at the time of aggregation or voting,
In this section, we evaluate all proposed variants of the we tuned this hyper parameter on the development set.
context association architecture and discuss the results. Figure 8 shows the effect of increasing the number of con-
text mentions used for relation classification. The num-
ber of context mentions considered ranged from three to
ten. Both architectures reach a peak F1 score between
3 to 5 context mentions. Performance quickly decays
4
https://clulab.github.io/neuralbiocontext/ almost asymptotically, as the number of considered con-
text mentions increases. This observation suggestions Distance Precision Recall F1 Support
that increasing the number of input text segments de- 0 0.796 0.818 0.807 573
rived from context mentions that are further apart from 1 0.490 0.450 0.469 262
the event introduces too much noise into the decision 2 0.398 0.336 0.364 146
process. 3 0.531 0.402 0.457 107
After the above tuning, we ran cross-validation exper- 4 0.569 0.393 0.465 84
iments for all aggregation and voting methods. Based 5+ 0.214 0.131 0.163 351
on the tuning results, we used the closest five mentions
Table 4
of each context class for the average aggregation archi- Cross-validation scores for the positive class of the Majority
tecture, and the closest three for all of the other archi- (3 votes) architecture stratified by sentence distance to the
tectures. Table 3 summarizes the cross validation per- closet context mention of the same class.
formance scores for all the architecture variants. The
precision, recall, and F1 scores reported are computed
just for the positive class (i.e., is context of) to avoid arti- class. Performance, along with the frequency of such in-
ficially inflating the scores with the dominating negative stances, quickly degrades as the distance between event
class. and context mention increases.
The top performing architecture is the majority vote.
It achieves an F1 score slightly above 0.53. The major-
ity vote architecture trades off recall for precision. The 6. Conclusions
reason for this is that the architecture needs to see at
least half of the individual input segments classified as We propose a family of neural architectures to detect bi-
positive in order to make that prediction. As a result, a ological context of biochemical events. We approach the
positive classification using this architecture comes with problem as an inter-sentence relation extraction that uses
a relatively high confidence. As expected, the one-hit multiple pieces of document-level evidence to classify
architecture achieves the opposite: it trades precision for whether a specific context label is the correct context
recall. One-hit only needs to see one individual positive type of an event extraction.
classification in order to emit a positive final classifica- We provide an analysis of different methods to com-
tion. As a result, one-hit attains the highest recall within bine evidence to generate a final decision. The ap-
the neural architectures but is more prone to false posi- proaches work either before classification, by aggregating
tives. embeddings in order to emit a decision, or after classifica-
We include several baseline algorithms to compare tion, creating ensembles that vote for multiple individual
the performance of the neural architectures. The first decisions.
baseline is a “heuristic” method that associates all the Using an expert-annotated corpus that associates bio-
context types within a constant number of sentences to chemical events with relevant biological context, our re-
an event mention. We also include our implementation sults show that in spite of the severe class imbalance, sev-
of three classifiers using the feature engineering method eral the neural architectures are competitive and achieve
of [29]. The top three performing neural architectures higher classification performance than a deterministic
have statistically significantly higher F1 score than the heuristic and other machine learning approaches.
random forest classifier, which is the strongest baseline The neural architectures particularly favor precision,
algorithm. which makes them more appealing for applications where
Note that the methods proposed by [29] that are in- higher precision is desirable.
cluded in the table aggregate multiple feature vectors from Inter-sentence relation extraction continues to be a
the different context mentions into a new feature vector challenge. An ablation study of the degree of aggrega-
composed of multiple statistics from the original feature tion of evidence shows how considering mentions that
space. Examples of these feature aggregations include are further apart from the event degrades performance.
the minimum, maximum and average values of the distri- An error analysis by sentence distance shows how the
bution of sentence distances, the frequency of the context difficulty of inter-sentence relation extraction correlates
type, and the proportion of times the context mention with the distance between the participants. The result of
is part of a noun phrase. Their aggregation approach is these analyses suggest that understanding how to filter
analogous to the one presented here (although here we out noisy event-context mention pairs and how to better
operate in embedding space), which is why the compari- weight the contribution of long-spanning mention pairs
son between these two approaches is fair. are important directions for future research.
Table 4 lists the classification scores of the top per-
forming method, stratifying the data by the sentence
distance to the closest context mention of the relevant
References [13] M. Banko, M. J. Cafarella, S. Soderland, M. Broad-
head, O. Etzioni, Open information extraction from
[1] P. R. Cohen, Darpa’s big mechanism program 12 the web, in: Proceedings of the Twentieth Inter-
(2015) 045008. URL: https://doi.org/10.1088/1478- national Joint Conference on Artificial Intelligence,
3975/12/4/045008. doi:1 0 . 1 0 8 8 / 1 4 7 8 - 3 9 7 5 / 1 2 / 4 / 2007, pp. 2670–2676.
045008. [14] N. Bach, S. Badaskar, A review of relation extrac-
[2] D. Zhou, D. Zhong, Y. He, Biomedical relation ex- tion, Literature review for Language and Statistics
traction: From binary to complex, Computational II (2007).
and Mathematical Methods in Medicine 2014 (2014) [15] C. Quan, M. Wang, F. Ren, An unsupervised text
1–18. doi:1 0 . 1 1 5 5 / 2 0 1 4 / 2 9 8 4 7 3 . mining method for relation extraction from biomed-
[3] L. Hirschman, A. Yeh, C. Blaschke, A. Valencia, ical literature, PLOS One (2014).
Overview of biocreative: critical assessment of in- [16] K. Fundel, R. Küffner, R. Zimmer, RelEx – Relation
formation extraction for biology, BMC Bioinformat- extraction using dependency parse trees, Bioinfor-
ics 6 (2005) S1. URL: https://doi.org/10.1186/1471- matics 23 (2007) 365–371.
2105-6-S1-S1. doi:1 0 . 1 1 8 6 / 1 4 7 1 - 2 1 0 5 - 6 - S 1 - S 1 . [17] H. Poon, K. Toutanova, C. Quirk, Distant supervi-
[4] M. A. Valenzuela-Escárcega, Ö. Babur, G. Hahn- sion for cancer pathway extraction from text, in:
Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, Pacific Symposium for Biocomputing, 2015.
M. Surdeanu, E. Demir, C. T. Morrison, Large-scale [18] K. Swampillai, M. Stevenson, Extracting relations
automated reading with reach discovers new can- within and across sentences, in: Proceedings of
cer driving mechanisms, in: Proceedings of the Recent Advances in Natural Language Processing,
BioCreative VI Workshop (BioCreative6), 2017. 2011.
[5] S. Riedel, D. McClosky, M. Surdeanu, A. McCallum, [19] S. K. Sahu, F. Christopoulou, M. Miwa, S. Ana-
C. D. Manning, Model combination for event ex- niadou, Inter-sentence relation extraction with
traction in bionlp 2011, in: Proceedings of BioNLP document-level graph convolutional neural net-
Shared Task 2011 Workshop, 2011, pp. 51–55. work, in: Proceedings of the 57th Annual Meet-
[6] H. Kilicoglu, S. Bergler, Adapting a general se- ing of the Association for Computational Linguis-
mantic interpretation approach to biological event tics, Association for Computational Linguistics,
extraction, in: Proceedings of BioNLP Shared Task Florence, Italy, 2019, pp. 4309–4316. URL: https:
2011 Workshop, 2011, pp. 173–182. //aclanthology.org/P19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 -
[7] C. Quirk, P. Choudhury, M. Gamon, L. Vander- 1423.
wende, Msr-nlp entry in bionlp shared task 2011, [20] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu,
in: Proceedings of BioNLP Shared Task 2011 Work- L. Huang, J. Zhou, M. Sun, DocRED: A large-scale
shop, 2011, pp. 155–163. document-level relation extraction dataset, in: Pro-
[8] J. Björne, T. Salakoski, Generalizing biomedical ceedings of the 57th Annual Meeting of the Associa-
event extraction, in: Proceedings of BioNLP Shared tion for Computational Linguistics, Association for
Task 2011 Workshop, 2011, pp. 183–191. Computational Linguistics, Florence, Italy, 2019, pp.
[9] J. Björne, T. Salakoski, Biomedical event extraction 764–777. URL: https://aclanthology.org/P19-1074.
using convolutional neural networks and depen- doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 0 7 4 .
dency parsing, in: BioNLP, 2018. [21] A. Mandya, D. Bollegala, F. Coenen, K. Atkinson, A
[10] H.-L. Trieu, T. T. Tran, K. N. A. Duong, A. Nguyen, dataset for inter-sentence relation extraction using
M. Miwa, S. Ananiadou, DeepEventMine: end-to- distant supervision, in: Proceedings of the Eleventh
end neural nested event extraction from biomedical International Conference on Language Resources
texts, Bioinformatics 36 (2020) 4910–4917. URL: and Evaluation (LREC 2018), European Language
https : / / doi.org / 10.1093 / bioinformatics / btaa540. Resources Association (ELRA), Miyazaki, Japan,
doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t a a 5 4 0 . 2018. URL: https://aclanthology.org/L18-1246.
a r X i v : h t t p s : / / a c a d e m i c . o u p . c o m / b i o i n f o r m a t i c s / a r t[22]
i c l e -I. Beltagy, M. E. Peters, A. Cohan, Longformer:
pdf/36/19/4910/34806218/btaa540.pdf. The long-document transformer, arXiv:2004.05150
[11] S. Rao, D. Marcu, K. Knight, H. Daumé, Biomedical (2020).
event extraction using abstract meaning represen- [23] S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Lin-
tation, in: BioNLP 2017, 2017, pp. 126–135. former: Self-attention with linear complexity, arXiv
[12] N. M. Hamad, J. H. Elconin, A. E. Karnoub, W. Bai, preprint arXiv:2006.04768 (2020).
J. N. Rich, R. T. Abraham, C. J. Der, C. M. Counter, [24] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao,
Distinct requirements for ras oncogenesis in human C. Zheng, Synthesizer: Rethinking self-attention in
versus mouse cells, Genes & development 16 (2002) transformer models, 2021. a r X i v : 2 0 0 5 . 0 0 7 4 3 .
2045–2057. [25] K. M. Choromanski, V. Likhosherstov, D. Dohan,
X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, guistics, Minneapolis, Minnesota, USA, 2019, pp.
A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, 72–78. URL: https://aclanthology.org/W19-1909.
A. Weller, Rethinking attention with performers, in: doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 1 9 0 9 .
International Conference on Learning Representa- [35] M. A. Valenzuela-Escárcega, G. Hahn-Powell,
tions, 2021. URL: https://openreview.net/forum?id= D. Bell, T. Hicks, E. Noriega, M. Surdeanu, C. T.
Ua6zuk0WRH. Morrison, Reach, https://github.com/clulab/reach,
[26] P. Chen, Permuteformer: Efficient relative position 2018.
encoding for long sequences, in: EMNLP, 2021. [36] M. A. Valenzuela-Escarcega, O. Babur, G. Hahn-
[27] M. Gerner, G. Nenadic, C. M. Bergman, An ex- Powel, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang,
ploration of mining gene expression mentions and M. Surdeanu, E. Demir, C. T. Morrison, Large-scale
their anatomical locations from biomedical text, in: automated machine reading discovers new cancer
Proceedings of the 2010 Workshop on Biomedical driving mechanisms, Database: The Journal of
Natural Language Processing, Association for Com- Biological Databases and Curation (2018). URL: http:
putational Linguistics, 2010, pp. 72–80. //clulab.cs.arizona.edu/papers/escarcega2018.pdf.
[28] F. Sarafraz, Finding conflicting statements in the doi:1 0 . 1 0 9 3 / d a t a b a s e / b a y 0 9 8 .
biomedical literature, Ph.D. thesis, University of [37] S. Gururangan, A. Marasović, S. Swayamdipta,
Manchester, 2012. K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop
[29] E. Noriega-Atala, P. D. Hein, S. S. Thumsi, Z. Wong, pretraining: Adapt language models to domains
X. Wang, S. M. Hendryx, C. T. Morrison, Ex- and tasks, in: Proceedings of ACL, 2020.
tracting inter-sentence relations for associating bi-
ological context with events in biomedical texts,
IEEE/ACM Transactions on Computational Bi-
ology and Bioinformatics 17 (2020) 1895–1906.
doi:1 0 . 1 1 0 9 / T C B B . 2 0 1 9 . 2 9 0 4 2 3 1 .
[30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
Pre-training of deep bidirectional transformers for
language understanding, in: Proceedings of the
2019 Conference of the North American Chap-
ter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), Association for Compu-
tational Linguistics, Minneapolis, Minnesota, 2019,
pp. 4171–4186. URL: https://aclanthology.org/N19-
1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tention is all you need, in: Advances in neural in-
formation processing systems, 2017, pp. 5998–6008.
[32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining ap-
proach, ArXiv abs/1907.11692 (2019).
[33] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H.
So, J. Kang, BioBERT: a pre-trained biomedical
language representation model for biomedical
text mining, Bioinformatics 36 (2019) 1234–1240.
URL: https://doi.org/10.1093/bioinformatics/btz682.
doi:1 0 . 1 0 9 3 / bioinformatics / btz682.
arXiv:https://academic.oup.com/bioinformatics/article-
pdf/36/4/1234/32527770/btz682.pdf.
[34] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng,
D. Jindi, T. Naumann, M. McDermott, Publicly
available clinical BERT embeddings, in: Proceed-
ings of the 2nd Clinical Natural Language Process-
ing Workshop, Association for Computational Lin-