Neural Architectures for Biological Inter-Sentence Relation
Extraction
Enrique Noriega-Atala, Peter M. Lovett, Clayton T. Morrison and Mihai Surdeanu
The University of Arizona, Tucson, Arizona, USA


                                          Abstract
                                          We introduce a family of deep-learning architectures for inter-sentence relation extraction, i.e., relations where the participants
                                          are not necessarily in the same sentence. We apply these architectures to an important use case in the biomedical domain:
                                          assigning biological context to biochemical events. In this work, biological context is defined as the type of biological system
                                          within which the biochemical event is observed. The neural architectures encode and aggregate multiple occurrences of
                                          the same candidate context mentions to determine whether it is the correct context for a particular event mention. We
                                          propose two broad types of architectures: the first type aggregates multiple instances that correspond to the same candidate
                                          context with respect to event mention before emitting a classification; the second type independently classifies each instance
                                          and uses the results to vote for the final class, akin to an ensemble approach. Our experiments show that the proposed
                                          neural classifiers are competitive and some achieve better performance than previous state of the art traditional machine
                                          learning methods without the need for feature engineering. Our analysis shows that the neural methods particularly improve
                                          precision compared to traditional machine learning classifiers and also demonstrates how the difficulty of inter-sentence
                                          relation extraction increases as the distance between the event and context mentions increase.

                                          Keywords
                                          Inter-sentence relation extraction, biological context, natural language processing, neural networks


1. Introduction                                                                                                                 Quantity                     Count
                                                                                                                                # of inter-sent. relations    1936
Extracting biochemical interactions that describe mecha-                                                                        Mean sent. distance            22
nistic information from scientific literature is a task that                                                                    Median sent. distance          5
has been well studied by the NLP community [1, 2, 3].                                                                           Max sent. distance            225
Automated event detection systems such as [4, 5, 6, 7,
8, 9, 10, 11] are able to detect and extract biochemical                                                           Table 1
events with high throughput and good recall. The infor-                                                            Statistics about the inter-sentence distances of biological con-
mation extracted with such tools enables scientists and                                                            text annotations.
researchers to analyze, study and discover mechanistic
pathways and their characteristics by aggregating the
interactions and biological processes described in the                                                             mechanistic pathways described by the literature. For
scientific literature.                                                                                             example, some tumors associated with oncogenic Ras
   However, when dealing with such mechanistic pro-                                                                in humans are different from those in mice, suggesting
cesses it is important to identify the biological context                                                          that the Ras pathway differs in both species [12]. Ignor-
in which they hold. Here, biological context means the                                                             ing the biological context information, specifically the
type of biological system, described at different levels                                                           species in the prior example, can mislead the reader to
of granularity, such as species, organ, tissue, cellular                                                           draw incorrect conclusions.
component, and/or cell-line within which the extracted                                                                Biological context is often not explicitly stated in the
biochemical interactions are observed. Knowing the bi-                                                             same clause that contains the biochemical event men-
ological context is important to correctly interpret the                                                           tion. Instead, the context is often established explicitly
                                                                                                                   somewhere else in the text, such as the previous sentence
The AAAI-22 Workshop on Scientific Document Understanding, March                                                   or paragraph. In other words, there is a long distance
01, 2022, Vancouver, BC, Canada                                                                                    relation between the event mention and its context. In
Envelope-Open enoriega@arizona.edu (E. Noriega-Atala);                                                             these cases, the context is implicitly propagated through
plovett@email.arizona.edu (P. M. Lovett); claytonm@arizona.edu
(C. T. Morrison); msurdeanu@arizona.edu (M. Surdeanu)
                                                                                                                   the discourse that leads up to that particular biochem-
GLOBE https://enoriega.info/ (E. Noriega-Atala);                                                                   ical event mention, as illustrated in figure 1. Table 1
https://pelovett.github.io/ (P. M. Lovett);                                                                        and figure 2 contain summary statistics about the sen-
https://ml4ai.github.io/people/clayton/ (C. T. Morrison);                                                          tence distances for the relations in the corpus used in this
http://surdeanu.cs.arizona.edu/mihai/ (M. Surdeanu)                                                                work. These statistics indicate that, while most of the
Orcid 0000-0001-7150-2989 (E. Noriega-Atala)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative   inter-sentence relations are close to the event mention
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                        they are associated with, there is a long tail of biological
                                                      Transfection of the R-Ras siRNA effectively reduced the expression of endogenous R-Ras protein in P C 1 2 c e l l s .
These results demonstrate that activation of endogenous R-Ras protein is essential for the ECM mediated cell migration and that regulation of R-Ras activity plays a key
                                                    role in ECM mediated cell migration. S e m a 4 D a n d P l e x i n - B 1 - R n d 1 i n h i b i t s P I 3 - K a c t i v i t y through its R-Ras GAP activity.

Figure 1: Example of an inter-sentence relation annotated by a domain expert. The biological context, highlighted in blue, is
established two sentences prior to the event mention, highlighted in pink.


                                                                                                          problem, the entities are potentially located in differ-
                          600
                                                                                                          ent sentences, making the context association task an
                          500
                                                                                                          instance of an inter-sentence relation extraction problem.
          # annotations


                          400                                                                                Previous work in inter-sentence relation extraction
                          300                                                                             includes [18], which combined within-sentence syntactic
                          200                                                                             features with an introduced dependency link between
                          100                                                                             the root nodes of parse trees from different sentences
                            0                                                                             that contain a given pair of entities. [19] proposes an
                                0   10      20        30        40          50
                                         # of sentences apart                                             inter-sentence relation extraction model that builds a la-
Figure 2: Distribution of inter-sentence distances of biologi-                                            beled edge graph convolutional neural network model
cal context annotations.                                                                                  on a document-level graph. There have also been efforts
                                                                                                          to create language resources to foster the development
                                                                                                          of inter-sentence relation extraction methods. [20] pro-
context mentions that are further than five sentences                                                     pose an open domain data set generated from Wikipedia
away from the corresponding event mentions.                                                               to Wikidata. [21] propose an inter-sentence relation ex-
  We frame the problem of associating event mentions                                                      traction data set constructed using distance supervision.
with their biological context as an inter-sentence relation                                               Modeling inter-sentence relation extraction using trans-
extraction task and propose a family of deep-learning                                                     former architectures require processing potentially long
architectures to identify context. The approach inspects                                                  sequences. Long input sequences are problematic be-
an event mention, a candidate context mention, and the                                                    cause computing the self-attention matrix has quadratic
text between them to determine whether the candidate                                                      runtime and space complexity relative to the its length.
context mention is context of the event mention. Our                                                      This observation has motivated research efforts to gener-
work makes the following contributions:                                                                   ate efficient approximations of self-attention. [22] pro-
                                                                                                          poses a sparse, drop-in replacement for the self-attention
       • Proposes a family of neural architectures that                                                   mechanism with linear complexity that relies on slid-
         leverages large pre-trained language models for                                                  ing windows and selects domain-dependent global atten-
         multi-sentence relation extraction.                                                              tion tokens from the input sequence. [23] proposes a
       • Extends a corpus of cancer-related open access                                                   lower-rank approximation of the self-attention matrix to
         papers with biochemical event extractions anno-                                                  linearize the complexity. [24] ommits the pair-wise de-
         tated with biological context. Unlike the original                                               pendencies between the input tokens and then factorizes
         corpus, this extended data set includes the full                                                 the attention matrix to reduce its rank. Other approaches
         text of each article, tokenized and aligned to its                                               [25] rely on kernel functions to compute approximations
         annotations.                                                                                     with linear time and space complexity. [26] takes this
       • Analyzes multiple methods to aggregate different                                                 approach further by using relative position encodings,
         pieces of evidence that correspond to the same                                                   instead of absolute ones.
         input event and context, and assesses the overall                                                   Prior work has specifically studied the contextualiza-
         performance and reliability of the networks under                                                tion of information extraction in the biomedical domain.
         these different aggregation schemes.                                                             [27] associates anatomical contextual containers with
                                                                                                          event mentions that appear in the same sentence via a
                                                                                                          set of rules that considers lexical patterns in the case
2. Related Work                                                                                           of ambiguity and falls back to token distance if no pat-
                                                                                                          tern is matched. [28] elaborates on the same idea by
The problem of relation extraction (RE) has received exten-                                               incorporating dependency trees into the rules instead of
sive attention [13, 14], including within the biomedical                                                  lexical patterns, as well as introducing a method to detect
domain [15, 16], with recent promising results incorporat-                                                negations and speculative statements.
ing distant supervision [17]. However, most of the work                                                      [29] previously studied the task of context association
focuses on identifying relations among entities within                                                    for the biomedical domain and framed it as a problem of
the same sentence. In the biological context association                                                  inter-sentence relation extraction. This work presents
label=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , and proliferation in < C O N > MCF-7 < / C O N >
         cells .
lbbel=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , …have linked the upregulation of [ E V E N T ]
          with rapid proliferation in certain [ C O N T E X T ] … < C O N > MCF-7 < / C O N > cells .

lcbel=() … < C O N > macrophages < / C O N > , and [ C O N T E X T ] , where it is a trimeric complex consisting of one alpha-chain … [ S E P ] …FcRbeta also acts as a chaperone
          that increases < E V T > FcepsilonRI expression < / E V T >


Figure 3: Example input text spans. (a) Single-sentence segment with markers; (b) multi-sentence segment with markers and
masked secondary event and context mentions; and (c) truncated long multi-sentence segment.


set of linguistic and lexical features that describe the                                         ontology depends on the type of entity: UniProt1 for
neighborhood of the participant entities and proposes an                                         proteins, PubChem2 for chemical entities, etc.
aggregation mechanism that results in improved context                                                Importantly, a context biological container type is
association.                                                                                     likely mentioned multiple times in the document. Ap-
   Previous work relied upon feature engineering to en-                                          proximately half of the context container types in the
code the participants and their potential interactions.                                          context-event relation corpus are detected two or more
State-of-the-art NLP research leverages large language                                           times, as illustrated in figure 5. Every candidate con-
models to exploit transfer learning. Models such as [30],                                        text mention that refers to the same container type is
and similar transformer based architectures [31] better                                          paired with the relevant event mention to generate a
capture the semantics of text based on its surrounding                                           text segment for each pair. Each segment is represented
context with unsupervised pre-training over extremely                                            as the concatenation of the sentences that include the
large corpora. Specialized models, such as [32, 33, 34]                                          event mention, one mention of the candidate context
refine language models by continuing pre-training with                                           container type, and all the sentences in between. These
in-domain corpora.                                                                               text segments are used as input to the network to make
   To the best of our knowledge, the work presented                                              predictions. If an article contains 𝑛𝑖 context mentions of
here is the first to propose and analyze deep-learning                                           container type 𝑖, then for each event mention the network
aggregation and ensemble architectures for many-to-one,                                          will take up to 𝑛𝑖 input text segments to determine if type
long-distance relation extraction.                                                               𝑖 is a context of the event. The task of the network is to
                                                                                                 learn whether context type 𝑖 is a context of the specific
                                                                                                 event mention by looking at a subset of the 𝑛𝑖 inputs. An
3. Neural Architectures for                                                                      article with 𝑗 context types and 𝑚 event mentions will
   Context Association                                                                           see a total of 𝑗 × 𝑚 classification problems and a total of
                                                                                                     𝑗
                                                                                                 ∑𝑖 𝑛𝑖 × 𝑚 input text segments. Figure 4 shows a block
We propose a family of neural architectures designed to                                          diagram of the family of architectures.
determine whether a candidate context class is relevant to                                            Each input segment is preprocessed as follows. The
a given biochemical event mention. A biochemical event                                           boundaries of the relevant event and candidate con-
mention (event mention for short) describes the interac-                                         text mentions are marked with the special tokens:
tion between proteins, genes, and other gene products                                            < E V T > . . . < / E V T > for the event mention and < C T X > . . .
through biochemical reactions such as regulation, inhi-                                          < / C T X > for the context mention. Other event or context
bition, phosphorylation, etc. In particular, we focus on                                         mentions present in the segment are masked with special
the 12 interactions detected by REACH [35]. A biological                                         [ E V E N T ] or [ C O N T E X T ] tokens, respectively, to avoid con-
container context mention (context mention for short)                                            fusing the classifier with other event mentions that aren’t
represents an instance from any of the following biolog-                                         the focus of the current prediction. Figure 3 shows ex-
ical container types: species (e.g., human, mice), organ                                         ample text spans where the event and context mentions
(e.g., liver, lung), tissue type (e.g., endothelium, muscle                                      are surrounded by their boundary tokens. Next, each
tissue), cell type (e.g., macrophages, neurons), or cell line                                    preprocessed text segment is tokenized using the tok-
(e.g., HeLa, MCF-7).                                                                             enizer specific to the pre-trained transformer used as the
   In this work, we use an existing information extraction                                       encoder. If a tokenized sequence exceeds the maximum
system [36] to detect and extract event mentions and can-                                        length allowed by the transformer, it is truncated before
didate context mentions. Candidate context mentions are                                          the encoding step by selecting the prefix of the sequence
grounded to ontology concepts with unique identifiers                                            up to half the length, the suffix up to half the length minus
to accommodate different spellings and synonyms that
refer to the same biological container type. The specific                                               1
                                                                                                            https://www.uniprot.org/
                                                                                                        2
                                                                                                            https://pubchem.ncbi.nlm.nih.gov/
                                                                                                                      Aggregation


      ︙                                    ︙                           ︙          ︙
  upregulates
                                          <CON>
    <EVT>
                                                                                                        Aggregation                              Correct
                                          MCF-7                                                          Function                                Context
    ErbB1/2
                                         </CON>
   expression
                                                                                                                       Aggregated
   </EVT>                                                                                                              Embedding

                                                           BioMed
      ︙                                    ︙              RoBERTA             …             …
                               …                                       ︙          ︙

      in                                upregulates

    <CON>                                <EVT>                                                                             ︙           Voting    Correct
                                                                                                                                      Function   Context
    MCF-7                                ErbB1/2

   </CON>                               expression
                                                                                                                         Individual
     cells                               </EVT>                                                                         Predictions


                Input Segments                                      Encoded Segments   Classification                   Voting
                                                                                       Embeddings


Figure 4: Context association neural architecture. The left-most box represents the input text segments after pre-processing.
The blocks inside the encoded segments box represent BioMed RoBERTa’s hidden states for the input segments. The
classification embeddings box contains averages of the hidden states corresponding to the < E V T > and < C O N > tokens of each input
segment. Depending on the choice of architecture, classification embeddings either flow through (a) the aggregation block,
which combines them to then generate the final classification; or (b) the voting block, where each embedding is classified,
then the final result is generated through a voting function.


                         400
                                                                                          nate from the previously discussed process. To generate
                         350
                                                                                          a single prediction, the network must combine the infor-
                         300
                                                                                          mation carried forward by the classification embeddings.
                         250
                                                                                          We propose two general approaches to combine the clas-
             frequency


                         200                                                              sification embeddings and generate the final prediction
                         150                                                              by combining the information before classification and
                         100                                                              after classification, respectively:
                          50
                           0
                                                                                                 • Aggregation: Classification embeddings are
                               1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 1920+
                                               num instances                                       combined together using an aggregation func-
                                                                                                   tion. The aggregated embedding is then passed
Figure 5: Distribution of the number of context class detec-
                                                                                                   through a multi-layer perceptron (MLP) to emit
tions per article (𝑛𝑖 ).
                                                                                                   a binary classification.
                                                                                                 • Voting: Each classification embedding is passed
                                                                                                   individually through the MLP, which emits a lo-
one token, and inserting a special < S E P > token between                                         cal decision based only on the individual input
them. Any truncated input segment is guaranteed to re-                                             text segment. The individual decisions are com-
tain both mentions and their local lexical context. Figure                                         bined using a voting function to emit the final
3 shows an example of a segment truncated using this                                               classification.
procedure. After tokenization, the segments are encoded
using BioMed RoBERTa-base [37] 3 , based on [32].                                            Intuitively, aggregation functions consider multiple
   The output hidden states of the < E V T > and < C O N > tokens                         information points to make an informed decision based
are averaged to create a classification embedding.                                        on the “bigger picture” presented by the article. Voting
   Each classification task emits a single binary predic-                                 functions, on the other hand, make isolated decisions
tion, but has up to 𝑛𝑖 classification embeddings to account                               solely based on information local to each input text seg-
for the multiple (potential) context mentions that origi-                                 ment, then use those individual predictions to vote for
                                                                                          the final classification, akin to an ensemble approach.
    3
      We used the available public checkpoint for both the BPE                               There are multiple ways to implement aggregation and
and BioMed RoBERTa models from https://huggingface.co/allenai/                            voting functions. We propose four implementations of
biomed_roberta_base                                                                       each kind, each following a intuitive principle.
                                                               [b]0.49


                 0.6
                 0.5
                 0.4
   Score value


                 0.3
                 0.2
                             F1
                 0.1         Precision
                             Recall
                       3          4           5            6             7           8            9          10
                                                      # of mentions
  Figure 6: Majority vote
                                                                                                                         [b]0.49


              0.65
                                                                                                  F1
              0.60                                                                                Precision
              0.55                                                                                Recall
              0.50
Score value


              0.45
              0.40
              0.35
              0.30
              0.25
                       3          4           5            6             7           8            9          10
                                                      # of mentions
  Figure 7: Average aggregation

  Figure 8: Precision/recall/F1 scores of the relation classifier as the number of context mention considered for each individual
  relation classification is varied.
                                   Documents        Event mentions        Context mentions           Annotations
               Validation                6                         685                      713               1,192
               Cross validation          20                      1,169                    1,926               1,543
               Total                     26                      1,854                    2,639               2,735
                                                      Cross-validation split
               Training                  17             975.83 (58.32)         1,654.83 (52.83)    1,288.33 (95.89)
               Testing                   3              193.16 (58.32)           271.16 (52.83)      254.66 (95.89)

Table 2
Statistics of the context association dataset. The upper part shows statistics from the overall dataset, both in total and split by
the two partitions: (a) validation set, and (b) partition used for the formal cross-validation experiments. The lower part shows
the average and standard deviations used for train/test for the different folds in cross-validation.


Aggregation Functions                                        gregation approach concatenates 𝑘 nearest classification
                                                             embeddings and uses a MLP to reduce the concatenated
Nearest Context Mention: Following the intuition that
                                                             embeddings to a new vector with the same number of
textual proximity should be a strong indicator of associ-
                                                             components as an individual classification embedding.
ation, this approach selects the context mention of the
                                                             The MLP works as map that combines the original 𝑘
relevant context type that is closest to the event mention.
                                                             classification embeddings whose parameters are learned
The closest context mention can appear either before
                                                             during training. If the number of input text segments
or after event mention. In this setting, all other context
                                                             is < 𝑘, the concatenated classification embeddings are
mentions are ignored. The approach results in only one,
                                                             padded with zeros before being mapped to the new vector
unaltered classification embedding. It is equivalent to
                                                             space.
the case where only one mention of the relevant context
type appears in a document (𝑛𝑖 = 1).
   Average Context Embedding: Conversely, all mentions Voting Functions
of the candidate context type can bear a degree of re- One hit: This voting approach requires the minimum
sponsibility to determine whether it is context of the amount of evidence to trigger a positive classification.
event mention. Without making a statement about the The context type is classified as i s c o n t e x t o f the event
importance of each context mention, we consider the mention if at least one classification embedding is classi-
text segments of the 𝑘 nearest context mentions of the fied as positive. Intuitively, this voting function favors
relevant context type, to either side. The upper bound is recall.
enforced for efficiency and is left as a hyper parameter.       Majority vote: Conversely, it can be argued that there
If there are less than 𝑘 context mentions, all the text seg- should be consensus in the vote. The majority vote func-
ments are considered. The segments are encoded, then tion triggers a positive classification if at least half of the
the resulting classification embeddings are averaged.        classification embeddings are classified as positive. In
   Inverse Distance aggregation: It can be argued that the contrast to one hit, this voting function favors precision.
influence of each context mention in the final decision         Post-inverse distance vote: Analogous to the inverse
decreases when it is farther apart from the event mention. distance aggregation approach, this approach takes the
We propose this aggregation approach, where instead of vote of each classification embedding as weighted by the
averaging the 𝑘 nearest classification embeddings, they normalized inverse sentence distance: 𝑤 = 𝑑 −1 / ∑𝑘 𝑑 −1 .
                                                                                                      𝑖   𝑖      𝑗 𝑗
are combined as a weighted sum, where each classifica- The final classification is emitted in favor of the class with
                                                      𝑘
tion embedding’s weight is defined as 𝑤𝑖 = 𝑑𝑖−1 / ∑𝑗 𝑑𝑗−1 , the highest weight. As opposed to the inverse distance
the normalized inverse sentence distance between the aggregation approach, the combination happens after
event mention and the context mention. The resulting passing the embeddings through the MLP.
aggregated embedding still carries information from the         Confidence vote: We can weight each vote proportion-
nearest 𝑘 context mentions, but their contributions di- ally to the confidence of the classifier. In this approach,
minish inversely proportionally to their distance from the vote of each individual classification is weighted by
the event mention.                                           the classifier’s confidence. The weights are given by
   Parameterized aggregation: Instead of relying upon a the normalized logits of the vote of each classification
heuristic approach to calculate the weights that deter- embedding: 𝑤 = 𝑙 / ∑𝑘 𝑙 .
                                                                            𝑖   𝑖   𝑗 𝑗
mine the contributions of each classification embedding,
we let the network learn the interactions between them
using an attention mechanism. The parameterized ag-
4. Full-Text Context-Event                                     Method                  Precision     Recall       F1
   Relation Corpus                                             Majority (3 votes)        0.580*       0.498     0.536*
                                                               Parameterized agg.        0.537*       0.494     0.514*
We used a corpus of biochemical events annotated with          One-hit                    0.409       0.668*     0.507
biological context to test the neural architectures for        Post inv. distance        0.571*       0.446      0.501
context assignment. Our version of the corpus is an            Nearest mention           0.541*       0.464      0.499
                                                               Average (5 segs)           0.527       0.469      0.497
extension of the corpus published by [29].
                                                               Inverse distance          0.544*       0.454      0.495
   The corpus consists of automated extractions of 26          Confidence vote            0.394       0.443      0.417
open-access articles from the PubMed Central repository,
all related to the domain of cancer biology. The first type                            Baselines
of extractions are events mentions. An event mention           Random forest              0.439       0.541      0.485
is a relation between one or more entities participating       Logistic regression        0.361       0.699      0.476
in a biochemical reaction or its regulation. These men-        Heuristic                  0.421       0.548      0.476
                                                               Decision tree              0.311       0.389      0.345
tions can be phosphorylation, ubiquitination, expression,
etc. The second type of extractions are candidate context     Table 3
mentions. These consist of named entity extractions of        Cross-validation results for the is context of class. * denotes
different biological container types: species, tissue types   statistically significant improvement w.r.t. the random forest
and cell lines.                                               classifier.
   Each event extracted was annotated by up to three
biologists who assigned the event’s relevant biological
context from a pool of candidate context extractions avail-   5.1. Automatic Negative Examples
able in the paper. Context annotations are not exclu-
sive, meaning that every event mention can be annotated       The context-event relation corpus only contains positive
with one or more context classes. The result is a set of      context annotations of event mentions. We automatically
annotated events, where each event can have zero or           generate negative examples for event mentions in each
more biological context associations, and there is at least   document by enumerating the cartesian product of all
one explicit mention for each biological context in the       event and context mentions followed by subtracting the
same article. The specifics of the automated event extrac-    annotated pairs. One consequence of generating negative
tion procedure, annotation tool, annotations protocols        examples using this exhaustive strategy is that it results
and inter-annotator agreements are thoroughly detailed        in most of the event/context pairs being negative exam-
in [29]. Table 2 contains summary statistics of the data      ples, with 60,367 (95.68%) negative pairs and 2,703 (4.32%)
set’s documents.                                              positive pairs. This results in a severe class imbalance,
   The original corpus release lacked the full text of the    which makes the classification task harder.
articles. Our proposed methodology requires the raw
text to be used as input to the neural architectures. Our     5.2. Results and Discussion
contribution here is an extension this corpus, where we
                                                              We use a cross validation evaluation framework similar
identified, processed and tokenized the full text of the
                                                              to the evaluation methodology used by [29]. Each fold
articles using the same information extraction tool [35]
                                                              contains all of the event-context pairs that belong to
used by the authors of the original corpus in such way
                                                              three different articles. However, we held out six papers
that the tokens align correctly with the annotations and
                                                              as a development set. During cross validation, one fold
extractions published previously. The full-text context-
                                                              is used for testing and training is performed using the
event relation corpus, along with the code for the experi-
                                                              remaining 𝑘 − 1 folds plus the data from the development
ments presented in this document, is publicly available
                                                              set. This way, we take advantage of more training data
for reproducibility and further research.4
                                                              and avoid leaking the information from development into
                                                              testing.
5. Experiments and Results                                       To better understand the impact of considering multi-
                                                              ple context mentions at the time of aggregation or voting,
In this section, we evaluate all proposed variants of the     we tuned this hyper parameter on the development set.
context association architecture and discuss the results.     Figure 8 shows the effect of increasing the number of con-
                                                              text mentions used for relation classification. The num-
                                                              ber of context mentions considered ranged from three to
                                                              ten. Both architectures reach a peak F1 score between
                                                              3 to 5 context mentions. Performance quickly decays
   4
       https://clulab.github.io/neuralbiocontext/             almost asymptotically, as the number of considered con-
text mentions increases. This observation suggestions                Distance      Precision     Recall      F1     Support
that increasing the number of input text segments de-                     0          0.796        0.818    0.807       573
rived from context mentions that are further apart from                   1          0.490        0.450    0.469       262
the event introduces too much noise into the decision                     2          0.398        0.336    0.364       146
process.                                                                  3          0.531        0.402    0.457       107
   After the above tuning, we ran cross-validation exper-                 4          0.569        0.393    0.465        84
iments for all aggregation and voting methods. Based                     5+          0.214        0.131    0.163       351
on the tuning results, we used the closest five mentions
                                                                   Table 4
of each context class for the average aggregation archi-           Cross-validation scores for the positive class of the Majority
tecture, and the closest three for all of the other archi-         (3 votes) architecture stratified by sentence distance to the
tectures. Table 3 summarizes the cross validation per-             closet context mention of the same class.
formance scores for all the architecture variants. The
precision, recall, and F1 scores reported are computed
just for the positive class (i.e., is context of) to avoid arti-   class. Performance, along with the frequency of such in-
ficially inflating the scores with the dominating negative         stances, quickly degrades as the distance between event
class.                                                             and context mention increases.
   The top performing architecture is the majority vote.
It achieves an F1 score slightly above 0.53. The major-
ity vote architecture trades off recall for precision. The         6. Conclusions
reason for this is that the architecture needs to see at
least half of the individual input segments classified as          We propose a family of neural architectures to detect bi-
positive in order to make that prediction. As a result, a          ological context of biochemical events. We approach the
positive classification using this architecture comes with         problem as an inter-sentence relation extraction that uses
a relatively high confidence. As expected, the one-hit             multiple pieces of document-level evidence to classify
architecture achieves the opposite: it trades precision for        whether a specific context label is the correct context
recall. One-hit only needs to see one individual positive          type of an event extraction.
classification in order to emit a positive final classifica-          We provide an analysis of different methods to com-
tion. As a result, one-hit attains the highest recall within       bine evidence to generate a final decision. The ap-
the neural architectures but is more prone to false posi-          proaches work either before classification, by aggregating
tives.                                                             embeddings in order to emit a decision, or after classifica-
   We include several baseline algorithms to compare               tion, creating ensembles that vote for multiple individual
the performance of the neural architectures. The first             decisions.
baseline is a “heuristic” method that associates all the              Using an expert-annotated corpus that associates bio-
context types within a constant number of sentences to             chemical events with relevant biological context, our re-
an event mention. We also include our implementation               sults show that in spite of the severe class imbalance, sev-
of three classifiers using the feature engineering method          eral the neural architectures are competitive and achieve
of [29]. The top three performing neural architectures             higher classification performance than a deterministic
have statistically significantly higher F1 score than the          heuristic and other machine learning approaches.
random forest classifier, which is the strongest baseline             The neural architectures particularly favor precision,
algorithm.                                                         which makes them more appealing for applications where
   Note that the methods proposed by [29] that are in-             higher precision is desirable.
cluded in the table aggregate multiple feature vectors from           Inter-sentence relation extraction continues to be a
the different context mentions into a new feature vector           challenge. An ablation study of the degree of aggrega-
composed of multiple statistics from the original feature          tion of evidence shows how considering mentions that
space. Examples of these feature aggregations include              are further apart from the event degrades performance.
the minimum, maximum and average values of the distri-             An error analysis by sentence distance shows how the
bution of sentence distances, the frequency of the context         difficulty of inter-sentence relation extraction correlates
type, and the proportion of times the context mention              with the distance between the participants. The result of
is part of a noun phrase. Their aggregation approach is            these analyses suggest that understanding how to filter
analogous to the one presented here (although here we              out noisy event-context mention pairs and how to better
operate in embedding space), which is why the compari-             weight the contribution of long-spanning mention pairs
son between these two approaches is fair.                          are important directions for future research.
   Table 4 lists the classification scores of the top per-
forming method, stratifying the data by the sentence
distance to the closest context mention of the relevant
References                                                                                            [13] M. Banko, M. J. Cafarella, S. Soderland, M. Broad-
                                                                                                                head, O. Etzioni, Open information extraction from
 [1] P. R. Cohen, Darpa’s big mechanism program 12                                                              the web, in: Proceedings of the Twentieth Inter-
     (2015) 045008. URL: https://doi.org/10.1088/1478-                                                          national Joint Conference on Artificial Intelligence,
     3975/12/4/045008. doi:1 0 . 1 0 8 8 / 1 4 7 8 - 3 9 7 5 / 1 2 / 4 /                                        2007, pp. 2670–2676.
     045008.                                                                                          [14] N. Bach, S. Badaskar, A review of relation extrac-
 [2] D. Zhou, D. Zhong, Y. He, Biomedical relation ex-                                                          tion, Literature review for Language and Statistics
     traction: From binary to complex, Computational                                                            II (2007).
     and Mathematical Methods in Medicine 2014 (2014) [15] C. Quan, M. Wang, F. Ren, An unsupervised text
     1–18. doi:1 0 . 1 1 5 5 / 2 0 1 4 / 2 9 8 4 7 3 .                                                          mining method for relation extraction from biomed-
 [3] L. Hirschman, A. Yeh, C. Blaschke, A. Valencia,                                                            ical literature, PLOS One (2014).
     Overview of biocreative: critical assessment of in- [16] K. Fundel, R. Küffner, R. Zimmer, RelEx – Relation
     formation extraction for biology, BMC Bioinformat-                                                         extraction using dependency parse trees, Bioinfor-
     ics 6 (2005) S1. URL: https://doi.org/10.1186/1471-                                                        matics 23 (2007) 365–371.
     2105-6-S1-S1. doi:1 0 . 1 1 8 6 / 1 4 7 1 - 2 1 0 5 - 6 - S 1 - S 1 .                            [17] H. Poon, K. Toutanova, C. Quirk, Distant supervi-
 [4] M. A. Valenzuela-Escárcega, Ö. Babur, G. Hahn-                                                             sion for cancer pathway extraction from text, in:
     Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang,                                                      Pacific Symposium for Biocomputing, 2015.
     M. Surdeanu, E. Demir, C. T. Morrison, Large-scale [18] K. Swampillai, M. Stevenson, Extracting relations
     automated reading with reach discovers new can-                                                            within and across sentences, in: Proceedings of
     cer driving mechanisms, in: Proceedings of the                                                             Recent Advances in Natural Language Processing,
     BioCreative VI Workshop (BioCreative6), 2017.                                                              2011.
 [5] S. Riedel, D. McClosky, M. Surdeanu, A. McCallum, [19] S. K. Sahu, F. Christopoulou, M. Miwa, S. Ana-
     C. D. Manning, Model combination for event ex-                                                             niadou, Inter-sentence relation extraction with
     traction in bionlp 2011, in: Proceedings of BioNLP                                                         document-level graph convolutional neural net-
     Shared Task 2011 Workshop, 2011, pp. 51–55.                                                                work, in: Proceedings of the 57th Annual Meet-
 [6] H. Kilicoglu, S. Bergler, Adapting a general se-                                                           ing of the Association for Computational Linguis-
     mantic interpretation approach to biological event                                                         tics, Association for Computational Linguistics,
     extraction, in: Proceedings of BioNLP Shared Task                                                          Florence, Italy, 2019, pp. 4309–4316. URL: https:
     2011 Workshop, 2011, pp. 173–182.                                                                          //aclanthology.org/P19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 -
 [7] C. Quirk, P. Choudhury, M. Gamon, L. Vander-                                                               1423.
     wende, Msr-nlp entry in bionlp shared task 2011, [20] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu,
     in: Proceedings of BioNLP Shared Task 2011 Work-                                                           L. Huang, J. Zhou, M. Sun, DocRED: A large-scale
     shop, 2011, pp. 155–163.                                                                                   document-level relation extraction dataset, in: Pro-
 [8] J. Björne, T. Salakoski, Generalizing biomedical                                                           ceedings of the 57th Annual Meeting of the Associa-
     event extraction, in: Proceedings of BioNLP Shared                                                         tion for Computational Linguistics, Association for
     Task 2011 Workshop, 2011, pp. 183–191.                                                                     Computational Linguistics, Florence, Italy, 2019, pp.
 [9] J. Björne, T. Salakoski, Biomedical event extraction                                                       764–777. URL: https://aclanthology.org/P19-1074.
     using convolutional neural networks and depen-                                                             doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 0 7 4 .
     dency parsing, in: BioNLP, 2018.                                                                 [21] A. Mandya, D. Bollegala, F. Coenen, K. Atkinson, A
[10] H.-L. Trieu, T. T. Tran, K. N. A. Duong, A. Nguyen,                                                        dataset for inter-sentence relation extraction using
     M. Miwa, S. Ananiadou, DeepEventMine: end-to-                                                              distant supervision, in: Proceedings of the Eleventh
     end neural nested event extraction from biomedical                                                         International Conference on Language Resources
     texts, Bioinformatics 36 (2020) 4910–4917. URL:                                                            and Evaluation (LREC 2018), European Language
     https : / / doi.org / 10.1093 / bioinformatics / btaa540.                                                  Resources Association (ELRA), Miyazaki, Japan,
     doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t a a 5 4 0 .                                          2018. URL: https://aclanthology.org/L18-1246.
     a r X i v : h t t p s : / / a c a d e m i c . o u p . c o m / b i o i n f o r m a t i c s / a r t[22]
                                                                                                       i c l e -I. Beltagy, M. E. Peters, A. Cohan, Longformer:
     pdf/36/19/4910/34806218/btaa540.pdf.                                                                       The long-document transformer, arXiv:2004.05150
[11] S. Rao, D. Marcu, K. Knight, H. Daumé, Biomedical                                                          (2020).
     event extraction using abstract meaning represen- [23] S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Lin-
     tation, in: BioNLP 2017, 2017, pp. 126–135.                                                                former: Self-attention with linear complexity, arXiv
[12] N. M. Hamad, J. H. Elconin, A. E. Karnoub, W. Bai,                                                         preprint arXiv:2006.04768 (2020).
     J. N. Rich, R. T. Abraham, C. J. Der, C. M. Counter, [24] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao,
     Distinct requirements for ras oncogenesis in human                                                         C. Zheng, Synthesizer: Rethinking self-attention in
     versus mouse cells, Genes & development 16 (2002)                                                          transformer models, 2021. a r X i v : 2 0 0 5 . 0 0 7 4 3 .
     2045–2057.                                                                                       [25] K. M. Choromanski, V. Likhosherstov, D. Dohan,
     X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis,                  guistics, Minneapolis, Minnesota, USA, 2019, pp.
     A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell,                72–78. URL: https://aclanthology.org/W19-1909.
     A. Weller, Rethinking attention with performers, in:                   doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 1 9 0 9 .
     International Conference on Learning Representa- [35] M. A. Valenzuela-Escárcega, G. Hahn-Powell,
     tions, 2021. URL: https://openreview.net/forum?id=                     D. Bell, T. Hicks, E. Noriega, M. Surdeanu, C. T.
     Ua6zuk0WRH.                                                            Morrison, Reach, https://github.com/clulab/reach,
[26] P. Chen, Permuteformer: Efficient relative position                    2018.
     encoding for long sequences, in: EMNLP, 2021.                     [36] M. A. Valenzuela-Escarcega, O. Babur, G. Hahn-
[27] M. Gerner, G. Nenadic, C. M. Bergman, An ex-                           Powel, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang,
     ploration of mining gene expression mentions and                       M. Surdeanu, E. Demir, C. T. Morrison, Large-scale
     their anatomical locations from biomedical text, in:                   automated machine reading discovers new cancer
     Proceedings of the 2010 Workshop on Biomedical                         driving mechanisms, Database: The Journal of
     Natural Language Processing, Association for Com-                      Biological Databases and Curation (2018). URL: http:
     putational Linguistics, 2010, pp. 72–80.                               //clulab.cs.arizona.edu/papers/escarcega2018.pdf.
[28] F. Sarafraz, Finding conflicting statements in the                     doi:1 0 . 1 0 9 3 / d a t a b a s e / b a y 0 9 8 .
     biomedical literature, Ph.D. thesis, University of [37] S. Gururangan, A. Marasović, S. Swayamdipta,
     Manchester, 2012.                                                      K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop
[29] E. Noriega-Atala, P. D. Hein, S. S. Thumsi, Z. Wong,                   pretraining: Adapt language models to domains
     X. Wang, S. M. Hendryx, C. T. Morrison, Ex-                            and tasks, in: Proceedings of ACL, 2020.
     tracting inter-sentence relations for associating bi-
     ological context with events in biomedical texts,
     IEEE/ACM Transactions on Computational Bi-
     ology and Bioinformatics 17 (2020) 1895–1906.
     doi:1 0 . 1 1 0 9 / T C B B . 2 0 1 9 . 2 9 0 4 2 3 1 .
[30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     Pre-training of deep bidirectional transformers for
     language understanding, in: Proceedings of the
     2019 Conference of the North American Chap-
     ter of the Association for Computational Linguis-
     tics: Human Language Technologies, Volume 1
     (Long and Short Papers), Association for Compu-
     tational Linguistics, Minneapolis, Minnesota, 2019,
     pp. 4171–4186. URL: https://aclanthology.org/N19-
     1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     tention is all you need, in: Advances in neural in-
     formation processing systems, 2017, pp. 5998–6008.
[32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining ap-
     proach, ArXiv abs/1907.11692 (2019).
[33] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H.
     So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical
     text mining, Bioinformatics 36 (2019) 1234–1240.
     URL: https://doi.org/10.1093/bioinformatics/btz682.
     doi:1 0 . 1 0 9 3        /       bioinformatics         / btz682.
      arXiv:https://academic.oup.com/bioinformatics/article-
      pdf/36/4/1234/32527770/btz682.pdf.
[34] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng,
     D. Jindi, T. Naumann, M. McDermott, Publicly
     available clinical BERT embeddings, in: Proceed-
     ings of the 2nd Clinical Natural Language Process-
     ing Workshop, Association for Computational Lin-