Neural Architectures for Biological Inter-Sentence Relation Extraction Enrique Noriega-Atala, Peter M. Lovett, Clayton T. Morrison and Mihai Surdeanu The University of Arizona, Tucson, Arizona, USA Abstract We introduce a family of deep-learning architectures for inter-sentence relation extraction, i.e., relations where the participants are not necessarily in the same sentence. We apply these architectures to an important use case in the biomedical domain: assigning biological context to biochemical events. In this work, biological context is defined as the type of biological system within which the biochemical event is observed. The neural architectures encode and aggregate multiple occurrences of the same candidate context mentions to determine whether it is the correct context for a particular event mention. We propose two broad types of architectures: the first type aggregates multiple instances that correspond to the same candidate context with respect to event mention before emitting a classification; the second type independently classifies each instance and uses the results to vote for the final class, akin to an ensemble approach. Our experiments show that the proposed neural classifiers are competitive and some achieve better performance than previous state of the art traditional machine learning methods without the need for feature engineering. Our analysis shows that the neural methods particularly improve precision compared to traditional machine learning classifiers and also demonstrates how the difficulty of inter-sentence relation extraction increases as the distance between the event and context mentions increase. Keywords Inter-sentence relation extraction, biological context, natural language processing, neural networks 1. Introduction Quantity Count # of inter-sent. relations 1936 Extracting biochemical interactions that describe mecha- Mean sent. distance 22 nistic information from scientific literature is a task that Median sent. distance 5 has been well studied by the NLP community [1, 2, 3]. Max sent. distance 225 Automated event detection systems such as [4, 5, 6, 7, 8, 9, 10, 11] are able to detect and extract biochemical Table 1 events with high throughput and good recall. The infor- Statistics about the inter-sentence distances of biological con- mation extracted with such tools enables scientists and text annotations. researchers to analyze, study and discover mechanistic pathways and their characteristics by aggregating the interactions and biological processes described in the mechanistic pathways described by the literature. For scientific literature. example, some tumors associated with oncogenic Ras However, when dealing with such mechanistic pro- in humans are different from those in mice, suggesting cesses it is important to identify the biological context that the Ras pathway differs in both species [12]. Ignor- in which they hold. Here, biological context means the ing the biological context information, specifically the type of biological system, described at different levels species in the prior example, can mislead the reader to of granularity, such as species, organ, tissue, cellular draw incorrect conclusions. component, and/or cell-line within which the extracted Biological context is often not explicitly stated in the biochemical interactions are observed. Knowing the bi- same clause that contains the biochemical event men- ological context is important to correctly interpret the tion. Instead, the context is often established explicitly somewhere else in the text, such as the previous sentence The AAAI-22 Workshop on Scientific Document Understanding, March or paragraph. In other words, there is a long distance 01, 2022, Vancouver, BC, Canada relation between the event mention and its context. In Envelope-Open enoriega@arizona.edu (E. Noriega-Atala); these cases, the context is implicitly propagated through plovett@email.arizona.edu (P. M. Lovett); claytonm@arizona.edu (C. T. Morrison); msurdeanu@arizona.edu (M. Surdeanu) the discourse that leads up to that particular biochem- GLOBE https://enoriega.info/ (E. Noriega-Atala); ical event mention, as illustrated in figure 1. Table 1 https://pelovett.github.io/ (P. M. Lovett); and figure 2 contain summary statistics about the sen- https://ml4ai.github.io/people/clayton/ (C. T. Morrison); tence distances for the relations in the corpus used in this http://surdeanu.cs.arizona.edu/mihai/ (M. Surdeanu) work. These statistics indicate that, while most of the Orcid 0000-0001-7150-2989 (E. Noriega-Atala) © 2022 Copyright for this paper by its authors. Use permitted under Creative inter-sentence relations are close to the event mention Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) they are associated with, there is a long tail of biological Transfection of the R-Ras siRNA effectively reduced the expression of endogenous R-Ras protein in P C 1 2 c e l l s . These results demonstrate that activation of endogenous R-Ras protein is essential for the ECM mediated cell migration and that regulation of R-Ras activity plays a key role in ECM mediated cell migration. S e m a 4 D a n d P l e x i n - B 1 - R n d 1 i n h i b i t s P I 3 - K a c t i v i t y through its R-Ras GAP activity. Figure 1: Example of an inter-sentence relation annotated by a domain expert. The biological context, highlighted in blue, is established two sentences prior to the event mention, highlighted in pink. problem, the entities are potentially located in differ- 600 ent sentences, making the context association task an 500 instance of an inter-sentence relation extraction problem. # annotations 400 Previous work in inter-sentence relation extraction 300 includes [18], which combined within-sentence syntactic 200 features with an introduced dependency link between 100 the root nodes of parse trees from different sentences 0 that contain a given pair of entities. [19] proposes an 0 10 20 30 40 50 # of sentences apart inter-sentence relation extraction model that builds a la- Figure 2: Distribution of inter-sentence distances of biologi- beled edge graph convolutional neural network model cal context annotations. on a document-level graph. There have also been efforts to create language resources to foster the development of inter-sentence relation extraction methods. [20] pro- context mentions that are further than five sentences pose an open domain data set generated from Wikipedia away from the corresponding event mentions. to Wikidata. [21] propose an inter-sentence relation ex- We frame the problem of associating event mentions traction data set constructed using distance supervision. with their biological context as an inter-sentence relation Modeling inter-sentence relation extraction using trans- extraction task and propose a family of deep-learning former architectures require processing potentially long architectures to identify context. The approach inspects sequences. Long input sequences are problematic be- an event mention, a candidate context mention, and the cause computing the self-attention matrix has quadratic text between them to determine whether the candidate runtime and space complexity relative to the its length. context mention is context of the event mention. Our This observation has motivated research efforts to gener- work makes the following contributions: ate efficient approximations of self-attention. [22] pro- poses a sparse, drop-in replacement for the self-attention • Proposes a family of neural architectures that mechanism with linear complexity that relies on slid- leverages large pre-trained language models for ing windows and selects domain-dependent global atten- multi-sentence relation extraction. tion tokens from the input sequence. [23] proposes a • Extends a corpus of cancer-related open access lower-rank approximation of the self-attention matrix to papers with biochemical event extractions anno- linearize the complexity. [24] ommits the pair-wise de- tated with biological context. Unlike the original pendencies between the input tokens and then factorizes corpus, this extended data set includes the full the attention matrix to reduce its rank. Other approaches text of each article, tokenized and aligned to its [25] rely on kernel functions to compute approximations annotations. with linear time and space complexity. [26] takes this • Analyzes multiple methods to aggregate different approach further by using relative position encodings, pieces of evidence that correspond to the same instead of absolute ones. input event and context, and assesses the overall Prior work has specifically studied the contextualiza- performance and reliability of the networks under tion of information extraction in the biomedical domain. these different aggregation schemes. [27] associates anatomical contextual containers with event mentions that appear in the same sentence via a set of rules that considers lexical patterns in the case 2. Related Work of ambiguity and falls back to token distance if no pat- tern is matched. [28] elaborates on the same idea by The problem of relation extraction (RE) has received exten- incorporating dependency trees into the rules instead of sive attention [13, 14], including within the biomedical lexical patterns, as well as introducing a method to detect domain [15, 16], with recent promising results incorporat- negations and speculative statements. ing distant supervision [17]. However, most of the work [29] previously studied the task of context association focuses on identifying relations among entities within for the biomedical domain and framed it as a problem of the same sentence. In the biological context association inter-sentence relation extraction. This work presents label=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , and proliferation in < C O N > MCF-7 < / C O N > cells . lbbel=() Phospholipase C delta-4 overexpression upregulates < E V T > ErbB1/2 expression < / E V T > , Erk signaling pathway , …have linked the upregulation of [ E V E N T ] with rapid proliferation in certain [ C O N T E X T ] … < C O N > MCF-7 < / C O N > cells . lcbel=() … < C O N > macrophages < / C O N > , and [ C O N T E X T ] , where it is a trimeric complex consisting of one alpha-chain … [ S E P ] …FcRbeta also acts as a chaperone that increases < E V T > FcepsilonRI expression < / E V T > Figure 3: Example input text spans. (a) Single-sentence segment with markers; (b) multi-sentence segment with markers and masked secondary event and context mentions; and (c) truncated long multi-sentence segment. set of linguistic and lexical features that describe the ontology depends on the type of entity: UniProt1 for neighborhood of the participant entities and proposes an proteins, PubChem2 for chemical entities, etc. aggregation mechanism that results in improved context Importantly, a context biological container type is association. likely mentioned multiple times in the document. Ap- Previous work relied upon feature engineering to en- proximately half of the context container types in the code the participants and their potential interactions. context-event relation corpus are detected two or more State-of-the-art NLP research leverages large language times, as illustrated in figure 5. Every candidate con- models to exploit transfer learning. Models such as [30], text mention that refers to the same container type is and similar transformer based architectures [31] better paired with the relevant event mention to generate a capture the semantics of text based on its surrounding text segment for each pair. Each segment is represented context with unsupervised pre-training over extremely as the concatenation of the sentences that include the large corpora. Specialized models, such as [32, 33, 34] event mention, one mention of the candidate context refine language models by continuing pre-training with container type, and all the sentences in between. These in-domain corpora. text segments are used as input to the network to make To the best of our knowledge, the work presented predictions. If an article contains 𝑛𝑖 context mentions of here is the first to propose and analyze deep-learning container type 𝑖, then for each event mention the network aggregation and ensemble architectures for many-to-one, will take up to 𝑛𝑖 input text segments to determine if type long-distance relation extraction. 𝑖 is a context of the event. The task of the network is to learn whether context type 𝑖 is a context of the specific event mention by looking at a subset of the 𝑛𝑖 inputs. An 3. Neural Architectures for article with 𝑗 context types and 𝑚 event mentions will Context Association see a total of 𝑗 × 𝑚 classification problems and a total of 𝑗 ∑𝑖 𝑛𝑖 × 𝑚 input text segments. Figure 4 shows a block We propose a family of neural architectures designed to diagram of the family of architectures. determine whether a candidate context class is relevant to Each input segment is preprocessed as follows. The a given biochemical event mention. A biochemical event boundaries of the relevant event and candidate con- mention (event mention for short) describes the interac- text mentions are marked with the special tokens: tion between proteins, genes, and other gene products < E V T > . . . < / E V T > for the event mention and < C T X > . . . through biochemical reactions such as regulation, inhi- < / C T X > for the context mention. Other event or context bition, phosphorylation, etc. In particular, we focus on mentions present in the segment are masked with special the 12 interactions detected by REACH [35]. A biological [ E V E N T ] or [ C O N T E X T ] tokens, respectively, to avoid con- container context mention (context mention for short) fusing the classifier with other event mentions that aren’t represents an instance from any of the following biolog- the focus of the current prediction. Figure 3 shows ex- ical container types: species (e.g., human, mice), organ ample text spans where the event and context mentions (e.g., liver, lung), tissue type (e.g., endothelium, muscle are surrounded by their boundary tokens. Next, each tissue), cell type (e.g., macrophages, neurons), or cell line preprocessed text segment is tokenized using the tok- (e.g., HeLa, MCF-7). enizer specific to the pre-trained transformer used as the In this work, we use an existing information extraction encoder. If a tokenized sequence exceeds the maximum system [36] to detect and extract event mentions and can- length allowed by the transformer, it is truncated before didate context mentions. Candidate context mentions are the encoding step by selecting the prefix of the sequence grounded to ontology concepts with unique identifiers up to half the length, the suffix up to half the length minus to accommodate different spellings and synonyms that refer to the same biological container type. The specific 1 https://www.uniprot.org/ 2 https://pubchem.ncbi.nlm.nih.gov/ Aggregation ︙ ︙ ︙ ︙ upregulates Aggregation Correct MCF-7 Function Context ErbB1/2 expression Aggregated Embedding BioMed ︙ ︙ RoBERTA … … … ︙ ︙ in upregulates ︙ Voting Correct Function Context MCF-7 ErbB1/2 expression Individual cells Predictions Input Segments Encoded Segments Classification Voting Embeddings Figure 4: Context association neural architecture. The left-most box represents the input text segments after pre-processing. The blocks inside the encoded segments box represent BioMed RoBERTa’s hidden states for the input segments. The classification embeddings box contains averages of the hidden states corresponding to the < E V T > and < C O N > tokens of each input segment. Depending on the choice of architecture, classification embeddings either flow through (a) the aggregation block, which combines them to then generate the final classification; or (b) the voting block, where each embedding is classified, then the final result is generated through a voting function. 400 nate from the previously discussed process. To generate 350 a single prediction, the network must combine the infor- 300 mation carried forward by the classification embeddings. 250 We propose two general approaches to combine the clas- frequency 200 sification embeddings and generate the final prediction 150 by combining the information before classification and 100 after classification, respectively: 50 0 • Aggregation: Classification embeddings are 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 1920+ num instances combined together using an aggregation func- tion. The aggregated embedding is then passed Figure 5: Distribution of the number of context class detec- through a multi-layer perceptron (MLP) to emit tions per article (𝑛𝑖 ). a binary classification. • Voting: Each classification embedding is passed individually through the MLP, which emits a lo- one token, and inserting a special < S E P > token between cal decision based only on the individual input them. Any truncated input segment is guaranteed to re- text segment. The individual decisions are com- tain both mentions and their local lexical context. Figure bined using a voting function to emit the final 3 shows an example of a segment truncated using this classification. procedure. After tokenization, the segments are encoded using BioMed RoBERTa-base [37] 3 , based on [32]. Intuitively, aggregation functions consider multiple The output hidden states of the < E V T > and < C O N > tokens information points to make an informed decision based are averaged to create a classification embedding. on the “bigger picture” presented by the article. Voting Each classification task emits a single binary predic- functions, on the other hand, make isolated decisions tion, but has up to 𝑛𝑖 classification embeddings to account solely based on information local to each input text seg- for the multiple (potential) context mentions that origi- ment, then use those individual predictions to vote for the final classification, akin to an ensemble approach. 3 We used the available public checkpoint for both the BPE There are multiple ways to implement aggregation and and BioMed RoBERTa models from https://huggingface.co/allenai/ voting functions. We propose four implementations of biomed_roberta_base each kind, each following a intuitive principle. [b]0.49 0.6 0.5 0.4 Score value 0.3 0.2 F1 0.1 Precision Recall 3 4 5 6 7 8 9 10 # of mentions Figure 6: Majority vote [b]0.49 0.65 F1 0.60 Precision 0.55 Recall 0.50 Score value 0.45 0.40 0.35 0.30 0.25 3 4 5 6 7 8 9 10 # of mentions Figure 7: Average aggregation Figure 8: Precision/recall/F1 scores of the relation classifier as the number of context mention considered for each individual relation classification is varied. Documents Event mentions Context mentions Annotations Validation 6 685 713 1,192 Cross validation 20 1,169 1,926 1,543 Total 26 1,854 2,639 2,735 Cross-validation split Training 17 975.83 (58.32) 1,654.83 (52.83) 1,288.33 (95.89) Testing 3 193.16 (58.32) 271.16 (52.83) 254.66 (95.89) Table 2 Statistics of the context association dataset. The upper part shows statistics from the overall dataset, both in total and split by the two partitions: (a) validation set, and (b) partition used for the formal cross-validation experiments. The lower part shows the average and standard deviations used for train/test for the different folds in cross-validation. Aggregation Functions gregation approach concatenates 𝑘 nearest classification embeddings and uses a MLP to reduce the concatenated Nearest Context Mention: Following the intuition that embeddings to a new vector with the same number of textual proximity should be a strong indicator of associ- components as an individual classification embedding. ation, this approach selects the context mention of the The MLP works as map that combines the original 𝑘 relevant context type that is closest to the event mention. classification embeddings whose parameters are learned The closest context mention can appear either before during training. If the number of input text segments or after event mention. In this setting, all other context is < 𝑘, the concatenated classification embeddings are mentions are ignored. The approach results in only one, padded with zeros before being mapped to the new vector unaltered classification embedding. It is equivalent to space. the case where only one mention of the relevant context type appears in a document (𝑛𝑖 = 1). Average Context Embedding: Conversely, all mentions Voting Functions of the candidate context type can bear a degree of re- One hit: This voting approach requires the minimum sponsibility to determine whether it is context of the amount of evidence to trigger a positive classification. event mention. Without making a statement about the The context type is classified as i s c o n t e x t o f the event importance of each context mention, we consider the mention if at least one classification embedding is classi- text segments of the 𝑘 nearest context mentions of the fied as positive. Intuitively, this voting function favors relevant context type, to either side. The upper bound is recall. enforced for efficiency and is left as a hyper parameter. Majority vote: Conversely, it can be argued that there If there are less than 𝑘 context mentions, all the text seg- should be consensus in the vote. The majority vote func- ments are considered. The segments are encoded, then tion triggers a positive classification if at least half of the the resulting classification embeddings are averaged. classification embeddings are classified as positive. In Inverse Distance aggregation: It can be argued that the contrast to one hit, this voting function favors precision. influence of each context mention in the final decision Post-inverse distance vote: Analogous to the inverse decreases when it is farther apart from the event mention. distance aggregation approach, this approach takes the We propose this aggregation approach, where instead of vote of each classification embedding as weighted by the averaging the 𝑘 nearest classification embeddings, they normalized inverse sentence distance: 𝑤 = 𝑑 −1 / ∑𝑘 𝑑 −1 . 𝑖 𝑖 𝑗 𝑗 are combined as a weighted sum, where each classifica- The final classification is emitted in favor of the class with 𝑘 tion embedding’s weight is defined as 𝑤𝑖 = 𝑑𝑖−1 / ∑𝑗 𝑑𝑗−1 , the highest weight. As opposed to the inverse distance the normalized inverse sentence distance between the aggregation approach, the combination happens after event mention and the context mention. The resulting passing the embeddings through the MLP. aggregated embedding still carries information from the Confidence vote: We can weight each vote proportion- nearest 𝑘 context mentions, but their contributions di- ally to the confidence of the classifier. In this approach, minish inversely proportionally to their distance from the vote of each individual classification is weighted by the event mention. the classifier’s confidence. The weights are given by Parameterized aggregation: Instead of relying upon a the normalized logits of the vote of each classification heuristic approach to calculate the weights that deter- embedding: 𝑤 = 𝑙 / ∑𝑘 𝑙 . 𝑖 𝑖 𝑗 𝑗 mine the contributions of each classification embedding, we let the network learn the interactions between them using an attention mechanism. The parameterized ag- 4. Full-Text Context-Event Method Precision Recall F1 Relation Corpus Majority (3 votes) 0.580* 0.498 0.536* Parameterized agg. 0.537* 0.494 0.514* We used a corpus of biochemical events annotated with One-hit 0.409 0.668* 0.507 biological context to test the neural architectures for Post inv. distance 0.571* 0.446 0.501 context assignment. Our version of the corpus is an Nearest mention 0.541* 0.464 0.499 Average (5 segs) 0.527 0.469 0.497 extension of the corpus published by [29]. Inverse distance 0.544* 0.454 0.495 The corpus consists of automated extractions of 26 Confidence vote 0.394 0.443 0.417 open-access articles from the PubMed Central repository, all related to the domain of cancer biology. The first type Baselines of extractions are events mentions. An event mention Random forest 0.439 0.541 0.485 is a relation between one or more entities participating Logistic regression 0.361 0.699 0.476 in a biochemical reaction or its regulation. These men- Heuristic 0.421 0.548 0.476 Decision tree 0.311 0.389 0.345 tions can be phosphorylation, ubiquitination, expression, etc. The second type of extractions are candidate context Table 3 mentions. These consist of named entity extractions of Cross-validation results for the is context of class. * denotes different biological container types: species, tissue types statistically significant improvement w.r.t. the random forest and cell lines. classifier. Each event extracted was annotated by up to three biologists who assigned the event’s relevant biological context from a pool of candidate context extractions avail- 5.1. Automatic Negative Examples able in the paper. Context annotations are not exclu- sive, meaning that every event mention can be annotated The context-event relation corpus only contains positive with one or more context classes. The result is a set of context annotations of event mentions. We automatically annotated events, where each event can have zero or generate negative examples for event mentions in each more biological context associations, and there is at least document by enumerating the cartesian product of all one explicit mention for each biological context in the event and context mentions followed by subtracting the same article. The specifics of the automated event extrac- annotated pairs. One consequence of generating negative tion procedure, annotation tool, annotations protocols examples using this exhaustive strategy is that it results and inter-annotator agreements are thoroughly detailed in most of the event/context pairs being negative exam- in [29]. Table 2 contains summary statistics of the data ples, with 60,367 (95.68%) negative pairs and 2,703 (4.32%) set’s documents. positive pairs. This results in a severe class imbalance, The original corpus release lacked the full text of the which makes the classification task harder. articles. Our proposed methodology requires the raw text to be used as input to the neural architectures. Our 5.2. Results and Discussion contribution here is an extension this corpus, where we We use a cross validation evaluation framework similar identified, processed and tokenized the full text of the to the evaluation methodology used by [29]. Each fold articles using the same information extraction tool [35] contains all of the event-context pairs that belong to used by the authors of the original corpus in such way three different articles. However, we held out six papers that the tokens align correctly with the annotations and as a development set. During cross validation, one fold extractions published previously. The full-text context- is used for testing and training is performed using the event relation corpus, along with the code for the experi- remaining 𝑘 − 1 folds plus the data from the development ments presented in this document, is publicly available set. This way, we take advantage of more training data for reproducibility and further research.4 and avoid leaking the information from development into testing. 5. Experiments and Results To better understand the impact of considering multi- ple context mentions at the time of aggregation or voting, In this section, we evaluate all proposed variants of the we tuned this hyper parameter on the development set. context association architecture and discuss the results. Figure 8 shows the effect of increasing the number of con- text mentions used for relation classification. The num- ber of context mentions considered ranged from three to ten. Both architectures reach a peak F1 score between 3 to 5 context mentions. Performance quickly decays 4 https://clulab.github.io/neuralbiocontext/ almost asymptotically, as the number of considered con- text mentions increases. This observation suggestions Distance Precision Recall F1 Support that increasing the number of input text segments de- 0 0.796 0.818 0.807 573 rived from context mentions that are further apart from 1 0.490 0.450 0.469 262 the event introduces too much noise into the decision 2 0.398 0.336 0.364 146 process. 3 0.531 0.402 0.457 107 After the above tuning, we ran cross-validation exper- 4 0.569 0.393 0.465 84 iments for all aggregation and voting methods. Based 5+ 0.214 0.131 0.163 351 on the tuning results, we used the closest five mentions Table 4 of each context class for the average aggregation archi- Cross-validation scores for the positive class of the Majority tecture, and the closest three for all of the other archi- (3 votes) architecture stratified by sentence distance to the tectures. Table 3 summarizes the cross validation per- closet context mention of the same class. formance scores for all the architecture variants. The precision, recall, and F1 scores reported are computed just for the positive class (i.e., is context of) to avoid arti- class. Performance, along with the frequency of such in- ficially inflating the scores with the dominating negative stances, quickly degrades as the distance between event class. and context mention increases. The top performing architecture is the majority vote. It achieves an F1 score slightly above 0.53. The major- ity vote architecture trades off recall for precision. The 6. Conclusions reason for this is that the architecture needs to see at least half of the individual input segments classified as We propose a family of neural architectures to detect bi- positive in order to make that prediction. As a result, a ological context of biochemical events. We approach the positive classification using this architecture comes with problem as an inter-sentence relation extraction that uses a relatively high confidence. As expected, the one-hit multiple pieces of document-level evidence to classify architecture achieves the opposite: it trades precision for whether a specific context label is the correct context recall. One-hit only needs to see one individual positive type of an event extraction. classification in order to emit a positive final classifica- We provide an analysis of different methods to com- tion. As a result, one-hit attains the highest recall within bine evidence to generate a final decision. The ap- the neural architectures but is more prone to false posi- proaches work either before classification, by aggregating tives. embeddings in order to emit a decision, or after classifica- We include several baseline algorithms to compare tion, creating ensembles that vote for multiple individual the performance of the neural architectures. The first decisions. baseline is a “heuristic” method that associates all the Using an expert-annotated corpus that associates bio- context types within a constant number of sentences to chemical events with relevant biological context, our re- an event mention. We also include our implementation sults show that in spite of the severe class imbalance, sev- of three classifiers using the feature engineering method eral the neural architectures are competitive and achieve of [29]. The top three performing neural architectures higher classification performance than a deterministic have statistically significantly higher F1 score than the heuristic and other machine learning approaches. random forest classifier, which is the strongest baseline The neural architectures particularly favor precision, algorithm. which makes them more appealing for applications where Note that the methods proposed by [29] that are in- higher precision is desirable. cluded in the table aggregate multiple feature vectors from Inter-sentence relation extraction continues to be a the different context mentions into a new feature vector challenge. An ablation study of the degree of aggrega- composed of multiple statistics from the original feature tion of evidence shows how considering mentions that space. Examples of these feature aggregations include are further apart from the event degrades performance. the minimum, maximum and average values of the distri- An error analysis by sentence distance shows how the bution of sentence distances, the frequency of the context difficulty of inter-sentence relation extraction correlates type, and the proportion of times the context mention with the distance between the participants. The result of is part of a noun phrase. Their aggregation approach is these analyses suggest that understanding how to filter analogous to the one presented here (although here we out noisy event-context mention pairs and how to better operate in embedding space), which is why the compari- weight the contribution of long-spanning mention pairs son between these two approaches is fair. are important directions for future research. Table 4 lists the classification scores of the top per- forming method, stratifying the data by the sentence distance to the closest context mention of the relevant References [13] M. Banko, M. J. Cafarella, S. Soderland, M. Broad- head, O. Etzioni, Open information extraction from [1] P. R. Cohen, Darpa’s big mechanism program 12 the web, in: Proceedings of the Twentieth Inter- (2015) 045008. URL: https://doi.org/10.1088/1478- national Joint Conference on Artificial Intelligence, 3975/12/4/045008. doi:1 0 . 1 0 8 8 / 1 4 7 8 - 3 9 7 5 / 1 2 / 4 / 2007, pp. 2670–2676. 045008. [14] N. Bach, S. Badaskar, A review of relation extrac- [2] D. Zhou, D. Zhong, Y. He, Biomedical relation ex- tion, Literature review for Language and Statistics traction: From binary to complex, Computational II (2007). and Mathematical Methods in Medicine 2014 (2014) [15] C. Quan, M. Wang, F. Ren, An unsupervised text 1–18. doi:1 0 . 1 1 5 5 / 2 0 1 4 / 2 9 8 4 7 3 . mining method for relation extraction from biomed- [3] L. Hirschman, A. Yeh, C. Blaschke, A. Valencia, ical literature, PLOS One (2014). Overview of biocreative: critical assessment of in- [16] K. Fundel, R. Küffner, R. Zimmer, RelEx – Relation formation extraction for biology, BMC Bioinformat- extraction using dependency parse trees, Bioinfor- ics 6 (2005) S1. URL: https://doi.org/10.1186/1471- matics 23 (2007) 365–371. 2105-6-S1-S1. doi:1 0 . 1 1 8 6 / 1 4 7 1 - 2 1 0 5 - 6 - S 1 - S 1 . [17] H. Poon, K. Toutanova, C. Quirk, Distant supervi- [4] M. A. Valenzuela-Escárcega, Ö. Babur, G. Hahn- sion for cancer pathway extraction from text, in: Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, Pacific Symposium for Biocomputing, 2015. M. Surdeanu, E. Demir, C. T. Morrison, Large-scale [18] K. Swampillai, M. Stevenson, Extracting relations automated reading with reach discovers new can- within and across sentences, in: Proceedings of cer driving mechanisms, in: Proceedings of the Recent Advances in Natural Language Processing, BioCreative VI Workshop (BioCreative6), 2017. 2011. [5] S. Riedel, D. McClosky, M. Surdeanu, A. McCallum, [19] S. K. Sahu, F. Christopoulou, M. Miwa, S. Ana- C. D. Manning, Model combination for event ex- niadou, Inter-sentence relation extraction with traction in bionlp 2011, in: Proceedings of BioNLP document-level graph convolutional neural net- Shared Task 2011 Workshop, 2011, pp. 51–55. work, in: Proceedings of the 57th Annual Meet- [6] H. Kilicoglu, S. Bergler, Adapting a general se- ing of the Association for Computational Linguis- mantic interpretation approach to biological event tics, Association for Computational Linguistics, extraction, in: Proceedings of BioNLP Shared Task Florence, Italy, 2019, pp. 4309–4316. URL: https: 2011 Workshop, 2011, pp. 173–182. //aclanthology.org/P19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - [7] C. Quirk, P. Choudhury, M. Gamon, L. Vander- 1423. wende, Msr-nlp entry in bionlp shared task 2011, [20] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, in: Proceedings of BioNLP Shared Task 2011 Work- L. Huang, J. Zhou, M. Sun, DocRED: A large-scale shop, 2011, pp. 155–163. document-level relation extraction dataset, in: Pro- [8] J. Björne, T. Salakoski, Generalizing biomedical ceedings of the 57th Annual Meeting of the Associa- event extraction, in: Proceedings of BioNLP Shared tion for Computational Linguistics, Association for Task 2011 Workshop, 2011, pp. 183–191. Computational Linguistics, Florence, Italy, 2019, pp. [9] J. Björne, T. Salakoski, Biomedical event extraction 764–777. URL: https://aclanthology.org/P19-1074. using convolutional neural networks and depen- doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 0 7 4 . dency parsing, in: BioNLP, 2018. [21] A. Mandya, D. Bollegala, F. Coenen, K. Atkinson, A [10] H.-L. Trieu, T. T. Tran, K. N. A. Duong, A. Nguyen, dataset for inter-sentence relation extraction using M. Miwa, S. Ananiadou, DeepEventMine: end-to- distant supervision, in: Proceedings of the Eleventh end neural nested event extraction from biomedical International Conference on Language Resources texts, Bioinformatics 36 (2020) 4910–4917. URL: and Evaluation (LREC 2018), European Language https : / / doi.org / 10.1093 / bioinformatics / btaa540. Resources Association (ELRA), Miyazaki, Japan, doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t a a 5 4 0 . 2018. URL: https://aclanthology.org/L18-1246. a r X i v : h t t p s : / / a c a d e m i c . o u p . c o m / b i o i n f o r m a t i c s / a r t[22] i c l e -I. Beltagy, M. E. Peters, A. Cohan, Longformer: pdf/36/19/4910/34806218/btaa540.pdf. The long-document transformer, arXiv:2004.05150 [11] S. Rao, D. Marcu, K. Knight, H. Daumé, Biomedical (2020). event extraction using abstract meaning represen- [23] S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Lin- tation, in: BioNLP 2017, 2017, pp. 126–135. former: Self-attention with linear complexity, arXiv [12] N. M. Hamad, J. H. Elconin, A. E. Karnoub, W. Bai, preprint arXiv:2006.04768 (2020). J. N. Rich, R. T. Abraham, C. J. Der, C. M. Counter, [24] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, Distinct requirements for ras oncogenesis in human C. Zheng, Synthesizer: Rethinking self-attention in versus mouse cells, Genes & development 16 (2002) transformer models, 2021. a r X i v : 2 0 0 5 . 0 0 7 4 3 . 2045–2057. [25] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, guistics, Minneapolis, Minnesota, USA, 2019, pp. A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, 72–78. URL: https://aclanthology.org/W19-1909. A. Weller, Rethinking attention with performers, in: doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 1 9 0 9 . International Conference on Learning Representa- [35] M. A. Valenzuela-Escárcega, G. Hahn-Powell, tions, 2021. URL: https://openreview.net/forum?id= D. Bell, T. Hicks, E. Noriega, M. Surdeanu, C. T. Ua6zuk0WRH. Morrison, Reach, https://github.com/clulab/reach, [26] P. Chen, Permuteformer: Efficient relative position 2018. encoding for long sequences, in: EMNLP, 2021. [36] M. A. Valenzuela-Escarcega, O. Babur, G. Hahn- [27] M. Gerner, G. Nenadic, C. M. Bergman, An ex- Powel, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, ploration of mining gene expression mentions and M. Surdeanu, E. Demir, C. T. Morrison, Large-scale their anatomical locations from biomedical text, in: automated machine reading discovers new cancer Proceedings of the 2010 Workshop on Biomedical driving mechanisms, Database: The Journal of Natural Language Processing, Association for Com- Biological Databases and Curation (2018). URL: http: putational Linguistics, 2010, pp. 72–80. //clulab.cs.arizona.edu/papers/escarcega2018.pdf. [28] F. Sarafraz, Finding conflicting statements in the doi:1 0 . 1 0 9 3 / d a t a b a s e / b a y 0 9 8 . biomedical literature, Ph.D. thesis, University of [37] S. Gururangan, A. Marasović, S. Swayamdipta, Manchester, 2012. K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop [29] E. Noriega-Atala, P. D. Hein, S. S. Thumsi, Z. Wong, pretraining: Adapt language models to domains X. Wang, S. M. Hendryx, C. T. Morrison, Ex- and tasks, in: Proceedings of ACL, 2020. tracting inter-sentence relations for associating bi- ological context with events in biomedical texts, IEEE/ACM Transactions on Computational Bi- ology and Bioinformatics 17 (2020) 1895–1906. doi:1 0 . 1 1 0 9 / T C B B . 2 0 1 9 . 2 9 0 4 2 3 1 . [30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Compu- tational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19- 1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tention is all you need, in: Advances in neural in- formation processing systems, 2017, pp. 5998–6008. [32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining ap- proach, ArXiv abs/1907.11692 (2019). [33] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–1240. URL: https://doi.org/10.1093/bioinformatics/btz682. doi:1 0 . 1 0 9 3 / bioinformatics / btz682. arXiv:https://academic.oup.com/bioinformatics/article- pdf/36/4/1234/32527770/btz682.pdf. [34] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, M. McDermott, Publicly available clinical BERT embeddings, in: Proceed- ings of the 2nd Clinical Natural Language Process- ing Workshop, Association for Computational Lin-