Extended Overview of ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents Yuan Li1 , Biaoyan Fang1 , Jiayuan He1,2 , Hiyori Yoshikawa1,3 , Saber A. Akhondi4 , Christian Druckenbrodt5 , Camilo Thorne5 , Zubair Afzal4 , Zenan Zhai1 , Timothy Baldwin1 and Karin Verspoor1,2 1 The University of Melbourne, Australia 2 RMIT University, Australia 3 Fujitsu Limited, Japan 4 Elsevier BV, Netherlands 5 Elsevier Information Systems GmbH, Germany Abstract In this paper, we provide an overview of the Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2021, part of the Conference and Labs of the Evaluation Forum 2021 (CLEF 2021). The ChEMU evaluation lab focuses on information extraction over chemical reactions from patent texts. As the second instance of our ChEMU lab series, we build upon the ChEMU corpus developed for ChEMU 2020, extending it for two distinct tasks related to reference resolution in chemical patents. Task 1 — Chemical Reaction Reference Resolution — focuses on paragraph-level references and aims to identify the chemical reactions or general conditions specified in one reaction description referred to by another. Task 2 — Anaphora Resolution — focuses on expression-level references and aims to identify the ref- erence relationships between expressions in chemical reaction descriptions. Herein, we describe the resources created for these tasks and the evaluation methodology adopted. We also provide a brief sum- mary of the results obtained in this lab, finding that one submission achieves substantially better results than our baseline models. Keywords Reaction reference resolution, Anaphora resolution, Chemical patents, Text mining, Information Extrac- tion 1. Introduction The discovery of new chemical compounds is perceived as a key driver of the chemical industry and many other industrial sectors, and information relevant for this discovery is found in chemical synthesis descriptions in natural language texts. In particular, patents serve as a critical source of information about new chemical compounds. Compared with journal publications, patents provide more timely and comprehensive information about new chemical compounds [1, 2, 3], since they are usually the first venues where new chemical compounds are disclosed. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " karin.verspoor@rmit.edu.au (K. Verspoor)  0000-0002-8661-1544 (K. Verspoor) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Illustration of the task hierarchy. Despite the significant commercial and research value of the information in patents, manual extraction of such information is costly, considering the large volume of patents available [4, 5]. Thus, developing automatic natural language processing (NLP) systems for chemical patents, which convert text corpora into structured knowledge about chemical compounds, has become a focus of recent research [6, 7]. The ChEMU campaign focuses on information extraction tasks over chemical reactions in patents1 . ChEMU 2020 [6, 8, 9] provided two information extraction tasks, named entity recognition (NER) and event extraction, and attracted 37 teams around the world to participate. In the ChEMU 2021 lab, we provide two new information extraction tasks: chemical reaction reference resolution and anaphora resolution, focusing on reference resolution in chemical patents. Compared with previous shared tasks dealing with anaphora resolution, e.g., the CRAFT-CR task [10], our proposed tasks extend the scope of reference resolution by considering reference relationships on both paragraph-level and expression-level (see Fig. 1). Specifically, our first task aims at the identification of reference relationships between reaction descriptions. Our second task aims at the identification of reference relationships between chemical expressions, including both coreference and bridging. Moreover, we focus on chemical patents while the CRAFT-CR task focused on journal articles. Unfortunately, we didn’t receive any submissions to Task 1, chemical reaction reference resolution. The complexity of this task in particular combined with relatively short time periods for people to develop their systems may have made it difficult for people to participate. We plan to re-run it in 2022, to give the opportunity for more people to participate since the data and task definitions will have been around for a longer period of time. As a result, the remainder of this paper will focus on the second task, anaphora resolution. The rest of the paper is structured as follows. We first discuss related work and shared tasks in Section 2 and introduce the corpus we created for use in the lab in Section 3. Then we give an overview of the task in Section 4 and detail the valuation framework of ChEMU in Section 5 including the evaluation methods and baseline models. We present the evaluation results in Section 6 and finally conclude this paper in Section 7. 1 Our main website is http://chemu.eng.unimelb.edu.au 2. Related Shared Tasks Several shared tasks have addressed reference resolution in scientific literature. BioNLP2011 hosted a subtask on protein coreference [11]. CRAFT 2019 hosted a subtask on coreference resolution (CRAFT-CR) in biomedical articles [10]. However, these shared tasks differ from ours in several respects. First, previous shared tasks considered different domains of scientific literature. For example, the dataset used in BioNLP2011 is derived from the GENIA corpus [12], which primarily focuses on the biological domain, viz. gene/proteins and their regulations. The dataset used in CRAFT- CR shared task is based on biomedical journal articles in PubMed [13, 14]. Our ChEMU shared task, in contrast, focuses on the domain of chemical patents. This difference entails the critical importance for this shared task: information extraction methodologies for general scientific literature or the biomedical domain will not be effective for chemical patents [15]. It is widely acknowledged that patents are written quite differently as compared with general scientific literature, resulting in substantially different linguistic properties. For example, patent authors may trade some clarity in wording for more protection of their intellectual property. Secondly, our reference resolution tasks include both paragraph-level and entity-level refer- ence phenomena. Our first task aims at identification of reference relationships between reaction descriptions, i.e. paragraph-level. This task is challenging because a reaction description may refer to an extremely remote reaction and thus requires processing of very long documents. Our second task aims at anaphora resolution, similarly to previous entity-level coreference tasks. However, a key difference is that we extend the scope of this task by including both coreference and bridging phenomena. That is, we not only aim at finding expressions referring to the same entity, but also expressions that are semantically related or associated. 3. The ChEMU Chemical Reaction Corpus In this section, we explain how the dataset is created for the anaphora resolution task. The complete annotation guidelines are made available at [16]. 3.1. Corpus Selection We build on the ChEMU corpus [17] developed for the ChEMU 2020 shared task [18]. The ChEMU corpus contains patents from the European Patent Office and the United States Patent and Trademark Office, available in English in a digital format. It is based on the Reaxys® database,2 containing reaction entries for patent documents manually created by experts in chemistry. It consists of ‘snippets’ extracted from chemical patents, where each snippet corresponds to a reaction description. It is common that several snippets are extracted from the same chemical patent. 2 Reaxys® Copyright ©2021 Elsevier Life Sciences IP Limited except certain content provided by third parties. Reaxys is a trademark of Elsevier Life Sciences IP Limited, used under license. https://www.reaxys.com 3.2. Mention Type We aim to capture anaphora in chemical patents, with a focus on identifying chemical compounds during the reaction process. Consistent with other anaphora corpora [19, 13, 20], only mentions that are involved in referring relationships (as defined in Section 3.3) and related to chemical compounds are annotated. The mention types that are considered for anaphora annotation are listed below. It should be noted that verbs (e.g. mix, purify, distil) and descriptions that refer to events (e.g. the same process, step 5) are not annotated in this corpus. 3.2.1. Chemical Names Chemical names are a critical component of chemical patents. We capture as atomic mentions the formal name of chemical compounds, e.g. N-[4-(benzoxazol-2-yl)-methoxyphenyl]-S-methyl- N’-phenyl-isothiourea or 2-Chloro-4-hydroxy-phenylboronic acid. Chemical names often include nested chemical components, but for the purposes of our corpus, we consider chemical names to be atomic and do not separately annotate internal mentions. Hence 4-(benzoxazol-2-yl)- methoxyphenyl and acid in the examples above will not be annotated as mentions, as they are part of larger chemical names. 3.2.2. Identifiers In chemical patents, identifiers or labels may also be used to represent chemical compounds, in the form of uniquely-identifying sequences of numbers and letters such as 5i. These can be abbreviations of longer expressions incorporating that identifier that occur earlier in the text, such as chemical compound 5i, or may refer back to an exact chemical name with that identifier. Thus, the identifier is annotated as an atomic mention as well. 3.2.3. Phrases and Noun Types Apart from chemical names and identifiers, chemical compounds are commonly presented as noun phrases (NPs). An NP consists of a noun or pronoun, and premodifiers; NPs are the most common type of compound expressions in chemical patents. Here we detail NPs that are related to compounds: 1. Pronouns: In chemical patents, pronouns (e.g. they or it) usually refer to a previously mentioned chemical compounds. 2. Definite and indefinite NPs: Commonly used to refer to chemical compounds, e.g. the solvent, the title compound, the mixture, and a white solid, a crude product. Furthermore, there are a few types of NPs that need specific handling in chemical patents: 1. Quantified NPs: Chemical compounds are usually described with a quantity. NPs with quantities are considered as atomic mentions if the quantities are provided, e.g. 398.4 mg of the compound 1. 2. NPs with prepositions: Chemical NPs connected with prepositions (e.g. in, with, of ) can be considered as a single mention. For example, the phrase 2,4-dichloro-6-(6-triuoromethylpyridin- 2-yl)-1,3,5-triazine (5.0 g, 16.9 mmol) in tetrahydrofuran (100 mL) is a single mention, as it describes a solvent that contains 2,4-dichloro-6-(6-triuoromethylpyridin-2-yl)-1,3,5- triazine (5.0 g, 16.9 mmol) and tetrahydrofuran (100 mL). NPs describing chemical equipment containing a compound may also be relevant to anaphora resolution. This generally occurs when the equipment that contains the compound undergoes a process that also affects the compound. Thus, equipment expressions such as the flask and the autoclave can also be mentions if they are used to implicitly refer to a contained compound. Unlike many annotation schemes, our annotation allows discontinuous mentions. For exam- ple, the underlined spans of the fragment 114 mg of 4-((4aS,7aS)-6-benzyloctahydro-1-pyrrolo[3,4- b]pyridine-1-yl)-7H-pyrrolo[2,3-d]pyrimidine was obtained with a yield of about 99.1% are treated as a single discontinuous mention. This introduces further complexity into the task and helps to capture more comprehensive anaphora phenomena. 3.2.4. Relationship to ChEMU 2020 entities Since this dataset is built on the ChEMU 2020 corpus [17], annotation of related chemical compounds is available by leveraging existing entity annotations introduced for the ChEMU 2020 named entity recognition (NER) task. However, there are some differences in the definitions of entities for the two tasks. In the original ChEMU 2020 corpus, entity annotations identify chemical compounds (i.e. RE- ACTION_PRODUCT, STARTING_MATERIAL, REAGENT_CATALYST, SOLVENT, and OTHER COMPOUND), reaction conditions (i.e. TIME, TEMPERATURE), quantity information (i.e. YIELD_PERCENT, YIELD_OTHER), and example labels (i.e. EXAMPLE_LABEL). There is over- lap with our definition of mention for the labels relating to chemical compounds. However, in our annotation, chemical names are annotated along with additional quantity information, as we consider this information to be an integral part of the chemical compound description. Furthermore, the original entity annotations do not include generic expressions that corefer with chemical compounds such as the mixture, the organic layer, or the filtrate, and neither do they include equipment descriptions. 3.3. Relation Types Anaphora resolution subsumes both coreference and bridging. In the context of chemical patents, we define four sub-types of bridging, incorporating generic and chemical knowledge. A referring mention which cannot be interpreted on its own, or an indirect mention, is called an anaphor, and the mention which it refers back to is called the antecedent. In relation annotation, we preserve the direction of the anaphoric relation, from the anaphor to the antecedent. Following similar assumptions in recent work, we restrict annotations to cases where the antecedent appears earlier in the text than the anaphor. 3.3.1. Coreference Coreference is defined as expressions/mentions that refer to the same entity [21, 22]. In chem- istry, identifying whether two mentions refer to the same entity needs to consider various Figure 2: Annotated snippet of anaphora resolution in the chemical patents. The figure is taken from [23]. Different color of links represent different anaphora relation types. chemical properties (e.g. temperature or pH). As such, for two mentions to be coreferent, they must share the same chemical properties. We consider two different cases of coreference: 1. Single Antecedents: the anaphor refers to a single antecedent. 2. Multiple Antecedents: the anaphor refers to multiple antecedents, e.g. start materials refers to all the chemical compounds or materials that are used at the beginning. It is possible for there to be ambiguity as to which mention of a given antecedent an anaphor refers to (where the mention is identical); in these cases the closest mention is selected. 3.3.2. Bridging As stated above, when we consider the anaphora relations, we take the chemical properties of the mention into consideration. Coreference is insufficient to cover all instances of anaphora in chemical patents, and bridging occurs frequently. We define four bridging types: TRANSFORMED Links between chemical compounds that are initially based on the same components, but which have undergone a change in condition, such as pH or temperature. Such cases must be one-to-one relations (not one-to-many). As shown in Figure 2, the mixture in line 2 and the first-mentioned mixture in line 3 have the TRANSFORMED relation, as they have the same chemical components but different chemical properties. REACTION-ASSOCIATED The relationship between a chemical compound and its immedi- ate source compounds is via a mixing process, where the source compounds retain their original chemical structure. This relation is one-to-many from the anaphor to the source compounds (antecedents). For example, the mixture in line 2 has REACTION-ASSOCIATED links to three mentions on line 1 that are combined to form it: (1) the solution of Compound (4) (0.815 g, 1.30 mmol) in THF (4.9 ml); (2) acetic acid (9.8 ml); and (3) water (4.9 ml)). WORK-UP Chemical compounds are used to isolate or purify an associated output product, in a one-to-many relation, from the anaphor to the compounds (antecedents) that are used for the work-up process. As demonstrated in Figure 2, The combined organic layer in line 5 comes from the extraction of The mixture and ethyl acetate in line 4, and they are hence annotated as WORK-UP. CONTAINED A chemical compound is contained inside equipment. It is a one-to-many relation from the anaphor (equipment) to the compounds (antecedents) that it contains. An example of this is a flask and the solution of Compound (4) (0.815 g, 1.30 mmol) in THF (4.9 ml) on line 1, where the compound is contained in the flask. 3.4. Annotation Process For the corpus annotation, we use the BRAT text annotation tool.3 In total 1500 snippets have been annotated by two chemical experts, a PhD candidate and a final year bachelor student in Chemistry. A draft of the annotation guideline was created and refined with chemical experts, then four rounds of annotation training were completed prior to beginning official annotation. In each round, the two annotators individually annotated the same 10 snippets (different across each round of annotation), and their annotations were compared and combined by an adjudicator; annotation guidelines were then refined based on discussion. After several rounds of training, we achieved a high inner-annotator agreement of Krippendorff’s 𝛼 = 0.92 [24] at the mention annotation level,4 and 𝛼 = 0.84 for relations. Finally, the development and test sets were double annotated by the two expert annotators, with any disagreements merged by the adjudicator. 3.5. Data Partitions We randomly partitioned the whole dataset into three splits for training, development, and test purposes, with a ratio of 0.6/0.15/0.25. The training and development sets were released to participants for model development. Note that participants are allowed to use the combination of training and development sets and to use their own partitions to build models. The test set is withheld for use in the formal evaluation. The statistics of the three splits including their number of snippets, total number of sentences, and average number of tokens per sentence, are summarized in Table 1. To ensure the snippets included in the training, development, and test splits have similar distributions, we compare the distribution of relation types (five types of relations in total). Based on the numbers in Table 1, we confirm that the label distribution in the three splits are similar, with very little variation (≤ 2%) across the three splits observed for each relation type. 3 https://brat.nlplab.org/ 4 With the lowest agreement being 𝛼 = 0.89 for coreference mentions. Table 1 Corpus annotation statistics. Training Development Test Snippets 6392 1535 2585 Sentences 763 164 274 Tokens/Sentences 15.8 15.2 15.8 Mentions 19626 4515 7810 Discontinuous mentions 876 235 399 Coreference 3568 870 1491 Bridging 10377 2419 4135 Transformed 493 107 166 Reaction-associated 3308 764 1245 Work-up 6230 1479 2576 Contained 346 69 148 4. Task definition This task requires the resolution of general anaphoric dependencies between expressions in chemical patents. Five types of anaphoric relationships are defined: 1. Coreference: two expressions/mentions that refer to the same entity. 2. Transformed: two chemical compound entities that are initially based on the same chemi- cal components and have undergone possible changes through various conditions (e.g., pH and temperature). 3. Reaction-associated: the relationship between a chemical compound and its immediate sources via a mixing process. The immediate sources do need to be reagents, but they need to end up in the corresponding product. The source compounds retain their original chemical structure. 4. Work-up: the relationship between chemical compounds that were used for isolation or purification purposes, and their corresponding output products. 5. Contained: the association holding between chemical compounds and the related equip- ment in which they are placed. The direction of the relation is from the related equipment to the previous chemical compound. Taking the text snippet in Figure 3 as an example, several anaphoric relationships can be extracted from it. [The mixture]4 and [the mixture]3 refer to the same “mixture” and thus, form a coreference relationship. The two expressions [The mixture]1 and [the mixture]2 are initially based on the same chemical components but the property of [the mixture]2 changes after the “stir” and “cool” action. Thus, the two expressions should be linked as “Transformed”. The expression [The mixture]1 comes from mixing the chemical compounds prior to it, e.g., [water (4.9 ml)]. Thus, the two expressions are linked as “Reaction-associated”. The expression [The combined organic layer] comes from the extraction of [ethyl acetate]. Thus, they are linked as “Work-up”. Finally, the expression [the solution] is contained by the entity [a flask], and the two are linked as “Contained”. [Acetic acid (9.8 ml)] and [water (4.9 ml)] were added to [the solution] in [a flask]. [The mixture]1 was stirred for 3 hrs at 50°C and then cooled to 0°C . 2N-sodium hydroxide aqueous solution was added to [the mixture]2 until the pH of [the mixture]3 became 9. [The mixture]4 was extracted with [ethyl acetate] for 3 times. [The combined organic layer] was washed with water and saturated aqueous sodium chloride. ID Relation type Anaphor Antecedent AR1 Coreference [The mixture]4 [the mixture]3 AR2 Transformed [the mixture]2 [The mixture]1 AR3 Reaction_associated [The mixture]1 [water (4.9 ml)] AR4 Work-up [The combined organic layer] [ethyl acetate] AR5 Contained [a flask] [the solution] Figure 3: Text snippet containing a chemical reaction, with its anaphoric relationships. The expressions that are involved are highlighted in bold. In the cases where several expressions have identical text form, subscripts are added according to their order of appearance. 5. Evaluation Framework 5.1. Evaluation Methods We use BRATEval5 to evaluate all the runs that we receive. Three metrics are used to evaluate the performance of all the submissions: Precision, Recall, and 𝐹1 score. We use two difference matching criteria, exact matching and relaxed matching (approximate matching), as in some practical applications it also makes sense to understand if the model can identify the approximate region of mentions. Formally, let 𝐸 = (𝐸𝑇, 𝐴, 𝐵) denote an entity where 𝐸𝑇 is the type of 𝐸, 𝐴 and 𝐵 are the beginning position (inclusive) and end position (exclusive) of the text span of 𝐸. Then two entities 𝐸1 and 𝐸2 are exactly matched (𝐸1 = 𝐸2 ), if 𝐸𝑇1 = 𝐸𝑇2 , 𝐴1 = 𝐴2 , and 𝐵1 = 𝐵2 . While two entities 𝐸1 and 𝐸2 are approximately matched (𝐸1 ≈ 𝐸2 ) if 𝐸𝑇1 = 𝐸𝑇2 , 𝐴2 < 𝐵1 , and 𝐴1 < 𝐵2 , i.e. the two spans [𝐴1 , 𝐵1 ) and [𝐴2 , 𝐵2 ) overlaps. Furthermore, let 𝑅 = (𝑅𝑇, 𝐸 𝑎𝑛𝑎 , 𝐸 𝑎𝑛𝑡 ) be a relation where 𝑅𝑇 is the type of 𝑅, 𝐸 𝑎𝑛𝑎 the anaphor of 𝑅, 𝐸 𝑎𝑛𝑡 the antecedent of 𝑅. Then 𝑅1 and 𝑅2 are exactly matched (𝑅1 = 𝑅2 ) if 𝑅𝑇1 = 𝑅𝑇2 , 𝐸1𝑎𝑛𝑎 = 𝐸2𝑎𝑛𝑎 , and 𝐸1𝑎𝑛𝑡 = 𝐸2𝑎𝑛𝑡 . While 𝑅1 and 𝑅2 are approximately matched (𝑅1 ≈ 𝑅2 ) if 𝑅𝑇1 = 𝑅𝑇2 , 𝐸1𝑎𝑛𝑎 ≈ 𝐸2𝑎𝑛𝑎 , and 𝐸1𝑎𝑛𝑡 ≈ 𝐸2𝑎𝑛𝑡 . In summary, we require strict type match in both exact and relaxed matching, but are lenient in span matching. 5.1.1. Exact Matching With the above definitions, the metrics for exact matching can be easily calculated. The true positives (TP) are exact matching pairs found in gold relations and predicted relations. Then false positives (FP) are the predicted relations that don’t have a match, i.e. 𝐹 𝑃 = #𝑝𝑟𝑒𝑑−𝑇 𝑃 , where #𝑝𝑟𝑒𝑑 is the number of predicted relations. Similarly, false negatives 𝐹 𝑁 are the gold relations that are not matched by any predicted relations, i.e. 𝐹 𝑁 = #𝑔𝑜𝑙𝑑 − 𝑇 𝑃 where #𝑔𝑜𝑙𝑑 is the 5 https://bitbucket.org/nicta_biomed/brateval/src/master/ 𝑃4 𝐺4 𝑃4 𝐺4 𝑃4 𝐺4 𝑃3 𝐺3 𝑃3 𝐺3 𝑃3 𝐺3 𝑃2 𝐺2 𝑃2 𝐺2 𝑃2 𝐺2 𝑃1 𝐺1 𝑃1 𝐺1 𝑃1 𝐺1 𝑃0 𝐺0 𝑃0 𝐺0 𝑃0 𝐺0 (a) All pairs matched in relaxed (b) A non-optimal bipartite (c) A maximum bipartite setting matching (𝑇 𝑃 = 3) matching (𝑇 𝑃 = 4) Figure 4: An example matching graph and two bipartite matching for it. number of gold relations. Finally Precision 𝑃 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃 ), Recall 𝑅 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑁 ), and 𝐹1 = 2/(1/𝑃 + 1/𝑅). 5.1.2. Relaxed Matching Unlike exact matching, relaxed matching is not well-defined and metrics in this setting have more than one way to calculate, therefore we need to clearly define all the metrics. Let consider an example shown in Figure 4a where nodes {𝑃𝑖 }5𝑖=1 are predicted relations, {𝐺𝑖 }5𝑖=1 are gold relations, and every edge between a 𝑃 node and a 𝐺 node means they are approximately matched. At first glance, one may think that 𝐹 𝑁 = 𝐹 𝑃 = 0 because every gold relation has at least a match and so does every predicted relation. However, it is impossible to find 5 true positive pairs from this graph without using one node more than once. Therefore, if 𝐹 𝑁 = 𝐹 𝑃 = 0, then 𝐹 𝑁 + 𝑇 𝑃 ̸= #𝑔𝑜𝑙𝑑 = 5 and 𝐹 𝑃 + 𝑇 𝑃 ̸= #𝑝𝑟𝑒𝑑 = 5, which is inconsistent with the formulas in exact setting. So, instead of defining 𝐹 𝑁 as the number of gold relations that don’t have a match, we just define 𝐹 𝑁 = #𝑔𝑜𝑙𝑑 − 𝑇 𝑃 . Similarly 𝐹 𝑃 is defined as #𝑝𝑟𝑒𝑑 − 𝑇 𝑃 . Then the problem remained is how to calculate 𝑇 𝑃 . Actually, finding true positive pairs can be considered as bipartite matching. Figure 4b shows a matching with 𝑇 𝑃 = 3 but is not optimal. Figure 4c shows one possible maximum bipartite matching with 𝑇 𝑃 = 4. Another optimal matching is replacing edge 𝑃0 − 𝐺0 with 𝑃0 − 𝐺1 . In summary, we define 𝑇 𝑃 as the maximum bipartite matching for the graph constructed by all approximately matched pairs, then 𝐹 𝑁 = #𝑔𝑜𝑙𝑑 − 𝑇 𝑃 and 𝐹 𝑃 = #𝑝𝑟𝑒𝑑 − 𝑇 𝑃 , finally Precision 𝑃 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃 ), Recall 𝑅 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑁 ), and 𝐹1 = 2/(1/𝑃 + 1/𝑅). This has been implemented in the latest BRATEval. 5.2. Coreference Linkings We consider two types of coreference linking, i.e. (1) surface coreference linking and (2) atomic coreference linking, due to the existence of transitive coreference relationships. By transitive Figure 5: The architecture of our baseline model. The figure is taken from [23]. coreference relationships we mean multi-hop coreference such as a link from an expression T1 to T3 via an intermediate expression T2, viz., “T1→T2→T3”. Surface coreference linking will restrict attention to one-hop relationships, viz., to: “T1→T2” and “T2→T3”. Whereas atomic coreference linking will tackle coreference between an anaphoric expression and its first antecedent, i.e. intermediate antecedents will be collapsed. Thus, these two links will be used for the above example, “T1→T3” and “T2→T3”. Note that we only consider transitive linking in coreference relationships. Note that {T1→T2,T2→T3} infers {T1→T3,T2→T3}, but the reverse is not true. This leads to a problem about how to score a prediction {T1→T3,T2→T3}, when the gold relation is {T1→T2,T2→T3}. Both T1→T3 and T2→T3 are true, but some information is missing here. Our solution is to first expand both the prediction set and gold set where all valid relations that can be inferred will be generated and added to the set, and then to evaluate the two sets normally. In the above example, the gold set will be expanded to {T1→T2,T2→T3,T1→T3}, and then the result is 𝑇 𝑃 = 2, 𝐹 𝑁 = 1. Likewise, when evaluate {T1→T4,T2→T4,T3→T4} against {T1→T2,T2→T3,T3→T4}, the gold set will be expanded into 6 relations, while the prediction set won’t be expanded as no new relation can be inferred. So the evaluation result will be 𝑇 𝑃 = 3, 𝐹 𝑁 = 3. One may worry that if there is a chain of length 𝑛 then its expanded set will be in 𝑂(𝑛2 ), when 𝑛 is large, this local evaluation result will have too much influence on the overall result. But we find in practice that coreference chains are relatively short, with 3 or 4 being the most typical lengths, so it is unlikely to be a big issue. 5.3. Baselines Our baseline model adopts an end-to-end architecture for coreference resolution [25, 26], as depicted in Figure 5. Following the methods presented in [23], we use GloVe embeddings and a character-level CNN as input to a BiLSTM to obtain contextualized word representations. Then all possible spans are enumerated and fed to a mention classifier which detects if the input is a mention. Based on the same mention representations, pairs of mentions are fed to a coreference classifier and a bridging classifier, where the coreference classifier does binary classification and the bridging one classifies pairs into 4 bridging relation types and a special class for no relation. Training is done jointly with all losses added together. Table 2 Overall performance for all runs on the test set. Here P, R, and F are short for Precision, Recall, and 𝐹1 score. For each metric, the best result is highlighted in bold. Exact-Match Relaxed-Match P R F P R F CMU 0.8177 0.7542 0.7847 0.909 0.8384 0.8723 Baseline-ChELMO 0.8566 0.6882 0.7633 0.9024 0.725 0.8041 Baseline-ELMO 0.8435 0.6676 0.7453 0.8875 0.7025 0.7842 HUKB 0.7132 0.6696 0.6907 0.7702 0.7231 0.7459 We released the code for training our baseline models to help the participants to get started on the shared task.6 Two variants of the baseline model are evaluated on the test set, one using the ELMO embeddings as input to the BiLSTM component, while the other used pretrained ChELMO, based on the embeddings of [27] pre-trained on chemical patents, with the hope of benefiting more from domain-specific pretraining. 6. Results and Discussions A total of 19 teams registered on our submission website for the shared task. Among them, we finally received 2 submissions on the test set. One team is from Carnegie Mellon University, US (CMU) and the other one is from Hokkaido University, Japan (HUKB). More details about their systems are provided in Section 7. In this section, we report their results along with the performance of our two baseline systems. We report the overall performance of all runs in Table 2. The rankings of different systems are fully consistent across all metrics. The CMU team achieves an 𝐹1 score of 0.7847 in exact matching, outperforming our two baselines which get 0.7633 and 0.7453, followed by the HUKB team who obtains 0.6907. The lead of the CMU team is even larger in relaxed matching, with an 𝐹1 score of 0.8723, about 7 points higher than our baselines. This shows the potential of the CMU model and indicates that the performance in exact matching may be further boosted if the boundary errors of their model could be corrected in a post-processing step. Our baselines have higher precision in the exact setting and precision in relaxed setting is also very close to the best, which indicates that our models are more conservative and could possibly be enhanced by making more aggressive predictions to improve recall. The use of domain-pretrained embeddings (ChELMO vs. ELMO) does, as expected, benefit performance. Table 3 provides more details about the performance of all models for each relation type. The CMU team outperforms others on TRANSFORMED relation by a large margin. While our baselines performs the best on CONTAINED relation type. For the other three relation types, the CMU model wins 𝐹1 score and recall, while our models achieve the highest precision, which is similar to our observation on the overall results. Given that the models perform very differently, it would be very interesting to do more analysis when the details of all the models 6 Code available at https://github.com/biaoyanf/ChEMU-Ref Table 3 Performance per relation type for all runs on the test set. Here P, R, and F are short for Precision, Recall, and 𝐹1 score. For each metric, the best result is highlighted in bold. Exact-Match Relaxed-Match P R F P R F CMU 0.7568 0.5822 0.6581 0.8945 0.6881 0.7779 Baseline-ChELMO 0.8476 0.4661 0.6015 0.9244 0.5084 0.656 COREFERENCE Baseline-ELMO 0.8497 0.4474 0.5861 0.9185 0.4836 0.6336 HUKB 0.6956 0.5319 0.6028 0.7868 0.6016 0.6819 CMU 0.7727 0.6892 0.7286 0.8561 0.7635 0.8071 Baseline-ChELMO 0.9211 0.7095 0.8015 0.9386 0.723 0.8168 CONTAINED Baseline-ELMO 0.9175 0.6014 0.7265 0.9794 0.6419 0.7755 HUKB 0.7214 0.6824 0.7014 0.7929 0.75 0.7708 CMU 0.8037 0.7631 0.7829 0.9019 0.8562 0.8785 REACTION_ Baseline-ChELMO 0.8381 0.7357 0.7836 0.8673 0.7614 0.8109 ASSOCIATED Baseline-ELMO 0.8145 0.7229 0.766 0.8498 0.7542 0.7991 HUKB 0.668 0.6803 0.6741 0.7224 0.7357 0.729 CMU 0.9423 0.8855 0.913 0.9423 0.8855 0.913 Baseline-ChELMO 0.7935 0.8795 0.8343 0.7935 0.8795 0.8343 TRANSFORMED Baseline-ELMO 0.7877 0.8494 0.8174 0.7877 0.8494 0.8174 HUKB 0.6611 0.7169 0.6879 0.6611 0.7169 0.6879 CMU 0.846 0.8447 0.8454 0.9195 0.9181 0.9188 Baseline-ChELMO 0.8705 0.7803 0.8229 0.9181 0.823 0.868 WORK_UP Baseline-ELMO 0.8566 0.7605 0.8057 0.899 0.7981 0.8456 HUKB 0.7467 0.7403 0.7435 0.7929 0.7861 0.7895 are disclosed, and hopefully every team can borrow ideas from others and further improve the performance. 7. Overview of Participants’ Approaches We received paper submissions from both the participating teams, i.e. the HUKB team and the CMU team. We first describe their approaches, then summarize the same and different aspects of them. 7.1. HUKB The HUKB team used a two-step approach for the anaphora resolution task, where mentions including both antecedent and anaphor are first detected, then mentions are classified into different types and relations between them are determined. For step one, they found that although existing parsers such as Chemical Tagger can generate useful features, they are not enough for mention detection. Therefore they trained a BioBERT model to find candidate mentions. This is done by treating mention detection as a NER task, where a BIOHD format is used to convert gold spans into sequence of labels. The BIOHD format is an extension to the well-known BIO format to support discontinuous spans which exist in this amphora resolution task. In the second step, different types of relations are determined based on different rules. COREFERENCE is first detected by 5 regular expression rules. The remaining relations are detected based on the features generated by ChemicalTagger. When no more relations can be found, a post-processing step is carried out to handle the transitivity property of COREFERENCE relations, i.e. enumerating all valid COREFERENCE relations based on the transitivity property. 7.2. CMU The CMU team proposed a pipelined system for anaphora resolution, where mentions are first extracted and then relations between them are determined. The first step is done by a BERT-CRF model which is trained using BIO tagging. In the second step, for each pair of mentions, the sequence of sentences that contain the pair is fed to a BERT model to obtain the encoded representation of every token, then the representation of a mention is simply the mean of all tokens in it, and finally the representations of two mentions are concatenated and classified into 6 classes using a linear layer (5 relations + a class for no relation). To correct boundary errors in mention detection, a rule-based post processing is done between step 1 and 2. Furthermore, ensembling of 5 models are used in both step 1 and 2 to improve performance. 7.3. Summary Both of them adopted a two-step approach where mentions are first detected and then relations between them are determined. They also both relied on BERT-like models to extract contextu- alized representations for mention detection. While the CMU team used a BERT-like model in the relation extraction, the HUKB team chose a rule-based method. In addition, the CMU team used ensembling of 5 models with majority voting in both mention detection and relation extraction, as well as a post-processing step between them to correct potential boundary errors in mention detection, where both techniques contribute to their superior overall performance. 8. Conclusions This paper presents a general overview of the activities and outcomes of the ChEMU 2021 evaluation lab. As the second instance of our ChEMU lab series, ChEMU 2021 targets two new tasks focusing on reference resolution in chemical patents. Our first task aims at identification of reference relationships between chemical reaction descriptions, and our second task aims at identification of reference relationships between expressions in chemical reactions. The evaluation result includes different approaches to tackling the shared task, with one submission clearly outperforming our baseline methods. We look forward to fruitful discussion and deeper understanding of the methodological details of these submissions at the workshop. Acknowledgements Funding for the ChEMU project is provided by an Australian Research Council Linkage Project, project number LP160101469, and Elsevier. We acknowledge the support of our ChEMU-Ref annotators, Dr. Sacha Novakovic and Colleen Hui Shiuan Yeow at the University of Melbourne, and the annotation teams supporting the reaction reference task annotation. References [1] S. A. Akhondi, H. Rey, M. Schwörer, M. Maier, J. Toomey, H. Nau, G. Ilchmann, M. Sheehan, M. Irmer, C. Bobach, et al., Automatic identification of relevant chemical compounds from patents, Database 2019 (2019). [2] M. Bregonje, Patents: A unique source for scientific technical information in chemistry related industry?, World Patent Information 27 (2005) 309–315. [3] S. Senger, L. Bartek, G. Papadatos, A. Gaulton, Managing expectations: Assessment of chemistry databases generated by automated extraction of chemical structures from patents, Journal of Cheminformatics 7 (2015) 1–12. [4] M. Hu, D. Cinciruk, J. M. Walsh, Improving automated patent claim parsing: Dataset, system, and experiments, arXiv preprint arXiv:1605.01744 (2016). [5] S. Muresan, P. Petrov, C. Southan, M. J. Kjellberg, T. Kogej, C. Tyrchan, P. Varkonyi, P. H. Xie, Making every SAR point count: The development of Chemistry Connect for the large-scale integration of structure and bioactivity data, Drug Discovery Today 16 (2011) 1019–1030. [6] J. He, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt, C. Thorne, R. Hoessel, Z. Afzal, Z. Zhai, B. Fang, H. Yoshikawa, A. Albahem, L. Cavedon, T. Cohn, T. Baldwin, K. Verspoor, Chemu 2020: Natural language processing methods are effective for information extraction from chemical patents, Frontiers Res. Metrics Anal. 6 (2021) 654438. URL: https://doi.org/ 10.3389/frma.2021.654438. doi:10.3389/frma.2021.654438. [7] M. Krallinger, F. Leitner, O. Rabal, M. Vazquez, J. Oyarzabal, A. Valencia, CHEMDNER: The drugs and chemical names extraction challenge, Journal of Cheminformatics 7 (2015) S1. [8] J. He, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt, C. Thorne, R. Hoessel, Z. Afzal, Z. Zhai, B. Fang, H. Yoshikawa, A. Albahem, L. Cavedon, T. Cohn, T. Baldwin, K. Verspoor, Overview of ChEMU 2020: Named entity recognition and event extraction of chemical reactions from patents, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), volume 12260, Lecture Notes in Computer Science, 2020. [9] D. Q. Nguyen, Z. Zhai, H. Yoshikawa, B. Fang, C. Druckenbrodt, C. Thorne, R. Hoessel, S. A. Akhondi, T. Cohn, T. Baldwin, et al., ChEMU: Named entity recognition and event extraction of chemical reactions from patents, in: European Conference on Information Retrieval, Springer, 2020, pp. 572–579. [10] W. A. Baumgartner Jr, M. Bada, S. Pyysalo, M. R. Ciosici, N. Hailu, H. Pielke-Lombardo, M. Regan, L. Hunter, CRAFT shared tasks 2019 overview—integrated structure, semantics, and coreference, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 174–184. [11] N. Nguyen, J.-D. Kim, J. Tsujii, Overview of BioNLP 2011 protein coreference shared task, in: Proceedings of BioNLP Shared Task 2011 Workshop, 2011, pp. 74–82. [12] T. Ohta, Y. Tateisi, J.-D. Kim, H. Mima, J. Tsujii, The GENIA corpus: An annotated research abstract corpus in molecular biology domain, in: Proceedings of the Second International Conference on Human Language Technology Research, 2002, pp. 82–86. [13] K. B. Cohen, A. Lanfranchi, M. J. Choi, M. Bada, W. A. B. Jr., N. Panteleyeva, K. Ver- spoor, M. Palmer, L. E. Hunter, Coreference annotation and resolution in the colorado richly annotated full text (CRAFT) corpus of biomedical journal articles, BMC Bioinform. 18 (2017) 372:1–372:14. URL: https://doi.org/10.1186/s12859-017-1775-9. doi:10.1186/ s12859-017-1775-9. [14] M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley, D. Sitnikov, J. Baumgartner, W. A., K. B. Cohen, K. Verspoor, J. A. Blake, L. E. Hunter, Concept annotation in the CRAFT corpus, BMC Bioinformatics 13 (2012) 161. URL: https://www.ncbi.nlm.nih.gov/pubmed/22776079. doi:10.1186/1471-2105-13-161. [15] M. Lupu, K. Mayer, N. Kando, A. J. Trippe, Current challenges in patent information retrieval, volume 37, Springer, 2017. [16] B. Fang, C. Druckenbrodt, C. Yeow Hui Shiuan, S. Novakovic, R. Hössel, S. A. Akhondi, J. He, M. Mistica, T. Baldwin, K. Verspoor, Chemu-ref dataset for modeling anaphora resolution in the chemical domain, 2021. doi:10.17632/r28xxr6p92. [17] K. Verspoor, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt, C. Thorne, R. Hoessel, J. He, Z. Zhai, ChEMU dataset for information extraction from chemical patents, 2020. doi:10. 17632/wy6745bjfj. [18] J. He, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt, C. Thorne, R. Hoessel, Z. Afzal, Z. Zhai, B. Fang, H. Yoshikawa, A. Albahem, L. Cavedon, T. Cohn, T. Baldwin, K. Verspoor, Overview of chemu 2020: Named entity recognition and event extraction of chemical reactions from patents, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, volume 12260 of Lecture Notes in Computer Science, Springer, 2020, pp. 237–254. URL: https: //doi.org/10.1007/978-3-030-58219-7_18. doi:10.1007/978-3-030-58219-7\_18. [19] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, Y. Zhang, Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes, in: S. Pradhan, A. Moschitti, N. Xue (Eds.), Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning - Proceedings of the Shared Task: Modeling Multi- lingual Unrestricted Coreference in OntoNotes, EMNLP-CoNLL 2012, July 13, 2012, Jeju Island, Korea, ACL, 2012, pp. 1–40. URL: https://www.aclweb.org/anthology/W12-4501/. [20] A. Ghaddar, P. Langlais, Wikicoref: An english coreference-annotated corpus of wikipedia articles, in: N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016, European Language Resources Association (ELRA), 2016. URL: http://www.lrec-conf.org/proceedings/lrec2016/summaries/192.html. [21] V. Ng, Machine learning for entity coreference resolution: A retrospective look at two decades of research, in: S. P. Singh, S. Markovitch (Eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, AAAI Press, 2017, pp. 4877–4884. URL: http://aaai.org/ocs/index.php/AAAI/AAAI17/ paper/view/14995. [22] K. Clark, C. D. Manning, Entity-centric coreference resolution with model stacking, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, The Association for Computer Linguistics, 2015, pp. 1405–1415. URL: https://doi.org/10.3115/v1/p15-1136. doi:10.3115/v1/p15-1136. [23] B. Fang, C. Druckenbrodt, S. A. Akhondi, J. He, T. Baldwin, K. Verspoor, ChEMU-Ref: A corpus for modeling anaphora resolution in the chemical domain, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2021. [24] K. Krippendorff, Measuring the reliability of qualitative text analysis data, Quality and quantity 38 (2004) 787–800. [25] K. Lee, L. He, M. Lewis, L. Zettlemoyer, End-to-end neural coreference resolution, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, Association for Computational Linguistics, 2017, pp. 188–197. URL: https: //doi.org/10.18653/v1/d17-1018. doi:10.18653/v1/d17-1018. [26] K. Lee, L. He, L. Zettlemoyer, Higher-order coreference resolution with coarse-to-fine inference, in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Association for Computational Linguistics, 2018, pp. 687–692. URL: https://doi.org/10.18653/v1/n18-2108. doi:10.18653/v1/n18-2108. [27] Z. Zhai, D. Q. Nguyen, S. Akhondi, C. Thorne, C. Druckenbrodt, T. Cohn, M. Gregory, K. Verspoor, Improving chemical named entity recognition in patents with contextualized word embeddings, in: Proceedings of the 18th BioNLP Workshop and Shared Task, Association for Computational Linguistics, Florence, Italy, 2019, pp. 328–338. URL: https: //www.aclweb.org/anthology/W19-5035. doi:10.18653/v1/W19-5035.