ChemProp: A Dataset with Annotations for Instructional Language in Chemical Patents Sopan Khosla1,∗ , Carolyn Rose2 1 AWS AI Labs 2 Carnegie Mellon University Abstract In this paper, we propose a new set of annotations for the ChEMU Chemical Reaction Corpus. Our annotations (ChemProp) non-trivially incorporate the signals from ChEMU 2020 and 2021 schema to extract the instructional structure from chemical patents with details about inputs, outputs, and reaction attributes for each event in the reaction snippet. We propose a semi-automatic algorithm to create ChemProp and benchmark state-of-the-art models proposed for ChEMU 2020 and ChEMU 2021 on it. We hope that ChemProp can play an important part in modeling the instructional language present in chemical patents. Keywords Chemical Patents, Information Extraction, Program Synthesis, Coreference Resolution, Relation Extraction 1. Introduction the entire instructional structure (e.g. a chronological sequence of inputs, reaction-steps, conditions, and out- Chemical research relies heavily on the knowledge of puts) of the underlying chemical patent. For example, chemical processes and synthesis, which are often de- even though ChEMU 2020 schema tries to relate reac- scribed in chemical patents or research literature, with tion events with associated compounds or conditions, patents also serving as a critical source of information it only operates on named-entities, and therefore does about new compounds [1]. Despite the significant value not cover important lexical items (noun-phrases) that of the information present in these documents, extrac- describe relevant reaction conditions and participants tion and organization of this information still heavily using co-referring generic expressions, for example, the relies on costly manual processes [2]. High influx of such mixture, the organic layer, or the filtrate. documents in chemistry has introduced the need for auto- In this work, we propose an algorithm that augments matic systems that can extract the structured knowledge CLEF ChEMU 2020 annotations with ChEMU 2021 an- present in these texts [3, 4]. notations to create a more complete annotation frame- CLEF ChEMU shared-task series released the ChEMU work for converting natural language chemical patents Chemical Reaction Corpus that contains reaction snip- into structured recipes. Instructional language is a use- pets extracted from chemical patents. For ChEMU ful structure that comprises of step-by-step instructions 2020 [3], the authors annotate information about relation- that need to be performed to complete a task. How- ships between reaction events (steps) and named-entities ever, most of the prior art in the instructional language involved in that step. ChEMU 2021 [4] on the other hand paradigm focuses on cooking recipes. We propose a new focuses specifically on extracting chemical relations be- dataset, ChemProp1 , that merges the ChEMU 2020 and tween a pair of entity-mentions. The framework intro- 2021 annotations to create labels for the instructional duces five domain-specific relations (including bridging language present in chemical patents. For each reac- and coreference) that link different noun-phrases present tion snippet, we annotate constituting events (reaction/ in the discourse. Finally, ChEMU 2022 [5] reused the work-up steps), their relative chronological order, and en- expression-level tasks from 2020 and 2021, and also in- tities that are associated with each of these events. More troduced other document-level information extraction specifically, for each reaction step in that snippet, we tasks. annotate the trigger event verb, and the noun phrases None of these shared tasks however fully capture (entities) that depict the (i) INPUT, (ii) OUTPUT, and (iii) reaction-attributes (RXN_ATTR) of that reaction step. The Third AAAI Workshop on Scientific Document Understanding, We leverage the raw reaction snippets from the ChEMU February 14, 2023, Washington, DC Chemical Reaction Corpus as our data and annotate it ∗ Work done when the author was a student at CMU. by (i) automatically combining the annotations of CLEF Envelope-Open sopankh@amazon.com (S. Khosla); cprose@cs.cmu.edu (C. Rose) ChEMU shared tasks 2020 and 2021, and (ii) manually GLOBE https://sopankhosla.github.io/ (S. Khosla) incorporating events/entities that are missed by the two © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://github.com/sopankhosla/chemprop CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: File 0050 with CC20 annotations. annotation schemes. 2. Furthermore, CC20 does not capture relationships Furthermore, we show the significance of these aug- between reaction steps (events) and noun-phrase mentations by evaluating the performance of the best mentions that denote combinations/ mixtures performing models on ChEMU 2020 and 2021 shared- (e.g., the reaction mixture) or coreferent expres- tasks on ChemProp. Our experiments show that models sions (e.g., the product). trained on ChemProp training data only achieve 0.69 Micro-F1 points on the test data, thus highlighting the CLEF ChEMU 2021 Annotation Schema (CC21). room for improvement. We also show that ChemProp Next year, He et al. [4] proposed an additional layer of contains novel entities and relationships that are not annotation to the patents corpus, which focuses on the present in ChEMU shared-tasks thus making it beneficial identification of anaphoric references. The new corpus as a standalone benchmark for instructional language contains annotations for both COREFERENCE and bridg- modeling from chemical patents. ing relations (Figure 2). The authors define four domain- specific sub-types for bridging: TRANSFORMED, RE- ACTION_ASSOCIATED, WORK_UP, CONTAINED. As 2. Prior Art a standalone schema, CC21 suffers from the following In this section, we brielfy describe the ChEMU Chemical issues: Reaction Corpus and the two state-of-the-art annotation 1. CC21 does not contain explicit information about schemes proposed during ChEMU shared-tasks ’20 & ’21. reaction steps. Therefore, it is less useful, in iso- lation, for information extraction from chemical CLEF ChEMU 2020 Annotation Schema (CC20). patents. He et al. [3] annotated a corpus of 1,500 patent snip- 2. Furthermore, CC21 differs from CC20 on its defi- pets sampled from 170 patents from the European Patent nition of mentions and therefore makes the com- Office and the United States Patent and Trademark Office. bination of two annotations non-trivial. In the Their annotation schema aims at extraction of chemical next section, we describe the algorithm to handle reactions (i.e. REACTION_STEP, WORKUP) from patent these ambiguities. snippets. It identifies trigger words that describe reaction steps and relates them to named-entities linked to the step (i.e. chemical compounds, time, temperature, and 3. ChemProp: Annotation yields; Figure 1). Despite being a comprehensive annota- tion schema, CC20 suffers from two major drawbacks: CC20 contains relationships between events and named- 1. CC20 does not annotate reaction steps that do entities, whereas CC21 connects noun phrases (including not relate to any named-entity in the discourse named-entities) based on their anaphoric relationships. snippet. E.g., as shown in Figure 1, CC20 does Together, CC20 and CC21 provide somewhat comple- not annotate the event concentrated (in line 7). mentary information about each reaction snippet in the Figure 2: File 0050 with CC21 annotations. Figure 3: File 0050 with ChemProp annotations. ChEMU corpus. In this section, we describe the steps we 3.1.1. Pre-processing take to merge these somewhat heterogeneous annotation As a pre-processing step, we setup data-structures that schema to create our new dataset ChemProp. help with the conversion algorithm. We create 3.1. Automatic Merging of ChEMU 2020 1. A many-to-one map from CC20 named-entities to CC21 named-entity annotations to tackle and 2021 the small annotation differences between the First, we present our algorithm that automatically merges two schemas. The map is many-to-one because signals from CC20 and CC21 and extracts lexical spans CC21 annotates entire solutions, whereas CC20 in the discourse, that most closely represent the inputs, keeps the individual compounds/elements, e.g., outputs, & reaction-attributes for each reaction step. [a solution of ethanol𝐶𝐶20 and water𝐶𝐶20 ]𝐶𝐶21 . We also store the inverse one-to-many map from 1. If ARG2𝑖𝐶𝐶21 is also present in the CC20 annota- CC21 to CC20. So for the given example, we tion, we use this ARG2𝑖𝐶𝐶21 as a pivot to combine store the two annotations. We extract the CC20 rela- tion (Rxn𝑖𝐶𝐶20 ) it is linked with. ethanol𝐶𝐶20 ↔ a solution of ethanol and water𝐶𝐶21 2. If ARG2𝑖𝐶𝐶21 is not present in CC20, it means that water𝐶𝐶20 ↔ a solution of ethanol and water𝐶𝐶21 . ARG2𝑖𝐶𝐶21 is a non named-entity noun-phrase and therefore needs to be resolved further. To 2. The mapping from ARG1 of CC21 reaction- ground such cases, we rely on the fact that the oriented relations (i.e., REACTION_ASSOCI- starting compounds in each reaction snippet ATED (R_ASSOC), WORK_UP, and TRANS- starts are named-entities. Therefore, we can as- FORMED (TRANS)) to their corresponding ARG2 sume that in order to reach current ARG1𝐶𝐶21 , all mentions. For example, for the relation TRANS- other ARG2𝐶𝐶21 s that also appeared as ARG1𝐶𝐶21 FORMED between The reaction mixture and The (noun-phrases) have been resolved in earlier it- reaction in Figure 2 (line 4, 5), we store erations of this algorithm. Therefore, the lat- est Rxn𝑖𝐶𝐶20 in the patent snippet between the The reaction mixture𝐶𝐶21 CC20 reaction event associated with ARG2𝑖𝐶𝐶21 Rxn:TRANS −−−−−−−−−→ The reaction𝐶𝐶21 . (i.e. ARG2𝑖𝐶𝐶21 → − Rxn𝐶𝐶20 ) and ARG1𝐶𝐶21 is re- turned. 3. The mapping from CC20 events (ARG1) to their From this list of relations, we consider the Rxn𝑖𝐶𝐶20 that corresponding argument mentions (ARG2). For is closest to ARG1𝐶𝐶21 , but occurs before it, to be the example, for the event stirred in Figure 1 (line 4), lexical event trigger that outputs ARG1𝐶𝐶21 : we store 𝑗 Event:TEMP Cand_Rxn𝐶𝐶20 = [Rxn1𝐶𝐶20 , Rxn2𝐶𝐶20 , ..., Rxn𝐶𝐶20 , ...] stirred𝐶𝐶20 ←−−−−−−−−→ r.t.𝐶𝐶20 𝑗 Event:TIME Rxn𝐶𝐶20 = 𝑐𝑙𝑜𝑠𝑒𝑠𝑡_𝑏𝑒𝑓 𝑜𝑟𝑒(ARG1, Cand_Rxn𝐶𝐶20 ) stirred𝐶𝐶20 ←−−−−−−−−→ 2 h𝐶𝐶20 . Output:Rxn 𝑗 ARG1𝐶𝐶21 ←−−−−−−−→ Rxn𝐶𝐶20 . 4. Finally, a dictionary to store the COREFERENCE relationships between different CC21 mentions. Consider the case where ARG1𝐶𝐶21 is The combined or- We create a separate map for coreference as they ganic phases (Figure 2; line 6). It is related to four men- denote that both mentions in the pair are equiv- tions (ARG2𝐶𝐶21 s) {The reaction mixture, a pad of celite, alent i.e., point to the same underlying entity in water, ethyl acetate} by the relation WORK_UP (Rxn𝐶𝐶21 ). the discourse: Three of these mentions {a pad of celite, water, ethyl ac- etate} are also present in the mapping created in pre- Coref ARG1𝐶𝐶21 ←−−−→ ARG2𝐶𝐶21 . processing step 1, whereas {The reaction mixture} is not. As discussed earlier, for ARG2𝑖𝐶𝐶21 s present in CC20, we The above mappings store the relationships annotated in first create a list of Rxn𝐶𝐶20 s (Cand_Rxn𝐶𝐶20 ) they cor- CC20 and CC21 that are relevant to creating ChemProp. respond to {filtered, diluted, extracted} (Figure 1). For The reaction mixture, the corresponding Rxn𝐶𝐶20 , based on 3.1.2. Algorithm its resolution in the previous step, is The reaction mixture ←→ stirred. We allow for three relationships between reaction events The latest CC20 relation between stirred and and entities in our new ChemProp benchmark – IN- ARG2𝐶𝐶21 , extracted, is then inserted into Cand_Rxn𝐶𝐶20 . PUT, OUTPUT, RXN_ATTR. CC20 already annotates Combining the four relations we get {extracted, filtered, RXN_ATTRs (reaction-attributes) and named-entity IN- diluted, extracted}. From these, extracted occurs closest PUTS/ OUTPUTS. Therefore, to complete the schema, to the current ARG1 while occurring before it. There- we devise an algorithm that can find a mapping between fore, The combined organic phases is considered to be the each non named-entity noun-phrase in CC21 to a reac- OUTPUT of extracted (The combined organic phases ← → tion event in CC20 (ARG𝐶𝐶21 ← → Rxn𝐶𝐶20 ). extracted; Figure 3). For each CC21 entity (ARG1𝐶𝐶21 ) that is related to other CC21 entities (ARG2𝐶𝐶21 ) (that occur before it Exceptional Cases. Although, most of the events can in the discourse) via one of the three reaction-oriented be fully annotated in ChemProp format using the above CC21 relations (Rxn𝐶𝐶21 ) i.e., REACTION_ASSOCIATED, steps, there are certain exceptions that arise due to the WORK_UP, and TRANSFORMED, two possibilities need mismatch between the motivation of CC20 and CC21 to be considered. For each ARG2𝑖𝐶𝐶21 in ARG2𝐶𝐶21 : schemas. 1. (E1) For a small number of cases, we find that the complete picture of the reaction snippet. In order to 𝑗 annotate such events, we manually go through all of reaction-event (Rxn𝐶𝐶20 ) that occurs right before an ARG1𝐶𝐶21 might not be the event that outputs the development and test files and fix the annotations it. manually thus arriving at a gold dev and gold test sets. Example. Consider the phrase – ”the reaction However we do not perform this quality assurance step mixture is filtered in Celite, and ethanol is on the training data due to it’s large size, and therefore added to the filtrate”. In this case, the filtrate only obtain a silver training set. We find, however, that refers to a state before the addition of ethanol cases which require a manual inspection occur rather and is the output of event filtered. We use regex infrequently, and therefore do not deteriorate the quality expressions to find such template patterns and of our training data much (sterling silver). Figure 3 shows resolve them automatically. We tackle the more an example patent snippet from the development set. complex occurrences in the manual quality assur- ance phase (as described in the next section). 2. (E2) In some instances, we find that ARG2𝑖𝐶𝐶21 4. ChemProp: Baseline and ARG1𝐶𝐶21 are related to each other by a In order to setup a baseline for ChemProp, we train the Rxn𝐶𝐶21 but no corresponding Rxn𝐶𝐶20 is present. pipeline-based system from Dutt et al. [6] (also referred to In such cases, we introduce a pseudo relation in as CC21_best going forward) on ChemProp training set. between them and leave its annotation to the We refer the reader to the paper for more details about manual step. the system. We show the performance of two setups as Example. Consider the phrase ”the reaction described in Dutt et al. [6], (i) relation-classification on mixture is filtered, and the filtrate is heated gold-entities and (ii) end-to-end classification. for 20 min”. For this phrase, filtered would not We also evaluate the CC20 ground-truth [3] and the be annotated in CC20 (no related named-entity), best-performing model at ChEMU shared-task 2020 [7] however, CC21 would annotate the pair (the fil- on ChemProp. As discussed earlier, the motivation of trate, the reaction mixture) as TRANSFORMED. ChemProp is very similar to CLEF ChEMU 2020. How- In this case, we automatically introduce a pseudo ever, in the absence of the support from CC21, the CC20 relation whose input is the reaction mixture annotation does not capture all the relationships that and output is the filtrate. make up the instructional language present in patent The patent snippets are an ordered sequence of event text. Hence, comparing CC20 against ChemProp would steps that transform a starting product to an end prod- allow us to quantify the additional information present uct. Therefore, one can, with sufficient confidence, also in ChemProp. consider the outputs of a particular event to be the best lexical representation of the inputs of the immediate next 4.1. Results event. This allows us to annotate both INPUTs and We use the BRAT evaluation script distributed by CLEF OUTPUTs of the CC20 reaction-steps using a single ChEMU 2021 shared-task organizers to evaluate differ- algorithm. Finally, we add the RXN_ATTR (reaction- ent setups. We find that CC20 ground-truth test data attribute) annotations present in CC20 on top to get our achieves an F1 score of 0.74 (Table 1). While the pre- final dataset, ChemProp. cision is near perfect, CC20 suffers from low recall on We note that although, we only consider three types INPUT and OUTPUT relations as the annotation excludes of relations, where each relation is between an EVENT some reaction events and does not annotate generic noun- mention and an ENTITY mention (Figure 3), the fine- phrases. Furthermore, CC20_best [7], a model designed grained classification of these two types of mentions for CC20 shared-task, gets an F1 score of 0.62, 12% be- provided in CC20 can be easily ported over to ChemProp low CC20 (ground-truth). These low numbers suggest to make the new annotation schema more informative. that ChemProp provides considerably more information about the chemical patent snippets that will be systemat- 3.2. Manual Quality Assurance ically missed by the systems trained on CC20. Next, we manually go through the development and test We observe that CC21_best (gold entities) achieves data to fix the exceptions (described in the previous sec- an F1 score of 0.86 on ChemProp test set. This suggests tion). that the model is able to somewhat reliably figure out As discussed earlier, the CC20 annotation does not which named-entities/ noun phrases are related to which annotate reaction steps that do not relate to any named- reaction event in the snippet. CC21_best (end-to-end), in entity in the snippet. However, in our case such events addition to relation classification, also extracts mentions are equally relevant and need to be extracted to get a from raw patent snippets, and therefore expectedly per- System Micro F1 [5] Y. Li, B. Fang, J. He, H. Yoshikawa, S. A. Akhondi, CC20 (ground-truth) 0.74 C. Druckenbrodt, C. Thorne, Z. Afzal, Z. Zhai, CC20_best 0.62 T. Baldwin, et al., Overview of chemu 2022 eval- Trained on ChemProp uation campaign: Information extraction in chem- ical patents, in: International Conference of the CC21_best (gold-entities) 0.86 Cross-Language Evaluation Forum for European CC21_best (end-to-end) 0.69 Languages, Springer, 2022, pp. 521–540. Table 1 [6] R. Dutt, S. Khosla, C. P. Rosé, A pipelined approach to Performance on ChemProp test set anaphora resolution in chemical patents., in: CLEF (Working Notes), 2021, pp. 710–719. [7] J. W. Y. R. Z. Zhang, Y. Zhang, Melaxtech: A re- forms much worse than CC21_best (gold entities) with port for clef 2020–chemu task of chemical reaction an overall F1 score of 0.69. extraction from patent (2020). 5. Conclusion In this work, we propose a new corpus ChemProp that non-trivially combines properties from ChEMU 2020 and 2021 annotation schema to extract the instructional struc- ture from chemical patents. We provide a semi-automatic algorithm to create ChemProp. Evaluating state-of-the- art models on the our new dataset suggests that there is still room for improvement in extracting relevant in- structional triggers from patent text. We believe that ChemProp can act as an important benchmark for in- structional language modeling. References [1] S. Senger, L. Bartek, G. Papadatos, A. Gaulton, Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Journal of chem- informatics 7 (2015) 1–12. [2] S. Muresan, P. Petrov, C. Southan, M. J. Kjellberg, T. Kogej, C. Tyrchan, P. Varkonyi, P. H. Xie, Making every sar point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data, Drug Discovery Today 16 (2011) 1019–1030. [3] J. He, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt, C. Thorne, R. Hoessel, Z. Afzal, Z. Zhai, B. Fang, H. Yoshikawa, et al., Overview of chemu 2020: named entity recognition and event extraction of chemical reactions from patents, in: International Conference of the Cross-Language Evaluation Fo- rum for European Languages, Springer, 2020, pp. 237–254. [4] J. He, B. Fang, H. Yoshikawa, Y. Li, S. A. Akhondi, C. Druckenbrodt, C. Thorne, Z. Afzal, Z. Zhai, L. Cavedon, et al., Chemu 2021: reaction refer- ence resolution and anaphora resolution in chemical patents, in: ECIR (2), 2021.