ChemProp: A Dataset with Annotations for Instructional
                                Language in Chemical Patents
                                Sopan Khosla1,∗ , Carolyn Rose2
                                1
                                    AWS AI Labs
                                2
                                    Carnegie Mellon University


                                                                             Abstract
                                                                             In this paper, we propose a new set of annotations for the ChEMU Chemical Reaction Corpus. Our annotations (ChemProp)
                                                                             non-trivially incorporate the signals from ChEMU 2020 and 2021 schema to extract the instructional structure from chemical
                                                                             patents with details about inputs, outputs, and reaction attributes for each event in the reaction snippet. We propose a
                                                                             semi-automatic algorithm to create ChemProp and benchmark state-of-the-art models proposed for ChEMU 2020 and ChEMU
                                                                             2021 on it. We hope that ChemProp can play an important part in modeling the instructional language present in chemical
                                                                             patents.

                                                                             Keywords
                                                                             Chemical Patents, Information Extraction, Program Synthesis, Coreference Resolution, Relation Extraction


                                1. Introduction                                                                                                                       the entire instructional structure (e.g. a chronological
                                                                                                                                                                      sequence of inputs, reaction-steps, conditions, and out-
                                Chemical research relies heavily on the knowledge of                                                                                  puts) of the underlying chemical patent. For example,
                                chemical processes and synthesis, which are often de-                                                                                 even though ChEMU 2020 schema tries to relate reac-
                                scribed in chemical patents or research literature, with                                                                              tion events with associated compounds or conditions,
                                patents also serving as a critical source of information                                                                              it only operates on named-entities, and therefore does
                                about new compounds [1]. Despite the significant value                                                                                not cover important lexical items (noun-phrases) that
                                of the information present in these documents, extrac-                                                                                describe relevant reaction conditions and participants
                                tion and organization of this information still heavily                                                                               using co-referring generic expressions, for example, the
                                relies on costly manual processes [2]. High influx of such                                                                            mixture, the organic layer, or the filtrate.
                                documents in chemistry has introduced the need for auto-                                                                                 In this work, we propose an algorithm that augments
                                matic systems that can extract the structured knowledge                                                                               CLEF ChEMU 2020 annotations with ChEMU 2021 an-
                                present in these texts [3, 4].                                                                                                        notations to create a more complete annotation frame-
                                   CLEF ChEMU shared-task series released the ChEMU                                                                                   work for converting natural language chemical patents
                                Chemical Reaction Corpus that contains reaction snip-                                                                                 into structured recipes. Instructional language is a use-
                                pets extracted from chemical patents. For ChEMU                                                                                       ful structure that comprises of step-by-step instructions
                                2020 [3], the authors annotate information about relation-                                                                            that need to be performed to complete a task. How-
                                ships between reaction events (steps) and named-entities                                                                              ever, most of the prior art in the instructional language
                                involved in that step. ChEMU 2021 [4] on the other hand                                                                               paradigm focuses on cooking recipes. We propose a new
                                focuses specifically on extracting chemical relations be-                                                                             dataset, ChemProp1 , that merges the ChEMU 2020 and
                                tween a pair of entity-mentions. The framework intro-                                                                                 2021 annotations to create labels for the instructional
                                duces five domain-specific relations (including bridging                                                                              language present in chemical patents. For each reac-
                                and coreference) that link different noun-phrases present                                                                             tion snippet, we annotate constituting events (reaction/
                                in the discourse. Finally, ChEMU 2022 [5] reused the                                                                                  work-up steps), their relative chronological order, and en-
                                expression-level tasks from 2020 and 2021, and also in-                                                                               tities that are associated with each of these events. More
                                troduced other document-level information extraction                                                                                  specifically, for each reaction step in that snippet, we
                                tasks.                                                                                                                                annotate the trigger event verb, and the noun phrases
                                   None of these shared tasks however fully capture                                                                                   (entities) that depict the (i) INPUT, (ii) OUTPUT, and (iii)
                                                                                                                                                                      reaction-attributes (RXN_ATTR) of that reaction step.
                                The Third AAAI Workshop on Scientific Document Understanding,                                                                         We leverage the raw reaction snippets from the ChEMU
                                February 14, 2023, Washington, DC                                                                                                     Chemical Reaction Corpus as our data and annotate it
                                ∗
                                    Work done when the author was a student at CMU.
                                                                                                                                                                      by (i) automatically combining the annotations of CLEF
                                Envelope-Open sopankh@amazon.com (S. Khosla); cprose@cs.cmu.edu
                                (C. Rose)                                                                                                                             ChEMU shared tasks 2020 and 2021, and (ii) manually
                                GLOBE https://sopankhosla.github.io/ (S. Khosla)                                                                                      incorporating events/entities that are missed by the two
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        1
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                                                                                                                          https://github.com/sopankhosla/chemprop


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: File 0050 with CC20 annotations.


annotation schemes.                                                2. Furthermore, CC20 does not capture relationships
   Furthermore, we show the significance of these aug-                between reaction steps (events) and noun-phrase
mentations by evaluating the performance of the best                  mentions that denote combinations/ mixtures
performing models on ChEMU 2020 and 2021 shared-                      (e.g., the reaction mixture) or coreferent expres-
tasks on ChemProp. Our experiments show that models                   sions (e.g., the product).
trained on ChemProp training data only achieve 0.69
Micro-F1 points on the test data, thus highlighting the        CLEF ChEMU 2021 Annotation Schema (CC21).
room for improvement. We also show that ChemProp               Next year, He et al. [4] proposed an additional layer of
contains novel entities and relationships that are not         annotation to the patents corpus, which focuses on the
present in ChEMU shared-tasks thus making it beneficial        identification of anaphoric references. The new corpus
as a standalone benchmark for instructional language           contains annotations for both COREFERENCE and bridg-
modeling from chemical patents.                                ing relations (Figure 2). The authors define four domain-
                                                               specific sub-types for bridging: TRANSFORMED, RE-
                                                               ACTION_ASSOCIATED, WORK_UP, CONTAINED. As
2. Prior Art                                                   a standalone schema, CC21 suffers from the following
In this section, we brielfy describe the ChEMU Chemical        issues:
Reaction Corpus and the two state-of-the-art annotation            1. CC21 does not contain explicit information about
schemes proposed during ChEMU shared-tasks ’20 & ’21.                 reaction steps. Therefore, it is less useful, in iso-
                                                                      lation, for information extraction from chemical
CLEF ChEMU 2020 Annotation Schema (CC20).                             patents.
He et al. [3] annotated a corpus of 1,500 patent snip-             2. Furthermore, CC21 differs from CC20 on its defi-
pets sampled from 170 patents from the European Patent                nition of mentions and therefore makes the com-
Office and the United States Patent and Trademark Office.             bination of two annotations non-trivial. In the
Their annotation schema aims at extraction of chemical                next section, we describe the algorithm to handle
reactions (i.e. REACTION_STEP, WORKUP) from patent                    these ambiguities.
snippets. It identifies trigger words that describe reaction
steps and relates them to named-entities linked to the
step (i.e. chemical compounds, time, temperature, and          3. ChemProp: Annotation
yields; Figure 1). Despite being a comprehensive annota-
tion schema, CC20 suffers from two major drawbacks:            CC20 contains relationships between events and named-
     1. CC20 does not annotate reaction steps that do          entities, whereas CC21 connects noun phrases (including
         not relate to any named-entity in the discourse       named-entities) based on their anaphoric relationships.
         snippet. E.g., as shown in Figure 1, CC20 does        Together, CC20 and CC21 provide somewhat comple-
         not annotate the event concentrated (in line 7).      mentary information about each reaction snippet in the
Figure 2: File 0050 with CC21 annotations.


Figure 3: File 0050 with ChemProp annotations.


ChEMU corpus. In this section, we describe the steps we     3.1.1. Pre-processing
take to merge these somewhat heterogeneous annotation
                                                            As a pre-processing step, we setup data-structures that
schema to create our new dataset ChemProp.
                                                            help with the conversion algorithm. We create

3.1. Automatic Merging of ChEMU 2020                            1. A many-to-one map from CC20 named-entities
                                                                   to CC21 named-entity annotations to tackle
     and 2021                                                      the small annotation differences between the
First, we present our algorithm that automatically merges          two schemas. The map is many-to-one because
signals from CC20 and CC21 and extracts lexical spans              CC21 annotates entire solutions, whereas CC20
in the discourse, that most closely represent the inputs,          keeps the individual compounds/elements, e.g.,
outputs, & reaction-attributes for each reaction step.             [a solution of ethanol𝐶𝐶20 and water𝐶𝐶20 ]𝐶𝐶21 .
       We also store the inverse one-to-many map from             1. If ARG2𝑖𝐶𝐶21 is also present in the CC20 annota-
       CC21 to CC20. So for the given example, we                    tion, we use this ARG2𝑖𝐶𝐶21 as a pivot to combine
       store                                                         the two annotations. We extract the CC20 rela-
                                                                     tion (Rxn𝑖𝐶𝐶20 ) it is linked with.
       ethanol𝐶𝐶20 ↔ a solution of ethanol and water𝐶𝐶21          2. If ARG2𝑖𝐶𝐶21 is not present in CC20, it means that
       water𝐶𝐶20 ↔ a solution of ethanol and water𝐶𝐶21 .             ARG2𝑖𝐶𝐶21 is a non named-entity noun-phrase
                                                                     and therefore needs to be resolved further. To
    2. The mapping from ARG1 of CC21 reaction-                       ground such cases, we rely on the fact that the
       oriented relations (i.e., REACTION_ASSOCI-                    starting compounds in each reaction snippet
       ATED (R_ASSOC), WORK_UP, and TRANS-                           starts are named-entities. Therefore, we can as-
       FORMED (TRANS)) to their corresponding ARG2                   sume that in order to reach current ARG1𝐶𝐶21 , all
       mentions. For example, for the relation TRANS-                other ARG2𝐶𝐶21 s that also appeared as ARG1𝐶𝐶21
       FORMED between The reaction mixture and The                   (noun-phrases) have been resolved in earlier it-
       reaction in Figure 2 (line 4, 5), we store                    erations of this algorithm. Therefore, the lat-
                                                                     est Rxn𝑖𝐶𝐶20 in the patent snippet between the
       The reaction mixture𝐶𝐶21                                      CC20 reaction event associated with ARG2𝑖𝐶𝐶21
                           Rxn:TRANS
                           −−−−−−−−−→ The reaction𝐶𝐶21 .             (i.e. ARG2𝑖𝐶𝐶21 → − Rxn𝐶𝐶20 ) and ARG1𝐶𝐶21 is re-
                                                                     turned.
    3. The mapping from CC20 events (ARG1) to their           From this list of relations, we consider the Rxn𝑖𝐶𝐶20 that
       corresponding argument mentions (ARG2). For            is closest to ARG1𝐶𝐶21 , but occurs before it, to be the
       example, for the event stirred in Figure 1 (line 4),   lexical event trigger that outputs ARG1𝐶𝐶21 :
       we store
                                                                                                                𝑗
                             Event:TEMP                         Cand_Rxn𝐶𝐶20 = [Rxn1𝐶𝐶20 , Rxn2𝐶𝐶20 , ..., Rxn𝐶𝐶20 , ...]
                stirred𝐶𝐶20 ←−−−−−−−−→ r.t.𝐶𝐶20                     𝑗
                             Event:TIME                         Rxn𝐶𝐶20 = 𝑐𝑙𝑜𝑠𝑒𝑠𝑡_𝑏𝑒𝑓 𝑜𝑟𝑒(ARG1, Cand_Rxn𝐶𝐶20 )
                stirred𝐶𝐶20 ←−−−−−−−−→ 2 h𝐶𝐶20 .                             Output:Rxn      𝑗
                                                                ARG1𝐶𝐶21 ←−−−−−−−→ Rxn𝐶𝐶20 .
    4. Finally, a dictionary to store the COREFERENCE
       relationships between different CC21 mentions.         Consider the case where ARG1𝐶𝐶21 is The combined or-
       We create a separate map for coreference as they       ganic phases (Figure 2; line 6). It is related to four men-
       denote that both mentions in the pair are equiv-       tions (ARG2𝐶𝐶21 s) {The reaction mixture, a pad of celite,
       alent i.e., point to the same underlying entity in     water, ethyl acetate} by the relation WORK_UP (Rxn𝐶𝐶21 ).
       the discourse:                                         Three of these mentions {a pad of celite, water, ethyl ac-
                                                              etate} are also present in the mapping created in pre-
                              Coref
                 ARG1𝐶𝐶21 ←−−−→ ARG2𝐶𝐶21 .                    processing step 1, whereas {The reaction mixture} is not.
                                                              As discussed earlier, for ARG2𝑖𝐶𝐶21 s present in CC20, we
The above mappings store the relationships annotated in       first create a list of Rxn𝐶𝐶20 s (Cand_Rxn𝐶𝐶20 ) they cor-
CC20 and CC21 that are relevant to creating ChemProp.         respond to {filtered, diluted, extracted} (Figure 1). For The
                                                              reaction mixture, the corresponding Rxn𝐶𝐶20 , based on
3.1.2. Algorithm                                              its resolution in the previous step, is The reaction mixture
                                                              ←→ stirred.
We allow for three relationships between reaction events
                                                                 The latest CC20 relation between stirred and
and entities in our new ChemProp benchmark – IN-
                                                              ARG2𝐶𝐶21 , extracted, is then inserted into Cand_Rxn𝐶𝐶20 .
PUT, OUTPUT, RXN_ATTR. CC20 already annotates
                                                              Combining the four relations we get {extracted, filtered,
RXN_ATTRs (reaction-attributes) and named-entity IN-
                                                              diluted, extracted}. From these, extracted occurs closest
PUTS/ OUTPUTS. Therefore, to complete the schema,
                                                              to the current ARG1 while occurring before it. There-
we devise an algorithm that can find a mapping between
                                                              fore, The combined organic phases is considered to be the
each non named-entity noun-phrase in CC21 to a reac-
                                                              OUTPUT of extracted (The combined organic phases ←         →
tion event in CC20 (ARG𝐶𝐶21 ←    → Rxn𝐶𝐶20 ).
                                                              extracted; Figure 3).
   For each CC21 entity (ARG1𝐶𝐶21 ) that is related to
other CC21 entities (ARG2𝐶𝐶21 ) (that occur before it
                                                              Exceptional Cases. Although, most of the events can
in the discourse) via one of the three reaction-oriented
                                                              be fully annotated in ChemProp format using the above
CC21 relations (Rxn𝐶𝐶21 ) i.e., REACTION_ASSOCIATED,
                                                              steps, there are certain exceptions that arise due to the
WORK_UP, and TRANSFORMED, two possibilities need
                                                              mismatch between the motivation of CC20 and CC21
to be considered. For each ARG2𝑖𝐶𝐶21 in ARG2𝐶𝐶21 :
                                                              schemas.
    1. (E1) For a small number of cases, we find that the   complete picture of the reaction snippet. In order to
                           𝑗                                annotate such events, we manually go through all of
       reaction-event (Rxn𝐶𝐶20 ) that occurs right before
       an ARG1𝐶𝐶21 might not be the event that outputs      the development and test files and fix the annotations
       it.                                                  manually thus arriving at a gold dev and gold test sets.
       Example. Consider the phrase – ”the reaction         However we do not perform this quality assurance step
       mixture is filtered in Celite, and ethanol is        on the training data due to it’s large size, and therefore
       added to the filtrate”. In this case, the filtrate   only obtain a silver training set. We find, however, that
       refers to a state before the addition of ethanol     cases which require a manual inspection occur rather
       and is the output of event filtered. We use regex    infrequently, and therefore do not deteriorate the quality
       expressions to find such template patterns and       of our training data much (sterling silver). Figure 3 shows
       resolve them automatically. We tackle the more       an example patent snippet from the development set.
       complex occurrences in the manual quality assur-
       ance phase (as described in the next section).
    2. (E2) In some instances, we find that ARG2𝑖𝐶𝐶21
                                                            4. ChemProp: Baseline
       and ARG1𝐶𝐶21 are related to each other by a         In order to setup a baseline for ChemProp, we train the
       Rxn𝐶𝐶21 but no corresponding Rxn𝐶𝐶20 is present.    pipeline-based system from Dutt et al. [6] (also referred to
       In such cases, we introduce a pseudo relation in    as CC21_best going forward) on ChemProp training set.
       between them and leave its annotation to the        We refer the reader to the paper for more details about
       manual step.                                        the system. We show the performance of two setups as
       Example. Consider the phrase ”the reaction          described in Dutt et al. [6], (i) relation-classification on
       mixture is filtered, and the filtrate is heated     gold-entities and (ii) end-to-end classification.
       for 20 min”. For this phrase, filtered would not       We also evaluate the CC20 ground-truth [3] and the
       be annotated in CC20 (no related named-entity),     best-performing model at ChEMU shared-task 2020 [7]
       however, CC21 would annotate the pair (the fil-     on ChemProp. As discussed earlier, the motivation of
       trate, the reaction mixture) as TRANSFORMED.        ChemProp is very similar to CLEF ChEMU 2020. How-
       In this case, we automatically introduce a pseudo   ever, in the absence of the support from CC21, the CC20
       relation whose input is the reaction mixture        annotation does not capture all the relationships that
       and output is the filtrate.                         make up the instructional language present in patent
   The patent snippets are an ordered sequence of event text. Hence, comparing CC20 against ChemProp would
steps that transform a starting product to an end prod- allow us to quantify the additional information present
uct. Therefore, one can, with sufficient confidence, also in ChemProp.
consider the outputs of a particular event to be the best
lexical representation of the inputs of the immediate next 4.1. Results
event. This allows us to annotate both INPUTs and
                                                           We use the BRAT evaluation script distributed by CLEF
OUTPUTs of the CC20 reaction-steps using a single
                                                           ChEMU 2021 shared-task organizers to evaluate differ-
algorithm. Finally, we add the RXN_ATTR (reaction-
                                                           ent setups. We find that CC20 ground-truth test data
attribute) annotations present in CC20 on top to get our
                                                           achieves an F1 score of 0.74 (Table 1). While the pre-
final dataset, ChemProp.
                                                           cision is near perfect, CC20 suffers from low recall on
   We note that although, we only consider three types
                                                           INPUT and OUTPUT relations as the annotation excludes
of relations, where each relation is between an EVENT
                                                           some reaction events and does not annotate generic noun-
mention and an ENTITY mention (Figure 3), the fine-
                                                           phrases. Furthermore, CC20_best [7], a model designed
grained classification of these two types of mentions
                                                           for CC20 shared-task, gets an F1 score of 0.62, 12% be-
provided in CC20 can be easily ported over to ChemProp
                                                           low CC20 (ground-truth). These low numbers suggest
to make the new annotation schema more informative.
                                                           that ChemProp provides considerably more information
                                                           about the chemical patent snippets that will be systemat-
3.2. Manual Quality Assurance                              ically missed by the systems trained on CC20.
Next, we manually go through the development and test         We observe that CC21_best (gold entities) achieves
data to fix the exceptions (described in the previous sec- an  F1 score of 0.86 on ChemProp test set. This suggests
tion).                                                     that the model is able to somewhat reliably figure out
   As discussed earlier, the CC20 annotation does not which named-entities/ noun phrases are related to which
annotate reaction steps that do not relate to any named- reaction event in the snippet. CC21_best (end-to-end), in
entity in the snippet. However, in our case such events addition to relation classification, also extracts mentions
are equally relevant and need to be extracted to get a from raw patent snippets, and therefore expectedly per-
          System                      Micro F1               [5] Y. Li, B. Fang, J. He, H. Yoshikawa, S. A. Akhondi,
          CC20 (ground-truth)            0.74                    C. Druckenbrodt, C. Thorne, Z. Afzal, Z. Zhai,
          CC20_best                      0.62                    T. Baldwin, et al., Overview of chemu 2022 eval-
                   Trained on ChemProp
                                                                 uation campaign: Information extraction in chem-
                                                                 ical patents, in: International Conference of the
          CC21_best (gold-entities)      0.86                    Cross-Language Evaluation Forum for European
          CC21_best (end-to-end)         0.69
                                                                 Languages, Springer, 2022, pp. 521–540.
Table 1                                                      [6] R. Dutt, S. Khosla, C. P. Rosé, A pipelined approach to
Performance on ChemProp test set                                 anaphora resolution in chemical patents., in: CLEF
                                                                 (Working Notes), 2021, pp. 710–719.
                                                             [7] J. W. Y. R. Z. Zhang, Y. Zhang, Melaxtech: A re-
forms much worse than CC21_best (gold entities) with             port for clef 2020–chemu task of chemical reaction
an overall F1 score of 0.69.                                     extraction from patent (2020).


5. Conclusion
In this work, we propose a new corpus ChemProp that
non-trivially combines properties from ChEMU 2020 and
2021 annotation schema to extract the instructional struc-
ture from chemical patents. We provide a semi-automatic
algorithm to create ChemProp. Evaluating state-of-the-
art models on the our new dataset suggests that there
is still room for improvement in extracting relevant in-
structional triggers from patent text. We believe that
ChemProp can act as an important benchmark for in-
structional language modeling.


References
[1] S. Senger, L. Bartek, G. Papadatos, A. Gaulton,
    Managing expectations: assessment of chemistry
    databases generated by automated extraction of
    chemical structures from patents, Journal of chem-
    informatics 7 (2015) 1–12.
[2] S. Muresan, P. Petrov, C. Southan, M. J. Kjellberg,
    T. Kogej, C. Tyrchan, P. Varkonyi, P. H. Xie, Making
    every sar point count: the development of chemistry
    connect for the large-scale integration of structure
    and bioactivity data, Drug Discovery Today 16 (2011)
    1019–1030.
[3] J. He, D. Q. Nguyen, S. A. Akhondi, C. Druckenbrodt,
    C. Thorne, R. Hoessel, Z. Afzal, Z. Zhai, B. Fang,
    H. Yoshikawa, et al., Overview of chemu 2020:
    named entity recognition and event extraction of
    chemical reactions from patents, in: International
    Conference of the Cross-Language Evaluation Fo-
    rum for European Languages, Springer, 2020, pp.
    237–254.
[4] J. He, B. Fang, H. Yoshikawa, Y. Li, S. A. Akhondi,
    C. Druckenbrodt, C. Thorne, Z. Afzal, Z. Zhai,
    L. Cavedon, et al., Chemu 2021: reaction refer-
    ence resolution and anaphora resolution in chemical
    patents, in: ECIR (2), 2021.