=Paper=
{{Paper
|id=Vol-2976/short-1
|storemode=property
|title=Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions
|pdfUrl=https://ceur-ws.org/Vol-2976/short-1.pdf
|volume=Vol-2976
|authors=Hermann Kroll,Judy Al-Chaar,Wolf-Tilo Balke
|dblpUrl=https://dblp.org/rec/conf/jcdl/KrollAB21
}}
==Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions==
<pdf width="1500px">https://ceur-ws.org/Vol-2976/short-1.pdf</pdf>
<pre>
Open Information Extraction in Digital Libraries:
Current Challenges and Open Research Questions
Hermann Kroll1 , Judy Al-Chaar1 and Wolf-Tilo Balke1
1
    Institute for Information Systems, TU Braunschweig, Mühlenpfordtstr. 23, 38106, Braunschweig, Germany,


                                             Abstract
                                             A central challenge for digital libraries is to provide effective access paths to ever-growing collections of mostly textual,
                                             i.e., unstructured information. The traditional, yet expensive way to manage, categorize, and annotate such collections is
                                             extensive manual metadata curation to semantically enrich library items. The ability to convert textual information automat-
                                             ically into a structured representation would be extremely beneficial, allowing for novel access paths as well as supporting
                                             semantically meaningful discovery. This paper investigates opportunities and challenges that the latest techniques for open
                                             information extraction offer for digital libraries. Open information extraction promises to work out-of-the-box and does not
                                             require domain-specific training data. To evaluate how well such tools perform, we perform a qualitative evaluation in two
                                             domains: general news and biomedicine. Our research shows current benefits, but also reveals serious challenges for practi-
                                             cal applications. In particular three research questions still have to be addressed to reliably use open information extraction
                                             in digital library projects.

                                             Keywords
                                             Digital Libraries, Open Information Extraction, Performance Measurement, Metadata Quality


1. Introduction                                                                                                       have been shown to be up to the job with reasonable
                                                                                                                      quality, their practical application comes at a high cost
Digital libraries want to offer structured access to infor-                                                           requiring huge amounts of training data [7]. Hence, even
mation and knowledge over constantly growing collec-                                                                  when limiting it down to specialized domains only, au-
tions. And indeed, there is a growing amount of struc-                                                                tomatically structuring textual collections is still rarely
tured databases, knowledge graphs, or linked open data                                                                performed in library practice.
sources available for retrieval in some domains. More-                                                                   In contrast to designing extraction systems for each
over, offering such structured information is also vital                                                              domain, methods for unsupervised information extrac-
for several downstream applications, such as support-                                                                 tion (OpenIE) promise to change the game. OpenIE aims
ing complex graph queries in DBpedia [1], or enabling                                                                 to extract knowledge from texts without knowing the
literature-based discovery methods to infer new knowl-                                                                entity and relation domains a-priori [6]. Thus, OpenIE
edge [2, 3, 4, 5]. Yet, the majority of knowledge in digital                                                          can be understood as an unsupervised method that could
library collections today is still hidden in textual form,                                                            be efficiently applied across different domains. Yet, al-
and effective methods to harvest structured knowledge                                                                 though OpenIE tools claim to be ready-to-use and suggest
from books, journal articles, conference proceedings, etc.                                                            a high extraction precision, they are still rarely used in
are rare. What are the main reasons?                                                                                  digital library projects. Is it because they are not quite
   It usually boils down to the costs vs. quality trade-off:                                                          as "ready-to-use" as is commonly expected? In previous
Today’s intelligent learning techniques enable domain                                                                 work, we have proposed a toolbox that utilizes OpenIE
experts to design reliable entity linking and relation ex-                                                            tools to harvest knowledge from texts [8]. The toolbox
traction for harvesting pre-designed relations between                                                                contains novel algorithms to clean OpenIE outputs, and
entities from texts, see e.g., [6]. However, these systems                                                            we performed a quantitative evaluation on biomedical
to a large degree rely on supervised learning and thus                                                                benchmarks. In contrast to our previous works [8, 9],
need large-scale training data that cannot be readily trans-                                                          here we analyze the performance of OpenIE on a qualita-
ferred across domains. That means experts have to give                                                                tive level, i.e., what are the main challenges in OpenIE
ten thousands of examples to train an extraction system                                                               for digital libraries? We do this by performing an evalua-
for a single relation. In brief, although supervised meth-                                                            tion in two common yet very different domains to allow
ods for entity recognition/linking and relation extraction                                                            for some generalizability: news articles from the New
                                                                                                                      York Times and scientific articles from PubMed. The
DISCO@JCDL2021, September 27–30, 2021, Online
" kroll@ifis.cs.tu-bs.de (H. Kroll); j.al-chaar@tu-bs.de                                                              contribution of this position paper is a discussion of the
(J. Al-Chaar); balke@ifis.cs.tu-bs.de (W. Balke)                                                                      future challenges and open research questions of OpenIE
 0000-0001-9887-9276 (H. Kroll); 0000-0002-5443-1215                                                                 in digital libraries. In particular, we formulate three open
(W. Balke)                                                                                                            research questions, which need to be answered before
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     OpenIE can be readily used throughout collections.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Investigating the Practical                                 some essential information is missed in the extraction.
                                                               Not means an erroneous extraction that does not yield
   Performance of OpenIE                                       correct or useful information.
OpenIE was built as a versatile set of tools for extracting       In a sentence with negation, extractions should always
information from unstructured texts. The word open in          include the negation to retain the original information.
OpenIE refers to the fact that OpenIE systems do not           We take the majority vote of the raters for reporting. Ta-
require pre-defined domains, relations, and named en-          ble 1 includes all sentence categories per tool and the
tities to extract new information. However, a system           number of sentences selected from a corpus. In addi-
that perfectly transforms unstructured into structured         tion, each category is manually evaluated by finding the
information is nowhere to be found, and research is still      percentage of extractions that full, partial or not show
ongoing [10, 11, 6]. Older OpenIE tools are based on sim-      correct and reasonable tuples. From the 200 sentences,
ple machine learning and rule-based methods. In general,       five representative sentences were selected for this paper
a rule-based system is a system that applies rules to store,   to explain our five categories and to give the reader an
sort, and manipulate data, e.g., the Stanford CoreNLP          intuition about the extraction results.
tool [10]. These systems use hand-crafted syntactic or            Simple. A simple sentence is a sentence that includes
semantic linguistic rules such as POS, and parsers, which      only one independent clause, i.e., subject, verb, and op-
usually cause errors in propagation and compounding            tionally an object. An example of this category is the
at each stage. Modern systems build on neural archi-           following sentence: 1. The naysayers raised fair points.
tectures to increase extraction quality [11], i.e., a neural   CoreNLP yields the following two extractions: (naysay-
system’s task can be seen as a classification problem or       ers; raised; fair points) and (naysayers; raised; points).
a sequence tagging problem. The main idea of a neural          CoreNLP tends to extract multiple, sometimes redundant,
OpenIE system is to learn arguments and relation tuples        tuples (fair points or points as the objects). For our evalua-
bootstrapped from a state-of-the-art OpenIE system. The        tion, we have always selected the largest CoreNLP tuples,
most recent and best-performing OpenIE neural system           e.g., we have selected the tuple that contains fair points.
is OpenIE6 2020 [11]. We analyze both OpenIE tools in          CoreNLP’s extracted tuple contains all the essential in-
the following, namely Stanford CoreNLP and OpenIE6.            formation that should be extracted. As for OpenIE6, it
                                                               extracted the following tuple: (The naysayers; raised; fair
                                                               points). As for the evaluation of CoreNLP in simple sen-
2.1. Evaluation Corpus                                         tences in the NYT corpus, 62% of its extractions consist
We have randomly selected articles from two different          of a complete statement, and 19% of the extractions were
domains for our qualitative evaluation to allow for some       partially complete. Thus, 19% of the extractions showed
generalizability of our findings. In particular, we investi-   an incomplete statement missing important parts of it.
gate ten articles from the New York Times and 17 biomed-       On the other hand, OpenIE6 showed very good results
ical articles from PubMed. Topic-wise, the news articles       (100%) when run on simple sentences.
are political, environmental, space & cosmos, and opinion         Compound. In contrast to simple sentences, a com-
articles. Various sentences were chosen from these arti-       pound sentence consists of two independent clauses that
cles based either on their structure or context. Regarding     are joined using a comma, semicolon, or any conjunc-
the structure of sentences, we feature five types of struc-    tion. For example, India has about 10 million coronavirus
tures: simple, compound, and complex sentences. The            cases now, and schools have been offering online instruc-
purpose of this approach is to go from easy-to-understand      tion since March [12]. The extractions expected from
sentences to more and more difficult ones. In addition,        this sentence basically have to be two extractions (one
nested sentences and sentences that contain any type           for each independent clause). In this case, however,
of negation, such as not, are selected. We selected 20         CoreNLP’s only extraction is (India; has now; about 10
sentences for each category in both corpora.                   million coronavirus cases.); the second phrase in this sen-
                                                               tence is not extracted at all. In contrast, OpenIE6 yields
                                                               the following extractions: (India; has; about 10 million
2.2. Extraction Quality                                        coronavirus cases now,) and (schools; have been offering;
The evaluation assessed whether the extraction includes        online instruction since March.). Both extractions cover
all essential and only reasonable information that should      the two independent clauses in the sentence. When the
be extracted. We employed three referees to rate the           system was run on the complete set of compound sen-
extracted information by both OpenIE tools. For each           tences, 81% (NYT) and 76% (PubMed) of OpenIE6’s extrac-
extraction, they decided whether the sentence’s original       tions were complete, and 19% (NYT) and 14% (PubMed)
information is retained completely, only partially, or is      were at least partially informative. In CoreNLP, however,
erroneously extracted: Full means that the statement car-      only 24% (NYT) and 15% (PubMed) of the extractions
ries the main message of the sentence. Partial means that      were fully extracted.
Table 1
Evaluation results for OpenIE extraction quality. Three experts rate the extraction quality for CoreNLP and OpenIE6 on a scale
between full (all information is kept), partial (relevant parts are missing) and not (information is wrongly or not extracted).
The report below is based on a majority vote over the individual ratings.
               Corpus        Sent. Category      #Sent.             CoreNLP                    OpenIE6
                                                             Full    Partial   Not      Full    Partial    Not
              NY Times           Simple             20       62%      19%      19%     100%       0%        0%
                               Compound             20       24%      41%      35%     81%       19%        0%
                                Complex             20       15%      53%      32%     78%       18%        4%
                                 Nested             20       4%       54%      42%     80%       18%        2%
                                Negation            20       5%        5%      90%     73%       10%       17%
               PubMed            Simple             20       52%      38%      10%     100%        0%       0%
                               Compound             20       15%      44%      41%     76%        14%      10%
                                Complex             20       38%      48%      14%     56%        13%      31%
                                 Nested             20       22%      63%      15%     89%        11%       0%
                                Negation            20       5%       33%      62%     81%        15%       4%


   Complex. A complex sentence is a sentence that con-          species; are impeccably adapted; to detect with sound)
sists of one independent clause and at least one dependant      and OpenIE6 extracted the following tuples: (many ma-
clause. A complex sentence might look like: Relentless          rine species; are impeccably adapted; to communicate
advertising campaigns are telling Indian parents that cod-      with sound) and (many marine species; are impeccably
ing is critical because making children code will develop       adapted; to detect with sound). Thus, CoreNLP misses
their cognitive skills [12]. A good extraction, in this case,   the second phrase, whereas OpenIE6 keeps the complete
would be if either one extraction that included the entire      information. CoreNLP extracts only the first component
sentence was produced or multiple extractions for each          from a set of nested sentences in most cases, ignoring
dependent and independent clause. CoreNLP’s most in-            the conjunction and everything that came after it. And
formative extractions for this sentence are: (Relentless        therefore, 4% (NYT) and 22% (PubMed) of CoreNLP’s ex-
advertising campaigns; are telling; Indian Parents), (cod-      tractions were complete; however, in OpenIE6, 80% (NYT)
ing; is; critical) and (making children code; will develop;     and 89% (PubMed) of the extractions were complete.
their cognitive skills). Nevertheless, the extraction (Re-         Negation. Last but not least, the last kind of sentences
lentless advertising campaigns; are telling; Indian Par-        selected were sentences containing any type of negation,
ents) seems unclear and incomplete. As for OpenIE6, we          such as not, no, none or neither. This category was se-
have the following tuples: (Relentless advertising cam-         lected to analyze how each tool reacts to negations in a
paigns; are telling; Indian parents that coding is critical     sentence. In this case, a sentence with the negation not
because making children code will develop their cogni-          was selected, e.g., Recent studies show that man was not
tive skills), (coding; is; critical because making children     always the hunter [14]. CoreNLP’s extraction was (Re-
code will develop their cognitive skills) and finally (mak-     cent studies; show; man). Whereas OpenIE6’s extractions
ing children code; will develop; their cognitive skills).       were (Recent studies; show; that man was not always
All the previous extractions do not miss any important          the hunter) and (man; was not; always the hunter). So,
information but are quite long. As for the complex sen-         CoreNLP ignores the negation part completely, whereas
tences, most of CoreNLP’s extractions miss important            OpenIE6 keeps the negation correctly. The tools were
parts of the sentence. For the news corpus, only 15% of         also tested on multiple sentences containing negations
the extractions were fully extracted. On the other hand,        such as not. In CoreNLP, some of these sentences did not
78% of OpenIE6’s extractions were complete.                     have any extractions at all. If the sentence had any extrac-
   Nested. Next, a sentence was selected to test whether        tions, then the negative part was either entirely ignored
the provided tools are able to handle nested extractions,       and not extracted or extracted but without the negation.
too. Consider the following sentence: As a result, many         On the contrary, most of the OpenIE6 extractions in-
marine species are impeccably adapted to detect and com-        cluded the negation. Still, in negation, 17% (NYT) and
municate with sound [13]. As we can see here, this sen-         4% (PubMed) of OpenIE6’s extractions were erroneously
tence consists of only one subject and one relation; how-       extracted. Nevertheless, compared to OpenIE6, CoreNLP
ever, the rest of the sentence can be divided into two ar-      showed a much higher percentage of extractions full of er-
guments. Here, the nested information that species adapt        rors: 90% (NYT) and 62% (PubMed) of CoreNLP’s yielded
to detect and adapt to communicate should be retained           extractions were incomplete or wrong.
ideally. CoreNLP’s only extraction was (many marine
Table 2
Evaluation results for the extracted OpenIE arguments (subjects and objects). Three experts rate whether the argument
represents a single concept of interest or a complex concept, where a complex concept consists of multiple concepts.
                       Corpus       Argument Type            CoreNLP                OpenIE6
                                                         Single  Complex        Single Complex
                     NY Times            Subject          98%      2%            89%      11%
                                         Object           80%      20%           32%      68%
                      PubMed             Subject          99%      1%            76%      24%
                                         Object           75%      25%           47%      53%


2.3. Argument Complexity                                      sentences [8]. Kolluru et al. have reported that OpenIE6
                                                              can process up to 31.7 sentences per second on a Tesla
Having a closer look at the extractions, it seems that
                                                              V100 GPU [11]. For comparison, an older system called
CoreNLP tends to extract smaller arguments (subjects or
                                                              RnnOIE can process up to 149.2 sentences per second but
objects) than OpenIE6. For example, CoreNLP yields the
                                                              come with a lower F1-measure between 39.5% (CaRB 1-1)
triple (making children code; will develop; their cognitive
                                                              and 56.0% (OIE16-C).
skills) whereas OpenIE6 extracts (coding; is; critical be-
cause making children code will develop their cognitive       Open Research Question 1. What is the best trade-off
skills). The last extraction may be hard-to-handle in a       between extraction runtime and accuracy?
downstream application because the object contains a
whole sentence fragment (obviously not structured). The          Extraction Arguments. Our qualitative evaluation has
latter one should ideally be broken into smaller pieces. To   revealed that OpenIE tools may extract complex argu-
understand how often arguments are complex, we asked          ments, i.e., an argument that involves multiple concepts.
our three experts to rate all extracted arguments again.      Handling complex arguments can be challenging when
They assessed whether an argument represents a single         using OpenIE in a digital library project, e.g., complex
concept of interest or a complex concept. For example,        arguments will not represent a precise entity for a knowl-
a single concept might be a city, a person, an article, a     edge graph. Thus, post-processing is necessary to filter
drug, etc. A complex concept consists of multiple smaller     arguments by some domain-specific rules or pre-known
concepts, e.g., a person doing something, a location plus     vocabularies. One example might be entity-based fil-
date information, an action plus date information, etc.       ters like in [8]. The core idea was to keep only domain-
The results are reported in Tab. 2. For example, 98% of       specific concepts in arguments that are found in pre-
CoreNLP’s extracted subjects on NYT are actually single       known entity vocabularies. In addition, complex con-
concepts. OpenIE6 extracts 89% subjects being single con-     cepts could be also be handled by hand-crafted rules, e.g.,
cepts. OpenIE6 extracts complex objects more often than       store a date in an argument as additional information
CoreNLP: 68% vs. 20% (NYT) and 53% vs. 25% (PubMed).          about the actual extraction.
                                                              Open Research Question 2. How should extracted ar-
                                                              guments be handled? And, may post-processing here be
3. Discussion                                                 helpful to handle, filter or repair complex arguments?
We analyzed CoreNLP and OpenIE6 on five sentence                 Not Canonicalized Outputs. OpenIE’s extractions are
categories in two domains: The New York Times and             not canonicalized, i.e., different subjects might refer to
PubMed. In addition, we put it into perspective with the      the same real-world concept (New York, NY, NYC, etc.).
main findings of our previous work [8].                       The same holds for relations: multiple verb phrases might
   Extraction Accuracy. First, OpenIE6 outperforms            represent the same relation, e.g., is born on, has birthdate.
CoreNLP for every sentence category. This finding is          Vashishth et al. propose a tool called CESI to canoni-
not surprising because Kolluru et al. have proposed Ope-      calize Open Knowledge Bases (a collection of OpenIE
nIE6 as the best performing OpenIE system in 2020 [11].       extractions) [15]. Their goal is to identify and resolve
They have evaluated OpenIE6 against ten different Ope-        synonymous subjects, relations, and objects that refer
nIE tools on four established benchmarks. Their find-         to the same real-world concept. They utilize side infor-
ings show that OpenIE6 achieves an F1-measure between         mation like the Paraphrase database and entity linking
46.4% (CaRB 1-1) and 65.6% (OIE16-C). However, our pre-       information to embed the open knowledge base into a
vious evaluation reveals that CoreNLP is much faster, i.e.,   high-dimensional embedding space. Then, agglomera-
CoreNLP requires around 8.5 minutes to process 52k sen-       tive clustering is used to find synonymous subjects, re-
tences, whereas OpenIE6 requires a modern GPU (Nvidia         lations, and objects. But, canonicalizing complex argu-
GTX 1080TI) and around one hour to process the same           ments might be especially challenging, i.e., how can a
whole sentence fragment be canonicalized correctly. In          [5] D. Hristovski, A. Kastrin, D. Dinevski, T. C. Rind-
addition, clustering results might be hard-to-interpret,            flesch, Constructing a graph database for semantic
e.g., which relation is hidden behind a set of verb phrases.        literature-based discovery, Studies in health tech-
We have thus proposed to integrate domain experts in the            nology and informatics 216 (2015) 1094.
canonicalizing process, i.e., domain experts build a reli-      [6] G. Weikum, X. L. Dong, S. Razniewski, F. Suchanek,
able relation vocabulary to canonicalize verb phrases [8].          Machine knowledge: Creation and curation of com-
                                                                    prehensive knowledge bases, Foundations and
Open Research Question 3. How can OpenIE extrac-                    Trends® in Databases 10 (2021) 108–490. doi:10.
tion be canonicalized to reliably resolve synonymous noun           1561/1900000064.
and verb phrases?                                               [7] H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat,
                                                                    T. C. Rindflesch, SemMedDB: a PubMed-scale repos-
    Conclusion. OpenIE offers a way to bring more struc-            itory of biomedical semantic predications, Bioin-
ture in otherwise unstructured document collections.                formatics 28 (2012) 3158–3160.
Our evaluation shows that in simple settings, modern            [8] H. Kroll, J. Pirklbauer, W.-T. Balke, A toolbox for the
OpenIE tools like OpenIE6 can indeed already extract                nearly-unsupervised construction of digital library
information with good quality. However, as sentences                knowledge graphs, in: To appear In: Proceedings of
become more complex, the resulting extractions usually              the ACM/IEEE Joint Conference on Digital Libraries
lack important information or do not retain precise se-             in 2021, JCDL ’21, Association for Computing Ma-
mantics.                                                            chinery, 2021.
    Still, we believe that OpenIE tools are extremely valu-     [9] H. Kroll, D. Nagel, M. Kunz, W.-T. Balke,
able because their advantage of not requiring domain-               Demonstrating narrative bindings: Linking dis-
specific training examples is necessary for scalability over        courses to knowledge repositories, in: Fourth
large digital libraries and especially for more heteroge-           Workshop on Narrative Extraction From Texts,
neous collections. Moreover, combined with methods for              Text2Story@ECIR2021, volume 2860 of CEUR Work-
filtering unnecessary information or detecting important            shop Proceedings, CEUR-WS.org, 2021, pp. 57–63.
domain-specific concepts, their overall quality in a con-           URL: http://ceur-ws.org/Vol-2860/paper7.pdf.
crete application may be drastically increased. To this        [10] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel,
end, we have formulated three demanding research ques-              S. Bethard, D. McClosky, The Stanford CoreNLP
tions for future research. More research will be necessary          natural language processing toolkit, in: Proceed-
to bridge the gap between unstructured and structured               ings of 52nd annual meeting of the association for
information while bypassing the need for supervision as             computational linguistics: system demonstrations,
much as possible.                                                   2014, pp. 55–60.
                                                               [11] K. Kolluru, V. Adlakha, S. Aggarwal, Mausam,
References                                                          S. Chakrabarti, OpenIE6: Iterative grid labeling
                                                                    and coordination analysis for open information ex-
 [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cy-            traction, in: Proceedings of the Conference on Em-
     ganiak, Z. Ives, DBpedia: A nucleus for a web of               pirical Methods in Natural Language Processing,
     open data, in: The semantic web, Springer, 2007,               EMNLP, Online, 2020, pp. 3748–3761.
     pp. 722–735.                                              [12] N. Misra,        Do children really need to learn
 [2] R. Zhang, M. J. Cairelli, M. Fiszman, G. Rosemblat,            to code?, https://www.nytimes.com/2021/01/02/
     H. Kilicoglu, T. C. Rindflesch, S. V. Pakhomov, G. B.          opinion/teaching-coding-schools-india.html, NY
     Melton, Using semantic predications to uncover                 Times (April 2021).
     drug–drug interactions in clinical data, Journal of       [13] S. Imbler, In the oceans, the volume is rising as
     Biomedical Informatics 49 (2014) 134–147.                      never before, https://www.nytimes.com/2021/02/
 [3] D. R. Swanson, Complementary structures in dis-                04/science/ocean-marine-noise-pollution.html, NY
     joint science literatures, in: Proceedings of the 14th         Times (April 2021).
     Annual International ACM SIGIR Conference on Re-          [14] A. Newitz,         What new science techniques
     search and Development in Information Retrieval,               tell us about ancient women warriors,
     SIGIR ’91, Association for Computing Machinery,                https://www.nytimes.com/2021/01/01/opinion/
     1991, p. 280–289. doi:10.1145/122860.122889.                   women-hunter-leader.html, NY Times (April 2021).
 [4] C. Wise, V. N. Ioannidis, M. R. Calvo, X. Song,           [15] S. Vashishth, P. Jain, P. Talukdar, CESI: Canoni-
     G. Price, N. Kulkarni, R. Brand, P. Bhatia, G. Karypis,        calizing open knowledge bases using embeddings
     Covid-19 knowledge graph: Accelerating informa-                and side information, in: Proceedings of the 2018
     tion retrieval and discovery for scientific literature,        World Wide Web Conference, WWW ’18, 2018, p.
     2020. arXiv:2007.12731.                                        1317–1327. doi:10.1145/3178876.3186030.

</pre>