=Paper=
{{Paper
|id=Vol-2976/short-1
|storemode=property
|title=Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions
|pdfUrl=https://ceur-ws.org/Vol-2976/short-1.pdf
|volume=Vol-2976
|authors=Hermann Kroll,Judy Al-Chaar,Wolf-Tilo Balke
|dblpUrl=https://dblp.org/rec/conf/jcdl/KrollAB21
}}
==Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions==
Open Information Extraction in Digital Libraries:
Current Challenges and Open Research Questions
Hermann Kroll1 , Judy Al-Chaar1 and Wolf-Tilo Balke1
1
Institute for Information Systems, TU Braunschweig, Mühlenpfordtstr. 23, 38106, Braunschweig, Germany,
Abstract
A central challenge for digital libraries is to provide effective access paths to ever-growing collections of mostly textual,
i.e., unstructured information. The traditional, yet expensive way to manage, categorize, and annotate such collections is
extensive manual metadata curation to semantically enrich library items. The ability to convert textual information automat-
ically into a structured representation would be extremely beneficial, allowing for novel access paths as well as supporting
semantically meaningful discovery. This paper investigates opportunities and challenges that the latest techniques for open
information extraction offer for digital libraries. Open information extraction promises to work out-of-the-box and does not
require domain-specific training data. To evaluate how well such tools perform, we perform a qualitative evaluation in two
domains: general news and biomedicine. Our research shows current benefits, but also reveals serious challenges for practi-
cal applications. In particular three research questions still have to be addressed to reliably use open information extraction
in digital library projects.
Keywords
Digital Libraries, Open Information Extraction, Performance Measurement, Metadata Quality
1. Introduction have been shown to be up to the job with reasonable
quality, their practical application comes at a high cost
Digital libraries want to offer structured access to infor- requiring huge amounts of training data [7]. Hence, even
mation and knowledge over constantly growing collec- when limiting it down to specialized domains only, au-
tions. And indeed, there is a growing amount of struc- tomatically structuring textual collections is still rarely
tured databases, knowledge graphs, or linked open data performed in library practice.
sources available for retrieval in some domains. More- In contrast to designing extraction systems for each
over, offering such structured information is also vital domain, methods for unsupervised information extrac-
for several downstream applications, such as support- tion (OpenIE) promise to change the game. OpenIE aims
ing complex graph queries in DBpedia [1], or enabling to extract knowledge from texts without knowing the
literature-based discovery methods to infer new knowl- entity and relation domains a-priori [6]. Thus, OpenIE
edge [2, 3, 4, 5]. Yet, the majority of knowledge in digital can be understood as an unsupervised method that could
library collections today is still hidden in textual form, be efficiently applied across different domains. Yet, al-
and effective methods to harvest structured knowledge though OpenIE tools claim to be ready-to-use and suggest
from books, journal articles, conference proceedings, etc. a high extraction precision, they are still rarely used in
are rare. What are the main reasons? digital library projects. Is it because they are not quite
It usually boils down to the costs vs. quality trade-off: as "ready-to-use" as is commonly expected? In previous
Today’s intelligent learning techniques enable domain work, we have proposed a toolbox that utilizes OpenIE
experts to design reliable entity linking and relation ex- tools to harvest knowledge from texts [8]. The toolbox
traction for harvesting pre-designed relations between contains novel algorithms to clean OpenIE outputs, and
entities from texts, see e.g., [6]. However, these systems we performed a quantitative evaluation on biomedical
to a large degree rely on supervised learning and thus benchmarks. In contrast to our previous works [8, 9],
need large-scale training data that cannot be readily trans- here we analyze the performance of OpenIE on a qualita-
ferred across domains. That means experts have to give tive level, i.e., what are the main challenges in OpenIE
ten thousands of examples to train an extraction system for digital libraries? We do this by performing an evalua-
for a single relation. In brief, although supervised meth- tion in two common yet very different domains to allow
ods for entity recognition/linking and relation extraction for some generalizability: news articles from the New
York Times and scientific articles from PubMed. The
DISCO@JCDL2021, September 27–30, 2021, Online
" kroll@ifis.cs.tu-bs.de (H. Kroll); j.al-chaar@tu-bs.de contribution of this position paper is a discussion of the
(J. Al-Chaar); balke@ifis.cs.tu-bs.de (W. Balke) future challenges and open research questions of OpenIE
0000-0001-9887-9276 (H. Kroll); 0000-0002-5443-1215 in digital libraries. In particular, we formulate three open
(W. Balke) research questions, which need to be answered before
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). OpenIE can be readily used throughout collections.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
2. Investigating the Practical some essential information is missed in the extraction.
Not means an erroneous extraction that does not yield
Performance of OpenIE correct or useful information.
OpenIE was built as a versatile set of tools for extracting In a sentence with negation, extractions should always
information from unstructured texts. The word open in include the negation to retain the original information.
OpenIE refers to the fact that OpenIE systems do not We take the majority vote of the raters for reporting. Ta-
require pre-defined domains, relations, and named en- ble 1 includes all sentence categories per tool and the
tities to extract new information. However, a system number of sentences selected from a corpus. In addi-
that perfectly transforms unstructured into structured tion, each category is manually evaluated by finding the
information is nowhere to be found, and research is still percentage of extractions that full, partial or not show
ongoing [10, 11, 6]. Older OpenIE tools are based on sim- correct and reasonable tuples. From the 200 sentences,
ple machine learning and rule-based methods. In general, five representative sentences were selected for this paper
a rule-based system is a system that applies rules to store, to explain our five categories and to give the reader an
sort, and manipulate data, e.g., the Stanford CoreNLP intuition about the extraction results.
tool [10]. These systems use hand-crafted syntactic or Simple. A simple sentence is a sentence that includes
semantic linguistic rules such as POS, and parsers, which only one independent clause, i.e., subject, verb, and op-
usually cause errors in propagation and compounding tionally an object. An example of this category is the
at each stage. Modern systems build on neural archi- following sentence: 1. The naysayers raised fair points.
tectures to increase extraction quality [11], i.e., a neural CoreNLP yields the following two extractions: (naysay-
system’s task can be seen as a classification problem or ers; raised; fair points) and (naysayers; raised; points).
a sequence tagging problem. The main idea of a neural CoreNLP tends to extract multiple, sometimes redundant,
OpenIE system is to learn arguments and relation tuples tuples (fair points or points as the objects). For our evalua-
bootstrapped from a state-of-the-art OpenIE system. The tion, we have always selected the largest CoreNLP tuples,
most recent and best-performing OpenIE neural system e.g., we have selected the tuple that contains fair points.
is OpenIE6 2020 [11]. We analyze both OpenIE tools in CoreNLP’s extracted tuple contains all the essential in-
the following, namely Stanford CoreNLP and OpenIE6. formation that should be extracted. As for OpenIE6, it
extracted the following tuple: (The naysayers; raised; fair
points). As for the evaluation of CoreNLP in simple sen-
2.1. Evaluation Corpus tences in the NYT corpus, 62% of its extractions consist
We have randomly selected articles from two different of a complete statement, and 19% of the extractions were
domains for our qualitative evaluation to allow for some partially complete. Thus, 19% of the extractions showed
generalizability of our findings. In particular, we investi- an incomplete statement missing important parts of it.
gate ten articles from the New York Times and 17 biomed- On the other hand, OpenIE6 showed very good results
ical articles from PubMed. Topic-wise, the news articles (100%) when run on simple sentences.
are political, environmental, space & cosmos, and opinion Compound. In contrast to simple sentences, a com-
articles. Various sentences were chosen from these arti- pound sentence consists of two independent clauses that
cles based either on their structure or context. Regarding are joined using a comma, semicolon, or any conjunc-
the structure of sentences, we feature five types of struc- tion. For example, India has about 10 million coronavirus
tures: simple, compound, and complex sentences. The cases now, and schools have been offering online instruc-
purpose of this approach is to go from easy-to-understand tion since March [12]. The extractions expected from
sentences to more and more difficult ones. In addition, this sentence basically have to be two extractions (one
nested sentences and sentences that contain any type for each independent clause). In this case, however,
of negation, such as not, are selected. We selected 20 CoreNLP’s only extraction is (India; has now; about 10
sentences for each category in both corpora. million coronavirus cases.); the second phrase in this sen-
tence is not extracted at all. In contrast, OpenIE6 yields
the following extractions: (India; has; about 10 million
2.2. Extraction Quality coronavirus cases now,) and (schools; have been offering;
The evaluation assessed whether the extraction includes online instruction since March.). Both extractions cover
all essential and only reasonable information that should the two independent clauses in the sentence. When the
be extracted. We employed three referees to rate the system was run on the complete set of compound sen-
extracted information by both OpenIE tools. For each tences, 81% (NYT) and 76% (PubMed) of OpenIE6’s extrac-
extraction, they decided whether the sentence’s original tions were complete, and 19% (NYT) and 14% (PubMed)
information is retained completely, only partially, or is were at least partially informative. In CoreNLP, however,
erroneously extracted: Full means that the statement car- only 24% (NYT) and 15% (PubMed) of the extractions
ries the main message of the sentence. Partial means that were fully extracted.
Table 1
Evaluation results for OpenIE extraction quality. Three experts rate the extraction quality for CoreNLP and OpenIE6 on a scale
between full (all information is kept), partial (relevant parts are missing) and not (information is wrongly or not extracted).
The report below is based on a majority vote over the individual ratings.
Corpus Sent. Category #Sent. CoreNLP OpenIE6
Full Partial Not Full Partial Not
NY Times Simple 20 62% 19% 19% 100% 0% 0%
Compound 20 24% 41% 35% 81% 19% 0%
Complex 20 15% 53% 32% 78% 18% 4%
Nested 20 4% 54% 42% 80% 18% 2%
Negation 20 5% 5% 90% 73% 10% 17%
PubMed Simple 20 52% 38% 10% 100% 0% 0%
Compound 20 15% 44% 41% 76% 14% 10%
Complex 20 38% 48% 14% 56% 13% 31%
Nested 20 22% 63% 15% 89% 11% 0%
Negation 20 5% 33% 62% 81% 15% 4%
Complex. A complex sentence is a sentence that con- species; are impeccably adapted; to detect with sound)
sists of one independent clause and at least one dependant and OpenIE6 extracted the following tuples: (many ma-
clause. A complex sentence might look like: Relentless rine species; are impeccably adapted; to communicate
advertising campaigns are telling Indian parents that cod- with sound) and (many marine species; are impeccably
ing is critical because making children code will develop adapted; to detect with sound). Thus, CoreNLP misses
their cognitive skills [12]. A good extraction, in this case, the second phrase, whereas OpenIE6 keeps the complete
would be if either one extraction that included the entire information. CoreNLP extracts only the first component
sentence was produced or multiple extractions for each from a set of nested sentences in most cases, ignoring
dependent and independent clause. CoreNLP’s most in- the conjunction and everything that came after it. And
formative extractions for this sentence are: (Relentless therefore, 4% (NYT) and 22% (PubMed) of CoreNLP’s ex-
advertising campaigns; are telling; Indian Parents), (cod- tractions were complete; however, in OpenIE6, 80% (NYT)
ing; is; critical) and (making children code; will develop; and 89% (PubMed) of the extractions were complete.
their cognitive skills). Nevertheless, the extraction (Re- Negation. Last but not least, the last kind of sentences
lentless advertising campaigns; are telling; Indian Par- selected were sentences containing any type of negation,
ents) seems unclear and incomplete. As for OpenIE6, we such as not, no, none or neither. This category was se-
have the following tuples: (Relentless advertising cam- lected to analyze how each tool reacts to negations in a
paigns; are telling; Indian parents that coding is critical sentence. In this case, a sentence with the negation not
because making children code will develop their cogni- was selected, e.g., Recent studies show that man was not
tive skills), (coding; is; critical because making children always the hunter [14]. CoreNLP’s extraction was (Re-
code will develop their cognitive skills) and finally (mak- cent studies; show; man). Whereas OpenIE6’s extractions
ing children code; will develop; their cognitive skills). were (Recent studies; show; that man was not always
All the previous extractions do not miss any important the hunter) and (man; was not; always the hunter). So,
information but are quite long. As for the complex sen- CoreNLP ignores the negation part completely, whereas
tences, most of CoreNLP’s extractions miss important OpenIE6 keeps the negation correctly. The tools were
parts of the sentence. For the news corpus, only 15% of also tested on multiple sentences containing negations
the extractions were fully extracted. On the other hand, such as not. In CoreNLP, some of these sentences did not
78% of OpenIE6’s extractions were complete. have any extractions at all. If the sentence had any extrac-
Nested. Next, a sentence was selected to test whether tions, then the negative part was either entirely ignored
the provided tools are able to handle nested extractions, and not extracted or extracted but without the negation.
too. Consider the following sentence: As a result, many On the contrary, most of the OpenIE6 extractions in-
marine species are impeccably adapted to detect and com- cluded the negation. Still, in negation, 17% (NYT) and
municate with sound [13]. As we can see here, this sen- 4% (PubMed) of OpenIE6’s extractions were erroneously
tence consists of only one subject and one relation; how- extracted. Nevertheless, compared to OpenIE6, CoreNLP
ever, the rest of the sentence can be divided into two ar- showed a much higher percentage of extractions full of er-
guments. Here, the nested information that species adapt rors: 90% (NYT) and 62% (PubMed) of CoreNLP’s yielded
to detect and adapt to communicate should be retained extractions were incomplete or wrong.
ideally. CoreNLP’s only extraction was (many marine
Table 2
Evaluation results for the extracted OpenIE arguments (subjects and objects). Three experts rate whether the argument
represents a single concept of interest or a complex concept, where a complex concept consists of multiple concepts.
Corpus Argument Type CoreNLP OpenIE6
Single Complex Single Complex
NY Times Subject 98% 2% 89% 11%
Object 80% 20% 32% 68%
PubMed Subject 99% 1% 76% 24%
Object 75% 25% 47% 53%
2.3. Argument Complexity sentences [8]. Kolluru et al. have reported that OpenIE6
can process up to 31.7 sentences per second on a Tesla
Having a closer look at the extractions, it seems that
V100 GPU [11]. For comparison, an older system called
CoreNLP tends to extract smaller arguments (subjects or
RnnOIE can process up to 149.2 sentences per second but
objects) than OpenIE6. For example, CoreNLP yields the
come with a lower F1-measure between 39.5% (CaRB 1-1)
triple (making children code; will develop; their cognitive
and 56.0% (OIE16-C).
skills) whereas OpenIE6 extracts (coding; is; critical be-
cause making children code will develop their cognitive Open Research Question 1. What is the best trade-off
skills). The last extraction may be hard-to-handle in a between extraction runtime and accuracy?
downstream application because the object contains a
whole sentence fragment (obviously not structured). The Extraction Arguments. Our qualitative evaluation has
latter one should ideally be broken into smaller pieces. To revealed that OpenIE tools may extract complex argu-
understand how often arguments are complex, we asked ments, i.e., an argument that involves multiple concepts.
our three experts to rate all extracted arguments again. Handling complex arguments can be challenging when
They assessed whether an argument represents a single using OpenIE in a digital library project, e.g., complex
concept of interest or a complex concept. For example, arguments will not represent a precise entity for a knowl-
a single concept might be a city, a person, an article, a edge graph. Thus, post-processing is necessary to filter
drug, etc. A complex concept consists of multiple smaller arguments by some domain-specific rules or pre-known
concepts, e.g., a person doing something, a location plus vocabularies. One example might be entity-based fil-
date information, an action plus date information, etc. ters like in [8]. The core idea was to keep only domain-
The results are reported in Tab. 2. For example, 98% of specific concepts in arguments that are found in pre-
CoreNLP’s extracted subjects on NYT are actually single known entity vocabularies. In addition, complex con-
concepts. OpenIE6 extracts 89% subjects being single con- cepts could be also be handled by hand-crafted rules, e.g.,
cepts. OpenIE6 extracts complex objects more often than store a date in an argument as additional information
CoreNLP: 68% vs. 20% (NYT) and 53% vs. 25% (PubMed). about the actual extraction.
Open Research Question 2. How should extracted ar-
guments be handled? And, may post-processing here be
3. Discussion helpful to handle, filter or repair complex arguments?
We analyzed CoreNLP and OpenIE6 on five sentence Not Canonicalized Outputs. OpenIE’s extractions are
categories in two domains: The New York Times and not canonicalized, i.e., different subjects might refer to
PubMed. In addition, we put it into perspective with the the same real-world concept (New York, NY, NYC, etc.).
main findings of our previous work [8]. The same holds for relations: multiple verb phrases might
Extraction Accuracy. First, OpenIE6 outperforms represent the same relation, e.g., is born on, has birthdate.
CoreNLP for every sentence category. This finding is Vashishth et al. propose a tool called CESI to canoni-
not surprising because Kolluru et al. have proposed Ope- calize Open Knowledge Bases (a collection of OpenIE
nIE6 as the best performing OpenIE system in 2020 [11]. extractions) [15]. Their goal is to identify and resolve
They have evaluated OpenIE6 against ten different Ope- synonymous subjects, relations, and objects that refer
nIE tools on four established benchmarks. Their find- to the same real-world concept. They utilize side infor-
ings show that OpenIE6 achieves an F1-measure between mation like the Paraphrase database and entity linking
46.4% (CaRB 1-1) and 65.6% (OIE16-C). However, our pre- information to embed the open knowledge base into a
vious evaluation reveals that CoreNLP is much faster, i.e., high-dimensional embedding space. Then, agglomera-
CoreNLP requires around 8.5 minutes to process 52k sen- tive clustering is used to find synonymous subjects, re-
tences, whereas OpenIE6 requires a modern GPU (Nvidia lations, and objects. But, canonicalizing complex argu-
GTX 1080TI) and around one hour to process the same ments might be especially challenging, i.e., how can a
whole sentence fragment be canonicalized correctly. In [5] D. Hristovski, A. Kastrin, D. Dinevski, T. C. Rind-
addition, clustering results might be hard-to-interpret, flesch, Constructing a graph database for semantic
e.g., which relation is hidden behind a set of verb phrases. literature-based discovery, Studies in health tech-
We have thus proposed to integrate domain experts in the nology and informatics 216 (2015) 1094.
canonicalizing process, i.e., domain experts build a reli- [6] G. Weikum, X. L. Dong, S. Razniewski, F. Suchanek,
able relation vocabulary to canonicalize verb phrases [8]. Machine knowledge: Creation and curation of com-
prehensive knowledge bases, Foundations and
Open Research Question 3. How can OpenIE extrac- Trends® in Databases 10 (2021) 108–490. doi:10.
tion be canonicalized to reliably resolve synonymous noun 1561/1900000064.
and verb phrases? [7] H. Kilicoglu, D. Shin, M. Fiszman, G. Rosemblat,
T. C. Rindflesch, SemMedDB: a PubMed-scale repos-
Conclusion. OpenIE offers a way to bring more struc- itory of biomedical semantic predications, Bioin-
ture in otherwise unstructured document collections. formatics 28 (2012) 3158–3160.
Our evaluation shows that in simple settings, modern [8] H. Kroll, J. Pirklbauer, W.-T. Balke, A toolbox for the
OpenIE tools like OpenIE6 can indeed already extract nearly-unsupervised construction of digital library
information with good quality. However, as sentences knowledge graphs, in: To appear In: Proceedings of
become more complex, the resulting extractions usually the ACM/IEEE Joint Conference on Digital Libraries
lack important information or do not retain precise se- in 2021, JCDL ’21, Association for Computing Ma-
mantics. chinery, 2021.
Still, we believe that OpenIE tools are extremely valu- [9] H. Kroll, D. Nagel, M. Kunz, W.-T. Balke,
able because their advantage of not requiring domain- Demonstrating narrative bindings: Linking dis-
specific training examples is necessary for scalability over courses to knowledge repositories, in: Fourth
large digital libraries and especially for more heteroge- Workshop on Narrative Extraction From Texts,
neous collections. Moreover, combined with methods for Text2Story@ECIR2021, volume 2860 of CEUR Work-
filtering unnecessary information or detecting important shop Proceedings, CEUR-WS.org, 2021, pp. 57–63.
domain-specific concepts, their overall quality in a con- URL: http://ceur-ws.org/Vol-2860/paper7.pdf.
crete application may be drastically increased. To this [10] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel,
end, we have formulated three demanding research ques- S. Bethard, D. McClosky, The Stanford CoreNLP
tions for future research. More research will be necessary natural language processing toolkit, in: Proceed-
to bridge the gap between unstructured and structured ings of 52nd annual meeting of the association for
information while bypassing the need for supervision as computational linguistics: system demonstrations,
much as possible. 2014, pp. 55–60.
[11] K. Kolluru, V. Adlakha, S. Aggarwal, Mausam,
References S. Chakrabarti, OpenIE6: Iterative grid labeling
and coordination analysis for open information ex-
[1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cy- traction, in: Proceedings of the Conference on Em-
ganiak, Z. Ives, DBpedia: A nucleus for a web of pirical Methods in Natural Language Processing,
open data, in: The semantic web, Springer, 2007, EMNLP, Online, 2020, pp. 3748–3761.
pp. 722–735. [12] N. Misra, Do children really need to learn
[2] R. Zhang, M. J. Cairelli, M. Fiszman, G. Rosemblat, to code?, https://www.nytimes.com/2021/01/02/
H. Kilicoglu, T. C. Rindflesch, S. V. Pakhomov, G. B. opinion/teaching-coding-schools-india.html, NY
Melton, Using semantic predications to uncover Times (April 2021).
drug–drug interactions in clinical data, Journal of [13] S. Imbler, In the oceans, the volume is rising as
Biomedical Informatics 49 (2014) 134–147. never before, https://www.nytimes.com/2021/02/
[3] D. R. Swanson, Complementary structures in dis- 04/science/ocean-marine-noise-pollution.html, NY
joint science literatures, in: Proceedings of the 14th Times (April 2021).
Annual International ACM SIGIR Conference on Re- [14] A. Newitz, What new science techniques
search and Development in Information Retrieval, tell us about ancient women warriors,
SIGIR ’91, Association for Computing Machinery, https://www.nytimes.com/2021/01/01/opinion/
1991, p. 280–289. doi:10.1145/122860.122889. women-hunter-leader.html, NY Times (April 2021).
[4] C. Wise, V. N. Ioannidis, M. R. Calvo, X. Song, [15] S. Vashishth, P. Jain, P. Talukdar, CESI: Canoni-
G. Price, N. Kulkarni, R. Brand, P. Bhatia, G. Karypis, calizing open knowledge bases using embeddings
Covid-19 knowledge graph: Accelerating informa- and side information, in: Proceedings of the 2018
tion retrieval and discovery for scientific literature, World Wide Web Conference, WWW ’18, 2018, p.
2020. arXiv:2007.12731. 1317–1327. doi:10.1145/3178876.3186030.