<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of [13] S. Imbler</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/122860.122889</article-id>
      <title-group>
        <article-title>Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hermann Kroll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Judy Al-Chaar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolf-Tilo Balke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Information Systems, TU Braunschweig</institution>
          ,
          <addr-line>Mühlenpfordtstr. 23, 38106, Braunschweig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2860</volume>
      <fpage>27</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>A central challenge for digital libraries is to provide efective access paths to ever-growing collections of mostly textual, i.e., unstructured information. The traditional, yet expensive way to manage, categorize, and annotate such collections is extensive manual metadata curation to semantically enrich library items. The ability to convert textual information automatically into a structured representation would be extremely beneficial, allowing for novel access paths as well as supporting semantically meaningful discovery. This paper investigates opportunities and challenges that the latest techniques for open information extraction ofer for digital libraries. Open information extraction promises to work out-of-the-box and does not require domain-specific training data. To evaluate how well such tools perform, we perform a qualitative evaluation in two domains: general news and biomedicine. Our research shows current benefits, but also reveals serious challenges for practical applications. In particular three research questions still have to be addressed to reliably use open information extraction in digital library projects.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Digital Libraries</kwd>
        <kwd>Open Information Extraction</kwd>
        <kwd>Performance Measurement</kwd>
        <kwd>Metadata Quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>have been shown to be up to the job with reasonable
quality, their practical application comes at a high cost
Digital libraries want to ofer structured access to infor- requiring huge amounts of training data [7]. Hence, even
mation and knowledge over constantly growing collec- when limiting it down to specialized domains only,
autions. And indeed, there is a growing amount of struc- tomatically structuring textual collections is still rarely
tured databases, knowledge graphs, or linked open data performed in library practice.
sources available for retrieval in some domains. More- In contrast to designing extraction systems for each
over, ofering such structured information is also vital domain, methods for unsupervised information
extracfor several downstream applications, such as support- tion (OpenIE) promise to change the game. OpenIE aims
ing complex graph queries in DBpedia [1], or enabling to extract knowledge from texts without knowing the
literature-based discovery methods to infer new knowl- entity and relation domains a-priori [6]. Thus, OpenIE
edge [2, 3, 4, 5]. Yet, the majority of knowledge in digital can be understood as an unsupervised method that could
library collections today is still hidden in textual form, be eficiently applied across diferent domains. Yet,
aland efective methods to harvest structured knowledge though OpenIE tools claim to be ready-to-use and suggest
from books, journal articles, conference proceedings, etc. a high extraction precision, they are still rarely used in
are rare. What are the main reasons? digital library projects. Is it because they are not quite</p>
      <p>It usually boils down to the costs vs. quality trade-of: as "ready-to-use" as is commonly expected? In previous
Today’s intelligent learning techniques enable domain work, we have proposed a toolbox that utilizes OpenIE
experts to design reliable entity linking and relation ex- tools to harvest knowledge from texts [8]. The toolbox
traction for harvesting pre-designed relations between contains novel algorithms to clean OpenIE outputs, and
entities from texts, see e.g., [6]. However, these systems we performed a quantitative evaluation on biomedical
to a large degree rely on supervised learning and thus benchmarks. In contrast to our previous works [8, 9],
need large-scale training data that cannot be readily trans- here we analyze the performance of OpenIE on a
qualitaferred across domains. That means experts have to give tive level, i.e., what are the main challenges in OpenIE
ten thousands of examples to train an extraction system for digital libraries? We do this by performing an
evaluafor a single relation. In brief, although supervised meth- tion in two common yet very diferent domains to allow
ods for entity recognition/linking and relation extraction for some generalizability: news articles from the New
York Times and scientific articles from PubMed. The
contribution of this position paper is a discussion of the
future challenges and open research questions of OpenIE
in digital libraries. In particular, we formulate three open
research questions, which need to be answered before
OpenIE can be readily used throughout collections.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Investigating the Practical</title>
    </sec>
    <sec id="sec-3">
      <title>Performance of OpenIE</title>
      <p>some essential information is missed in the extraction.</p>
      <p>Not means an erroneous extraction that does not yield
correct or useful information.</p>
      <p>OpenIE was built as a versatile set of tools for extracting In a sentence with negation, extractions should always
information from unstructured texts. The word open in include the negation to retain the original information.
OpenIE refers to the fact that OpenIE systems do not We take the majority vote of the raters for reporting.
Tarequire pre-defined domains, relations, and named en- ble 1 includes all sentence categories per tool and the
tities to extract new information. However, a system number of sentences selected from a corpus. In
addithat perfectly transforms unstructured into structured tion, each category is manually evaluated by finding the
information is nowhere to be found, and research is still percentage of extractions that full, partial or not show
ongoing [10, 11, 6]. Older OpenIE tools are based on sim- correct and reasonable tuples. From the 200 sentences,
ple machine learning and rule-based methods. In general, five representative sentences were selected for this paper
a rule-based system is a system that applies rules to store, to explain our five categories and to give the reader an
sort, and manipulate data, e.g., the Stanford CoreNLP intuition about the extraction results.
tool [10]. These systems use hand-crafted syntactic or Simple. A simple sentence is a sentence that includes
semantic linguistic rules such as POS, and parsers, which only one independent clause, i.e., subject, verb, and
opusually cause errors in propagation and compounding tionally an object. An example of this category is the
at each stage. Modern systems build on neural archi- following sentence: 1. The naysayers raised fair points.
tectures to increase extraction quality [11], i.e., a neural CoreNLP yields the following two extractions:
(naysaysystem’s task can be seen as a classification problem or ers; raised; fair points) and (naysayers; raised; points).
a sequence tagging problem. The main idea of a neural CoreNLP tends to extract multiple, sometimes redundant,
OpenIE system is to learn arguments and relation tuples tuples (fair points or points as the objects). For our
evaluabootstrapped from a state-of-the-art OpenIE system. The tion, we have always selected the largest CoreNLP tuples,
most recent and best-performing OpenIE neural system e.g., we have selected the tuple that contains fair points.
is OpenIE6 2020 [11]. We analyze both OpenIE tools in CoreNLP’s extracted tuple contains all the essential
inthe following, namely Stanford CoreNLP and OpenIE6. formation that should be extracted. As for OpenIE6, it
extracted the following tuple: (The naysayers; raised; fair
points). As for the evaluation of CoreNLP in simple
sen2.1. Evaluation Corpus tences in the NYT corpus, 62% of its extractions consist
We have randomly selected articles from two diferent of a complete statement, and 19% of the extractions were
domains for our qualitative evaluation to allow for some partially complete. Thus, 19% of the extractions showed
generalizability of our findings. In particular, we investi- an incomplete statement missing important parts of it.
gate ten articles from the New York Times and 17 biomed- On the other hand, OpenIE6 showed very good results
ical articles from PubMed. Topic-wise, the news articles (100%) when run on simple sentences.
are political, environmental, space &amp; cosmos, and opinion Compound. In contrast to simple sentences, a
comarticles. Various sentences were chosen from these arti- pound sentence consists of two independent clauses that
cles based either on their structure or context. Regarding are joined using a comma, semicolon, or any
conjuncthe structure of sentences, we feature five types of struc- tion. For example, India has about 10 million coronavirus
tures: simple, compound, and complex sentences. The cases now, and schools have been ofering online
instrucpurpose of this approach is to go from easy-to-understand tion since March [12]. The extractions expected from
sentences to more and more dificult ones. In addition, this sentence basically have to be two extractions (one
nested sentences and sentences that contain any type for each independent clause). In this case, however,
of negation, such as not, are selected. We selected 20 CoreNLP’s only extraction is (India; has now; about 10
sentences for each category in both corpora. million coronavirus cases.); the second phrase in this
sentence is not extracted at all. In contrast, OpenIE6 yields
2.2. Extraction Quality the following extractions: (India; has; about 10 million
coronavirus cases now,) and (schools; have been ofering;
The evaluation assessed whether the extraction includes online instruction since March.). Both extractions cover
all essential and only reasonable information that should the two independent clauses in the sentence. When the
be extracted. We employed three referees to rate the system was run on the complete set of compound
senextracted information by both OpenIE tools. For each tences, 81% (NYT) and 76% (PubMed) of OpenIE6’s
extracextraction, they decided whether the sentence’s original tions were complete, and 19% (NYT) and 14% (PubMed)
information is retained completely, only partially, or is were at least partially informative. In CoreNLP, however,
erroneously extracted: Full means that the statement car- only 24% (NYT) and 15% (PubMed) of the extractions
ries the main message of the sentence. Partial means that were fully extracted.</p>
      <p>Complex. A complex sentence is a sentence that con- species; are impeccably adapted; to detect with sound)
sists of one independent clause and at least one dependant and OpenIE6 extracted the following tuples: (many
maclause. A complex sentence might look like: Relentless rine species; are impeccably adapted; to communicate
advertising campaigns are telling Indian parents that cod- with sound) and (many marine species; are impeccably
ing is critical because making children code will develop adapted; to detect with sound). Thus, CoreNLP misses
their cognitive skills [12]. A good extraction, in this case, the second phrase, whereas OpenIE6 keeps the complete
would be if either one extraction that included the entire information. CoreNLP extracts only the first component
sentence was produced or multiple extractions for each from a set of nested sentences in most cases, ignoring
dependent and independent clause. CoreNLP’s most in- the conjunction and everything that came after it. And
formative extractions for this sentence are: (Relentless therefore, 4% (NYT) and 22% (PubMed) of CoreNLP’s
exadvertising campaigns; are telling; Indian Parents), (cod- tractions were complete; however, in OpenIE6, 80% (NYT)
ing; is; critical) and (making children code; will develop; and 89% (PubMed) of the extractions were complete.
their cognitive skills). Nevertheless, the extraction (Re- Negation. Last but not least, the last kind of sentences
lentless advertising campaigns; are telling; Indian Par- selected were sentences containing any type of negation,
ents) seems unclear and incomplete. As for OpenIE6, we such as not, no, none or neither. This category was
sehave the following tuples: (Relentless advertising cam- lected to analyze how each tool reacts to negations in a
paigns; are telling; Indian parents that coding is critical sentence. In this case, a sentence with the negation not
because making children code will develop their cogni- was selected, e.g., Recent studies show that man was not
tive skills), (coding; is; critical because making children always the hunter [14]. CoreNLP’s extraction was
(Recode will develop their cognitive skills) and finally (mak- cent studies; show; man). Whereas OpenIE6’s extractions
ing children code; will develop; their cognitive skills). were (Recent studies; show; that man was not always
All the previous extractions do not miss any important the hunter) and (man; was not; always the hunter). So,
information but are quite long. As for the complex sen- CoreNLP ignores the negation part completely, whereas
tences, most of CoreNLP’s extractions miss important OpenIE6 keeps the negation correctly. The tools were
parts of the sentence. For the news corpus, only 15% of also tested on multiple sentences containing negations
the extractions were fully extracted. On the other hand, such as not. In CoreNLP, some of these sentences did not
78% of OpenIE6’s extractions were complete. have any extractions at all. If the sentence had any
extrac</p>
      <p>Nested. Next, a sentence was selected to test whether tions, then the negative part was either entirely ignored
the provided tools are able to handle nested extractions, and not extracted or extracted but without the negation.
too. Consider the following sentence: As a result, many On the contrary, most of the OpenIE6 extractions
inmarine species are impeccably adapted to detect and com- cluded the negation. Still, in negation, 17% (NYT) and
municate with sound [13]. As we can see here, this sen- 4% (PubMed) of OpenIE6’s extractions were erroneously
tence consists of only one subject and one relation; how- extracted. Nevertheless, compared to OpenIE6, CoreNLP
ever, the rest of the sentence can be divided into two ar- showed a much higher percentage of extractions full of
erguments. Here, the nested information that species adapt rors: 90% (NYT) and 62% (PubMed) of CoreNLP’s yielded
to detect and adapt to communicate should be retained extractions were incomplete or wrong.
ideally. CoreNLP’s only extraction was (many marine
2.3. Argument Complexity
sentences [8]. Kolluru et al. have reported that OpenIE6
can process up to 31.7 sentences per second on a Tesla
V100 GPU [11]. For comparison, an older system called
RnnOIE can process up to 149.2 sentences per second but
come with a lower F1-measure between 39.5% (CaRB 1-1)
and 56.0% (OIE16-C).</p>
      <p>Having a closer look at the extractions, it seems that
CoreNLP tends to extract smaller arguments (subjects or
objects) than OpenIE6. For example, CoreNLP yields the
triple (making children code; will develop; their cognitive
skills) whereas OpenIE6 extracts (coding; is; critical
because making children code will develop their cognitive Open Research Question 1. What is the best trade-of
skills). The last extraction may be hard-to-handle in a between extraction runtime and accuracy?
downstream application because the object contains a
whole sentence fragment (obviously not structured). The Extraction Arguments. Our qualitative evaluation has
latter one should ideally be broken into smaller pieces. To revealed that OpenIE tools may extract complex
arguunderstand how often arguments are complex, we asked ments, i.e., an argument that involves multiple concepts.
our three experts to rate all extracted arguments again. Handling complex arguments can be challenging when
They assessed whether an argument represents a single using OpenIE in a digital library project, e.g., complex
concept of interest or a complex concept. For example, arguments will not represent a precise entity for a
knowla single concept might be a city, a person, an article, a edge graph. Thus, post-processing is necessary to filter
drug, etc. A complex concept consists of multiple smaller arguments by some domain-specific rules or pre-known
concepts, e.g., a person doing something, a location plus vocabularies. One example might be entity-based
fildate information, an action plus date information, etc. ters like in [8]. The core idea was to keep only
domainThe results are reported in Tab. 2. For example, 98% of specific concepts in arguments that are found in
preCoreNLP’s extracted subjects on NYT are actually single known entity vocabularies. In addition, complex
conconcepts. OpenIE6 extracts 89% subjects being single con- cepts could be also be handled by hand-crafted rules, e.g.,
cepts. OpenIE6 extracts complex objects more often than store a date in an argument as additional information
CoreNLP: 68% vs. 20% (NYT) and 53% vs. 25% (PubMed). about the actual extraction.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Discussion</title>
      <p>We analyzed CoreNLP and OpenIE6 on five sentence Not Canonicalized Outputs. OpenIE’s extractions are
categories in two domains: The New York Times and not canonicalized, i.e., diferent subjects might refer to
PubMed. In addition, we put it into perspective with the the same real-world concept (New York, NY, NYC, etc.).
main findings of our previous work [8]. The same holds for relations: multiple verb phrases might</p>
      <p>Extraction Accuracy. First, OpenIE6 outperforms represent the same relation, e.g., is born on, has birthdate.
CoreNLP for every sentence category. This finding is Vashishth et al. propose a tool called CESI to
canoninot surprising because Kolluru et al. have proposed Ope- calize Open Knowledge Bases (a collection of OpenIE
nIE6 as the best performing OpenIE system in 2020 [11]. extractions) [15]. Their goal is to identify and resolve
They have evaluated OpenIE6 against ten diferent Ope- synonymous subjects, relations, and objects that refer
nIE tools on four established benchmarks. Their find- to the same real-world concept. They utilize side
inforings show that OpenIE6 achieves an F1-measure between mation like the Paraphrase database and entity linking
46.4% (CaRB 1-1) and 65.6% (OIE16-C). However, our pre- information to embed the open knowledge base into a
vious evaluation reveals that CoreNLP is much faster, i.e., high-dimensional embedding space. Then,
agglomeraCoreNLP requires around 8.5 minutes to process 52k sen- tive clustering is used to find synonymous subjects,
retences, whereas OpenIE6 requires a modern GPU (Nvidia lations, and objects. But, canonicalizing complex
arguGTX 1080TI) and around one hour to process the same ments might be especially challenging, i.e., how can a</p>
      <sec id="sec-4-1">
        <title>Open Research Question 2. How should extracted ar</title>
        <p>guments be handled? And, may post-processing here be
helpful to handle, filter or repair complex arguments?
whole sentence fragment be canonicalized correctly. In
addition, clustering results might be hard-to-interpret,
e.g., which relation is hidden behind a set of verb phrases.
We have thus proposed to integrate domain experts in the
canonicalizing process, i.e., domain experts build a
reliable relation vocabulary to canonicalize verb phrases [8].</p>
      </sec>
      <sec id="sec-4-2">
        <title>Open Research Question 3. How can OpenIE extrac</title>
        <p>tion be canonicalized to reliably resolve synonymous noun
and verb phrases?</p>
        <p>Conclusion. OpenIE ofers a way to bring more
structure in otherwise unstructured document collections.
Our evaluation shows that in simple settings, modern
OpenIE tools like OpenIE6 can indeed already extract
information with good quality. However, as sentences
become more complex, the resulting extractions usually
lack important information or do not retain precise
semantics.</p>
        <p>Still, we believe that OpenIE tools are extremely
valuable because their advantage of not requiring
domainspecific training examples is necessary for scalability over
large digital libraries and especially for more
heterogeneous collections. Moreover, combined with methods for
ifltering unnecessary information or detecting important
domain-specific concepts, their overall quality in a
concrete application may be drastically increased. To this
end, we have formulated three demanding research
questions for future research. More research will be necessary
to bridge the gap between unstructured and structured
information while bypassing the need for supervision as
much as possible.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>