1. Introduction

Journal of [13] S. Imbler

10.1145/122860.122889

Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions

Hermann Kroll

Judy Al-Chaar

Wolf-Tilo Balke

0 0 Institute for Information Systems, TU Braunschweig , Mühlenpfordtstr. 23, 38106, Braunschweig , Germany

2021

2860 27 30

A central challenge for digital libraries is to provide efective access paths to ever-growing collections of mostly textual, i.e., unstructured information. The traditional, yet expensive way to manage, categorize, and annotate such collections is extensive manual metadata curation to semantically enrich library items. The ability to convert textual information automatically into a structured representation would be extremely beneficial, allowing for novel access paths as well as supporting semantically meaningful discovery. This paper investigates opportunities and challenges that the latest techniques for open information extraction ofer for digital libraries. Open information extraction promises to work out-of-the-box and does not require domain-specific training data. To evaluate how well such tools perform, we perform a qualitative evaluation in two domains: general news and biomedicine. Our research shows current benefits, but also reveals serious challenges for practical applications. In particular three research questions still have to be addressed to reliably use open information extraction in digital library projects.

eol>Digital Libraries Open Information Extraction Performance Measurement Metadata Quality

1. Introduction

have been shown to be up to the job with reasonable quality, their practical application comes at a high cost Digital libraries want to ofer structured access to infor- requiring huge amounts of training data [7]. Hence, even mation and knowledge over constantly growing collec- when limiting it down to specialized domains only, autions. And indeed, there is a growing amount of struc- tomatically structuring textual collections is still rarely tured databases, knowledge graphs, or linked open data performed in library practice. sources available for retrieval in some domains. More- In contrast to designing extraction systems for each over, ofering such structured information is also vital domain, methods for unsupervised information extracfor several downstream applications, such as support- tion (OpenIE) promise to change the game. OpenIE aims ing complex graph queries in DBpedia [1], or enabling to extract knowledge from texts without knowing the literature-based discovery methods to infer new knowl- entity and relation domains a-priori [6]. Thus, OpenIE edge [2, 3, 4, 5]. Yet, the majority of knowledge in digital can be understood as an unsupervised method that could library collections today is still hidden in textual form, be eficiently applied across diferent domains. Yet, aland efective methods to harvest structured knowledge though OpenIE tools claim to be ready-to-use and suggest from books, journal articles, conference proceedings, etc. a high extraction precision, they are still rarely used in are rare. What are the main reasons? digital library projects. Is it because they are not quite

It usually boils down to the costs vs. quality trade-of: as "ready-to-use" as is commonly expected? In previous Today’s intelligent learning techniques enable domain work, we have proposed a toolbox that utilizes OpenIE experts to design reliable entity linking and relation ex- tools to harvest knowledge from texts [8]. The toolbox traction for harvesting pre-designed relations between contains novel algorithms to clean OpenIE outputs, and entities from texts, see e.g., [6]. However, these systems we performed a quantitative evaluation on biomedical to a large degree rely on supervised learning and thus benchmarks. In contrast to our previous works [8, 9], need large-scale training data that cannot be readily trans- here we analyze the performance of OpenIE on a qualitaferred across domains. That means experts have to give tive level, i.e., what are the main challenges in OpenIE ten thousands of examples to train an extraction system for digital libraries? We do this by performing an evaluafor a single relation. In brief, although supervised meth- tion in two common yet very diferent domains to allow ods for entity recognition/linking and relation extraction for some generalizability: news articles from the New York Times and scientific articles from PubMed. The contribution of this position paper is a discussion of the future challenges and open research questions of OpenIE in digital libraries. In particular, we formulate three open research questions, which need to be answered before OpenIE can be readily used throughout collections.

2. Investigating the Practical Performance of OpenIE

some essential information is missed in the extraction.

Not means an erroneous extraction that does not yield correct or useful information.

OpenIE was built as a versatile set of tools for extracting In a sentence with negation, extractions should always information from unstructured texts. The word open in include the negation to retain the original information. OpenIE refers to the fact that OpenIE systems do not We take the majority vote of the raters for reporting. Tarequire pre-defined domains, relations, and named en- ble 1 includes all sentence categories per tool and the tities to extract new information. However, a system number of sentences selected from a corpus. In addithat perfectly transforms unstructured into structured tion, each category is manually evaluated by finding the information is nowhere to be found, and research is still percentage of extractions that full, partial or not show ongoing [10, 11, 6]. Older OpenIE tools are based on sim- correct and reasonable tuples. From the 200 sentences, ple machine learning and rule-based methods. In general, five representative sentences were selected for this paper a rule-based system is a system that applies rules to store, to explain our five categories and to give the reader an sort, and manipulate data, e.g., the Stanford CoreNLP intuition about the extraction results. tool [10]. These systems use hand-crafted syntactic or Simple. A simple sentence is a sentence that includes semantic linguistic rules such as POS, and parsers, which only one independent clause, i.e., subject, verb, and opusually cause errors in propagation and compounding tionally an object. An example of this category is the at each stage. Modern systems build on neural archi- following sentence: 1. The naysayers raised fair points. tectures to increase extraction quality [11], i.e., a neural CoreNLP yields the following two extractions: (naysaysystem’s task can be seen as a classification problem or ers; raised; fair points) and (naysayers; raised; points). a sequence tagging problem. The main idea of a neural CoreNLP tends to extract multiple, sometimes redundant, OpenIE system is to learn arguments and relation tuples tuples (fair points or points as the objects). For our evaluabootstrapped from a state-of-the-art OpenIE system. The tion, we have always selected the largest CoreNLP tuples, most recent and best-performing OpenIE neural system e.g., we have selected the tuple that contains fair points. is OpenIE6 2020 [11]. We analyze both OpenIE tools in CoreNLP’s extracted tuple contains all the essential inthe following, namely Stanford CoreNLP and OpenIE6. formation that should be extracted. As for OpenIE6, it extracted the following tuple: (The naysayers; raised; fair points). As for the evaluation of CoreNLP in simple sen2.1. Evaluation Corpus tences in the NYT corpus, 62% of its extractions consist We have randomly selected articles from two diferent of a complete statement, and 19% of the extractions were domains for our qualitative evaluation to allow for some partially complete. Thus, 19% of the extractions showed generalizability of our findings. In particular, we investi- an incomplete statement missing important parts of it. gate ten articles from the New York Times and 17 biomed- On the other hand, OpenIE6 showed very good results ical articles from PubMed. Topic-wise, the news articles (100%) when run on simple sentences. are political, environmental, space & cosmos, and opinion Compound. In contrast to simple sentences, a comarticles. Various sentences were chosen from these arti- pound sentence consists of two independent clauses that cles based either on their structure or context. Regarding are joined using a comma, semicolon, or any conjuncthe structure of sentences, we feature five types of struc- tion. For example, India has about 10 million coronavirus tures: simple, compound, and complex sentences. The cases now, and schools have been ofering online instrucpurpose of this approach is to go from easy-to-understand tion since March [12]. The extractions expected from sentences to more and more dificult ones. In addition, this sentence basically have to be two extractions (one nested sentences and sentences that contain any type for each independent clause). In this case, however, of negation, such as not, are selected. We selected 20 CoreNLP’s only extraction is (India; has now; about 10 sentences for each category in both corpora. million coronavirus cases.); the second phrase in this sentence is not extracted at all. In contrast, OpenIE6 yields 2.2. Extraction Quality the following extractions: (India; has; about 10 million coronavirus cases now,) and (schools; have been ofering; The evaluation assessed whether the extraction includes online instruction since March.). Both extractions cover all essential and only reasonable information that should the two independent clauses in the sentence. When the be extracted. We employed three referees to rate the system was run on the complete set of compound senextracted information by both OpenIE tools. For each tences, 81% (NYT) and 76% (PubMed) of OpenIE6’s extracextraction, they decided whether the sentence’s original tions were complete, and 19% (NYT) and 14% (PubMed) information is retained completely, only partially, or is were at least partially informative. In CoreNLP, however, erroneously extracted: Full means that the statement car- only 24% (NYT) and 15% (PubMed) of the extractions ries the main message of the sentence. Partial means that were fully extracted.

Complex. A complex sentence is a sentence that con- species; are impeccably adapted; to detect with sound) sists of one independent clause and at least one dependant and OpenIE6 extracted the following tuples: (many maclause. A complex sentence might look like: Relentless rine species; are impeccably adapted; to communicate advertising campaigns are telling Indian parents that cod- with sound) and (many marine species; are impeccably ing is critical because making children code will develop adapted; to detect with sound). Thus, CoreNLP misses their cognitive skills [12]. A good extraction, in this case, the second phrase, whereas OpenIE6 keeps the complete would be if either one extraction that included the entire information. CoreNLP extracts only the first component sentence was produced or multiple extractions for each from a set of nested sentences in most cases, ignoring dependent and independent clause. CoreNLP’s most in- the conjunction and everything that came after it. And formative extractions for this sentence are: (Relentless therefore, 4% (NYT) and 22% (PubMed) of CoreNLP’s exadvertising campaigns; are telling; Indian Parents), (cod- tractions were complete; however, in OpenIE6, 80% (NYT) ing; is; critical) and (making children code; will develop; and 89% (PubMed) of the extractions were complete. their cognitive skills). Nevertheless, the extraction (Re- Negation. Last but not least, the last kind of sentences lentless advertising campaigns; are telling; Indian Par- selected were sentences containing any type of negation, ents) seems unclear and incomplete. As for OpenIE6, we such as not, no, none or neither. This category was sehave the following tuples: (Relentless advertising cam- lected to analyze how each tool reacts to negations in a paigns; are telling; Indian parents that coding is critical sentence. In this case, a sentence with the negation not because making children code will develop their cogni- was selected, e.g., Recent studies show that man was not tive skills), (coding; is; critical because making children always the hunter [14]. CoreNLP’s extraction was (Recode will develop their cognitive skills) and finally (mak- cent studies; show; man). Whereas OpenIE6’s extractions ing children code; will develop; their cognitive skills). were (Recent studies; show; that man was not always All the previous extractions do not miss any important the hunter) and (man; was not; always the hunter). So, information but are quite long. As for the complex sen- CoreNLP ignores the negation part completely, whereas tences, most of CoreNLP’s extractions miss important OpenIE6 keeps the negation correctly. The tools were parts of the sentence. For the news corpus, only 15% of also tested on multiple sentences containing negations the extractions were fully extracted. On the other hand, such as not. In CoreNLP, some of these sentences did not 78% of OpenIE6’s extractions were complete. have any extractions at all. If the sentence had any extrac

Nested. Next, a sentence was selected to test whether tions, then the negative part was either entirely ignored the provided tools are able to handle nested extractions, and not extracted or extracted but without the negation. too. Consider the following sentence: As a result, many On the contrary, most of the OpenIE6 extractions inmarine species are impeccably adapted to detect and com- cluded the negation. Still, in negation, 17% (NYT) and municate with sound [13]. As we can see here, this sen- 4% (PubMed) of OpenIE6’s extractions were erroneously tence consists of only one subject and one relation; how- extracted. Nevertheless, compared to OpenIE6, CoreNLP ever, the rest of the sentence can be divided into two ar- showed a much higher percentage of extractions full of erguments. Here, the nested information that species adapt rors: 90% (NYT) and 62% (PubMed) of CoreNLP’s yielded to detect and adapt to communicate should be retained extractions were incomplete or wrong. ideally. CoreNLP’s only extraction was (many marine 2.3. Argument Complexity sentences [8]. Kolluru et al. have reported that OpenIE6 can process up to 31.7 sentences per second on a Tesla V100 GPU [11]. For comparison, an older system called RnnOIE can process up to 149.2 sentences per second but come with a lower F1-measure between 39.5% (CaRB 1-1) and 56.0% (OIE16-C).

Having a closer look at the extractions, it seems that CoreNLP tends to extract smaller arguments (subjects or objects) than OpenIE6. For example, CoreNLP yields the triple (making children code; will develop; their cognitive skills) whereas OpenIE6 extracts (coding; is; critical because making children code will develop their cognitive Open Research Question 1. What is the best trade-of skills). The last extraction may be hard-to-handle in a between extraction runtime and accuracy? downstream application because the object contains a whole sentence fragment (obviously not structured). The Extraction Arguments. Our qualitative evaluation has latter one should ideally be broken into smaller pieces. To revealed that OpenIE tools may extract complex arguunderstand how often arguments are complex, we asked ments, i.e., an argument that involves multiple concepts. our three experts to rate all extracted arguments again. Handling complex arguments can be challenging when They assessed whether an argument represents a single using OpenIE in a digital library project, e.g., complex concept of interest or a complex concept. For example, arguments will not represent a precise entity for a knowla single concept might be a city, a person, an article, a edge graph. Thus, post-processing is necessary to filter drug, etc. A complex concept consists of multiple smaller arguments by some domain-specific rules or pre-known concepts, e.g., a person doing something, a location plus vocabularies. One example might be entity-based fildate information, an action plus date information, etc. ters like in [8]. The core idea was to keep only domainThe results are reported in Tab. 2. For example, 98% of specific concepts in arguments that are found in preCoreNLP’s extracted subjects on NYT are actually single known entity vocabularies. In addition, complex conconcepts. OpenIE6 extracts 89% subjects being single con- cepts could be also be handled by hand-crafted rules, e.g., cepts. OpenIE6 extracts complex objects more often than store a date in an argument as additional information CoreNLP: 68% vs. 20% (NYT) and 53% vs. 25% (PubMed). about the actual extraction.

3. Discussion

We analyzed CoreNLP and OpenIE6 on five sentence Not Canonicalized Outputs. OpenIE’s extractions are categories in two domains: The New York Times and not canonicalized, i.e., diferent subjects might refer to PubMed. In addition, we put it into perspective with the the same real-world concept (New York, NY, NYC, etc.). main findings of our previous work [8]. The same holds for relations: multiple verb phrases might

Extraction Accuracy. First, OpenIE6 outperforms represent the same relation, e.g., is born on, has birthdate. CoreNLP for every sentence category. This finding is Vashishth et al. propose a tool called CESI to canoninot surprising because Kolluru et al. have proposed Ope- calize Open Knowledge Bases (a collection of OpenIE nIE6 as the best performing OpenIE system in 2020 [11]. extractions) [15]. Their goal is to identify and resolve They have evaluated OpenIE6 against ten diferent Ope- synonymous subjects, relations, and objects that refer nIE tools on four established benchmarks. Their find- to the same real-world concept. They utilize side inforings show that OpenIE6 achieves an F1-measure between mation like the Paraphrase database and entity linking 46.4% (CaRB 1-1) and 65.6% (OIE16-C). However, our pre- information to embed the open knowledge base into a vious evaluation reveals that CoreNLP is much faster, i.e., high-dimensional embedding space. Then, agglomeraCoreNLP requires around 8.5 minutes to process 52k sen- tive clustering is used to find synonymous subjects, retences, whereas OpenIE6 requires a modern GPU (Nvidia lations, and objects. But, canonicalizing complex arguGTX 1080TI) and around one hour to process the same ments might be especially challenging, i.e., how can a

Open Research Question 2. How should extracted ar

guments be handled? And, may post-processing here be helpful to handle, filter or repair complex arguments? whole sentence fragment be canonicalized correctly. In addition, clustering results might be hard-to-interpret, e.g., which relation is hidden behind a set of verb phrases. We have thus proposed to integrate domain experts in the canonicalizing process, i.e., domain experts build a reliable relation vocabulary to canonicalize verb phrases [8].

Open Research Question 3. How can OpenIE extrac

tion be canonicalized to reliably resolve synonymous noun and verb phrases?

Conclusion. OpenIE ofers a way to bring more structure in otherwise unstructured document collections. Our evaluation shows that in simple settings, modern OpenIE tools like OpenIE6 can indeed already extract information with good quality. However, as sentences become more complex, the resulting extractions usually lack important information or do not retain precise semantics.

Still, we believe that OpenIE tools are extremely valuable because their advantage of not requiring domainspecific training examples is necessary for scalability over large digital libraries and especially for more heterogeneous collections. Moreover, combined with methods for ifltering unnecessary information or detecting important domain-specific concepts, their overall quality in a concrete application may be drastically increased. To this end, we have formulated three demanding research questions for future research. More research will be necessary to bridge the gap between unstructured and structured information while bypassing the need for supervision as much as possible.