<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Appositions, and Adjectives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seyed Iman Mirrezaei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Martins</string-name>
          <email>bruno.g.martins@ist.ul.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel F. Cruz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADVIS Lab, Department of Computer Science, University of Illinois at Chicago</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Superior Te ́cnico, Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Discovering knowledge from textual sources and subsequently expanding the coverage of knowledge bases like DBPedia or Freebase currently requires either extensive manual work or carefully designed information extractors. Information extractors capture triples from textual sentences. Each triple consists of a subject, a predicate/property, and an object. Triples can be mediated via verbs, nouns, adjectives, and appositions. We propose TRIPLEX, an information extractor that complements previous efforts, concentrating on triples related to nouns, adjectives, and appositions. TRIPLEX automatically constructs templates expressing noun-mediated triples from a bootstrapping set. The bootstrapping set is constructed without manual intervention by creating templates that include syntactic, semantic, and lexical constraints. We report on an automatic evaluation method to examine the output of information extractors both with and without the TRIPLEX approach. Our experimental study indicates that TRIPLEX is a promising approach for extracting noun-mediated triples.</p>
      </abstract>
      <kwd-group>
        <kwd>Open information extraction</kwd>
        <kwd>relation extraction</kwd>
        <kwd>noun-mediated relation triples</kwd>
        <kwd>compound nouns</kwd>
        <kwd>appositions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Deriving useful knowledge from unstructured text is a challenging task. Nowadays,
knowledge needs to be extracted almost instantaneously and automatically from
continuous streams of information such as those generated by news agencies or published
by individuals on the social web to enrich properties related to people, places, and
organizations in existing large-scale knowledge bases, such as Freebase [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], DBPedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or
Google’s Knowledge Graph [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Subsequently these values can be used by search
engines to provide answers for user queries (e.g., the resignation date of a given politician,
or the ownership after a company acquisition). For many natural language processing
Copyright © 2015 for this paper by its authors. Copying permitted for private and academic
purposes.
(NLP) applications, including question answering, information retrieval, machine
translation, and information extraction it is important to extract facts from text. For example,
a question answering system may need to find the location of Microsoft Visitor Center
in the sentence The video features the Microsoft Visitor Center, located in Redmond.
Open Information Extractors (OIE) aim to extract triples from text, with each triple
consisting of a subject, a predicate/property, and an object. These triples can be
expressed via verbs, nouns, adjectives, and appositions. Most OIE systems described in
the literature, such as TextRunner [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], WOE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], or ReVerb [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focus on the
extraction of verb-mediated triples. Other OIE systems, such as OLLIE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], ClauseIE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
Xavier and Lima’s system [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], or ReNoun [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], may also, or only, extract
nounmediated triples from text. OLLIE was the first approach for simultaneously extracting
verb-mediated and noun-mediated triples, although it can only capture noun-meditated
triples that are expressed in verb-mediated formats. For example, OLLIE can extract a
triple &lt;Bill Gates; be co-founder of; Microsoft&gt; from Microsoft
cofounder Bill Gates spoke at a conference but cannot extract a triple &lt;Microsoft;
headquarter; Redmond&gt; from Microsoft is an American corporation
headquartered in Redmond. ClauseIE extracts noun-mediated triples from appositions and
possessives based upon a predefined set of rules. ReNoun uses seeds (i.e., examples
gathered through manually crafted rules) and an ontology to learn patterns for extracting
noun-mediated triples.
      </p>
      <p>
        The OIE system that we have built is named TRIPLEX. It is designed specifically
to extract triples from noun phrases, adjectives, and appositions. Systems like OLLIE,
which can only extract triples corresponding to relations expressed through verb phrases,
can be assisted by TRIPLEX, which extracts triples from grammatical dependency
relations, involving noun phrases and modifiers that correspond to adjectives and
appositions. TRIPLEX recognizes patterns that express noun-mediated triples during its
automatic bootstrapping process. The bootstrapping process uses Wikipedia pages to find
sentences that express triples. Then, it constructs templates from dependency paths in
sentences from the bootstrapping set. The templates express how noun-mediated triples
occur in sentences and allow information to be extracted relating to different levels of
text analysis, from lexical (i.e., word tokens) and shallow syntactic features (i.e., parts
of speech tags), to features resulting from a deeper syntactic analysis (i.e., features
derived from dependency parsing). In addition, semantic constraints may be included in
some templates to obtain more precise extractions. Templates are then generalized to
broaden their coverage (i.e., those with similar constraints are merged together). Finally,
the templates can be used to extract triples from previously unseen text. We evaluated
TRIPLEX according to the automated framework suggested by Bronzi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
extending it to assess noun-mediated triples.
      </p>
      <p>The remainder of this paper is organized as follows: In Section 2, we briefly
summarize related work in the area of open-domain information extraction. The TRIPLEX
pipeline is presented in Section 3. Section 4 describes our experiments, ending with a
discussion on the obtained results. Finally, Section 5 concludes the paper, summarizing
the main aspects and presenting possible directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Work</title>
      <p>
        OIE systems are used to extract triples from text and they can be classified into two
major groups. The first group includes systems that consider verb-mediated triples
(TextRunner [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], WOE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], ReVerb [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and OLLIE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). The second group includes
systems that consider noun-mediated triples (OLLIE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], ClauseIE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Xavier and
Lima’s system [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and ReNoun [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
      </p>
      <p>In the first group, the earliest proposed OIE system was TextRunner. This system
first detects pairs of noun phrases; then it finds a sequence of words as a potential
relation (i.e., predicate) between each pair of noun phrases. In a similar way, WOE uses
a dependency parser to find the shortest dependency path between two noun phrases.
All of the approaches in the first group assume that the object occurs after the subject.</p>
      <p>
        OLLIE, which is a member of the second group, was the first approach to extract
both noun-mediated and verb-mediated triples. It uses high confidence triples extracted
by ReVerb as a bootstrapping set to learn patterns. These patterns, mostly based on
dependency parse trees, indicate different ways of expressing triples in textual sources.
It is important to note that OLLIE only extracts noun-mediated triples that can be
expressed via verb-mediated formats. Therefore, it only covers a limited group of
nounmediated triples. In comparison, TRIPLEX only extracts triples from compound nouns,
adjectives, and appositions. ClauseIE uses knowledge about English grammar to detect
clauses based on the dependency parse trees of sentences [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Subsequently, triples are
generated depending on the type of those clauses. ClauseIE has predefined rules to
extract triples from dependency parse trees and it is able to generate both verb-mediated
triples from clauses and noun-mediated triples from possessives and appositions. In
contrast, TRIPLEX automatically learns rules that extract triples during its
bootstrapping process.
      </p>
      <p>
        Xavier and Lima use a boosting approach to expand the training set for information
extractors so as to cover an increased variety of noun-mediated triples [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. They find
verb interpretations for noun and adjective based phrases, and transform these into
verbmediated triples to enrich the training set. Still, these verb interpretations can create long
and ambiguous sentences. This makes filtering unrelated interpretations an essential
step before adding the inferred verb interpretations to the training set of information
extractors. TRIPLEX does not depend on verb patterns to extract noun-mediated triples,
thereby making such filtering unnecessary.
      </p>
      <p>
        Closest to our work is ReNoun, a system that uses an ontology of noun attributes
and a manually crafted set of extraction rules, to extract seeds [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The seeds are then
used to learn dependency parse patterns for extracting triples. In contrast, TRIPLEX
uses data from Wikipedia, specifically infobox properties, that is, DBPedia properties
extracted from infoboxes, and infobox values during its bootstrapping process, without
requiring manual intervention.
3
      </p>
      <p>TRIPLEX
OIE systems extract triples from an input sentence according to format &lt;subject;
relation; object&gt;. In these triples, a relation phrase (i.e., a predicate or
property) expresses a semantic relation between the subject and the object. The subject
and the object are noun phrases and the relation phrase is a textual fragment that
indicates a semantic relation between two noun phrases. The semantic relation can be
verb-mediated or noun-mediated. For example, an extractor may find triples &lt;Kevin
Systrom; profession; cofounder&gt; and &lt;Kevin Systrom; appears on;
NBC News&gt; in the sentence Instagram cofounder Kevin Systrom appears on NBC
News. The first triple is noun-mediated and the second one is verb-mediated.</p>
      <p>The TRIPLEX approach focuses on noun-mediated triples from noun phrases,
adjectives, and appositions. First, it finds sentences that express noun-mediated triples.
These sentences are detected by using a dependency parser to find grammatical
relations between nouns, adjectives, appositions, and conjunctions in sentences. Second, it
automatically extracts templates from the sentences. Finally, these templates are used
to extract noun-mediated triples from previously unseen text.</p>
      <p>The TRIPLEX pipeline uses the Stanford NLP toolkit3 to parse sentences, extract
dependencies, label tokens with named entity (NE) and with part-of-speech (POS)
information, and perform co-reference resolution.</p>
      <p>
        The co-reference resolution module is used to replace in all the sentences pronouns
and other co-referential mentions with the corresponding entity spans prior to
subsequent processing. The dependency parser discovers the syntactic structure of input
sentences. A dependency parse of a sentence is a directed graph whose vertices are words
and whose edges are syntactic relations between the words. Each dependency
corresponds to a binary grammatical relation between a governor and a dependent [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For
example, the dependency relations root&lt;ROOT,went&gt;, nsubj&lt;went,Obama&gt;, and
prep-to&lt;went,Denver&gt; can be found in the sentence Obama went to Denver. In
root&lt;ROOT,went&gt;, ROOT is the governor and went is the dependent. The
partof-speech tagger assigns a morpho-syntactic class to each word, such as noun, verb,
or adjective. The Named Entity Recognition (NER) model4 labels sequences of words
according to the pre-defined categories: Person, Organization, Location, and Date.
      </p>
      <p>The other components of the pipeline are a noun phrase chunker, which
complements the POS, NER, and dependency parsing modules from the Standard NLP toolkit,
WordNet synsets, and Wikipedia synsets. The noun phrase chunker extracts noun phrases
from sentences. WordNet is a lexical database that categorizes English words into sets of
synonyms called synsets. WordNet synsets are used to recognize entities of a sentence
according to the pre-defined categories, complementing the Stanford NER system.
Several synsets are also built for each Wikipedia page. There are different mentions for a
Wikipedia page (e.g., redirects and alternative names) and also in the hypertext anchors
that point to a Wikipedia page. For example, in the Wikipedia page for the University of
Illinois at Chicago, the word UIC is extensively used to refer to the university. Synsets
of Wikipedia pages are constructed automatically by using redirection page links,
backward links, and hypertext anchors. These links are retrieved using the JWPL library.5
TRIPLEX uses infobox properties and infobox values of Wikipedia during its
bootstrapping process. We use a Wikipedia English dump6 to extract all Wikipedia pages and
query Freebase and DBPedia according to the Wikipedia page ID, to determine the type
of the page. Wikipedia pages are categorized under the following types: Person,
Orga3 http://nlp.stanford.edu/software/corenlp.shtml
4 http://nlp.stanford.edu/software/CRF-NER.shtml
5 https://code.google.com/p/jwpl/
6 http://dumps.wikimedia.org/backup-index.html
nization, or Location. Additionally, we perform co-reference resolution on extracted
Wikipedia pages to identify words that refer to the same Wikipedia page subject. We
then use these words to enrich synsets of the respective Wikipedia page. We now
describe the TRIPLEX approach for extracting templates, starting with the generation of
the bootstrapping set of sentences.
3.1</p>
      <sec id="sec-2-1">
        <title>Bootstrapping Set Creation</title>
        <p>
          Following ideas from OLLIE that leverage bootstrapping [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], our first goal is to
construct automatically a bootstrapping set that expresses multiple ways in which
information in noun phrases, adjectives, and appositions is encapsulated. The bootstrapping set
is created by processing extracted Wikipedia pages and their corresponding infoboxes.
        </p>
        <p>Extracted Wikipedia pages without infobox templates are ignored during sentence
extraction, while the other pages are converted into sets of sentences. Finally, we
perform preprocessing on sentences from extracted Wikipedia pages and use custom
templates (i.e., regular expressions) to identify infobox values from the text. We also convert
dates to strings. For instance, the infobox with value 1961j8j4 is translated to August
4, 1961. We begin template extraction by processing 3,061,956 sentences in extracted
Wikipedia pages that are matched with infobox values.</p>
        <p>
          The sentence extractor automatically constructs a bootstrapping set by matching
infobox values of the extracted Wikipedia pages with sentences in the corresponding
Wikipedia pages. The sentence extractor searches for sentences that are matched with
infobox values in a given Wikipedia page. If in a sentence there exists a dependency
path between the current infobox value and the synset of the page name, and if this
dependency path only contains nouns, adjectives, and appositions, then the sentence
is extracted. For instance, given the page for Barack Obama, the extractor matches the
infobox value August 4, 1961 with the sentence Barack Hussein Obama II (born August
4, 1961). This process is repeated for all infobox values of a Wikipedia page. In order
to match complete names with abbreviations such as UIC, the extractor uses a set of
heuristics that was originally proposed in WOE [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], named full match, synonym set
match, and partial match. Full match is performed when the page name is found within
a sentence of the page. Synonym set match occurs when one member of the synonym
set for the page name is discovered within a sentence. Partial match is performed when a
prefix or suffix of a member of the synonym set is used in a sentence. Finally, a template
is created by marking an infobox value and a synset member in the dependency path
of a selected sentence. We apply a constraint on the length of the dependency path
between a synset member and an infobox value to reduce bootstrapping errors. This
constraint sets the maximum length of the dependency path to 6, which was determined
experimentally by checking the quality of our bootstrapping set. To check its quality, we
randomly selected 100 sentences for manual examination and found that approximately
90% of the extracted sentences satisfied the dependency path length.
        </p>
        <p>After creating the bootstrapping set, the next step is to automatically create
templates from dependency paths that express noun-mediated triples. Templates describe
how noun-mediated triples can occur in textual sentences. Each template results from
a dependency path between a synset member (a subject) and an infobox value (an
object). We annotate these paths with POS tags, named entities, and WordNet synsets.
In the template, to each infobox value we add the name of the infobox. In addition, a
prep‐in
vmod
template includes a template type, based on the types of the Wikipedia page where the
sentence occurred. The types of dependencies between synset members and infobox
values are also attached to the template. If there is a copular verb or a verbal
modifier in the dependency path, we will add them as a lexical constraint to the template.
For example, headquartered is a verbal modifier added as a lexical constraint to the
corresponding template for the sentence: Microsoft is an American corporation
headquartered in Redmond (see Figure 1). Born is another lexical constraint for templates
related to nationality, as in the sentence The Italian-born Antonio Verrio was frequently
commissioned. We merge templates if exists the only differences among them relate to
lexical constraints. We keep one template and a list of lexical constraints for the merged
templates. Finally, we process all templates and remove redundant ones.</p>
        <p>Infobox values may occur before or after synset members of the page name in
sentences. If there exists a dependency path between these values without considering their
position, the related template is extracted. For example, the infobox value occurs before
the synset member in the sentence Instagram co-founder Kevin Systrom announced
a hiring spree. In this example, co-founder is the infobox value and Steve Hafner is
the synset member of the Wikipedia page. The infobox value may also occur after the
synset member, as shown in the sentence Microsoft is an American corporation
headquartered in Redmond. In this case, corporation is the synset member and Redmond is
the infobox value (see Figure 1). We also considered some additional heuristics in the
template generation process. When there is a conjunction dependency between nouns
in a sentence, if one of the nouns has a specific template, the template will be expanded
to include all the noun relations joined by conjunctions in the sentence.</p>
        <p>The noun phrase chunker is finally used to search dependency paths and merge
words that are part of the same noun phrase chunk. In addition, we do not apply the
noun phrase chunker if a synset member and the infobox value occur in the same chunk.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Template Matching</title>
        <p>
          This section describes how we use the dependency paths of a sentence together with the
extracted templates to detect noun-mediated triples. First, named entities and WordNet
synsets are used to recognize the candidate subjects of a sentence together with their
types. Then, dependency paths between candidate subjects and all potential objects are
identified and annotated by the NLP pipeline. Finally, candidate infobox names (which
are properties in DBPedia) are assigned to a candidate subject and a candidate
object, derived from matching templates with subject types, dependency types, WordNet
synsets, POS tags, and named entity annotations. If there are lexical constraints in a
template, the words in the dependency path between a subject and an object must be
matched with one of the phrases in the lexical constraint list. We also consider the
semantic similarity between the words and the member of lexical constraint list, using
Jiang and Conrath’s approach to calculate the semantic similarity between words [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          When there is a specific range (Person, Organization, Location, or Date) for an
infobox name (property) of a triple and when the object type of a triple is unknown,
a previously trained confidence function is used to decrease the confidence score of
the triple. A logistic regression classifier is used in this confidence function after it is
trained using 500 triples extracted from Wikipedia pages. Our confidence function is an
extension of the confidence function proposed for OLLIE [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and for ReVerb [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. A set
of features (i.e., frequency of the extraction template, existence of lexical words in
templates, range of properties, and semantic object type) are computed for each extracted
triple, and the classifier is used to predict the confidence score for the extracted triples.
        </p>
        <p>Finally, each candidate triple has an infobox name that is mapped to a DBPedia
property and the object type of a candidate triple should be matched with the range of a
property in DBPedia. When the range of a property is a literal, all possible values of the
property are retrieved from DBPedia and compared with the candidate object. If their
values are not matched, the candidate triple is discarded.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        We conducted a comprehensive set of experiments to compare the outputs of TRIPLEX,
OLLIE, and ReVerb based upon the approach suggested by Bronzi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These
authors introduced an approach to evaluate verb-mediated information extractors
automatically. We improve on their approach by expanding it to the evaluation of
nounmediated triples. Additionally, we compare TRIPLEX, OLLIE, and ReVerb using a
manually constructed gold standard. Finally, we compare information extractors according
to the quality of their extracted triples.
      </p>
      <p>
        We first created a dataset by taking 1000 random sentences from Wikipedia that
have not been used during the bootstrapping process. Each sentence in the test dataset
has a corresponding Wikipedia page ID. All extracted facts gathered by information
extractors from these sentences needed to be verified. A fact is a triple &lt;subject;
predicate; object&gt; that indicates a relation between a subject and an object. A
fact is correct if its corresponding triple has been found in the Freebase or DBPedia
knowledge bases or if there is a significant association between the entities (subjects
and objects) and the predicate on the web [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In order to estimate the precision of an
information extractor [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we use the following formula:
      </p>
      <p>
        In Equation 1, jbj is the number of extracted facts from Freebase and DBPedia. jSj
is the total number of extracted facts by the system. Finally, jaj indicates the number of
correct facts returned by the information extractor, which have been validated by using
pointwise mutual information (PMI) as defined in Equation 2 from occurrences in the
web. Since values of properties in Freebase and DBPedia are not completely filled,
Bronzi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] suggest computing PMI to verify a fact. The likelihood of observing a
fact is computed based on its subject (subj), object (obj) and predicate (pred):
PMI(subj; pred; obj) =
      </p>
      <p>Count(subj ^ pred ^ obj)</p>
      <p>Count(subj ^ obj)</p>
      <p>
        When verifying the extracted facts, we use the corresponding Wikipedia ID of each
sentence to retrieve all possible properties and their values from Freebase or DBPedia.
These values will then be used to verify extracted facts from sentences. The semantic
similarity between the properties of those knowledge bases and the predicate of a fact
are calculated [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This similarity measure uses WordNet together with corpus statistics
to calculate semantic similarity between words. If the semantic similarity is above a
predetermined threshold and both entities are also matched together, the fact is deemed
correct [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The function Count(q) indicates the number of documents that contain the query
elements in its proximity. The proximity value shows the maximum number of words
between the query elements. The proximity value is at most 4 in our experiments. We
estimate query q by using the Google search engine. The range of the PMI function is
between 0 and 1. The higher the PMI value, the more likely that the fact is correct. In
particular, a fact is deemed correct if its PMI value is above a given threshold. The value
for this threshold, determined experimentally, is 10 3 in our experiments. We also use
the method in Equation 3 to estimate recall, as suggested by Bronzi at al.:
Recall =</p>
      <p>jaj + jbj
jaj + jbj + jcj + jdj</p>
      <p>The parameters jaj and jbj are computed as we mentioned in Equation 1. We now
describe how we estimate jcj and jdj. First, all correct facts within sentences should
be identified. Each fact contains two entities and a relation. All possible entities of a
sentence are detected by the Stanford named entity recognizer and WordNet synsets.
Furthermore, we use the Stanford CoreNLP toolkit to detect all verbs (predicates) in a
sentence. Finally, we expand the set of predicates from sentences by adding DBPedia
and Freebase properties.</p>
      <p>We use three sets S, P , and O to create all the possible facts, which are respectively
the set of recognized subjects, predicates, and objects in the sentences. All possible
facts are produced by the Cartesian product of these three sets, G = (S P O).
Then, jcj in Equation 3 is estimated by subtracting jbj from the intersection of Freebase
and DBPedia with G. D denotes the set of all the facts in Freebase and in DBPedia.
Then, G n D is computed by applying PMI to the facts that are not in D . Finally, jdj is
determined by subtracting jaj from G n D.
(1)
(2)
(3)</p>
      <p>
        We further select 50 sentences from the dataset of 1000 sentences, and a human
judge (H) extracts all of the correct facts. Then, we use the suggested method by Bronzi
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], to compute the agreement between the automatic evaluation and manual
evaluation. The agreement between the automatic evaluation and human evaluation is found
to be 0.71. Now, we are able to determine the precision and recall for information
extractors, based upon the automatic and manual evaluations.
      </p>
      <p>We ran OLLIE, ReVerb, and TRIPLEX both individually and in combination and
then computed precision and recall. Table 1 shows results for the automatic and manual
evaluation of information extractors.</p>
      <p>ReVerb only generates verb-mediated triples and OLLIE extracts verb-mediated
triples and also noun-mediated triples, if they are expressed in verb-mediated styles.
TRIPLEX generates noun-mediated triples and it can complement the results of OLLIE
and ReVerb. OLLIE, ReVerb, and TRIPLEX all assign a confidence score to each
extracted triple. In these experiments, the extracted triples are only considered if their
confidence scores are above a threshold of 0.2. TRIPLEX shows an improvement in
Table 1 in comparison with the automatic evaluation because extracted facts with very
low PMI are considered as false in automatic evaluation. However, these facts were
often evaluated as true positive by the human judge. We analyzed the errors made by
TRIPLEX in the gold standard dataset that was manually annotated. TRIPLEX’s errors
can be classified into two groups: errors in precision (false positive) and errors in recall
(false negative). In the gold standard 65% of the triples are related to verb-mediated
triples, which are not considered by TRIPLEX.</p>
      <p>Table 2 shows triples in the gold standard that are not extracted by TRIPLEX. Of
those, 10% obtained low confidence scores (false negatives) because the NER module
and WordNet could not find the semantic type for the objects. We penalize the
confidence score of a candidate triple if its predicate has one particular property type and
if no type is detected for the triple’s object. For example, the range of the nationality
property in DBPedia is a Location constraint but neither the NER module nor
WordNet can recognize a type in the phrase Swedish writer or Polish-American scientist.
Also, 12% of errors are related to the dependency parser because the parser could not
detect a correct grammatical relation between the words in a sentence. Another 7% of
the errors occurred when the coreferencing module did not properly resolve
coreferential expressions during template extraction. This problem is alleviated by assigning low
confidence scores to this group of templates. Finally, 6% of errors were caused by
overgeneralized templates. During template generalization, POS tags where substituted by
universal POS tags. Since some templates only extract triples for proper nouns, nouns or
personal pronouns, generalizing and merging these templates together did not produce
correct triples.</p>
      <p>We also saw that 20% of the errors (false positive) are correct triples. Subjects for
these triples were not popular on the web, therefore there were few documents about
these subjects. Thus, applying the same PMI threshold as that used for prominent
subjects proved to be ineffective. For example, triples extracted by TRIPLEX are judged as
incorrect in the sentence, Alexey Arkhipovich Leonov (born 30 May 1934 in Listvyanka,
Kemerovo Oblast, Soviet Union) is a retired Soviet/Russian cosmonaut. These triples
include information about birth date, birth place, origin, and profession, but are not
available in the gold standard. Many other false positive errors were due to dependency
parser problems, named entity problems, chunker problems, and over generalized
templates. OIE systems such as ReVerb and OLLIE usually fail to extract triples from
compound nouns, adjectives, conjunctions, reduced clauses, parenthetical phrases, and
appositions. TRIPLEX only covers noun-mediated triples in sentences and we can
examine TRIPLEX’s output with respect to the gold standard, as shown in the Table 3. The
table shows that 12% of noun-mediated triples are related to conjunctions, adjectives,
and noun phrases, meaning that TRIPLEX is also able to extract noun-mediated triples
from noun conjunctions. For example, TRIPLEX extracts triples about Rye Barcott’s
professions from the sentences Rye Barcott is author of It Happened on the Way to War.
He is a former U.S. Marine and cofounder of Carolina for Kibera. Moreover, TRIPLEX
is able to extract triples from appositions and parenthetical phrases, containing 9% of
extracted triples within this category. For example, extracted triples from the
following sentence indicate Michelle Obama’s two professions, birth date, and nationality;
Michelle LaVaughn Robinson Obama (born January 17, 1964), an American lawyer
and writer, is the wife of the current president of the United States. We see that 6%
of triples are related to titles or professions, such as Sir Herbert Lethington Maitland,
Film director Ingmar Bergman, and Microsoft co-founder Bill Gates. OLLIE is
similarly able to capture this kind of triples because they are expressed in a verb-mediated
style. However, TRIPLEX does so without using a verb-mediated format. The final
fraction of 8% is for noun-mediated triples that rely on the lexicon of noun-mediated
templates. For example, the headquarters of Microsoft is extracted from the sentence,
Microsoft is an American multinational corporation headquartered in Redmond,
Washington. Finally, 65% of the extracted triples are verb-mediated triples. Both ReVerb and
OLLIE generate verb-mediated triples from sentences. The majority of errors produced
by OLLIE and ReVerb are due to incorrectly identifying subjects or objects. ReVerb
first locates verbs in a sentence and then looks for noun phrases to the left and right
of the verbs. ReVerb’s heuristics sometimes fail to find correct subjects and objects
because of compound nouns, appositions, reduced clauses, or conjunctions. OLLIE relies
on extracted triples from ReVerb for its bootstrapping process and learning patterns.
Therefore, OLLIE’s patterns are limited to verb-meditated triples. However, OLLIE
also produces noun-meditated triples if they can be expressed via verb-mediated
formats. Thus, OLLIE’s coverage of noun-mediated triples is limited.</p>
      <p>We also analyzed some sentences to figure out why different information extractors
were not able to produce all of the triples in the gold standard. The first reason is that
there may not be sufficient information in a sentence to extract triples. For example,
TRIPLEX can find the triple &lt;Antonio; nationality; Italian&gt; but it would
not find the triple &lt;Antonio; nationality; England&gt; in the sentence The
Italian-born Antonio Verrio was responsible for introducing Baroque mural painting
into England. Second, OLLIE and ReVerb cannot successfully extract verb-mediated
triples from sentences that contain compound nouns, appositions, parentheses,
conjunctions, or reduced clauses. When OLLIE and ReVerb cannot yield verb-mediated
triples, recall will be affected because verb-mediated triples are outside of the scope of
TRIPLEX. Improvements to OLLIE and ReVerb could substantially lead to better
precision and recall. Finally, improvements on the different NLP components can also lead
to better precision and recall for information extractors that rely heavily on them.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>This paper presented TRIPLEX, an information extractor to generate triples from noun
phrases, adjectives, and appositions. First, a bootstrapping set is automatically
constructed from infoboxes in Wikipedia pages. Then, templates with semantic, syntactic,
and lexical constraints are constructed automatically to capture triples. Our experiments
found that TRIPLEX complements the output of verb-mediated information extractors
by capturing more noun-mediated triples. The extracted triples can for instance be used
to populate Wikipedia pages with missing infobox attribute values or to assist authors
in the task of annotating Wikipedia pages. We also improved an automated method to
evaluate triples by expanding it to include noun-mediated triples.</p>
      <p>
        In future work, we plan to improve results for triples involving numerical values
with different units (i.e., square meter, meter) in generating extraction templates. We
would also like to enrich the bootstrapping set process by using a probabilistic
knowledge base (e.g., Probase [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), as it may broaden the coverage of the bootstrapping set
and support the construction of more templates.
      </p>
      <sec id="sec-4-1">
        <title>Acknowledgments</title>
        <p>We would like to thank Matteo Palmonari for useful discussions. Cruz and Mirrezaei
were partially supported by NSF Awards CCF-1331800, IIS-1213013, and IIS-1143926.
Cruz was also supported by a Great Cities Institute scholarship. Martins was supported
by the Portuguese FCT through the project grants EXCL/EEI-ESS/0257/2012
(DataStorm Research Line of Excellence) and PEst-OE/EEI/LA0021/2013 (INESC-ID’s
Associate Laboratory multi-annual funding).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          . pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Broadhead</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Open Information Extraction for the Web</article-title>
          .
          <source>In: International Joint Conferences on Artificial Intelligence (IJCAI)</source>
          . pp.
          <fpage>2670</fpage>
          -
          <lpage>2676</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bollacker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturge</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.:
          <article-title>Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          . ACM (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bronzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mesquita</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merialdo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic Evaluation of Relation Extraction Systems on Large-scale</article-title>
          .
          <source>In: Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction</source>
          . pp.
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>De Marneffe</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Stanford Typed Dependencies Manual (
          <year>2008</year>
          ), http: //nlp.stanford.edu/software/dependencies_manual.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Del</given-names>
            <surname>Corro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Gemulla</surname>
          </string-name>
          , R.: ClausIE:
          <article-title>Clause-based Open Information Extraction</article-title>
          .
          <source>In: International World Wide Web Conference (WWW)</source>
          . pp.
          <fpage>355</fpage>
          -
          <lpage>366</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fader</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Identifying Relations for Open Information Extraction</article-title>
          .
          <source>In: Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>1535</fpage>
          -
          <lpage>1545</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrath</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <source>Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy</source>
          . In: International Conference on Research in Computational Linguistics. pp.
          <fpage>19</fpage>
          -
          <lpage>33</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mausam</surname>
            , Schmitz,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bart</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Open Language Learning for Information Extraction</article-title>
          .
          <source>In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</source>
          . pp.
          <fpage>523</fpage>
          -
          <lpage>534</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Introducing the Knowledge Graph: Things, Not Strings</article-title>
          . Official Google Blog, May (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          :
          <article-title>Open Information Extraction Using Wikipedia</article-title>
          . In:
          <article-title>Annual Meeting of the Association for Computational Linguistics</article-title>
          . pp.
          <fpage>118</fpage>
          -
          <lpage>127</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          :
          <article-title>Probase: A Probabilistic Taxonomy for Text Understanding</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <fpage>481</fpage>
          -
          <lpage>492</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Xavier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lima</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Boosting Open Information Extraction with Noun-Based Relations</article-title>
          .
          <source>In: International Conference on Language Resources and Evaluation</source>
          . pp.
          <fpage>96</fpage>
          -
          <lpage>100</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Yahya</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whang</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>ReNoun: Fact Extraction for Nominal Attributes</article-title>
          .
          <source>In: International Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>325</fpage>
          -
          <lpage>335</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>