<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Open Information Extraction using Domain Knowledge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cheikh Kacfah Emani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Catarina Ferreira Da Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Fies</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parisa Ghodous</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CSTB</institution>
          ,
          <addr-line>290 route des Lucioles, BP 209, 06904 Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universite Lyon 1, LIRIS</institution>
          ,
          <addr-line>CNRS, UMR5205, F-69622</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Open Information Extraction (OIE) aims to identify all the possible assertions within a sentence. Recent and thus the most e cient OIE-tools use the grammatical dependencies or the syntactic tree of the sentence to perform extraction. When they provide a wrong extraction it is mainly due to parsing errors. In this paper, we propose to handle these parsing errors before doing OIE itself. To achieve our goal we focus on multi-word expressions (MWE). They represent more than 45% of wrong extractions. We show how the MWE-problem can be handle in a given domain and how MWE-unbreakable property is a good lter for OIE.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years, researchers have tackled the problem of Open Information
Extraction in di erent manner: from machine learning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to the exploitation of
sentence structure [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This last type of approaches obtains the best
results. Unfortunately, their OIE-tools (exploiting grammatical dependencies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
and syntactic tree [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) sometimes output incorrect tuples. These wrong
extractions are mainly due to parsing errors. Indeed these approaches take advantage
of the syntactic tree or grammatical dependencies provided by a parser.
Consequently, a good way to improve Open Information Extraction is to handle
parsing errors before the extraction stage itself. To achieve this goal, we have
decided to handle multi-word expressions (MWE). A MWE is a phrase, made
up of a set of words, which has a precise meaning and is unbreakable.
\MWEerrors" represent more than 45% of parsing errors. We propose an algorithm
to shorten multi-word expressions. We have evaluated our proposals in a given
domain, which is law texts in building engineering construction, and show how
we outperform existing tools. Indeed, in a given domain, multi-word expressions
are easy to handle: domain terms, recurrent domain-independent terms, named
entities, etc. Our goal is discussed through the following agenda. Initially, a brief
state of the art on OIE is presented (Sect. 2). Next, we detail our contribution
(Sect. 3). Finally, our algorithms are tested on a set of sentences issue from law in
the eld of engineering construction (Sect. 4.1) and we discuss our early results
(Sect. 4.2).
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>As mentioned in the introduction, we want to perform \Open" Information
Extraction, but from sentences which describe a precise eld. This constraint
makes us have pieces of information about terminology as described in next
sections. Nevertheless, our ideas can be used in an \open" manner (MWE will be
mainly named entities, formulae, etc.) and thus be compared to \traditional"
OIE-systems.</p>
      <p>
        During the recent years, many systems were developed to perform OIE. It is
the case of ReVerb [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], OLLIE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], ClausIE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and CSD-IE [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]. ReVerb by
means of e cient heuristics, focused on incoherent and uninformative triples.
Unfortunately, relations extracted by ReVerb were necessarily verb-based. This
is the main reason why OLLIE, developed by the same group of researchers, was
provided. In addition to be able to identify non verb-driven facts, OLLIE aims
to provide the context/condition, if existing, in which the extracted fact can be
considered true. These two previous tools are machine learning-based. The most
recent approach does not need any additional resource. They only exploit result
of a standard parser. ClausIE uses grammatical typed dependencies and CSD-IE
the syntactic tree of the input sentence. These two tools dissect each piece of the
result they get from the parsing tool. Consequently, if a dependency is wrong
or a sub-tree is incorrectly labelled in the syntactic tree these OIE-tools may
provide inaccurate extraction. This is why we propose to make some
preprocessing operations before OIE itself. The details of these tasks are given in the next
section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Handle Multi-Word Expressions</title>
      <p>Researchers commonly agree that parsing errors lead to major incorrect
extractions in Open Information Extraction. In a sample set of sentences select from
various regulatory texts in the eld of building engineering (see Sect. 4.1 for
more details about the corpus), the percentage of errors due to MWE is 46:15%
using CSD-IE. To handle problems caused by MWE is a thus relevant way to
improve result of IE. Our solution to improve the quality of OIE-tools when they
face the MWE-problem is a three-step operation: (i) Detect MWE, (ii) compress
each MWE and (iii) expand each MWE at the end of the extraction step. This
process is illustrated by Fig. 1.</p>
      <sec id="sec-3-1">
        <title>Step 1 - Detection of Multi-Word Expressions</title>
        <p>For us, a MWE is every phrase which the meaning will be modi ed (even become
meaningless) by the addition or the deletion of any of its word. Consequently, a
domain term, an idiomatic expression, a phrasal verb, a named entity, a formula,
a quotation etc. is a MWE. These examples of MWE make us foresee that MWE
are more easily and reliably identi able in a given domain. One can have also
domain-independent terms that are not related to the eld of study but are
frequently found in the corpus. It is the case of operators (example: less than,
less than or equal to, as much as), idiomatic expressions (example: \Loose your
head", \Jump in feet rst"), units of measurements, etc. So, a set of MWE in
a precise domain can be made up of the terminology of the eld and frequent
terms. This last category of terms can be obtain by means of existing statistical
methods and the help of human experts. At this stage we identify the MWE
present in the original sentence. We thus have a list of possible MWE in our
corpus (see Sect. 4.1 for more details).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Step 2 - Compression of a Multi-Word Expression</title>
        <p>
          The reason why precision of OIE-tools is a ected by MWE is that the latter is
considered by the former to be non atomic. Hence, to limit potential hazardous
fragmentation of expressions, we propose to extract information from a new
version of sentences where each MWE will have been replaced by a shortened
version. So, now the question is: how do we get this short version of MWE?
When trying to answer this question, we must have in mind that the shortened
sentences must always be semantically and syntactically correct to be
appropriately handled by OIE-tools. We propose the following steps for shortening a
MWE (using its syntactic parse tree):
1. if the MWE is a clause (list of labels for clauses is available in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) or a verb
phrase, there is no shortening ;
2. else, if the MWE is a noun phrase, the rst token labelled noun is considered
to be the shortened version of the MWE;
3. else, we take the string provided by the smallest phrase1 within the tree.
Let us note that some MWE will be short enough so that they will remain
the same after the shortening. Although, such MWE (like any other MWE)
is considered to be atomic. This is important to have in mind because, if an
OIE-tool breaks a MWE, the resulting triple will be incorrect.
        </p>
        <p>After this stage, we now perform OIE itself, which is the third step. This
OIE is done by using existing OIE-systems. Consequently, the following steps
come after OIE and take as input results of OIE, i.e triples.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Step 4 - Filtering of Open Information Extraction Results</title>
        <p>Earlier in this work, we have pointed out a set of things which degrades precision
of OIE-tools. We have focused on the problematic role caused by multi-word
expressions. Now, we use the only characteristic of MWE to nalise our
OIEprocess. Indeed, a MWE is unbreakable. Consequently, when a triple contains
only a fragment of a MWE, it is considered as incorrect. This ltering is done
before the expansion stage, so the MWE are in their \shortened" form.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Step 5 - Expansion of a Multi-Word Expression</title>
        <p>After the OIE has been done from the shortened version of the sentence, we now
have to reconcile extracted facts with the original (long) sentence. We then look
into the list of the extracted facts to replace shorten version of MWE by their
initial long form. This is the aim of this step.
4
4.1</p>
      </sec>
      <sec id="sec-3-5">
        <title>Evaluation</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Preliminary Evaluation and Discussion</title>
      <p>
        After making some statistics on the factors which lead to incorrect Information
Extraction, we have decided to tackle the multi-word expressions-problem. The
rst step of the approach we propose is to identify them in the input sentence. Be
able to perform such identi cation implies to have a list of possible MWE. That
is why we hypothesize that the sentence describes the realities of a speci c eld
of interest. Actually, to know that we are working in a speci c domain implies
to have a good idea of the terminology of this domain. For our evaluation, we
have taken the list of terms (labels of the concepts in the eld) as the set of our
MWE. These terms have been obtained through a key terms extraction process.
1 An exhaustive list of labels for phrases is available in the Penn Treebank [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
We have taken advantage of existing tools (Alchemy 2 in our case) to carry out
this extraction. For this preliminary evaluation, our corpus is made up of 50
random sentences from documents about re safety [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], energy e ciency [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and
accessibility [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our list of MWE consists of result provided by Alchemy without
terms containing a proper noun. Moreover, we have added units of measurement.
      </p>
      <p>
        To perform OIE itself after preprocessing tasks, we have used ClausIE of
Del Corro and Gemulla [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In addition, we compare our results to CSD-IE of
Bast and Haussmann [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and to the \original" version of ClausIE. Results are
presented by Tab. 1.
      </p>
      <p>
        CSD-IE performance in this domain-speci c corpus (58.26%) is less good than
in \open datasets" like the Wikipedia (70.0%) and New York Times dataset
(71.5%) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The same remark can be made to ClausIE. But by handling the
MWE-problem, we obtain 81.81% of correct extractions (18.19% of errors). We
still have a certain number of errors. Some of these errors are caused by our
handling of MWE as discussed in the next section and others errors come from
OIE-tools we use at the extraction step itself.
4.2
      </p>
      <sec id="sec-4-1">
        <title>Discussion</title>
        <p>We have seen that handle MWE, in a given domain, helps to improve OIE on
sentences of that domain. However we see that our method to handle MWE has
to be improved. Indeed 30% of remaining errors after the shortening of MWE
are due to that operation. Indeed:
{ when we choose the rst noun of a noun phrase-MWE to replace this MWE
it is not always its suitable representative. Indeed some nouns can sometimes
be tagged as verb and thus a potential predicate (e.g: \ re" in the expression
re extinguisher, \means" in the term means of access, etc.) and it can cause
wrong extractions. Consequently, when we have more than a noun in a noun
phrase, we must have more criteria to choose the representative.
2 http://www.alchemyapi.com/
{ In some sentences, parsers correctly identify the prepositional modi ers of
all verbs, nouns, adverbs, etc. Consequently the presence of MWE is a priori
not a problem for OIE-systems. Unfortunately, the deletion of prepositions
(found for example in a noun phrase-MWE) during the shortening may lead
to parsing errors. Indeed, parsers will try to identify new relations which
may be wrong leading to incorrect extractions as illustrated below:
1. Original sentence : \A stair is a xed means of access."
2. Shortened version : \A stair is a xed means."
3. OIE : CSD-IE!(A stair, means, is) &amp; ClausIE!(A stair, a xed means)
One of the possible solutions to avoid the shortening of MWE to escape from
their multi-word problem is to replace a MWE by a synonym. Ideally, this
synonym should have less words (why not a single word?) than the original MWE.
Such synonyms could be found in Linked Open Data or in lexical databases like
Wordnet, etc.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The goal of this work is to see how we can leverage domain knowledge (mainly
terminology) to improve Open Information Extraction. We have focused on
multi-word expressions which cause more than 45% of errors in existing
OIEtools. We have thus decided to reduce their length by proposing a shortening
algorithm for multi-word terms. First results of our approach are very promising.
In our goal to take advantage as much as possible of domain knowledge we can
go further in facts ltering. For instance, simple domain and range constraints
could help in detecting wrong facts. When we take the following fact (building,
has width, door), we can state that it is incorrect by exploiting the fact that the
range of the \predicate" has width is xsd: oat.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>American with Disabilities Act (ADA): 2010 ADA Standards for Accessible Design</article-title>
          (sep
          <year>2010</year>
          ), http://www.fire.tas.gov.au/userfiles/stuartp/file/ Publications/FireSafetyInBuildings.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bast</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haussmann</surname>
          </string-name>
          , E.:
          <article-title>Open information extraction via contextual sentence decomposition</article-title>
          .
          <source>In: Semantic Computing (ICSC)</source>
          ,
          <year>2013</year>
          IEEE Seventh International Conference on. pp.
          <volume>154</volume>
          {
          <fpage>159</fpage>
          . IEEE Computer Society (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bast</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haussmann</surname>
          </string-name>
          , E.:
          <article-title>More informative open information extraction via simple inference</article-title>
          .
          <source>In: Advances in Information Retrieval. Lecture Notes in Computer Science</source>
          , vol.
          <volume>8416</volume>
          , pp.
          <volume>585</volume>
          {
          <fpage>590</fpage>
          . Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bies</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferguson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>MacIntyre</given-names>
            , R.,
            <surname>Tredinnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Marcinkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Schasberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Bracketing guidelines for treebank II Style Penn Treebank project</article-title>
          .
          <source>University of Pennsylvania</source>
          <volume>97</volume>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Building</given-names>
            <surname>Safety Unit Tasmania Fire Service</surname>
          </string-name>
          : Fire Safety in Buildings (aug
          <year>2002</year>
          ), http://www.fire.tas.gov.au/userfiles/stuartp/file/Publications/ FireSafetyInBuildings.pdf, obligaitions of owners and occupiers
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>California</given-names>
            <surname>Energy Commission: 2008 Building Energy E ciency Standards</surname>
          </string-name>
          (
          <year>2008</year>
          ), http://www.energy.ca.gov/2008publications/CEC-400-2008-001/ CEC-400-2008-001-CMF.
          <article-title>PDF, for residential and nonresidential buildings</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Del</given-names>
            <surname>Corro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Gemulla</surname>
          </string-name>
          , R.:
          <article-title>Clausie: clause-based open information extraction</article-title>
          .
          <source>In: Proceedings of the 22nd international conference on World Wide Web</source>
          . pp.
          <volume>355</volume>
          {
          <fpage>366</fpage>
          . WWW '13,
          <string-name>
            <given-names>International</given-names>
            <surname>World Wide Web Conferences Steering Committee</surname>
          </string-name>
          , Republic and Canton of Geneva, Switzerland (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fader</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Identifying relations for open information extraction</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>1535</volume>
          {
          <fpage>1545</fpage>
          . EMNLP '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mausam</surname>
            , Schmitz,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bart</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Open language learning for information extraction</article-title>
          .
          <source>In: EMNLP-CoNLL</source>
          . pp.
          <volume>523</volume>
          {
          <fpage>534</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>