<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extraction of De¯nitions for Bulgarian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hristo Tanev</string-name>
          <email>Hristo.Tanev@jrc.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>21020 Ispra</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Joint Research Center</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We participated at the Monolingual Bulgarian QA task at CLEF-2006 with a de¯nition extraction system based on linguistic templates and keywords. Our system uses a partial syntactic parser for Bulgarian to detect noun phrases as candidates for de¯nitions. Our system answered correctly to 28% of the de¯nition questions. This year we participated at the Monolingual Bulgarian QA task with a system which answers de¯nition questions. Our work was inspired by the online Bulgarian QA system \Socrates" [1]. We think that automatic extraction of de¯nitions is important for several reasons: First, albeit the number of online encyclopaedic resources in English increases in quality and range (for example Wikipedia (http://www.wikipedia.org) provides over 1 million English articles), for many languages like Bulgarian the quantity and quality of such resources are not su±cient. As a result, no encyclopaedic entries can be found for many topics on the Bulgarian Web. For example, question number 9 from the Bulgarian QA test set of CLEF 2006 is \Kakvo e OneNote?" (\What is OneNote?"). The Bulgarian version of Wikipedia provides no article for OneNote (though the English version does). If we search for OneNote with Google (http://www.google.bg) in the Bulgarian-language pages we can hardly ¯nd good descriptions of OneNote. On the other hand, if we search for de¯nitions on the Bulgarian Web using the automatic de¯nition extraction service of \Socrates" (http://tanev.dir.bg/Socrat.htm), we ¯nd that OneNote is an application which has functions of a notebook and following the link returned we can see a relevant description of OneNote. Second, the encyclopaedic resources usually give high-quality well-structured descriptions of a term, but more information can be captured by scanning free texts. Such information can be</p>
      </abstract>
      <kwd-group>
        <kwd>Question answering</kwd>
        <kwd>Questions beyond factoids</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>more subjective, controversial, or incomplete with respect to the encyclopaedias, but nevertheless
it can be useful. For example for question 126 \Kakvo e Evrovizia?" (\What is Eurovision?")
the Bulgarian version of Wikipedia returns a short de¯nition and a list of winners; on the other
hand, one of the de¯nitions returned by \Socrates" on-line de¯nition extraction is that \Evrovizia
e nay-golemiat skandal na godinata" (\Eurovision is the biggest scandal of the year"). Following
the link returned we can get interesting information considering a scandal around the Bulgarian
participation in this song contest.</p>
      <p>Third, if a de¯nition pattern like \TERM is DEFINITION" is present in a document, even if
the de¯nition extracted is not informative enough, the pattern itself means that TERM has an
important role in the article. In this way, identi¯cation of de¯nition templates can be used to rank
better the results from a search engine.</p>
      <p>Fourth, but not last in importance is the fact that automatic de¯nition extraction can help to
the people who build dictionaries and encyclopaedic resources like Wikipedia by providing them
with relevant textual fragments.</p>
      <p>
        Our de¯nition extraction system uses linguistic templates and clues similar to the ones
described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In this paper we will give an overview of the linguistic templates and rules used by our system,
as well as our participation at CLEF 2006.
2</p>
    </sec>
    <sec id="sec-2">
      <title>De¯nition Extraction Patterns</title>
      <p>
        De¯nition questions ask for a de¯nition of a person or a term (e.g. \Who is Galileo?" -
answer:\Italian astronomer" ). Techniques which rely on Named Entity recognition are not useful
for this type of questions. On the other hand, templates provide a reliable instrument for
de¯nition extraction. For example, the approach described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used only super¯cial patterns of the
type: \a TERM is DEFINITION", \TERM, DEFINITION". However, such approaches are error
prone, since similar patterns can be encountered in non de¯nition contexts. For example, \The
charge of a positron is about..." is not a de¯nition of the positron, though the pattern \a positron
is" is present as a substring. We use linguistic constraints and rules to avoid or mitigate the e®ect
of similar errors.
      </p>
      <p>
        For each de¯nition question, our system ¯rst tries to match one of its templates on the
linguistically pre-processed text. We used the LINGUA language engine [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to perform text segmentation,
part-of-speech tagging and parsing.
      </p>
      <p>The phrases which match one of these patterns are considered candidate de¯nitions. For every
candidate a set of linguistic constraints are applied. For example, if the template \TERM is
DEFINITION" is found in the text, DEFINITION should be parsed by the parser as a noun
phrase and has to agree by gender and number with TERM.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Linguistic and Lexical Clues</title>
      <p>Our experiments demonstrated that for high-quality de¯nition extraction it is not enough to
capture a fragment which matches certain patterns. It is necessary also to analyze the content of
the phrase and its context. Each phrase - a candidate for a de¯nition is evaluated using a set of
evaluation rules which consider its syntactic context and lexical content.</p>
      <p>Here we are going to give some examples for evaluation rules:</p>
      <p>If a phrase matches a pattern like \TERM is DEFINITION" or \TERM, DEFINITION" we
give lower weight to the matches where this pattern is preceded by a preposition: \Prep TERM is
DEFINITION" or \Prep TERM, DEFINITION". In most such constructions the de¯nition does
not refer to TERM but to another phrase which contains it.</p>
      <p>If a phrase matches a pattern like \TERM is DEFINITION", we give lower weight to the
matches where the TERM is part of a bigger noun phrase like in \svobodniat elektron e valna"
(\the free electron is a wave"). In this case the de¯nition refers to another term (\the free electron")
rather than to \electron" itself.</p>
      <p>If we have the pattern \TERM, DEFINITION", but it is a part of a comma separated list, then
the candidate for de¯nition is most probably not a de¯nition. Therefore its weight is decreased.</p>
      <p>If a candidate de¯nition for a person contains one of the keywords designating occupation,
social role, or other words used for famous people, like \shampion" (\champion"), \golemiat"(\the
great"), etc., higher weight is given to this de¯nition.</p>
      <p>Longer de¯nitions obtain higher weight.</p>
      <p>After the application of all the rules, each candidate de¯nition phrase obtains a weight; phrases
are sorted according to this weight and the best one is chosen.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Future Directions</title>
      <p>We participated in the Bulgarian Monolingual QA task at CLEF 2006. We run our system only
on the de¯nition questions. The accuracy we achieved on this question subset was moderate
about 28%.</p>
      <p>There is a lot of space for improvement in our de¯nition extraction system: First of all, we
may enlarge the lexicon with \interesting" words when evaluating de¯nitions of people. We may
learn automatically syntactic and lexical clues for the de¯nitions context and structure. The
Bulgarian section of Wikipedia can be used as a training corpus. Finally, we may estimate the
informativeness of a de¯nition by considering the Inverse Document Frequency of its words.</p>
      <p>Our de¯nition extraction system may have a broad range of applications, especially in the
context of Internet. It may be used to build pro¯les of people and oragnizations and extract
relations between them, to classify automatically terms, to populate ontologies, etc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          \
          <article-title>Socrates - a Question Answering prototype for Bulgarian" In RANLP-</article-title>
          2003
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          , Borovets - Bulgaria, September,
          <year>2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kouylekov</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Negri</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coppola</surname>
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Magnini</surname>
            <given-names>B.</given-names>
          </string-name>
          \
          <article-title>Multilingual Pattern Libraries for Question Answering: a Case Study for De¯nition Questions"</article-title>
          <source>In LREC 2004 Proceedings</source>
          , Lisbon, Portugal,
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ravichandran</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            <given-names>E.</given-names>
          </string-name>
          \
          <article-title>Learning Surface Text Patterns for a Question Answering System"</article-title>
          <source>In Proceedings of ACL</source>
          <year>2002</year>
          , Philadelphia,
          <year>2002</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mitkov</surname>
            <given-names>R.</given-names>
          </string-name>
          \
          <source>Shallow Language Processing Architecture for Bulgarian" In Proceedings of COLING</source>
          <year>2002</year>
          , Taiwan,
          <year>2002</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>