=Paper= {{Paper |id=Vol-1171/CLEF2005wn-QACLEF-SimovEt2005 |storemode=property |title=BulQA: Bulgarian-Bulgarian Question Answering at CLEF 2005 |pdfUrl=https://ceur-ws.org/Vol-1171/CLEF2005wn-QACLEF-SimovEt2005.pdf |volume=Vol-1171 |dblpUrl=https://dblp.org/rec/conf/clef/SimovO05a }} ==BulQA: Bulgarian-Bulgarian Question Answering at CLEF 2005== https://ceur-ws.org/Vol-1171/CLEF2005wn-QACLEF-SimovEt2005.pdf
BulQA: Bulgarian–Bulgarian Question Answering
                at CLEF 2005
                                Kiril Simov and Petya Osenova
           Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Bulgaria
                      kivs@bultreebank.org, petya@bultreebank.org


                                            Abstract
     This paper describes the architecture of a Bulgarian–Bulgarian question answering
     system — BulQA. The system relies on a partially parsed corpus for answer extraction.
     The questions are also analyzed partially. Then on the basis of the analysis some queries
     to the corpus are created. After the retrieval of the documents that potentially contain
     the answer, each of them is further processed with one of several additional grammars.
     The grammar depends on the question analysis and the type of the question. At present
     these grammars can be viewed as patterns for the type of questions, but our goal is to
     develop them further into a deeper parsing system for Bulgarian. The CLaRK System
     is used as an implementation platform — [5].

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 Database
ManagementLanguages [Query Languages]

General Terms
Measurement, Performance, Experimentation

Keywords
Question answering, Answer support, Pattern grammars


1    Introduction
This paper describes the architecture and the linguistic processing of a question answering system
for Bulgarian — BulQA. The system has three main modules: Question analysis module, Interface
module, Answer extraction module. The Question analysis module deals with the syntactic and
semantic interpretation of the question. The result of this module is independent from task and
domain representation of the syntactic and semantic information in the question. The Interface
module bridges the interpretation received from the first module to the input necessary for the
third module. The Answer extraction module is responsible for the actual detection of the answer
in the corresponding corpus. This architecture has the advantage that it allows the poly-usage
of the same modules in different tasks, such as Bulgarian as source language in a multilingual
question answering, or Bulgarian as a target language. In fact, only the Interface module has to
be re-implemented in order to tune the connection between Bulgarian modules and the modules
for the other languages.
    In CLEF 2005 we have used the Question analysis module for two tasks: Bulgarian-English
QA and Bulgarian-Bulgarian QA. The former is very similar to our participation at the CLEF
2004 ([6]) and for that reason is remains out of this paper’s scope.
    However, being participants in both tasks, we had to implement two versions of the Interface
module. For the Bulgarian-English QA task the Answer searching module is based on the Diogene
system ([4]) implemented at the ITC-Irst, Trento, Italy. For the Bulgarian-Bulgarian task we had
implemented our own Answer searching module. This paper describes it in more detail.
    Also the paper discusses the necessary resources and processing for answer support in different
contexts. In this way we delimit the future developments of the system.
    The structure of the paper is as follows: in section 2 we discuss language technology adaptation
for the analysis of Bulgarian questions; section 3 describes the interface module; in section 4 we
present the answer extraction approach on the basis of additional grammars. Section 5 comments
on the necessary language resources and processing for more complicated answer supporting; the
last section reports on the results of the question answering track and concludes the paper.


2     Linguistic Processing of the Corpus and the Questions
2.1    Processing the Corpus
The processing of the corpus is done in two steps: off-line and runtime. The goal is as much
as possible processing to be done prior to the actual usage in the answer searching. The off-line
processing tools are as follows: tokenization, named-entity recognition, morphological analyzer,
neural-network based morphosyntactic disambiguation, chunking. These are the very basic tools
which were widely used in our previous systems. We consider the results of these tools as reliable.
For an overview of the available language resources and tools of Bulgarian and how they were used
for Bulgarian-English task at CLEF 2004 see [6]. The result of this preprocessing of the corpus is
stored as a set of XML documents with some indexing for searching with XPath language, which
is implemented in the CLaRK system — [5]. Although the results of the preprocessing are still
not very deep, they allow us to save time during the answer searching. In future we intend to
extend the processing with additional information.
    The runtime processing of the corpus is based on additional partial parsing modules that are
tuned to the type of the questions, the type of the answer and to the type of the content of the
questions. Thus we constructed new modules, such as specific partial analyses (we developed new
partial grammars for more complex NPs with a semantic categorization, such as time, location
and others). The reason these new processing modules have not been included in the off-line
processing is that they depend too much on the information from the questions. Thus, they are
likely to produce a wrong analysis if there is no appropriate information. The runtime processing
is done only for a few documents that are retrieved from the corpus on the basis of the keywords
derived from the questions.

2.2    Processing the Questions
The processing of questions is similar to the off-line processing of the corpus. In fact, we have
enhanced the processing from the last year. The processing is mainly connected to the use of more
elaborate semantic lexicon and module for processing of time expressions (i.e. dates, periods and
event marking adverbials) in order to manage questions with temporal restrictions.
   Here is an example of the analysis of the question “Koj kosmicheski aparat trygva za Lunata na
25 yanuari 1994 g.?” (in English: Which space probe started for the Moon on 25 January 1994?):


    
        Koj
        kosmicheski
        aparat
    
    trygva
    
         za
         Lunata
    
    
         na
         
             25
             yanuari
             1994
             g.
         
    
    ?


    Here each common word is annotated within the following XML element hw ana=”MSD”
bf=”LemmaList”iwordformh/wi, where the value of attribute ana is the correct morpho-syntactic
tag for the wordform in the given context. The value of the attribute bf is a list of the lem-
mas assigned to the wordform. Names are annotated within the following XML element hname
ana=”MSD” sort=”Sort”iNameh/namei, where the value of the attribute ana is the same as
above. The value of the attribute sort determines whether this is a name of a person, a location,
an organization or some other entity. The abbreviations are annotated in a similar way, and ad-
ditionally they have type and exp attributes which encode the type of the abbreviation (acronym
or contraction) and its extension.
    The next level of analysis is the result of the chunk grammars. In the example there are two
NPA elements (NPA stands for a noun phrase of head-adjunct type), a lexical V element (lexical
verb) and two PP elements. Also, one of the noun phrases is annotated as a date expression
with a sort attribute with value: Date. This information is percolated to the preposition phrase
which is annotated with the relation label On Date. This is a result of the combination of the
preposition meaning and the category of the noun phrase. The noun in the other prepositional
phrase is annotated as a LOCATION name.
    The result of this analysis had to be translated into the format which the answer extraction
module uses as an input.


3     Interface Module
Here we describe the implemented interface module which translates the result of the question
analysis module into the template necessary for the system, which extracts the answers of the
questions. This module is an extension of the module we have implemented for the Bulgarian-
English task. The main difference is that we do not transfer the question analyses into DIOGENE’s
type of template with English translations of the keywords, but instead we define a set of processing
steps for the Answer searching module. The processing steps are of two kinds: corpus processing
and document processing. The first processing step retrieves documents from the corpus that
potentially contain the relevant answers. The second one analyzes additionally the retrieved
documents in order to extract the answer(s). The process includes the following steps:
    • Determining the head of the question.
      The determination of the question head was performed by searching for the chunk which
      contains the interrogative pronoun. There were cases in which the question was expressed
      with the help of imperative forms of verbs: nazovete (name-plural!), kazhete (point out-
      plural!; say-plural!). After the chunk selection we classify the interrogative pronoun within a
      hierarchy of question’s heads. In this hierarchy some other elements of the chunks — mainly
      prepositions — play an important role as well.

    • Determining the head word of the question and its semantic type.
      The chunk determined in the previous step also is used for determining the head word of the
      question. There are five cases. First, the chunk is an NP chunk in which the interrogative
      pronoun is a modifier. In this case the head noun is the head word of the question. For
      example, in the question: What nation is the main weapons supplier to Third World
      countries? the noun ‘nation’ is the head word of the question. In the second case the chunk
      is a PP chunk in which there is an NP chunk similar to the NP chunk from the previous
      case. Thus, again the head noun is a head word for the question. For example, in the
      question: In what music genre does Michael Jackson excel? the noun ‘genre’ is the head
      word of the question. Third, the interrogative pronoun is a complement of a copula verb
      and there is a subject NP. In this case the head word of the question is the head noun of the
      subject NP chunk of the copula. For example, in the question: What is a basic ingredient
      of Japanese cuisine? ‘ingredient’ is the head of the question. The fourth case covers the
      questions with imperative verbs. Then again the head of the question is the head noun
      of the complement NP chunk. For example, in the question: Give a symptom of the Ebola
      virus. the noun ‘symptom’ is the head of the question. The last case covers all the remaining
      questions. Then the head word of the question is the interrogative phrase (or word) itself.
      For example, in the question: When was the Convention on the Rights of the Child adopted?
      the head of the question is the interrogative word ‘when’. The semantic type of the head
      word is determined by the annotation of the words with semantic classes from the semantic
      dictionary. When there are more than one semantic classes we add all of them. The type of
      the interrogative pronoun is used later for disambiguation. If no semantic class is available
      in the dictionary, then the class ‘other’ is assigned.
    • Determining the type of the question.
      The type of the question is determined straightforwardly by the semantic type of the head
      word. For the recognition of the questions with temporal restriction we count on the pre-
      processing of the questions and the assigned temporal relations. As temporal restriction we
      consider such expressions that are not part of the head of the question.
    • Determining the keywords of the question and their part of speech.
      The keywords are determined by the non-functional words in the question. Sometimes it
      is possible to construct multi-token keywords, such as names (Michael Jackson), terms or
      collocations. For the Bulgarian-Bulgarian task this is important when there are special rules
      for query generation for document retrieval (see next section). We also used gazetteers of
      abbreviated forms of the most frequent organizations in English. This was very helpful in
      finding the correct answers to the Definition Organization questions because in many cases
      these abbreviations lack Cyrillic counterparts, and thus the search is very direct even in the
      Bulgarian corpus. Only the extensions seem to have systematically Cyrillic counterparts,
      and therefore they need more complex processing sometimes.


4     Answer Extraction and Validation
The answer extraction is a two-step process: first, the documents possibly containing the answer
are retrieved from the corpus; then the retrieved documents are additionally processed with special
partial grammars which depend on the type of answer, the type of the question and the found
keywords in the document. We can view these grammars as patterns for the different types of
questions.
    As it was mentioned above, for document retrieval we are using CLaRK system. The corpus
is presented as a set of XML documents. The search is done via XPath language enhanced with
index mechanism over the (selected) content of each document. The initial step of the answer
extraction is done via translating of the keywords from the analysis of the question into an XPath
expression. This expression selects the appropriate documents from the corpus. The expression
itself is a disjunctive where each disjunct describes some combinations of keywords and their
variants. The variants are necessary because the keywords in the question bear different degree of
informativeness with respect to the answer (see the discussion below on the answer support). For
example, for named entities we constructed different (potential) representations: Michael Jackson
can be M. Jackson or only Jackson. Where possible, we convert the corresponding keyword to
a canonical form (for example, dates) and we simply match the canonical forms from the corpus
and the question.
    Definition questions provide one key word or expression. Thus, they are easily trackable at this
stage. For example (‘Who is Nelson Mandela?’ has a key expression ‘Nelson Mandela’). However,
the factoid questions are more difficult to process even at that general stage. Obviously, the reason
is that the question key words are not always the best answer-pointers. This is the reason we to
develop our own search engine instead of a standard one. This envisages future developments when
we will maximally use the implicit lexical information and incorporate more reasoning along the
lines of contemporary investigations of paraphrases, entailment and different degrees of synonymy.
    When the documents are retrieved, they are additionally processed in the following way: first,
the keywords (the ones from the question and its variants or synonymical expressions) are selected.
Then special partial grammars (implemented as cascaded regular grammars in the CLaRK System)
are run within the contexts of the keywords. These grammars use the information about the type
of the answer and how it is connected to the keywords. The context of a single keyword (or phrase)
can be explored by several different grammars and (potentially) several possible answers. If we
found more than one answer we apply some additional constraints to select one of them as result.
In case no answer was found, the NIL value is returned.
    The implementation of this architecture is done in the CLaRK system. The pattern grammars
are still not enough with respect to the different kinds of questions. Thus, for other types of
questions the resources that we have for Bulgarian are not suffice for real question answering, and
only some opportunistic patterns can be implemented. As we would like to develop the system
along the lines of knowledge rich question answering systems we did not try to implement many
such opportunistic patterns, but more effort was invested in classification of the contexts that
support the answers. Next section is an attempt to characterize the processing that we would like
to incorporate in the future developments.


5     Discourse Requirements for Answer Support
As stated in CLEF 2005 guidelines, each type of question has an abstract corresponding answer
type, but when the answer is in a real context, there exists a scale with respect to the answer
acceptability. And the concrete answer must be mapped against this scale. The change of the
context can change the answer grade in the scale. In this section we will try to give some examples
of answers supported by different contexts.
    We consider the text as consisting of two types of information: (1) ontological classes and
relations, and (2) world facts. The ontological part determines generally the topic and the domain
of the text. We call the corresponding ”minimal” part of ontology implied by the text ontology
of the text. The world facts represent an instantiation of the ontology in the text. Both types of
information are called uniformly ‘semantic content of the text’. Both components of the semantic
content are connected to the syntactic structure of the text. Any (partial) explication of the
semantic content of a text will be called semantic annotation of the text1 . The semantic content of
a question includes some required, but underspecified element(s) which has(have) to be specialized
  1 Defined in this way the semantic annotation could contain also some pragmatic information and actual world

knowledge.
by the answer in such a way that the specialization of the semantic content of the question has to
be true with respect to the actual world.
    We consider a textual element a to be an supported answer of a given question q in the text t
if and only if the semantic content of the question with the addition of the semantic annotation
of the textual element a is true in the world2 .
    Although the above definition is quite vague it gives some ideas about the support that an
answer receives from the text in which it is found. The semantic annotation of the answer comprises
all the concepts applicable for the textual element of the answer and also all relations in which
the element participated as an argument3 . Of course, if we had the complete semantic annotation
of the corpus and the question, it would be relatively easy to find a correct answer of the question
into the corpus, if such exists. Unfortunately, such an explication of the semantic annotation of the
text is not feasible with the current NLP technology. Thus we are forced to search for an answer
using partial semantic annotations. In order to give an idea of the complexity necessary in some
cases we would like to mention that the context which has to be explored can vary from a phrase
(one NP), to a clause, a sentence, a paragraph, the whole article or even the whole issues. The
required knowledge can be linguistic relations, discourse relations, world knowledge, inferences
above the semantic annotation.
    Here are some examples of dependencies with different contexts and a description of the prop-
erties necessary to interpret the relations:

Relations within NP. Bulgarian nominal phrase is very rich in its structure. We will consider
the following models:

NP :- NP NP
    This model is important for two kinds of questions: definition questions for people and questions
for measurement. The first type of question is represented by the abstract question ”Koj e Ime-
na-chovek?” (Who is Name-of-a-Person?): Koj e Nikolaj Hajtov? (Who is Nikolaj Hajtov?). As
it was discussed in [7] some of the possible patterns that can help us to find the answer to the
question are: ”NP Name”, ”Name is NP” where the Name is the name from the question and NP
constitutes the answer. The first pattern is from the type we consider here. The other one and
some more patterns are presented below. Although it is a very simple pattern the quality of the
answer extraction depends on the quality of the grammar for nominal phrase. The first NP can
be quite complicated and recursive. Here are some examples:
       [NP klasikyt] [NP Nikolaj Hajtov]
             (the classic Nikolaj Hajtov)
       [NP golemiya bylgarski pisatel] [NP Nikolaj Hajtov]
             (the big Bulgarian writer Nikolaj Hajtov)
       [NP zhiviyat klasik na bylgarskata literatura] [NP Nikolaj Hajtov]
             (the alive classic of the Bulgarian literature Nikolaj Hajtov)
       [CoordNP predsedatel na syyuza na pisatelite i zhiv klasik na bylgarskata literatura]
       [NP Nikolaj Hajtov]
             (chair of the committee of the union of the writers and alive
             classic of the Bulgarian literature Nikolaj Hajtov)
    As it can be seen from the examples, the first NP can comprise a head noun and modifiers
of different kinds: adjectives, prepositional phrases. It also can exemplify coordination. Thus, in
order to process such answers, the system needs to recognize correctly the first NP. This step is hard
for a base NP chunker (being nonrecursive), but when it is combined with semantic information
and a named-entity module, then the task is solvable. A characteristic for the first NP is that the
head noun denotes a human. If such nouns are mapped to ontological characteristics, the work of
the tool is facilitated.
  2 World such as it is described by the corpus.
  3 We consider the case when the answer denotes a relation to be a concept.
    Another usage of this NP recursive model concerns measurement questions, such as: ”Kolko
e prihodyt na ”Grijnpijs” za 1999 g.?” (How much is the income of Greenpeace for 1999?). The
answers to such questions have the following format: ”number”, ”noun for number”, ”noun for
measurement”. For example, ”[NP 300 miliona] [NP dolara]” (300 million dollars). The NPs are
relatively easy to recognize, but their composition remains unrecognized in many cases and the
systems return partial answers like ‘300 million’ or only ‘300’. However, without the complete
measurement information such an answer is not quite correct and is discarded.
    Problems arise when there are longer names of organizations with embedded PPs or with
contacting PPs which are not part of them. The systems often return some NP, but the thing is
that they suggest either the dependant NP as an answer instead of the head one, or an NP, which
is a part of a PP not modifying the head NP. An example for the first case is the answer to the
question: What is FARC? The system answered ‘Columbia’ instead of answering ‘Revolutionary
Armed Forces of Colombia’ or at least ‘Revolutionary Armed Forces’. An example for the second
case is the answer to the question: What is CFOR?. It was ‘Bosnia’ instead of ‘command forces’
(in Bosnija).
    Another interesting case is when the first NP has the form AP NP where AP is a rela-
tional adjective connecting the noun with another noun like: italianski (Italian)− >Italy, ruski
(Russian)− >Russia, etc. In this case the answer of questions like ”Ot koya strana e FIAT?”(Where
does FIAT come from?) or ”Na koya strana e prezident Boris Yelcin?” (Of which country Boris
Yelcin is the president?) is encoded within the adjective. This means that we should have lexi-
cons, which are interrelated in order to derive the necessary information even when it is indirectly
present in the text. Note that this does not hold only within NPs. For example, the answer of
the question ‘Who was Michael Jackson married to?’ could be ‘Michael Jackson’s ex-wife Debby’.
Of course, here the relation is more complex, because there is a relation not only between ‘marry’
and ‘wife’, but also temporal mapping between ‘was married’ and ‘ex-wife’.

NP :- (Parenthetical NP) | (NP Parenthetical)
    Such NPs are relevant for definition questions about the extensions of acronyms: Kakvo e
BMW? (What is BMW?). Very often the answers are presented in the form of an NP, which is the
full name of the organization and the corresponding acronym is given as a parenthetical expression
in brackets, or the opposite. In this case two gazetteers: of acronyms and the corresponding
organization names would be of help. Additionally, we have to rely on opportunistic methods as
well, because it is not possible to have all the new occurrences in pre-compiled repositories. Then,
the case with the extension as parenthesis is easier to handle than the opposite case. Recall the
problems with defining the boundaries of a complex name.

NP :- NP RelClauss
   Here the main relations are expressed via the following relative pronoun. It is a kind of local
coreference. Let us consider the example: ‘Mr Murdoch, who is the owner of several newspapers’.
We can trace who is Murdoch through the relative clause. However, sometimes it might be tricky,
because in complex NPs we do not know whether the relative clause modifies the head NP or the
dependant one. For example, in the phrase: ‘the refugee camp in the city, which is the biggest in
the country’, we cannot know whether the camp or the city is the biggest in the country.

Relations within a clause (sentence). In order to derive the relevant information, very
often we need the availability of relations among paraphrases of the same event. This idea was
discussed in [1], [2] and [3] among others. For that task, however, the corpus should be annotated
with verb frames and the grammatical roles of their arguments. Additionally, lists of possible
adjuncts are also needed, because they are mapped as answer types to questions for time, measure,
location, manner. Thus we have to go beyond the argument structure annotation. The ideal lexical
repository should include relations between semantic units, such as if something is a location, you
can measure distance to it; if something is an artefact, you can measure its cost etc. Also, the
classical example with the entailment like: if you write something, then you are its author, can be
derived from a rich explanatory dictionary, which is properly parsed.

Discourse relations. They are necessary, when the required information cannot be assessed
locally. When some popular politician is discussed in the newspaper, it might be the case that
he is addressed only by his name, not the title: ‘Yaser Arafat’ instead of ‘the Palestinian leader
Yaser Arafat’. In such cases we need to navigate through wider context and then the marked
coreferential relations become a must: Yaser Arafat is mentioned in the sentence, then in the next
one he is referred to as ‘the Palestinian leader’ and finally, as ‘he’. Here we could rely on anaphora
resolution tools and on some gathered encyclopedic knowledge.

World knowledge. We usually rely on our world knowledge when there is more specific in-
formation in the questions and more general in the candidate answers. For example, to the
question ‘Who is Diego Armando Maradona?’ we found answers only about ‘Diego Maradona’
or ‘Maradona’. For this case we could be sure that all these names belong to the same person.
However, there could be trickier cases like both Bush - father and son. If the marker ‘junior’ or
‘senior’ is not there, then we have to rely on other supportive markers like temporal information
or some events that are connected with the one or the other.


6    Results and Outlook
The result from our Bulgarian-Bulgarian QA track can be viewed as a preliminary test of our QA
system. We got the following statistics: 37 out of the 200 extracted answers were correct, 160
were wrong and 3 inexact. The distribution of the correct answers among the question categories
is as follows: 21 definition questions: 13 for organizations and 8 for persons; 16 factoid questions:
2 for locations, 2 for measure, 1 for organizations, 2 for other categories, 2 for persons, and 3 for
time. For the temporal restricted questions: 2 for locations and 2 for organizations. The main
problems that we encountered during the contest were as follows: (1) the lack of a complete set of
relevant QA processing tools for Bulgarian, (2) for some of the questions we were not able to run
the procedure because of business travels during the testing period. Thus, for about one third of
the questions the answer NIL was stated without a real application of the system.
    Our plans for future work are to build on our experience from CLEF 2005 participation. We
plan to implement more pattern grammars and to enrich the resources for Bulgarian in two aspects:
(1) qualitative – better integration of the available resources and tools, and (2) quantitative –
creation of more support grammars for the off-line procedure.


References
[1] Ido Dagan and Oren Glickman. Probabilistic Textual Entailment: Generic Applied Modeling
    of Language Variability. Learning Methods for Text Understanding and Mining Workshop.
    Available at: http://www.cs.biu.ac.il/ glikmao/Publications
[2] Milen Kouylekov and Bernardo Magnini.Recognizing                    Textual Entailment with
    Tree Edit Distance Algorithms. PASCAL Challenges                    Workshop. Available at:
    http://www.kouylekov.net/Publications.html
[3] Dekang Lin and Patrick Pantel.Discovery of Inference Rules for Question Answering. In: Nat-
    ural Language Engineering 7(4):343-360.
[4] Negri M., Tanev H., and Magnini B.: Bridging Languages for Question Answering: DIOGENE
    at CLEF-2003. Proceedings of CLEF-2003, Trondheim, Norway. (2003) 321–330
[5] Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dimitrov, M., Kiryakov, A.: CLaRK — an
    XML-based System for Corpora Development. Proceedings of the Corpus Linguistics 2001
    Conference. (2001) 558–560
[6] Petya Osenova, Alexander Simov, Kiril Simov, Hristo Tanev, and Milen Kouylekov. Bulgarian-
    English Question Answering: Adaptation of Language Resources. In (Peters, Clough, Gonzalo,
    Jones, Kluck, and Magnini eds.) Fifth Workshop of the Cross–Language Evaluation Forum
    (CLEF 2004), Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany.
    (2005)
[7] Hristo Tanev. Socrates: A Question Answering Prototype for Bulgarian. In Proceedings of
    RANLP 2003, Borovetc, Bulgaria. pp. 377-386. (2003)