EQUAL : Encyclopaedic
                          QUestion Answering for Lists
                                            Iustin Dornescu
                                   RIILP, University of Wolverhampton
                                       i.dornescu2@wlv.ac.uk


                                                 Abstract

       This paper presents EQUAL , a semantic question answering (QA) system which relies on
       structural information from Wikipedia to extract answers to list questions in the GikiCLEF
       2009 task. The system is described in the context of of a more general architecture for se-
       mantic QA. Unlike the usual textual QA approach, EQUAL does not rely on identifying the
       answer within a text snippet by using keyword retrieval. Instead, it explores the Wikipedia
       page graph extracting and aggregating information from multiple documents. Enforcing se-
       mantic constraints proved to yield precise results, and the system achieved the highest score
       amongst the participants in the task.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information
Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Manag-
ment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Question answering, List questions, Wikipedia, Semantic question answering


1     Motivation
This paper describes the EQUAL system and its participation in GikiCLEF1 2009. The system was de-
signed to combine the coverage of open-domain textual QA with the precision/accuracy and semantics of
closed-domain QA, whilst also addressing some of the limitations of standard textual QA systems. The
feed-forward pipeline architecture often employed in QA exacerbates errors, mainly due to the use of
approximate solutions to difficult NLP problems during intermediate steps. Systems usually rely on the
hypothesis of information redundancy in large corpora, focusing on question rewriting patterns (e.g. in-
terrogative vs. affirmative) and keyword expansion to identify the best paragraphs (snippets). This makes
textual QA essentially a ’smart’ paragraph retrieval system, enhanced with post-retrieval ’extraction’ and
weak semantic algorithms. Aggregating information from multiple documents usually means re-ranking
    1 http://www.linguateca.pt/GikiCLEF/
candidates based on the extraction frequency of each one. Statistical prevalence is a surrogate of real se-
mantic information which, in many QA systems, is discarded in favour of the distributional hypothesis:
“similar words appear in similar contexts”. The use of standard NLP tools also imposes a constraint re-
garding the content that is used in QA: plain text. The rich structural information common in HTML pages
(e.g. links, tables) and their (implicit) semantics are largely ignored. While using plain text has certain
advantages, current NLP tools are not powerful enough to compensate for the loss of information when the
mark-up is removed.
    Although textual QA works very well for certain classes of questions, the techniques employed limit
the range of “feasible” questions to those which can be directly answered by a textual snippet. When the
information necessary to compile the answer is spread across several documents, traditional systems lack
the semantic representations necessary to combine the different pieces of information. This is not only due
to the difficulty of general purpose inference (an AI-hard problem) but also to the complexity of different
NLP tasks. As new resources have become available, the time is now right to add more semantic processing
to the QA task. Newer campaigns such as the GikiCLEF task or in the INEX Entity Ranking2 task focus
on list questions, which are known to be difficult for most textual QA systems.
    This paper presents an alternative semantic QA architecture and prototype system EQUAL to address
current limitations of textual QA. The rest of the paper is structured as follows: Section 2 outlines the gen-
eral architecture of semantic QA. Section 3 describes the way natural language is converted into semantic
representations and Section 4 shows how answers are retrieved for these question interpretations. Section 5
discusses the results and related work, and the paper finishes with conclusions and ideas for future work.


2     Semantic QA Architecture
To successfully address list questions, a novel architecture is proposed that does not have words at its core,
but entities. The semantic QA architecture, of which EQUAL is a prototype, has two main processing
phases: a) the analysis phase, which is responsible for understanding the question and identifying answers,
and b) the feedback phase, responsible for interacting with the user. The purpose of the latter is not
only user modelling and dialogue management, but also the efficient interface between the human and the
machine, allowing clarification questions for disambiguation, named sets of results for subsequent querying
or presenting additional information3 . This paper focuses on the analysis phase.
    One of the key ideas behind the semantic QA architecture is its ability to address ambiguity. Rather
than employing standard NLP tools to get a most probable output, the system must be aware of different
sources of ambiguity and be able to deal with them directly during processing. Humans use common sense
and their knowledge of the world when understanding and interpreting questions. To compensate for the
lack of such abilities, during the analysis phase the system must generate all possible interpretations and
rank them based on the results found and interaction with the user.
    All questions are interpreted as natural language retrieval queries. Therefore each question needs to be
parsed and the resulting semantic structures need to be converted into formal queries which will correspond
to a retrieval algorithm. The two main issues addressed in this paper are the mapping of natural language
to formal semantic constituents (the semantic interpretation of the question) and the particular data model
chosen to represent information according to the semantic model (retrieving the answers). The third main
issue (not addressed here) is the generation of metadata for feedback and interaction.
    The data model adopted is closely related to Datalog and RDF 4 . Each entity corresponds to a node in
a graph. It has attributes, similar to owl:DatatypeProperties, and relations with other nodes, corresponding
to owl:ObjectProperty. This abstract model facilitates the aggregation of information from various sources
as needed at query time. For example, instead of having Paris as an entry in the LOCATION gazetteer, this
model allows a QA system to ”know” that Paris is a Roman establishment and a French city, to ”know”
    2 http://www.inex.otago.ac.nz/tracks/entity-ranking/entity-ranking.asp
   3 The feedback phase is beyond the scope of the current paper, given that GikiCLEF is not an interactive task. However, it is a

major component of the semantic QA architecture and therefore needs mentioning.
    4 http://www.w3.org/RDF/
its population size, its mayor, what universities it has, and so on. A semantic QA system can identify the
answers to questions which requires the extraction and aggregation of information from more than one
document, since it is focused on entities, and not on words.
     The entire Wikipedia graph can be represented using these three concepts: nodes, relations and prop-
erties. The distinction is made between articles (entities, redirects, disambiguation and lists), categories
(classification) and templates (semi-structured format for representing object properties). This graph can
be merged with Semantic Web datasets such as dbpedia5 , Yago [5], geonames6 , etc., to give us enough
information to address a wide variety of questions.


3    Semantic Question Interpretation
The task of interpreting questions is difficult due to the ambiguity which characterizes natural language.
While for humans it is relatively easy to analyse questions and derive an annotation which makes sense,
computers, lacking general purpose inference and common-sense, cannot deal effectively with ambiguity.
An important part of assigning a semantic interpretation relates to the way data is represented. For example,
instead of finding people that are professional football players and who were born in Brazil, a system
using Wikipedia does not need to compose the meaning of a span of text by examining individual words,
but instead exploits the Brazilian footballers category to extract the entities directly. In a nutshell, the
difficulty resides in the fact that very similar natural language expressions can have very different semantic
interpretations. In order to tackle this problem, the proposed architecture tries to cover as many semantic
interpretations as possible by trying to identify distinct semantic constraints. This means that it can exploit
the different ways information is represented in Wikipedia, and therefore is more likely to find an answer.
    EQUAL is an enhanced version of the WikipediaListQA@wlv system which took part in the GikiP
2008 pilot [4], and builds upon its predecessor by taking steps towards implementing the semantic QA
architecture presented in Section 2). For EQUAL , interpreting natural language questions means assigning
semantics to each unit in a way that would deduce the meaning of an expression, say "Brazilian football
players", from the meaning of its individual words. This process is hindered by the lack of a formal
semantics dictionary and by the fact that multi-word expressions can have semantics which differ from
those of the individual words comprising them. The unit "American football players" would not only mean
a different nationality but also a different sport. Wikipedia and its category folksonomy is used to address
this problem: in determining the meaning of a natural language expression, we reuse the meaning of its
sub-expressions as implied by the Wikipedia Category Graph.
    In our model, a question interpretation is a set of semantic constraints which can be represented
using entities, relations and properties. A question can have several interpretations, each corresponding
to a different understanding of the question. There are several types of constraints used by the EQUAL
system:

    • EAT (expected answer type) – indicates the type of entities sought: only entities of this type are
      considered answer candidates;

    • entity – denotes an entity corresponding to a Wikipedia article;

    • relation – indicates the relation between entities, usually represented by the verb;

    • property – restricts a set of entities to a subset which have a certain property (e.g. population size)

    • geographic – restricts the entities to having a particular relation with a geographic entity (e.g. born
      in a country, contained in a region);

    • temporal – constrains the temporal interval of valid entities;

    • introduction – marks the phrase used to start the question and indicates redundancy.
List || the Italian places || where Ernest Hemingway | visited | during his life.
{intro List} the {eat Italian places} where {entity Ernest Hemingway} {rel visited} {temp during his life}.

Name || Romanian writers || who were living | in USA | in 2003
{intro Name} {eat Romanian writers} who were {rel living} in USA {temp in 2003}

                                Figure 1: Example of constraints in questions


    Figure 1 demonstrates the shallow parsing (see Section 3.1) and relevant constraints applied to two
questions. Identifying these constraints is related to the more general problem of parsing. The semantics
of the constraints themselves in the context of Wikipedia are defined by constraint verifiers, i.e. the actual
implementation which verifies whether a particular constraint holds. In this model, a constraint can have
several verifiers, which can be specialised for certain subsets of the data or that can use external data
sources. Issues related to parsing and verifiers are presented in the following sections.

3.1    Parsing
Parsing is usually employed to map the syntactic parse tree to semantic representations (e.g. subject –
actor, verb – action). Standard parsers select the most probable parse(s) given their training data. While
this is sufficient for some NLP applications, QA systems need the guarantee that the correct meaning of the
question can be deduced from the syntactic trees identified by the parser. Some parsers (e.g. the Stanford
Parser7 [2]) do generate the top n trees, which improves the chances that the correct parse is available.
However, there are very few models trained for interrogative sentences, and hardly any which are trained
on a corpus of questions similar to those in GikiCLEF.
    In order to deal with ambiguity, the semantic QA architecture uses a parallel algorithm to generate
all plausible parse trees and to create question interpretations by annotating the syntactic units with their
corresponding semantic counterparts. Erroneous interpretations are unlikely to yield any results. A prac-
tical system with reasonable coverage would need to employ a ranking scheme to guide the exploration
of the parse tree space. The system could then further process each interpretation in parallel, and, based
on interaction with the user, decide at run-time how to prune the search space, rather than rely on corpus-
determined thresholds.
    In EQUAL , determining the syntactic units within a question is performed via a shallow parsing mech-
anism based on certain classes of words (prepositions, relative pronouns, conjunctions, adverbs, verbs) as
delimiters. EQUAL uses a simplified sequential model of the general architecture (see Section 2), allow-
ing it to stop when it finds a semantic interpretation which yields results. The system is based on a simple
rule-driven approach which was developed using the GikiCLEF 2009 training question set.
    The first step in parsing is to split the question into two parts: the domain of valid candidates (a SE-
LECT clause) and the filtering criterion (the constraints, a WHERE clause). This is achieved by removing
the introductory phrase (List..., Name...) and identifying the first verb or pronoun which is considered to
be the delimiter. In the second step, the two parts are processed differently. The domain clause is mapped
to the most relevant category (the EAT constraint) using as many words as possible. If the match is not
straightforward, then the longest match containing the first noun in plural form is identified, and the re-
maining words are mapped to an additional constraint. The filtering criterion is split into sub-filters using
prepositions. The type of the sub-filters (i.e. the semantic constraint) and the way they are linked are dealt
with by the semantic interpretation module (see below).
  5 http://wiki.dbpedia.org/About
  6 http://www.geonames.org/ontology/
  7 http://nlp.stanford.edu/software/lex-parser.shtml
3.2     Semantic Interpretation and Constraints
This processing phase assigns semantic interpretations to syntactic units. This means dealing with ambi-
guity in the context of the semantic constraints that the system ’understands’, and the way that the relevant
information is present in Wikipedia and any associated data sources. The main ambiguities that are dealt
with during this phase are:

      • referential ambiguity: eponymous entities have the same name and need to be resolved correctly
        (“Which American writers were born in Paris?" – Paris can refer to several entities commonly known
        as Paris: people, cities, characters, depending on which one is used the retrieved results can be
        different);

      • structural ambiguity: the same constraint can be attached to more than one head (“Which countries
        have won a futsal European championship played in Spain?”);

      • type ambiguity: the constraint can be interpreted in more than one way (“What novels did Orwell
        publish before 1984?" – temporal constraint = year; entity constraint = novel).

    The system tries to identify a semantic constraint and to find the correct dependent for each syntactic
constituent. Where there is ambiguity, the distinct interpretations are addressed separately. The system
examines each syntactic constituent to determine its possible semantic type. For example phrases such as
“greater than", “more than" could trigger a numeric property filter, while the preposition “in" marks both
geographical and temporal constraints. This compatibility is not stored beforehand, but is embedded in the
actual constraint verifiers used.


4      Answer Extraction
The GikiCLEF question set is composed of entity list questions. While these questions are regarded as more
difficult than factoid ones (most factoid questions can be rephrased as list questions which have exactly one
answer), they have two important characteristics which make them accessible:

    1. encyclopaedic domain — since all questions are asking about entities represented in Wikipedia,
       certain assumptions regarding the “domain” can be made;

    2. structural uniformity — the way the questions are formulated is syntactically uniform, most questions
       conforming to a limited set of patterns.

4.1     Data Representation
In the GikiCLEF competition, (with very few exceptions) identifying answers requires extracting infor-
mation from more than one document, using basic notions of structure (tables, infoboxes, layout) and
interlinking. As mentioned in Section 2, Wikipedia is viewed as a graph represented using the three con-
cepts: nodes, relations and properties, and the distinction is made between articles (entities, redirects, dis-
ambiguation and lists), categories (classification) and templates (semi-structured format for representing
object properties).
    From the redirect and disambiguation pages, entity aliases are extracted to be used in the question
analysis. The articles describing entities are indexed and the links between them are stored in a database,
allowing us to query the set of pages that are linked to/from a given article. The categories and their links
are stored in a separate table for navigating the category DAG. Apart from the straightforward query for
the set of categories of a page X, EQUAL can also check if a given page X is directly or indirectly within
a category Y , determining, level by level, the pages indirectly connected to category Y , with or without
transitivity filtering. Transitivity filtering refers to categories which do not have a plural noun in their title,
or which have a corresponding article with exactly the same name (e.g. London). This rule is designed to
limit the recursive exploration of the Wikipedia graph. The text of the articles was indexed because some
constraint verifiers need to extract information from the text at query time.
    The category folksonomy is used to derive type information for entities. Being a folksonomy, the
relations between categories are not specified, which results in ambiguities. There are both type and topic
categories, but this distinction in not explicit in the mark-up. Although most of the time a direct link
from an article to a category denotes an rdf:type relation, precautions should be taken when assuming
transitivity. More advanced methods can be employed to disambiguate these links (see dbpedia and Yago),
but EQUAL only uses a simple rule which limits transitivity: subcategories are processed in a DFS order
only if no answers have been found and categories with no plural noun in their name are skipped.
    The advantage of the categories graph is that it exhibits various granularity levels. For example, it is
possible to directly map the question span “Brazilian football players" to the category Brazilian footballers,
instead of having to deduce the semantics of the individual words, such as checking which ”football play-
ers” were ”born in Brazil” (see Figure 5). These issues are most relevant to the EAT verifier.

4.2   Constraint Verifiers
The verifiers corresponding to the semantic constraints from Section 3 are largely those inherited from the
WikipediaListQA@wlv system. The idea behind verifiers is that a generic constraint (e.g. geographic con-
tainment) can be realised in Wikipedia by different means (demonym modifiers, e.g. Nepalese, categories,
e.g. Mountains of Nepal, tables, infoboxes, text). By using an adaptor design pattern, EQUAL can use
several methods for the same semantic constraint. This makes it possible to exploit both information ex-
traction algorithms and external datasets. For example, a geographical reasoning Web-Service can be used
to evaluate relations such as containment, neighbouring, distance, and others, while the dbpedia dataset can
also be used to extract properties of entities.
    The constraint verifiers in EQUAL are rather rudimentary and only implement two methods: create,
which creates a new object via a factory method using a span from the question, and apply, to test whether
an entity corresponds to the particular semantic constraint. These verifiers are quite general and offer a
good balance between accuracy and coverage, but more specialised implementations could perform better.
The reason for not adding new verifiers is that a manual approach is not scalable, and improvements could
be the result of over-fitting the actual questions, solving the test not the task. Each verifier is discussed
below.
    Entity constraint. Filtering a set of entities based on a relation with a given entity X is enforced by
checking which of the articles mentioned by X are part of the initial set and which of the remaining articles
mention X. EQUAL uses both the links table and also the article text to identify mentions, to account for
cases where the link was not marked by the Wikipedia editors.
    Relation constraint. Identifying mentions between pages is a high recall step. EQUAL must then
confirm that the context of the mention page is ’compatible’ with the constraint identified in the question.
Initial experiments revealed that in many cases testing the compatibility needs more than synonym expan-
sion or WordNet similarity measures. Therefore this constraint was ignored in the GikiCLEF test set. This
means that in questions such as GC09-01 and GC09-09 EQUAL answers a slightly different question than
the one asked. The effects on performance vary, but usually demonstrate a decrease in precision.
    Temporal constraint. In questions that have a time constraint, this is applied to further filter the
relation constraint. The implementation uses a temporal annotator developed for the QALL-ME project8 to
transform temporal expressions to time intervals, and then tests if the context of the mention is compatible
with the constraint from the question. This approach suffers from the same problems as above. Therefore,
for GikiCLEF EQUAL used a simpler test: if the constraint is one year, then it must be textually present
somewhere in the article’s content. All other forms of temporal constraints are ignored, even if in other
interpretations the question span could be considered an entity constraint.
    Property constraint. EQUAL uses both the infobox content and the article text to extract numeric
values. The comparatives and other quantifiers from the question are respected. If the noun denoting the
  8 http://qallme.fbk.eu/
            Figure 2: Which Brazilian football players play in clubs in the Iberian Peninsula?


property is not found or a numeric value cannot be extracted, the constraint is not satisfied and the candidate
article is discarded.
    Geographic constraint. This constraint makes use of a list of demonyms extracted from Wikipedia
and a small set of patterns to determine whether a given entity is part of a category such as "ncp of/from/in
Country" (e.g. "Mountains of Nepal") or "Demonym ncp" (e.g. "Romanian poets") where ncp stands for a
common noun in plural form. This strategy is also used when searching for the best category match for the
EAT. If such a pattern does not apply, EQUAL interprets this as an entity constraint, as a fall-back.
    Introduction constraint. A list of interrogative pronouns (e.g. which), imperative verbs (e.g. name)
and declarative constructions (e.g. I’m interested.... . . ) which was manually compiled from the QALL-ME
benchmark [1] is used to identify the spans at the beginning of the question. These spans can be considered
to mark list questions, but in the context of GikiCLEF this is redundant information and these spans are
ignored in all subsequent processing.


5    Discussion
Processing some of the questions requires query-time verifier evaluation for a large number of articles,
which is rather inefficient. This is especially true when the EAT category is much more general than the
answers sought, because exploring its sub-articles yields a very large set of candidate articles. Questions
which have more than one set of entities can also be very slow to process. If the system did not return any
answers after 30 minutes, the execution was stopped. This happened for roughly half of the 25 questions
which EQUAL did not answer. To address this issue, two strategies are currently being investigated: either
using an RDF store for faster execution of some patterns, or selecting a better entry point in Wikipedia
using spreading activation.
    Processing English seems to confer some advantages, due to the larger size of the English Wikipedia.
EQUAL had 69 correct answers out of the total of 138 answers for 25 of the 50 questions for English, which
give a precision of 50% and a score of 34.5. By simply mapping these results in all the other languages
using the interwiki link, the cumulative results total 385 correct answers out of 813 for the same questions
in all the languages (precision 47%, score 181.93).
    While EQUAL ranked first in GikiCLEF, the scores are less impressive in terms of recall: the system
identified one third of the complete correct answer pool which contains all the correct answers from all the
systems. An analysis is under way to investigate why so many answers were missed. This could prove
crucial regarding which enhancements should be addressed first.

Related Research – NLIs and ontologies
The methodology employed shares similarities with natural language interfaces (NLIs) to databases and
ontologies. The PRECISE system [3] uses a graph–flow algorithm to map tokens from the query to the
lexicon of relations, attributes and values of a database. Given the constraints imposed by the database
schema, the validity of a question interpretation can easily be established. They show “how questions that
can be mapped to non-recursive Datalog clauses are semantically tractable.” EQUAL has a much broader
coverage, and the mechanisms are different, however recursion needs to be investigated. Likewise, the
formalism behind ontologies (e.g. Description logic, OWL) are not present in the large RDF repositories,
since these use general purpose terminologies meant to facilitate data reuse rather than advanced inference.
    Amongst the projects that employ information extraction techniques to extract semantic relations from
Wikipedia, the most relevant to this work is WikiTaxonomy [6] which extracts explicit and implicit relations
by splitting the category title. For example, from Category:Movies directed by Woody Allen relations such
as directedBy and type(movie) are extracted. A total of 3.4 million type and 3.2 million spatial relations
(e.g. locatedIn), along with 43,000 memberOf relations and 44,000 other relations such as causedBy and
writtenBy were extracted. This is similar to the geographical verifier used in EQUAL and to the methods
the system employs to identify the best category for the EAT mapping.


6    Conclusions and future work
This paper presented an entity-centric architecture for semantic QA and a prototype system, EQUAL . At
the core of the architecture is a set of semantic constraints which are used to represent the possible inter-
pretations of questions. When extracting the answers for each interpretation, the constraints are verified
using the underlying data and the set of available verifiers, and the candidate entities which satisfy all the
constraints are selected. The system performed very well in the GikiCLEF 2009 task, ranking first among 8
systems: it achieved a precision of 47% and a relative recall of 32%, proving adequate for the task. Further
improvements are required in order to increase recall and to reduce the runtime complexity of the method.
     Possible ways of improving EQUAL include adding verifiers focused on more reliable processing
of text within articles, automatically learning constraint verifiers using seed data and adding quantifier
semantics. In addition, creating a question corpus to investigate training different parsing models or using
an RDF store as the underlying model could prove beneficial. In terms of general semantic QA, architecture
it is necessary to implement the feedback module which would make EQUAL an interactive QA system,
and also to devise ranking models for combining the results of several interpretations/verifiers, in order to
present the user with the most relevant information.


Acknowledgements
The development of the EQUAL system was partly supported by the EU-funded project QALL-ME (FP6
IST-033860).


References
[1] Elena Cabrio, Milen Kouylekov, Bernardo Magnini, Matteo Negri, Laura Hasler, Constantin Orasan,
    David Tomas, Jose Luis Vicedo, Guenter Neumann, and Corinna Weber. The QALL-ME Benchmark:
    a Multilingual Resource of Annotated Spoken Requests for Question Answering. In Nicoletta Calzo-
    lari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, and Daniel Tapias,
    editors, Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Mar-
    rakech, Morocco, may 2008. European Language Resources Association (ELRA). http://www.lrec-
    conf.org/proceedings/lrec2008/.

[2] Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In ACL, pages 423–430,
    2003.
[3] Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. Towards a theory of natural language interfaces
    to databases. In IUI ’03: Proceedings of the 8th international conference on Intelligent user interfaces,
    pages 149–157, New York, NY, USA, 2003. ACM.

[4] Diana Santos, Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling,
    and Yvonne Skalban. Getting geographical answers from Wikipedia: the GikiP pilot at CLEF. In
    Francesca Borri, Alessandro Nardi, and Carol Peters, editors, Working Notes for the CLEF 2008 Work-
    shop, Aarhus, Denmark, September 2008. CLEF 2008 Organizing Committee.

[5] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A Core of Semantic Knowledge.
    In 16th international World Wide Web conference (WWW 2007), New York, NY, USA, 2007. ACM
    Press.

[6] Cäcilia Zirn, Vivi Nastase, and Michael Strube. Distinguishing between Instances and Classes in the
    Wikipedia Taxonomy. In ESWC, pages 376–387, 2008.