=Paper= {{Paper |id=Vol-2841/DARLI-AP_13 |storemode=property |title=Conversational Question Answering Using a Shift of Context |pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_13.pdf |volume=Vol-2841 |authors=Nadine Steinmetz,Bhavya Senthil-Kumar,Kai-Uwe Sattler |dblpUrl=https://dblp.org/rec/conf/edbt/SteinmetzSS21 }} ==Conversational Question Answering Using a Shift of Context== https://ceur-ws.org/Vol-2841/DARLI-AP_13.pdf
                                    Conversational Question Answering
                                         Using a Shift of Context
               Nadine Steinmetz                                    Bhavya Senthil-Kumar∗                               Kai-Uwe Sattler
                 TU Ilmenau                                               TU Ilmenau                                     TU Ilmenau
                  Germany                                                   Germany                                       Germany
      nadine.steinmetz@tu-ilmenau.de                               bhavya.tharayil@gmail.com                          kus@tu-ilmenau.de

ABSTRACT                                                                               requires context information. For humans, this also works when
Recent developments in conversational AI and Speech recogni-                           the context is slightly changing to a different topic but having
tion have seen an explosion of conversational systems such as                          minor touch points to the previous course of conversation. With
Google Assistant and Amazon Alexa which can perform a wide                             this paper, we present an approach that is able to handle slight
range of tasks such as providing weather information, making ap-                       shifts of context. Instead of maintaining within a certain node
pointments etc. and can be accessed from smart phones or smart                         distance within the knowledge graph, we analyze each follow-up
speakers. Chatbots are also widely used in industry for answering                      question for context shifts.
employee FAQs or for providing call center support. Question                              The remainder of the paper is organized as follows. In Sect. 2
Answering over Linked Data (QALD) is a field that has been                             we discuss related approaches from QA and conversational QA.
intensively researched in the recent years and QA systems have                         The processing of questions and the resolving of context is de-
been successful in implementing a natural language interface to                        scribed in Sect. 3. We have implemented this approach in a pro-
DBpedia. However, these systems expect users to phrase their                           totype which is evaluated using a newly created benchmark. The
questions completely in a single shot. Humans on the other hand                        results of this evaluation presented in Sect. 4 demonstrate that
tend to converse by asking a series of interconnected questions or                     our approach is capable of holding up several types of conversa-
follow-up questions in order to gather information about a topic.                      tions. Finally, we conclude the results in Sect. 5 and point out to
With this paper, we present a conversational speech interface for                      future work.
QA, where users can pose questions in both text and speech to
query DBpedia entities and converse in form of a natural dialog                        2   RELATED WORK
by asking follow-up questions. We also contribute a benchmark                          Our proposed system is able to handle speech as incoming ques-
for contextual question answering over Linked Data consisting                          tions as well as textual input. But, the focus of our proposed
of 50 conversations with 115 questions.                                                approach is the correct interpretation of conversations within
                                                                                       a context. Therefore, we limit the related work discussion to
1    INTRODUCTION                                                                      research approaches on conversational QA.
With the increasing development of speech interfaces and chat-                            Dhingra et al. proposed KB-InfoBot, a fully neural end-to-end
bots, conversations with a machine becomes more and more                               multi-turn dialogue agent for the movie-on-demand task [2]. The
common to us. In this way, we are able to ask for everyday life                        agent is trained entirely from user feedback. It consists of a belief
questions or communicate with a customer service without talk-                         tracker module which identifies user intents, extracts associated
ing to a real person. But, most of these interfaces are either trained                 attributes and tracks the dialogue state in a Soft-KB Lookup com-
for a very specific domain or limited regarding the type of ques-                      ponent which acts as an interface with the movie database to
tions. Question Answering (QA) systems based on knowledge                              query for relevant results. Subsequently, the state is summarized
graphs aim to answer questions from the complete knowledge                             into a vector and a dialogue policy selects the next action based
represented by the graph. Of course, these systems might be                            on the current dialogue state. The policy is entirely trained on
specific regarding domains, but as they are tuned to graph pat-                        dialogues. The authors proposed a probabilistic framework for
terns, they also work on all-purpose knowledge bases, such as                          computing the posterior distribution of the user target over a
DBpedia or Wikidata. Over the last 10 years, the QALD challenge                        knowledge base, termed a Soft-KB lookup. This distribution is
(Question Answering on Linked Data) provided several datasets                          constructed from the agent’s belief about the attributes of the
for the purpose of the evaluation of QA systems. Participating                         entity being searched for. The dialogue policy network, which
systems showed the ability to understand natural language and                          decides the next system action, receives as input this full distribu-
transform it to a formal query language, specifically SPARQL, in                       tion instead of a handful of retrieved results. They show in their
order to provide the answer to the user. These systems require                         experiments that this framework allows the agent to achieve a
complete sentences respectively questions to be able to process                        higher task success rate in fewer dialogue turns.
it further. In contrast to that, conversations often start with a                         Microsoft XiaoIce (“Little Ice” in Chinese) is a social chatbot
complete sentence/question and further questions are built upon                        primarily designed to be an AI companion with which users
the context of the preceding questions and answers. Follow-up                          can form long-term emotional connections [9]. This ability to
questions might only consist of a fragment of a sentence and must                      establish long-term emotional connections with it’s human users
be completed in mind. Which means, the conversation partner                            distinguishes XiaoIce from other social chatbots as well as popu-
                                                                                       lar conversational AI personal assistants such as Amazon Alexa,
∗ work was done during master studies at the institute
                                                                                       Apple Siri, Microsoft Cortana and Google Assistant. XiaoIce has
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   attracted over 660 million active users since it’s launch in May
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       2014 and has developed more than 230 skills ranging from ques-
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
                                                                                       tion answering, recommending movies or restaurants to story-
                                                                                       telling and comforting. The most sophisticated skill is Core Chat,
where it engages in long and open-domain conversations with             3     METHOD & IMPLEMENTATION
users.                                                                  We introduce our system architecture as well as the processing
   In addition to having a sufficiently high IQ required to acquire     steps during a conversation in detail in the next sections.
a range of skills to complete specific tasks, social chatbots also
require a high EQ to meet user’s emotional needs such as affection      3.1      System Architecture
and social belonging. The core of XiaoIce’s system design is
                                                                        Figure 1 shows the proposed system architecture. The speech
this integration of IQ (content-centric) and EQ(user-centric) to
                                                                        interface is developed using Dialogflow, Google’s development
generate contextual and interpersonal responses to form long-
                                                                        suite for creating dialogue systems on websites, mobiles and IoT
term connections with users.
                                                                        devices. Dialogflow is an AI-powered tool that uses machine
   Just recently, Vakulenko et al. presented an approach to rewrite
                                                                        learning models to detect the intents of a conversation. The
follow-up questions in conversations in order to be able to find
                                                                        underlying Google Speech To Text (STT) API is used to transcribe
the correct answer [8]. Their proposed model employs a unidirec-
                                                                        user audio input to text. The Text To Speech (TTS) API returns
tional Transformer decoder. The model predicts output tokens
                                                                        spoken text as an audio response back to the user. Once an intent
according to an input sequence, references to a context and the
                                                                        is identified, Dialogflow sends a request to our webhook service
context itself. The training is performed via a teacher forcing
                                                                        with the transcribed text. The webhook service then handles the
approach. After the rewriting the input question, the question
                                                                        user’s query and forwards the request to an external API, i.e. our
answering process consists of the retrieval of relevant text pas-
                                                                        HYDRA API3 or the DBpedia SPARQL endpoint4 . The result of
sage, the extraction of potential answers to the input question
                                                                        the query is forwarded to the Dialogflow agent which finally
and the ranking of the answers. The authors evaluated their ap-
                                                                        responds to the user with a spoken audio response. Our webhook
proach on two different datasets (QuAC1 and TREC CAsT2 ). In
                                                                        service is the crux of the system that handles user queries. It has
contrast to this approach, we are dealing with Semantic Ques-
                                                                        a context module that handles follow-up queries by resolving
tion Answering and our procedure to identify the context of a
                                                                        co-references to entities using a dialog memory.
conversational question is based on the underlying knowledge
graph – specifically DBpedia.
   Reddy et al. introduced CoQA, a dataset for Conversational
                                                                        3.2      Processing Steps
Question Answering systems consisting of a total of 127,000             The processing steps in answering a query and the detailed steps
questions along with their answers from conversations about             are explained in the following sections.
text passages covering seven unique domains [5]. The questions             3.2.1 Intent Classification. In the first step, we identify the
are sourced using Amazon Mechanical Turk (AMT) by pairing               intent of the question. An intent represents the purpose of a
two annotators, a questioner and an answerer, on a passage              user’s input. We predefined several intents for each type of user
through the ParlAI MTurk API [4]. Every turn in the conversation        request that we want to handle:
contains a question and an answer along with its rationale. It
can be observed that around 30.5 percent of questions of the                  • greeting intent,
CoQA dataset does not rely on coreference with the conversation               • factual intent - for first questions in conversations,
history and can be answered on their own. Almost half of the                  • follow-up intent - to continue a conversation,
questions (49.7 percent) contains explicit coreference markers                • a fall back intent - if something is wrong with the question
such as he, she, it etc. They either refer to an entity or an event             or it cannot be answered.
introduced previously in the conversation. The remaining 19.8           For the training of the intent classification, we used a set of dif-
percent does not contain any explicit coreference markers but           ferent sample questions. We identify follow-up queries based on
rather refer to an entity or event implicitly.                          certain contextual cues such as the presence of explicit coref-
   All approaches and datasets described above are based on un-         erences like “he”, “she”, “his”, “her”, “it’s” etc. or the presence
structured textual sources. In contrast, our approach is based on       of phrases like “What about”, “how about” etc. For each intent
a knowledge graph and takes into account the semantic informa-          we can either respond with a static response as in the case of a
tion of the question. Christmann et al. proposed an approach for        greeting or define an action to be triggered. When a user writes
a conversational question answering system based on knowledge           or says something, Dialogflow categorizes and matches the inten-
graphs [1]. Their system, CONVEX, examines the context of an            tion to a predefined intent which is known as intent classification.
input question within the knowledge graph to be able to answer          For example a query such as “What is the capital of Cameroon?”
follow-up questions. Based on the edges and nodes around the            would be matched to the factual query intent and a follow-up
focus of the first question, the conversation is disambiguated.         query such as “What about China?” or “What is its population?”
The authors claim, that topic changes are rare and therefore only       would trigger the follow-up query intent.
investigate the direct context of the focus entity of the first ques-
                                                                           3.2.2 Webhook Service and fulfillment. Once the intent of the
tion. The evaluation benchmark has been sourced by Amazon
                                                                        user is identified and necessary parameters are extracted, the
Mechanical turks and consists of around 11,200 conversations
                                                                        Dialogflow agent sends a webhook request with the transcribed
based on the Wikidata knowledge graph. In contrast to CONVEX,
                                                                        raw text, the intent and parameters to the webhook service. De-
our system is able to handle topic changes – for instance, when
                                                                        pending on the intent the appropriate function is triggered to
the focus entity changes, but the predicate stays the same.
                                                                        handle the query. The end user writes or speaks in the query
                                                                        “What is the capital of Czech Republic?”. Based on the identified
                                                                        intent, different actions are triggered:

                                                                        3 HYDRA is a QA system that transforms natural language to SPARQL queries for
1 https://quac.ai/
2 http://www.treccast.ai/
                                                                        DBpedia based knowledge graphs, cf. [7]
                                                                        4 http://dbpedia.org/sparql
                                                    Figure 1: System architecture


    • A factual question intent triggers a direct API call for our        • context resolution (CR).
      QA API HYDRA.                                                     These three sub-steps are described in detail in the next para-
    • In case of a follow-up intent, the question is further ana-    graphs.
      lyzed to retrieve the answer.
                                                                        Named Entity Linking. Following the sample question above,
   In the first case, the Dialogflow agent matches the end-user
                                                                     a follow-up question could be "What about China?". Therefore,
expression to the factual question intent. This triggers the di-
                                                                     we have to identify the new NE in the follow-up question. For
rect call of the HYDRA REST interface with the query text. The
                                                                     this, we utilize the DBpedia Spotlight web service5 to annotate
HYDRA system generates the SPARQL using a template based
                                                                     mentions of named entities in natural language. As the disam-
approach (citation removed for anonymization). The HYDRA ser-
                                                                     biguation of ambiguous mentions would require more contextual
vice provides the answer and the respective SPARQL query. The
                                                                     information, the service most probably returns the NE with the
answer is transferred to Dialogflow and provided to the user.
                                                                     highest popularity.
The SPARQL query is parsed to extract the subject and predi-
cates and the information is stored in the output context of the        Predicate Identification. In case, the follow-up question asks
payload. For every conversation turn, we store the subject and       for a different fact about the same named entity, we need to
predicate values to the context memory in order to resolve them      identify the new predicate. For the mapping of natural language
in subsequent follow-up queries.                                     phrases to their corresponding DBpedia properties, the PATTY
   For the case of a follow-up intent, the question is analyzed      [3] dataset is used. The dataset contains the original labels and
according to the context of the preceding conversation. The actual   several alternative phrases for each DBpedia ontology property.
analysis and the processing steps are described in the following     For the identification of property phrases in the follow-up ques-
sections.                                                            tion, we use the Dialogflow entity service6 . The service allows to
                                                                     create custom (so-called) entities to train the identification within
   3.2.3 Context Detection in Follow-Up Questions. For the han-
                                                                     natural language. Here, we create a custom entity using the DB-
dling of a follow-up question, several steps are required. We
                                                                     pedia properties and the phrases from the PATTY dataset. The
identify named entities and detect the type of context shift. For
                                                                     Dialogflow agent is trained on this custom entity and we can use
our approach, a shift of context can occur in two different ways:
                                                                     this custom entity to recognize the phrases in the user utterance
    • Either the focus (named entity (NE)) of the previous ques-     and map them to their corresponding DBpedia properties. For ex-
      tion changes,                                                  ample the phrases “Who was he married to”, “Who is his spouse”
    • or the follow-up question asks for a different relation-       etc. are both mapped to the DBpedia property dbo:spouse and
      ship/predicate.                                                stored as a parameter in the webhook request payload and can
Consider the factual question What is the official language in       then be accessed in the webhook service handler.
Germany?. Here, the following types of questions are conceivable:
                                                                        Resolving context. When resolving the context of a follow-up
    • Either the follow-up question asks for the official language   question, the essential task is to identify the shift of context. This
      of a different country,                                        means, we need to analyze if the focus entity or the property
    • or the conversation continues with a question regarding a      changed. Consider the example query as shown in Figure 2. Here,
      different fact about Germany.                                  the subject is “Germany” and “official language” refers to the
Any other question would be considered to be a new factual           property. Now consider the follow-up query “What about China?”
question and the start of a new conversation. Therefore, we          is being asked. After the NEL and PI processing steps, the named
identify the shift of context and continue to generate the new       entity for the phrase “China” is identified so we know that this
SPARQL based on the context that maintains compared to the           is the new focus the user is referring to, but no properties are
previous question.                                                   identified. Therefore, we need to resolve this from the predicate
   Thus, our approach for context detection in follow-up ques-       mentioned in the previous turn in the conversation. We fetch the
tions consists of three sub-steps:                                   property dbo:language from the dialog memory and reformulate
    • named entity linking (NEL),                                    5 https://www.dbpedia-spotlight.org/api
    • predicate identification (PI), and                             6 https://cloud.google.com/dialogflow/es/docs/entities-custom
                                       Figure 2: Sample conversation for a change of the property.


the same query by replacing the subject with the new subject                     in Turkmenistan?" followed by the coreference "it" in the
dbr:China7 . Now consider the case if the follow-up question                     follow up question "What is it’s official language?" will be
would have been "What about its national anthem?". Here, the                     resolved to "Turkmenistan".
new property for the phrase “national anthem” is identified, but          These rules are based on our inferences from the questions in
it is required to keep the focus entity of the previous question.       our evaluation benchmark consisting of 50 sets of conversations
We reformulate the query with the new property dbo:anthem               each of which referring to a question in the QALD 9 evaluation
and subject dbr:Germany.                                                dataset (cf. Section 4.1).
    In this manner, at each turn of the conversation we need to
identify the presence of new entities or properties and resolve the        Asking for Clarifications. There are some cases where the
missing part from the dialog memory. However, the follow-up             follow-up question may be referring to the answer to the first
question could also refer to the answer of the previous question        question rather than the subject in the first question. For exam-
instead of the focus entity. In order to resolve the coreferences       ple consider the question “Who is the spouse of Barack Obama?”
correctly, we rely on some rules based on the question type of          followed by the follow-up question “And at which school did she
the first question and the type of follow-up question as follows:       study?”. In this case, the focus of the first question as well as the
                                                                        answer are of the same type. We need to resolve the ambiguity
     • WHO : If the first question is of type WHO, we know that
                                                                        by asking the user for clarification. For our first prototype, we
       the expected answer type is a Person. Therefore if we have
                                                                        have implemented this for the entities of type dbo:Person. That
       a WHO question followed by a coreference "he/she" as in
                                                                        means, whenever the focus of the question as well as the answer
       When was she born , we resolve the subject to the answer
                                                                        to the question both refer to named entities of type dbo:Person,
       of the first question. If the same question is followed by a
                                                                        we resolve it by asking the user which entity they actually meant.
       coreference "it" such as "What is its capital?", we resolve
                                                                        We have implemented this using a suggestions list where users
       the coreference to the first subject "Germany".
                                                                        are presented a list of options containing the two entities men-
     • WHAT / WHICH : If the first question is of type WHAT
                                                                        tioned and users can confirm by selecting the appropriate entity.8
       or WHICH, the expected answer can be a city, currency,
                                                                        Figure 4 shows the conversation which tries to answer the QALD
       number etc. For example, "What is the capital of Czech
                                                                        question “What is the name of the school where Obama’s wife
       Republic?". Now if this is followed by a question such
                                                                        studied?”. The focus of the first query is "Barack Obama" and
       as "what about its currency?" the coreference "it" is re-
                                                                        the system returns the spouse of Barack Obama i. e, “Michelle
       solved to the subject of the first question which is "Czech
                                                                        Obama” as the answer. Both entities are saved in the dialog mem-
       Republic".
                                                                        ory for future reference. When the user asks the follow-up query
     • WHEN : For the question type When, the expected an-
                                                                        "Which school did she/he study?" the system asks the user for
       swer is a date or a location, for example "When was Princess
                                                                        clarification in order to resolve the ambiguity between the two
       Diana born?" or "Where was Bach born?”. So, in the case
                                                                        entities9 . The user is now presented with two options as a sug-
       of a follow-up question with a coreference "she" as in "And
                                                                        gestion list and the user confirms by clicking on one of them.
       when did she die?" shall always be resolved to the subject
                                                                        Finally, the response for the selected query is presented to the
       identified in the first question "Princess Diana".
                                                                        user.
     • WHERE : Here the expected answer is a location, for
       example "Where was Bach born?" and similar to the case              3.2.4 Query execution and Response. Once all entities are re-
       of WHEN questions coreferences in subsequent follow-up           solved, the SPARQL query is formulated by replacing the subject
       questions are always resolved to the subject of the first        and property in the general SPARQL template. Then the SPARQL
       question, here in this example "Bach".                           query is sent to the SPARQL endpoint for execution. The results
     • HOW : The expected answer type of the question type              from the SPARQL endpoint are usually either a text value such
       How is a number. Hence, the coreferences in the follow-up
                                                                        8 Cf. the demo video for a clarification case:
       queries are resolved to the subject in the first question. For
                                                                        https://zenodo.org/record/4091782/files/Barack_Obama.mov
       example the question "How many languages are spoken              9 The system cannot resolve the gender-based co-reference of the relevant named
                                                                        entities yet. But, DBpedia provides the property foaf:gender which could be used
7 dbr refers to http://dbpedia.org/resource/                            to identify the gender of all entities involved.
                                         Figure 3: Sample conversation for a change of the focus entity.




                                                            Figure 4: Asking for Clarification


as a number or a DBpedia resource URI. We then formulate the                    4     EXPERIMENTS
response as a complete sentence. For this, we use a template                   In this section, we describe some experiments to evaluate our
of the format “The property of subject is result”. This text                   system and discuss the results. Section 4.1 introduces our evalu-
response is converted to speech using Google Text To Speech                    ation benchmark. Sections 4.2.1 and 4.2.2 shows the evaluation
(TTS) and presented as spoken output to the user. If the result of             results of our system against the benchmark. In Section 4.2.3, we
the SPARQL query returns a DBpedia resource then we execute                    take a detailed look at the separate processing steps and their
an additional SPARQL query to obtain the thumbnail associated                  failure rates. Section 4.2.4 describes the results based on different
with that resource if any10 .                                                  question types and discusses the reasons of failures in each step.
   Figure 5 shows an example conversation which answers the                    We discuss the results of our experiments in Section 4.3.
question "Who was the doctoral supervisor of Albert Einstein?"
from the QALD 7 train dataset in a series of follow-up questions                4.1     Evaluation Benchmark
and Figure 6 shows another sample conversation that answers
                                                                               To the best of our knowledge, benchmarks for conversational QA
the question "What is the currency of Czech Republic" in a series
                                                                               anchored in DBpedia are not existent11 . Therefore, we introduce a
of follow-up questions.
                                                                               benchmark of 50 conversations inspired from the QALD datasets.
                                                                               Each conversation consists of a direct question followed by one
                                                                               or two follow-up questions. Overall, the dataset consists of a total

                                                                               11 Christmann et al. [1] and Saha et al. [6] just recently published their own bench-
10 We prepared video demos for some sample conversations:
                                                                               marks for conversational QA on knowledge graphs, but they are both based on
https://doi.org/10.5281/zenodo.4091782                                         Wikidata.
                                    Figure 5: Example query and response (QALD-7-train, question id:167)




                                    Figure 6: Example query and response (QALD-7-train, question id:169)

                                                 Table 1: Precision of direct and follow-up questions

                                      Type of Question              Number of Questions   Answered     micro-precision
                                       Direct questions                     50               42             0.84
                                     Follow-up questions                    65               49             0.75
                                           Overall                         115               91             0.79


of 115 individual questions12 . Each conversation corresponds to                    4.2   Evaluation Results
a question from the QALD-9 dataset.                                                 Our proposed system performs completely automatically for
                                                                                    given questions. Of course, in case of ambiguous follow-up ques-
                                                                                    tions a human interaction is required and requested. But the
12 The dataset is available at https://doi.org/10.5281/zenodo.4091791
                                                                                    analysis of the questions and the shift of context is performed
    Table 2: Percentage of failure by processing step                   4.2.3 Evaluation based on Processing steps. In this section, we
                                                                     present the evaluation results based on the individual processing
                                            Percentage of            steps involved and discuss the reasons of failure. Table 2 shows
                       Processing Step
                                                  Failure            the percentage of failure for each of the processing steps. As
                Speech Recognition                   1.74            discussed in the previous section, the system was able to answer
             Response from HYDRA                     7.83            80% of the questions in overall and the failure rate is 20 %.
           Named Entity Recognition                  1.74
                                                                        Response from HYDRA. The highest percentage of failure is
               Mapping of Predicate                  5.21
                                                                     due to an incorrect response to the direct query from HYDRA
                  Resolving context                  4.35
                                                                     as the failure of a direct query results in the failure of all of the
                              Total                 20.86
                                                                     subsequent follow-up queries. For some answers the HYDRA API
                                                                     took too long to respond or was not able to provide an answer.
Table 3: Percentage of failure by follow-up question type
                                                                        Mapping of Predicates. We use PATTY predicates and their
                                                                     synonyms mapped to a custom Dialogflow Entity for identifying
                                                Percentage of        predicates. In some cases the system fails to map the correct
                 Follow-up Question Type
                                                      Failure        predicate which results in failure. Consider the example “When
                             WHERE                       3.07        was Bach born?” from the conversation id:5 followed by the query
                       WHO/WHOM                          7.69        “And where?”. The system fails to map "where" to "birthPlace"
                             WHICH                       4.61        as this is open-ended and could also refer to a location as in
                               HOW                          0        “deathPlace”. As a result the follow-up query cannot be resolved.
                              WHEN                          0
               AND/WHAT + Entity                         9.23           Resolving context. The step involving resolution of co-references
   AND/WHAT + coreference + Predicate                       0        or context in follow-up queries is a very crucial step in answering
                                Total                   24.61        follow-up queries and has a percentage of failure of 4.35. We rely
                                                                     on certain patterns or rules to resolve the context and sometimes
                                                                     the system does not have enough knowledge to resolve them
                                                                     correctly. Consider the example question “What is the capital of
completely self-sufficient. Hence, the evaluation is executed in     French Polynesia?”. The follow-up question is “And who is its
this manner. Section 4.2.1 and Section 4.2.2 describe our system’s   mayor?” Now, during context resolution it fails to map the coref-
results based on the precision and recall overall and in detail      erence “its” to the correct reference (the response of the question)
for follow-up questions. In Section 4.2.3 the separate processing    as it does not have the background knowledge that only entities
steps of our system are evaluated. In addition, the reasons for a    of type city are associated with the predicate “mayor”. Hence, it
failure in each step are discussed. In Section 4.2.4, we evaluate    fails to correctly resolve the co-reference in the follow-up query.
the system based on the different types of follow-up questions
and discuss the reasons for failures.                                   Named Entity Recognition. NE mentions in the follow-up queries
                                                                     are annotated using the DBpedia Spotlight API which was suc-
   4.2.1 Overall Precision and Recall. We evaluated the overall      cessful in resolving most of the entities in our benchmark. How-
quality of our system using the measures precision and recall.       ever, it fails to resolve certain entities resulting in a failure rate of
Recall is defined as the ratio of the number of correct answers      1.74. For example, in the follow-up question "And Harry Potter"
provided by the system to the number of gold standard answers        the entity is resolved to the television show with the same name
with respect to a single question 𝑞. In other words, recall is the   and not the character Harry Potter which results in an incorrect
ratio of intersection of relevant and retrieved answers to the       response.
relevant answers. And precision is defined as the ratio of the
number of correct answers provided by the system to the number          Speech Recognition. The Google Speech to Text (STT) API was
of all answers provided by the system. Precision can also be         able to correctly transcribe most of the queries correctly and
defined as the ratio of the intersection of relevant and retrieved   had a very low failure rate of 1.74. However, it fails to transcribe
answers to the retrieved answers.                                    few entities such as the entity "Kurosawa" and "MI6" correctly
   For the dataset mentioned above, 36 out of 50 conversations       which could also vary depending upon the accent or the way
were answered completely by our system. i. e. it answered the        different users pronounce certain words. It however performed
direct question as well as all follow-up questions correctly. We     well overall during the evaluation and was able to transcribe
did not encounter partially correct answers. That means either       most of the input accurately.
the system fails to answer a question or the answer returned
                                                                        4.2.4 Evaluation based on Question Types. In this section we
was 100% accurate. Therefore, recall and precision are the same
                                                                     evaluate the system based on the different types of follow-up
(= 0.72) for the complete dataset.
                                                                     questions. The system was able to correctly answer 75 % of the
   4.2.2 Direct and Follow-up queries. We also evaluated the di-     follow-up queries and has an overall failure rate of 24. 61 %. Table
rect and follow-up queries separately and the results are shown      2 shows the percentage of failure for each of the question types.
in Table 1. The evaluation benchmark consists of 50 direct ques-
tions out of which 42 were answered correctly and 49 out of the      4.3    Discussion
65 follow-up queries were answered correctly. As a result the        The quality of Conversational QA systems depends on the suc-
micro-precision of direct queries is 0.84 and that of follow-up      cess of previous questions along the course of the conversation
queries is 0.75. The overall precision considering 91 out of the     and especially of the first question. As shown in Table 2, the
total 115 questions answered is 0.79.                                success of a conversation fails when the response of the initial
QA API is incorrect. Therefore, QA in general is required to be                       [6] Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra, Karthik Sankaranarayanan,
the objective of our further research. Secondly, we identified the                        and Sarath Chandar. 2018. Complex Sequential Question Answering: Towards
                                                                                          Learning to Converse Over Linked Question Answer Pairs with a Knowledge
issue of predicate mapping when the question mentions a rela-                             Graph. CoRR abs/1801.10314 (2018). arXiv:1801.10314 http://arxiv.org/abs/
tionship that cannot be mapped to a predicate in the knowledge                            1801.10314
                                                                                      [7] Nadine Steinmetz, Ann-Katrin Arning, and Kai-Uwe Sattler. 2019. From Natural
graph. The lack of alternative labels for parts of the ontology is                        Language Questions to SPARQL Queries: A Pattern-based Approach. In Daten-
common for most knowledge graphs – properties and ontology                                banksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des
classes often have one label whereas the entities mostly have                             GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 4.-8. März 2019,
                                                                                          Rostock, Germany, Proceedings. 289–308. https://doi.org/10.18420/btw2019-18
a main label and alternative label. Therefore, further research                       [8] Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anan-
regarding ontology enrichment is required to provide more infor-                          tha. 2020. Question Rewriting for Conversational Question Answering.
mation about the ontology properties and classes. Unfortunately,                          arXiv:cs.IR/2004.14652
                                                                                      [9] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018. The Design and Im-
a direct comparison of results to other approaches is not feasible,                       plementation of XiaoIce, an Empathetic Social Chatbot. arXiv:cs.HC/1812.08989
because of the lack of a common DBpedia-based benchmark and
published results respectively.

5    SUMMARY & FUTURE WORK
In this paper, we present our approach to conversational QA
using a shift of context. We developed a first prototype that iden-
tifies the type of context shift. Thus, a conversation consisting
of several request about specific topics / named entities can be
conducted. But, in contrast to other recent approaches, our ap-
plication is also able to handle slight changes of context and
the follow-up question are not required to ask for facts in direct
proximity of the primary sub-graph within the knowledge graph.
Likewise, smart speech interfaces, such as Alexa or Siri, are able
to do a conversation in a very limited way, for instance when
asked for the weather of different cities in a sequence. Whereas
our approach is able to handle conversations about a wide range
of domains and with slight changes of topics. In addition, we
propose a novel dataset for evaluation purposes of conversational
QA based on DBpedia. The dataset consists of 50 conversations
and 115 questions and it is publicly available. As discussed in
the evaluation section, the improvement of basic QA system is
essential for the quality of a conversation. Here, we pursue the
further development of our pattern-based approach to map simi-
lar natural language patterns to similar graph patterns. Future
work will also include the further utilization of DBpedia context
information to construct the application in a more intelligent
manner: for the identification of the gender of named entities, or
the mapping of natural language phrases to DBpedia properties.

6    ACKNOWLEDGEMENTS
This work was partially funded by the German Research Founda-
tion (DFG) under grant no. SA782/26.

REFERENCES
[1] Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh,
    and Gerhard Weikum. 2019. Look Before You Hop: Conversational
    Question Answering over Knowledge Graphs Using Judicious Context Ex-
    pansion. In Proceedings of the 28th ACM International Conference on Information
    and Knowledge Management (CIKM ’19). ACM, New York, NY, USA, 729–738.
    https://doi.org/10.1145/3357384.3358016
[2] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal
    Ahmed, and Li Deng. 2017. End-to-End Reinforcement Learning of Dialogue
    Agents for Information Access. In ACL.
[3] Max Planck Institute for Informatics. 2014.                A large resource
    of relational patterns.             https://www.mpi-inf.mpg.de/departments/
    databases-and-information-systems/research/yago-naga/patty/.
[4] Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen
    Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A Dialog Research Soft-
    ware Platform. In Proceedings of the 2017 Conference on Empirical Methods in
    Natural Language Processing: System Demonstrations. Association for Computa-
    tional Linguistics, Copenhagen, Denmark, 79–84. https://doi.org/10.18653/v1/
    D17-2014
[5] Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A Con-
    versational Question Answering Challenge. Transactions of the Association for
    Computational Linguistics 7 (March 2019), 249–266. https://www.aclweb.org/
    anthology/Q19-1016