<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparison of Services for Intent and Entity Recognition for Conversational Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Iovine</string-name>
          <email>andrea.iovine@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Polignano</string-name>
          <email>marco.polignano@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fedelucio Narducci</string-name>
          <email>fedelucio.narducci@poliba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>pierpaolo.basile@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco de Gemmis</string-name>
          <email>marco.degemmis@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>giovanni.semeraro@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Bari</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Conversational Recommender Systems (CoRSs) are becoming increasingly popular. However, designing and developing a CoRS is a challenging task since it requires multi-disciplinary skills. Even though several third-party services are available for supporting the creation of a CoRS, a comparative study of these platforms for the specific recommendation task is not available yet. In this work, we focus our attention on two crucial steps of the Conversational Recommendation (CoR) process, namely Intent and Entity Recognition. We compared four of the most popular services, both commercial and open source. Furthermore, we proposed two custom-made solutions for Entity Recognition, whose aim is to overcome the limitations of the other services. Results are very interesting and give a clear picture of the strengths and weaknesses of each solution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        CCS Concepts
•Computing methodologies ! Discourse, dialogue and
pragmatics; Information extraction; •Information
systems ! Recommender systems; •Human-centered
computing ! Natural language interfaces; Please use the 2012
Classifiers and see this link to embed them in the text: https:
//dl.acm.org/ccs/ccs_flat.cfm
INTRODUCTION
Conversational Recommender Systems (CoRSs) are
intelligent systems that provide personalized access to large sets
of items (e.g. e-commerce catalogues, daily news,
streaming platforms, etc.) by exposing a conversational interface.
Indeed, the distinguishing feature of a CoRS compared to a
standard recommender system is its ability to interact with
the user during the whole recommendation process.
Chatbased interfaces have been often proposed as a way to bring
interactivity into the recommendation process: the user and
the recommender system interact by exchanging messages
in natural language. However, developing a chat-based
interface for a CoRS is a very challenging task, since it requires
multi-disciplinary skills: Natural Language Processing (NLP),
Machine Learning (ML), and Human-Computer Interaction
(HCI). The scope of this work is to analyze the main NLP
steps performed by a CoRS. In this area, two main Natural
Language Understanding (NLU) tasks play a crucial role:
Entity Recognition (ER) and Intent Recognition (IR). ER consists
of identifying real-world entities that users can refer to during
the dialogue. A CoRS needs to acquire preferences from users
to build their profile, and being able to recognize items and
properties is essential in order to achieve high-quality
recommendations. The IR step is equally important: if the user’s
request is not identified correctly, the entire recommendation
process may fail, resulting in user frustration, and low user
retention rates. Many third-party NLU services are available,
which can perform both ER and IR tasks. However, an
extensive evaluation of these solutions in the area of CoRSs has
not been done yet. A relevant work on that direction has been
proposed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The authors evaluated different platforms on
a corpus of questions related to public transportation, and on
another one from two StackExchange platforms. We decided
to investigate ER and IR tasks in the music, book, and movie
recommendation scenarios. In our opinion, those scenarios are
really challenging for several reasons: entity names often are
complete sentences with their own meaning (e.g. the movie
title "Life is beautiful"); aliases are often associated to person
names (e.g. the American song-writer Barry Eugene Carter
also known as Barry White); intents can be expressed in many
different forms (e.g. for recommendation requests, users can
send messages like Suggest me a movie, What can I watch
tonight?, I’m looking for a sci-fi movie). In this study, we
evaluated four third-party services. For the ER task, we also
added two of our own solutions, for a total of six systems.
The experiment has been carried out on a dataset of real user
messages, collected from the interactions with a CoRS in the
book, music, and movie domains. Therefore, we evaluate each
platform’s ability to understand messages written by real users
in the recommendation context. The experiment was carried
out in order to answer the following Research Questions:
RQ1 - Is a purpose-built solution better than a
generalpurpose one for a chat-based CoRS? Developing a
customized component requires more effort, however this effort
may not result in better performance.
      </p>
      <p>RQ2 - Does fuzzy matching improve Entity Recognition
performance? Introducing Fuzzy matching for the ER step
might be a solution to the problem of identifying misspelled
or incomplete entity mentions. However, it may also generate
some noise in the recognition process (e.g. treating unrelated
words as entities).</p>
      <p>
        RQ3 - Is a commercial Intent Recognizer better than an
open-source one for a chat-based CoRS? Taking inspiration
from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], commercial platforms can exploit large quantities of
data fed by their userbase. Conversely, open-source solutions
like Rasa can only rely on the data that the current user is
able to provide for training. This hypothetically means that
the former has better performance than the latter. At the end
of this work, we highlight the specific peculiarities of each
service, and we explain why one service might be preferred
over another via a detailed analysis of specific cases.
RELATED WORK
A CoRS is defined as a system that provides recommendations
to users via a multi-turn dialogue [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The main
characteristic of CoRSs is that the user profile is acquired in an iterative
fashion. The system can interact by asking the user to rate
some items, and the user can influence the outcome of the
recommendation by providing feedback on the suggested items.
On the other hand, traditional recommender systems require
that all user information is provided before generating a
recommendation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. A review of state of the art regarding
Conversational Recommender Systems is described in
Jannach et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Kang et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] investigated how people
interact with a natural language-based CoRS through voice or
text. To do this, the authors developed a natural language
interface, and integrated it in the MovieLens system. The authors
classified three types of recommendation goals and several
types of follow-up queries from the collected data. Objective
recommendation goals express the desire to find movies based
on known attributes, such as the genre. Subjective goals
instead involve the evaluation of subjective qualities (e.g. "sad
movies", "interesting characters"). Finally, navigational goals
are expressed when the user wants to find a specific movie by
its title. A work similar to this is described in Cai and Chen [5].
An exploratory study was conducted by manually annotating
200 human-human conversations. The result is a taxonomy of
intents that are commonly associated with the task of receiving
recommendations. The authors state that users can express
preferences by either mentioning entities (e.g. movies that
they have previously watched), attributes (i.e. subjective or
objective qualities of the items), or purposes (e.g. "a movie
to watch with the family"). While these works are focused
on understanding how people express their needs to a CoRS,
the scope of our study is to evaluate the ability of the existing
NLU tools to correctly understand these expressions.
Developing a CoRS that interacts using a natural language
dialogue requires performing several Natural Language
Processing tasks. Most NLP strategies must be trained on large
quantities of text. In recent years, researchers have published
many datasets for training and evaluating CoRSs. Examples
are Dodge et al. [
        <xref ref-type="bibr" rid="ref7">9</xref>
        ], Asri et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Suglia et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], Li
et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and Radlinski et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. However, many of the
datasets in the literature were developed with the specific
intent of training End-to-End systems, i.e. systems that learn
the two tasks of maintaining a conversation and generating
recommendations together. In this paper, we only focus on
the first task so that the underlying recommendation can be
performed with any of the existing algorithms. As stated in
the Introduction, several NLU platforms are already available,
which contain many of the essential blocks needed to quickly
design and develop Conversational Agents (CAs). They
extract relevant information for a specific machine understanding
task from natural language content. [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ]. These systems have
been demonstrated to be suitable for several NLP tasks [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
such as question answering [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], document translation [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
and dialogue management [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Entity Recognition and
Intent Recognition are tasks that are frequently addressed by
those systems. Google’s Dialogflow (also known as API.ai)1,
IBM’s Watson Assistant2, Microsoft’s LUIS3 are the most
famous commercial solutions. Other platforms are Amazon
Lex4, Facebook’s Wit.ai5, and Recast.ai6. On the other hand,
open-source services like Snips.ai7 and RASA8 are also
available. These systems have been largely described and analyzed
in the literature, showing encouraging performance on
different datasets [
        <xref ref-type="bibr" rid="ref17 ref27 ref4 ref5">27, 7, 4, 6, 17</xref>
        ]. As for NLU systems for
conversational agents, Wisniewski et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] compares several
systems with the aim of identifying the best solution for
building a robust CA. The authors compared Snips.ai with Amazon
Lex, API.ai, LUIS, Apple’s SiriKit9, IBM Watson, Wit.ai on
a dataset of 328 queries chosen by the Snips.ai team. The
results confirm the validity of the service, showing much higher
accuracy values than its competitors for the Intent Recognition
task. Braun et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] compared the performance obtained by
LUIS, Watson Assistant, API.ai, Wit.ai, and Amazon Lex on
two different datasets. The first one consists of 206 questions
about public transport, which were asked to a Telegram
chatbot, and then manually annotated by the authors. The second
one consists of 290 questions made on two StackExchange
1https://dialogflow.cloud.google.com/
2https://www.ibm.com/cloud/watson-assistant/
3https://www.luis.ai/
4https://aws.amazon.com/it/lex/
5https://wit.ai/
6https://cai.tools.sap/
7https://snips.ai/
8https://rasa.com/
9https://developer.apple.com/documentation/sirikit
platforms. The results show that LUIS performs better than its
competitors on the two datasets in terms of F1-score.
The state of the art shows that the research area of CoRSs is
indeed quite active, and that many works have been published
on the subject. However, very few of them focus on the
evaluation of Natural Language Understanding components for the
conversational recommendation scenario. We believe that this
is important, especially if the CoRS should support a
mixedinitiative interaction, in which both the system and the user
can take control of the conversation. Furthermore, we believe
that this research is interesting for practitioners that would like
to develop a conversational recommender system using tools
that are currently available on the market. Research shows
that they perform well in domains such as question
answering and customer service, but there is no evidence regarding
their performance in the conversational recommendation
scenario. We believe that there are some challenges that are
unique to this application. In fact, the intents sand entities of
the domains explored in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are not particularly ambiguous
(e.g. Station destination, Station start, Departure time, etc.).
Conversely, the conversational recommendation scenario
offers both ambiguous entities, and user requests that are highly
heterogeneous, as anticipated in the Introduction.
      </p>
      <p>
        CORS ARCHITECTURE AND DIALOG MODEL
Chat-based CoRSs are a specialization of a Goal-Oriented
Conversational Agent. A general architecture for CAs can be
found in Williams et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. A possible adaptation of the
Goal-Oriented CA architecture for the chat-based CoRS task
can be seen in Fig. 1. For this paper, we exclude the
components related to speech, since the focus is on written text
messages. The user message is first analyzed by the Natural
Language Understanding (NLU) component. This
component is responsible for extracting all information from the
user message. Two tasks are commonly associated with
Natural Language Understanding: Intent Recognition and Entity
Recognition. Intent Recognition is performed to understand
the general action or request that is formulated in the message.
For example, asking "What movie should I watch tonight" is a
clear indication of the user’s intention of receiving
recommendations from the system. Entity Recognition is used to extract
named entities that are mentioned by the user. This is essential
for a CoRS, as it allows the user to discuss about the items
that he or she is interested in. A CoRS that suggests movies
should be able to recognize entities such as movies, actors,
directors, genres, etc. The output of the NLU component is
a semantically tagged version of the user message, which
describes the intent of the user, and the ratings to each individual
entity that may have been mentioned within. The next module
in the architecture is the Dialogue State Tracker (DST). It is
responsible for keeping a consistent conversation between the
user and the system. It also remembers all the information that
was exchanged between the two parties, which is referred to as
the dialogue state. The DST makes it possible for the system
to remember what was said previously in the conversation, e.g.
when the user answers a question made by the system. The
state is updated at each conversation turn, based on the user’s
message and the previous state. Several DST strategies have
been proposed in the literature: rule-based, such as Branting
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; frame-based, such as Göker and Thompson [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ];
based on statistical models, such as Gavšic´ and Young [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ];
and based on neural networks, such as Bordes et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For
the CoRS task, the dialogue state can contain the user’s profile.
Frame-based DST is often used for this purpose, as it models
the conversation as a set of slots (i.e. attributes) that need to be
filled by the user with the necessary information. An example
of this is the Adaptive Place Advisor [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which uses natural
language to model a constraint-based recommender system.
The Dialogue Policy (DP) is in charge of choosing the action
that will be performed by the system, based on the current
input and the state of the conversation. Policy selection can be
performed either via rules (e.g. one action per intent), or can
be selected via a probabilistic model, such as [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The actions
that a CoRS can perform are mostly related to the profile
acquisition (e.g. ask the user to rate an item), or recommendation
(e.g. generate a list of recommendations). Therefore, the DP
can directly communicate with the underlying recommender
system. It is worth noting that the architecture described in
Fig. 1 is not bound to a specific recommendation algorithm.
Finally, the Natural Language Generator (NLG) is in charge
of generating the textual response that will be sent back to the
user. This response can be a feedback of an action made by the
system, or a question that the user should answer, or a generic
response to a colloquial message. The simplest method for
NLG is to use response templates that can be filled with
contextual data. End-to-End conversational agents such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
attempt to generate a complete sentence based on the user’s
input. In this paper, we focus our attention on the NLU
component, and in particular, on the Intent and Entity Recognition
tasks, which we introduced in Section 1.
      </p>
      <p>
        To model the interaction between the user and the CoRS, we
devised a general dialogue model, which involves three
distinct actions: providing a preference to one or more items,
requesting a recommendation, requesting to view her profile.
We chose these three actions as they represent the most basic
activities that are performed by recommender systems:
acquiring the user profile and generating recommendations. We
then added the ability to show the user profile, which is
essential for ensuring transparency. Another advantage of this
dialog model is that it is agnostic to the underlying
recommendation algorithm, as it can be easily adapted to work with a
collaborative or content-based filtering recommender system.
Finally, it is consistent with the findings of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and [5]. In
the movie domain, given the utterance I like Ghostbusters
because I love comedy movies, the CoRS should recognize that
the user is providing a positive preference for two different
kinds of items: a movie (Ghostbusters) and a genre (comedy).
By asking the system Can you suggest me a movie to watch?
instead, the CoRS should be able to understand that the user
is asking for a recommendation. Finally, the message What
are the preferences stored in my profile? means that the user
is asking for exploring his/her profile.
      </p>
      <p>EVALUATED PLATFORMS
The objective of this study is to compare several NLU
platforms based on their ability to perform tasks commonly carried
out in a chat-based CoRS, i.e. intent and entity recognition.</p>
      <p>Due to the large quantity of NLU platforms available on the
market and limited resources, a comparison of all platforms
would hardly be feasible. Therefore, we decided to conduct
the study on a subset of the most popular platforms. We chose
to include three of the most popular commercial NLU
platforms, which are Google Dialogflow, Microsoft LUIS, and
IBM Watson Assistant. Then, we selected one of the most
popular open-source NLU platforms, i.e. Rasa. These four
platforms already provide all the tools required to build a
functioning CoRS, except for the recommendation algorithm. On
the surface, the commercial platforms share many similarities:
they all provide the same NLU functionalities (i.e. intent and
entity recognition), a Web-based interface for building the
dialogue model, multi-language support, and can be easily
integrated into a wide variety of messaging platforms. Also,
they all employ Machine Learning (ML) algorithms for the
NLU tasks, which need to be trained by supplying example
sentences. The ML techniques used are proprietary, and thus
no implementation details are provided, which makes it more
difficult to choose one platform over another based on their
characteristics. An advantage of Dialogflow and Watson
Assistant over LUIS is the fuzzy matching function that allows
the Entity Recognition to tolerate spelling errors. Rasa differs
from the previous platforms for several reasons: first and
foremost, it is open-source and self-hosted, rather than proprietary
and cloud-based. Rasa is essentially divided into two parts:
Rasa NLU, which handles the Natural Language
Understanding pipeline, and Rasa Core which implements a Dialogue
management component. This component uses ML to model
the conversation logic, unlike the previous platforms that use
hand-written rules. Due to the open-source nature of Rasa,
it is also more transparent about the ML techniques that it
employs. In fact, each component of the NLU pipeline can
be customized, and many state-of-the-art models are already
available, such as spaCy embeddings. However, Rasa does
not currently implement a fuzzy matching function for Entity
Recognition. While a developer could plug an external fuzzy
string matching library for this scope, we chose not to do it,
as we want to test Rasa’s performance out of the box. In
addition to these platforms, we also added to the comparison
of two Entity Recognizers that were developed in-house, and
that is purpose-built for the CoRS task. Both components are
trained to recognize and link entities that are contained within
a knowledge base.</p>
      <p>This composition is tailored to the Research Questions
formulated in the Introduction. We included both commercial
and open-source platforms in the study to test the hypothesis
formulated in RQ3. The addition of custom-built components
for Entity Recognition is useful to answer RQ1. Finally, the
fact that only some platforms support Fuzzy Matching for
Entity Recognition allows us to answer RQ2.</p>
      <p>
        Custom ER solutions
Before starting the description of the custom-made ER models,
it is worth noting that the entity recognition task for a CoRS
is not trivial. Indeed, despite the fact that entity names are
already available in the knowledge base, the ER should be able
to tolerate misspellings, incomplete mentions, and synonyms.
In fact, when a user interacts with a conversational agent
thorough text, typing errors are likely to occur [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Another
problem is that the same entity may have several surface forms.
One example of such is Barry White as introduced in Section
1. Furthermore, users may expect to mention entities in a
shortened form, for convenience. An efficient ER should
properly manage these issues, and not accounting for them
will result in a more frustrating user experience.
      </p>
      <p>
        NER_FCR (Named Entity Recognition based on Fuzzy CRF
+ Regex) is composed of two parts: an Entity Recognizer
that extracts the parts of the user message containing entity
mentions, and an Entity Linker that maps the entity mentions
back to the knowledge base. The Entity Recognition is
implemented using parts from the Stanford CoreNLP10 library. In
particular, it uses a combination of two algorithms: one based
on regular expressions and one based on Conditional Random
Fields (CRF) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The CRF-based classifier can be trained by
supplying example sentences. It classifies each word as either
entity or non-entity, based on several features such as the shape
and Part-of-speech tag of the current, previous, and next word.
It is able to tag entities even if they were not spelled correctly
or completely. The regex-based classifier only needs a list of
entities, and ensures that correctly spelled entities are
recognized. The Entity Linking part uses fuzzy string matching11
to find the entities that are most similar to the mention based
10https://stanfordnlp.github.io/CoreNLP/
11https://github.com/xdrop/fuzzywuzzy
on the Levenshtein distance. Multiple entities are retrieved if
the mention is ambiguous.
      </p>
      <p>
        NER_KG (Named Entity Recognition based on Knowledge
Graphs) exploits a knowledge-based approach for finding
entities mentioned in the user sentence and linking them to the
correct concept in the knowledge base. Wikidata12 is the
Knowledge Base chosen for this task. In particular, we can
customize NER_KG according to the entities involved in a
specific domain. For example, if the CoRS is a movie
recommender system, we can filter only Wikidata concepts related
to this specific domain: e.g. movies, actors, and directors.
NER_KG performs both Spotting and Linking steps. In the
spotting step, the algorithm analyzes the text in order to
discover candidate entities. In particular, the algorithm detects
sequences of words (surface form) matching a Wikidata alias,
and then all the concepts that can be associated to the alias are
retrieved. This spotting strategy enables NER_KG to deal with
noisy text, For the second step (also referred to as Linking), it
uses an approach based on graph embeddings to exploit the
relations between concepts in the knowledge graph. In
particular, NER_KG uses holographic embeddings (HolE) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] built
on Wikidata. The disambiguation step consists of selecting the
correct concept for each surface form. The idea is to choose
the concept that is more similar to the other concepts occurring
in the text, following the hypothesis of one topic for discourse.
The motivation behind this approach is that, in a sentence, the
user tends to refer to entities that are in some way related. Due
to the use of linked data, NER_KG does not require training
sentences to work.
      </p>
      <p>
        DATASET
The dataset used in this study contains real user messages,
which were captured by logging the interactions between users
and a chat-based CoRS in the movie, book, and music domains
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Each message is bundled with the ID of the user, the
intent, the entities mentioned within, and a timestamp. The
dataset is composed of 5,318 messages for the movie domain,
1,862 for the book domain, and 2,096 for the music domain.
The dataset was filtered by removing all messages that do
not fit with the dialogue model defined in Section 3. For
instance, we removed generic chit-chat messages, or messages
that users sent in response to a question made by the system.
The remaining messages have been manually checked by three
human annotators. In particular, the annotators validated the
intents identified by the CoRS, checked the list of entities
identified in the message, and fixed errors. For each entity
mentioned by the user, the annotators tagged it with the most
appropriate Wikidata entity.
      </p>
      <p>At the end of the cleaning and revising process, 2,410 queries
have been obtained. In particular, there are 1,016 messages
for the movie domain, 747 for the book domain, and 647 for
the music domain. Table 1 presents an overview of the
composition of the final dataset. Table 2 contains some examples
of user messages in the movie domain. Each message is
annotated with one of the following intents: preference, request
recommendation, and show profile, which are the actions of the
12https://www.wikidata.org/
Dataset Intent å</p>
      <p>preference 743
movie request_recommendation 206
show_profile 67
preference 534
music request_recommendation 146
show_profile 67
preference 438
book request_recommendation 142
show_profile 67
dialogue model described in Section 3. Each entity was
manually classified according to its role in the domain. The roles
defined for each domain are: item_movie, item_song, item_book
for entities concerning movies/songs/books; item_people for
entities concerning real people like actors, singers or writers;
item_genre that are entities describing the genre of an item.
Entities are all linked to their corresponding Wikidata ID. The
dataset is publicly available on GitHub13.</p>
      <p>EXPERIMENTAL EVALUATION
The objective of this experiment is to evaluate the performance
of the NLU platforms presented in Section 4 for the Intent
and Entity recognition tasks in the Conversational
Recommendation (CoR) scenario. We used the datasets described in
Section 5 both to train and test the ML components of each
platform. The idea is that by performing the test on multiple
domains we will obtain stronger evidence about the strengths
and weaknesses of each solution. Moreover, we are interested
in observing each platform’s ability to understand messages
sent by real users in the CoR scenario. This scenario is
challenging due to the high variability of user messages, and to the
presence of misspelled or partial entity mentions. Although a
service may perform better in a domain and worse in others,
we expect that all platforms show similar performance in all
three domains.</p>
      <p>
        Protocol
The experimental protocol adopted in our work is inspired
from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], with some changes. First, we adopted a 5-fold
cross-validation instead of a random train-test split. Also,
we decided to analyze IR and ER performance separately.
All NLU platforms described in Section 4 followed the same
experimental protocol, which means that each platform was
trained and tested using the same data. Two exceptions are
made: first, only the ER test was performed for NER_FCR
and NER_KG. Also, NER_KG does not require training
sentences, so cross-validation was unnecessary. To perform the
5-fold CV, we divided each dataset into five parts. We then
repeated training and testing five times, by using one fold as
a test set, and the other four folds as training set. In each
step, we trained the Intent and Entity Recognition components
of each platform with the training set. Where possible, we
used the batch import functionalities made available by each
platform, in order to easily re-train the agents for each
iteration. During testing, the user messages in the test set were
13https://github.com/aiovine/converse-dataset
Message
I like the avengers
I like ghostbuster
I don’t like blade runner
I like Titanic, but I don’t really like James Cameron
i like Jonny Depp
See my profile
Suggest some film
Can you suggest me an action movie?
Intent
preference
preference
preference
preference
preference
show_profile
request_recommendation
request_recommendation
      </p>
      <p>Entities
"The Avengers (Q182218)" - positive
"Ghostbusters (Q108745)" - positive
"Blade Runner (Q184843)" - negative
"Titanic (Q44578)" - positive
"James Cameron (Q42574)" - negative
"Johnny Depp (Q37175)" - positive
analyzed by the platform. For LUIS, we used the integrated
testing function. For the other platforms, we created a script
that calls the available APIs programmatically. The intent and
entities that were extracted by the system were compared with
the actual tags in the dataset. For the IR task, an error was
counted if the recognized intent was different from the true
intent, or if no intent was found by the system. For the ER
task, a message was considered correctly tagged if all entity
mentions contained within were found, and if each mention
was linked to the correct entity in the knowledge base.
Therefore, we counted an error when an entity mention was not
captured by the system, when one or more words were
incorrectly tagged as entity mentions, or when the entity mention
was successfully found, but it was linked to the wrong entity.
Also, if an entity was recognized twice in the same sentence,
one error was recorded. It is worth noting that each platform
received the entire set of possible entities in the training phase,
because measuring the ability to recognizing previously
unseen entities is out of the scope of this experiment. Since
Rasa offers multiple choices in terms of NLU pipelines, we
also conducted a preliminary test in order to choose, for each
domain, the pipeline that obtained the highest F1-score for
both Intent and Entity Recognition. As a result, we used the
supervised_embeddings pipeline for the movie domain, and
the pretrained_embeddings_spacy pipeline for the book and
movie domains.</p>
      <p>In order to measure the performance, we used the precision,
recall and F1-score metrics. These metrics were calculated
for the Intent and Entity Recognition tasks separately. Aside
from the per-domain measures, we also calculated the overall
measures, aggregated from the three domains. We decided to
exclude the precision of the intent-not-found class from the
calculation, because we do not find it relevant to our study.
Therefore, whenever a platform did not return any intent for a
message, only a false negative has been recorded.
Results
The results of the experiment can be found in Figure 2 and
Tables 6 - 11. Each table reports the results of the Intent
and Entity Recognition tests for a single NLU platform. We
reported the values of the three metrics for each individual
intent/entity type. The Total row is calculated over all intent
or entity types in a domain. The Overall row is calculated
over all domains. Figure 2 describes the F1-score recorded
respectively for the Intent and Entity Recognition test by each
platform, in each domain, plus an overall figure that
summarizes all three domains. Observing the results, all platforms
behave relatively well in terms of Intent Recognition, with
Fscores no lower than 0.97. A notable example is Rasa, which
obtained a perfect IR score in the music domain. Overall,
Watson Assistant is the platform that obtains the best
performance, with an F1-score of 0.9988. However, by analyzing
the trend in each domain, we see that the situation is not as
clear: Watson, LUIS, and Rasa trade places as the best system
in each domain. Instead, Dialogflow seems to perform the
worst, with an overall F1-score of 0.9908.</p>
      <p>For the Entity Recognition task, the best two systems are
NER_FCR and Dialogflow, with an overall F-score of 0.9920
and 0.9911, respectively. In second place we find Watson
Assistant, with an F-score of 0.9715. In the last three places,
we find NER_KG, LUIS, and Rasa, that obtained similar
results throughout the domains, with the lowest score achieved
by Rasa on the book domain with 0.8399. Watson Assistant
seems to struggle most with recognizing genres, with a recall
of 0.6111 and an F-score of 0.7586 in the book domain, as
we can see in Table 6. LUIS obtained mostly high precision
and low recall for each entity type, as seen in Table 8. This is
expected due to the lack of fuzzy matching, which means that
only exact entity matches will be recognized. A similar trend
can be seen for Rasa in Table 9, which makes sense because it
also lacks fuzzy matching.</p>
      <p>In particular, Rasa has some difficulties in recognizing movie
names, music artists, and genres (especially in the book
domain), with a recall of 0.3889 and an F-score of 0.56. Unlike
the previous platforms, Dialogflow is able to handle all entity
types without major problems. In fact, it was able to recognize
all song entities correctly. This is also true for NER_FCR,
which performed comparably to Dialogflow for every entity
type. NER_KG was able to recognize genres better than the
other platforms (with F-scores ranging from 0.8955 to 0.9811),
however it fared worse than the others in recognizing movies,
songs, and book names. It especially struggled with
recognizing songs, obtaining a recall of 0.6786 and an F-score of
0.7917.</p>
      <p>A one-way ANOVA statistical test (significance level = 0.05)
was performed. The F-score was the dependent variable, with
the platform being the independent variable. The null
hypothesis for this test is: The average F-score for Intent/Entity
recognition is the same for all platforms. Results of the ANOVA
test for both intents and entities are described in Table 3. The
test rejected the null hypothesis in both cases. Therefore, we
proceeded to perform post-hoc tests, to find which platforms
perform better (or worse) than others.</p>
      <p>To do this, we used the t-test for dependent samples
(significance level = 0.05). The results of the t-test for the Intent
Recognition tasks are described in Table 4. For each pair
of platforms, we reported the value of the t statistic and the
p-value. If the p-value is less than the significance level, a
symbol is added: (+) means that the platform on the row has a
significantly higher F-score than the platform on the column,
(-) means the opposite. The t-test did not find a clear winner
for the IR task. However, it confirmed that Dialogflow was
significantly worse than the other three platforms, which is in
accordance with what was said at the beginning of this Section.
More interesting results come from the ER task represented
in Table 5. In fact, almost all pairs of platforms obtained
significantly different results. Therefore, we can confirm that
Dialogflow and NER_FCR both obtained the best performance.
Watson Assistant placed second, followed by NER_KG, then
LUIS, then Rasa.</p>
      <p>Discussion
With the results described in the Section 6.2, we can answer the
Research Questions introduced in the Introduction. Regarding
RQ1, we can see that one of our own components (NER_FCR)
Domain
Total
Movie
Book
Music</p>
      <p>Intent Recognition test
F p
18.61 &lt;0.001
6.95 &lt;0.001
3.83 0.009
9.36 &lt;0.001</p>
      <p>Entity Recognition test
F p
129.8 &lt;0.001
47.27 &lt;0.001
57.73 &lt;0.001
34.39 &lt;0.001
managed to perform very well in the ER task, obtaining a
better F-score than most general-purpose platforms. However,
the statistical test did not confirm that it performed better than
Dialogflow. NER_KG, on the other hand, was outclassed by
both Dialogflow and Watson Assistant. While this is true, it
is worth noting that NER_KG was able to obtain this result
by only relying on Linked data, instead of training sentences
(as said in Section 4.1). This could give NER_KG the edge
when developers do not possess training data at hand. The
linked data-based approach also allowed it to obtain good
performance in recognizing genres, where the other platforms
struggled.</p>
      <p>The answer to RQ2 is positive: all the systems that implement
fuzzy matching for the ER component (Dialogflow, Watson,
NER_FCR, NER_KG) performed better than those that did
not implement it (LUIS, Rasa). This confirms the intuition
that we presented in Section 4.1. Indeed, making sure that
the system is able to tolerate spelling errors is beneficial to
the overall performance of the CoRS and improves the user
experience. However, we cannot conclude which fuzzy
matching implementation is definitely superior: both Dialogflow
and NER_FCR obtained the highest score as stated earlier.
Watson Assistant also implements it, however, it performed
significantly worse than the first two. Analyzing the results
in detail, we can see that Watson’s implementation of fuzzy
matching seems limited. In fact, it is especially less tolerant of
incomplete mentions. For example, given the test sentence I
like Shakespeare, it failed to retrieve the entity William
Shakespeare, while Dialogflow and NER_FCR did not.</p>
      <p>Another peculiarity is that Watson seems to use fuzzy
matching only on non-existent words. Given the two sentences I
like Johnny Detp and I like Johnny Deep, Watson is able to
recognize Johnny Depp correctly in the first sentence, but
not in the second. NER_KG’s performance in recognizing
genres can be justified by the fact that it is able to exploit all
aliases that are provided by Wikidata, while the other systems
only worked with a single label for each entity. Genres are a
particularly difficult type of entity, because they have much
more variability in their surface forms, compared to titles or
person names. For example, just the presence or absence of
"film" or "movie" in the genre name can create issues in most
systems (e.g. I like action films versus I like action). Both
LUIS and Rasa lack fuzzy matching, which means that entity
mentions are correctly recognized only if they match exactly
their label. However, Rasa also missed some entities that had
exact matching. This seems to happen especially when there
are multiple entities per message.</p>
      <p>The answer to RQ3 is negative: Rasa was able to perform
just as well as all the commercial platforms in terms of Intent
Recognition. On the contrary, Dialogflow performed
significantly worse than Rasa and all the others. In particular, it
appears that the performance of the Intent and Entity
Recognition were linked in some way. By analyzing the results in
further detail, we can see that most of the cases in which
Dialogflow fails to recognize the preference intent happen when
it is not able to find any entity in the sentence. In these cases,
it returns the default fallback intent. This happens mostly
with sentences of the form I like [entity], which suggests that
overfitting may have occurred.</p>
      <p>Performing the study on three different domains allows us
to explore the differences between them. These differences
are especially noticeable in the Entity Recognition test. For
example, we can see that the systems that the gap in ER
performance between fuzzy matching and non-fuzzy matching
system is more prominent in the book domain. A possible
explanation for this is that users refer to book-related entities
using many different surface forms, e.g. by shortening book
titles or author names. Another peculiarity is that users often
refer to series of books rather than single works (e.g. "Harry
Potter"), which may be a potential source of confusion, as the
ER component needs to disambiguate both types of references.
This is an important aspect to consider when developing a
natural language-based interface for a CoRS. Even though
the same dialog model was originally applied for all three
domains, the way the users express their preferences is still
significantly different between each domain. These differences
produce a noticeable effect in the overall performance of the
system, which should be considered when developing a CoRS
for a new domain. Understanding the users’ language is key
to obtain good recognition accuracy and, therefore, good user
experience.</p>
      <p>
        CONCLUSION
In this paper, we presented a comparison of several Natural
Language Understanding systems that may be employed to
create Conversational Recommender Systems. We compared
their performance in the Intent and Entity Recognition tasks in
three different domains: movies, books, and music. We were
able to find that the platforms performed differently. For the
Intent Recognition task, Watson Assistant, LUIS, and Rasa
were able to achieve comparable scores. For the Entity
Recognition task, Dialogflow was the best commercial platform, and
performed comparably with a custom-developed Entity
Recognizer (NER_FCR). It is interesting to note that some of the
findings from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] were only partly confirmed by our study. In
fact, Rasa performed as well as the commercial platforms for
Intent Recognition, but was instead worse in terms of Entity
Recognition. Another difference is the fact that in our study
Dialogflow was one of the best services for Entity
Recognition, while in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] it was one of the worst. This further confirms
what emerged in that study, i.e. that the performances of NLU
services are strictly dependent on the domain and, we may
add, on the specific task.
      </p>
      <p>As regards the limitations of our study, the first one comes
from the fact that we could not include all possible platforms
in the evaluation. We chose the services that we deemed the
most popular. Another limitation is the size of the dataset: the
number of messages used in the experiment is not very large,
which could probably limit the significance of the result. As
future work, we propose to collect more data, by expanding
the dataset in other domains. In particular, we will make sure
to increase the variability and complexity of the messages
for each intent, for example, by collecting more preference
messages with multiple entity mentions.</p>
      <p>Domain Intent Precision
preference 0.9987
request_recommendation 1.0000
Movie show_profile 1.0000</p>
      <p>Total 0.9990
preference 1.0000
request_recommendation 1.0000
Music show_profile 1.0000</p>
      <p>Total 1.0000
preference 0.9977
request_recommendation 1.0000
Books show_profile 1.0000</p>
      <p>Total 0.9985
Overall 0.9992
language understanding services for conversational
question answering systems. In Proceedings of the 18th
Annual SIGdial Meeting on Discourse and Dialogue.
174–185.
[5] Wanling Cai and Li Chen. 2019. Towards a Taxonomy
of User Feedback Intents for Conversational</p>
      <p>Recommendations. In RecSys.
[6] Massimo Canonico and Luigi De Russis. 2018. A
comparison and critique of natural language
understanding tools. Cloud Computing 2018 (2018),
120.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Layla</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Asri</surname>
          </string-name>
          , Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and
          <string-name>
            <given-names>Kaheer</given-names>
            <surname>Suleman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems</article-title>
          .
          <source>arXiv:1704.00057 [cs] (March</source>
          <year>2017</year>
          ). http://arxiv.org/abs/1704.00057 arXiv:
          <fpage>1704</fpage>
          .
          <fpage>00057</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-Lan</given-names>
            <surname>Boureau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning End-to-End Goal-Oriented Dialog</article-title>
          . arXiv:
          <volume>1605</volume>
          .07683 [cs] (May
          <year>2016</year>
          ). http://arxiv.org/abs/1605.07683 arXiv:
          <fpage>1605</fpage>
          .
          <fpage>07683</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Karl</given-names>
            <surname>Branting</surname>
          </string-name>
          , James Lester, and
          <string-name>
            <given-names>Bradford</given-names>
            <surname>Mott</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Dialogue Management for Conversational Case-Based Reasoning</article-title>
          .
          <source>In Advances in Case-Based Reasoning</source>
          , David Hutchison,
          <string-name>
            <given-names>Takeo</given-names>
            <surname>Kanade</surname>
          </string-name>
          , Josef Kittler,
          <string-name>
            <surname>Jon M. Kleinberg</surname>
            , Friedemann Mattern, John
            <given-names>C.</given-names>
            Mitchell, Moni Naor, Oscar Nierstrasz, C.
          </string-name>
          <string-name>
            <surname>Pandu Rangan</surname>
          </string-name>
          , Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Dough Tygar, Moshe Y. Vardi, Gerhard Weikum,
          <source>Peter Funk, and Pedro A. González Calero (Eds.)</source>
          . Vol.
          <volume>3155</volume>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <fpage>77</fpage>
          -
          <lpage>90</lpage>
          . DOI:http://dx.doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -28631-
          <issue>8</issue>
          _
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Braun</surname>
          </string-name>
          , Adrian Hernandez Mendez, Florian Matthes, and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Langen</surname>
          </string-name>
          .
          <year>2017</year>
          . Evaluating natural
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Alice</given-names>
            <surname>Coucke</surname>
          </string-name>
          , Adrien Ball, Clment Delpuech, Clment Doumouro, Sylvain Raybaud, Thibault Gisselbrecht, and
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Dureau</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Benchmarking natural language understanding systems: Google, facebook, microsoft, amazon and snips</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kathleen</given-names>
            <surname>Dahlgren</surname>
          </string-name>
          and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Stabler</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Natural language understanding system</article-title>
          .
          <source>(Aug. 11</source>
          <year>1998</year>
          ).
          <source>US Patent 5</source>
          ,
          <issue>794</issue>
          ,
          <fpage>050</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Dodge</surname>
          </string-name>
          , Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra,
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Arthur</given-names>
            <surname>Szlam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>arXiv:1511.06931 [cs] (Nov</source>
          .
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          http://arxiv.org/abs/1511.06931 arXiv:
          <fpage>1511</fpage>
          .
          <fpage>06931</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling</article-title>
          .
          <source>In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Ann Arbor, Michigan,
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          . DOI: http://dx.doi.org/10.3115/1219840.1219885
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Milica</surname>
            <given-names>Gavšic´</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Effective handling of dialogue state in the hidden information state POMDP-based dialogue manager</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing</source>
          <volume>7</volume>
          ,
          <issue>3</issue>
          (May
          <year>2011</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          . DOI: http://dx.doi.org/10.1145/1966407.1966409
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Goker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Thompson</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>The adaptive place advisor: A conversational recommendation system</article-title>
          .
          <source>In Proceedings of the 8th German Workshop on Case Based Reasoning. Citeseer</source>
          ,
          <volume>187</volume>
          -
          <fpage>198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Andrea</surname>
            <given-names>Iovine</given-names>
          </string-name>
          , Fedelucio Narducci, and Marco de Gemmis.
          <year>2019</year>
          .
          <article-title>A Dataset of Real Dialogues for Conversational Recommender Systems</article-title>
          . In CLiC-it
          <year>2019</year>
          . 6.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Dietmar</surname>
            <given-names>Jannach</given-names>
          </string-name>
          , Ahtsham Manzoor, Wanling Cai,
          <string-name>
            <given-names>and Li</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Survey on Conversational Recommender Systems</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>00646</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jie</surname>
            <given-names>Kang</given-names>
          </string-name>
          , Kyle Condiff,
          <string-name>
            <given-names>Shuo</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Joseph A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , Loren Terveen, and
          <string-name>
            <given-names>F. Maxwell</given-names>
            <surname>Harper</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Understanding How People Use Natural Language to Ask for Recommendations</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems - RecSys '17</source>
          . ACM Press, Como, Italy,
          <fpage>229</fpage>
          -
          <lpage>237</lpage>
          . DOI: http://dx.doi.org/10.1145/3109859.3109873
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Raymond</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Samira</given-names>
            <surname>Kahou</surname>
          </string-name>
          , Hannes Schulz, Vincent Michalski, Laurent Charlin, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Pal</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards Deep Conversational Recommendations</article-title>
          . (
          <year>2018</year>
          ),
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Xingkun</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Arash Eshghi, Pawel Swietojanski, and
          <string-name>
            <given-names>Verena</given-names>
            <surname>Rieser</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Benchmarking Natural Language Understanding Services for building Conversational Agents</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>05566</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Tariq</given-names>
            <surname>Mahmood</surname>
          </string-name>
          and
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Improving recommender systems with adaptive conversational strategies</article-title>
          .
          <source>In Proceedings of the 20th ACM conference on Hypertext and hypermedia - HT '09</source>
          . ACM Press, Torino, Italy, 73. DOI: http://dx.doi.org/10.1145/1557914.1557930
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Malykh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladislav</given-names>
            <surname>Lyalin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Named Entity Recognition in Noisy Domains</article-title>
          .
          <source>In 2018 International Conference on Artificial Intelligence Applications and Innovations (IC-AIAI)</source>
          .
          <volume>60</volume>
          -
          <fpage>65</fpage>
          . DOI: http://dx.doi.org/10.1109/IC-AIAI.
          <year>2018</year>
          .8674438 ISSN: null.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Maximilian</surname>
            <given-names>Nickel</given-names>
          </string-name>
          , Lorenzo Rosasco,
          <article-title>Tomaso A Poggio, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Holographic Embeddings of Knowledge Graphs.</article-title>
          .
          <source>In The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)</source>
          .
          <fpage>1955</fpage>
          -
          <lpage>1961</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Filip</surname>
            <given-names>Radlinski</given-names>
          </string-name>
          , Krisztian Balog, Bill Byrne, and
          <string-name>
            <given-names>Karthik</given-names>
            <surname>Krishnamoorthi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Coached conversational preference elicitation: A case study in understanding movie preferences</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Deepak</given-names>
            <surname>Ravichandran</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Learning surface text patterns for a question answering system</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics</source>
          ,
          <fpage>41</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S</given-names>
            <surname>Sreelekha</surname>
          </string-name>
          , Pushpak Bhattacharyya,
          <article-title>Shishir K Jha,</article-title>
          and
          <string-name>
            <given-names>D</given-names>
            <surname>Malathi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A survey report on evolution of machine translation</article-title>
          .
          <source>Int. J. Control Theory Appl</source>
          <volume>9</volume>
          ,
          <issue>33</issue>
          (
          <year>2016</year>
          ),
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Suglia</given-names>
          </string-name>
          , Claudio Greco, Pierpaolo Basile, Giovanni Semeraro, and
          <string-name>
            <given-names>Annalina</given-names>
            <surname>Caputo</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>An Automatic Procedure for Generating Datasets for Conversational Recommender Systems</article-title>
          . (
          <year>2017</year>
          ),
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Alex</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel R Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Glue: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>07461</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Raux</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Henderson</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The Dialog State Tracking Challenge Series: A Review</article-title>
          .
          <source>Dialogue &amp; Discourse</source>
          <volume>7</volume>
          ,
          <issue>3</issue>
          (April
          <year>2016</year>
          ),
          <fpage>4</fpage>
          -
          <lpage>33</lpage>
          . http://dad.uni-bielefeld.de/index.php/dad/article/ view/3685
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Caroline</surname>
            <given-names>Wisniewski</given-names>
          </string-name>
          , Clment Delpuech, David Leroy,
          <string-name>
            <given-names>Franois</given-names>
            <surname>Pivan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Dureau</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Benchmarking Natural Language Understanding Systems</article-title>
          . (
          <year>2017</year>
          ). https: //snips.ai/content/sdk-benchmark-visualisation/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>