=Paper= {{Paper |id=Vol-1927/paper6 |storemode=property |title=BotDCAT-AP: An Extension of the DCAT Application Profile for Describing Datasets for Chatbot Systems |pdfUrl=https://ceur-ws.org/Vol-1927/paper6.pdf |volume=Vol-1927 |authors=Paolo Cappello, Marco Comerio, Irene Celino |dblpUrl=https://dblp.org/rec/conf/semweb/CappelloCC17 }} ==BotDCAT-AP: An Extension of the DCAT Application Profile for Describing Datasets for Chatbot Systems== https://ceur-ws.org/Vol-1927/paper6.pdf
    BotDCAT-AP: An Extension of the DCAT Application
     Profile for Describing Datasets for Chatbot Systems

                    Paolo Cappello, Marco Comerio and Irene Celino
                             Cefriel – Politecnico di Milano
                               via Fucini, 2 – Milano, Italy
              E-mail: {paolo.cappello,marco.comerio,irene.celino}@cefriel.com



       Abstract. Although it is still an emerging technology, the increasing usage of
       chatbots (also known as bots) has opened a promising touchpoint for citizen and
       customer engagement. A chatbot consists of a computer program aimed at sim-
       ulating a conversation between humans and machines through the formulation of
       appropriate answers making use of external knowledge. Therefore, managing ex-
       ternal knowledge is a crucial task for the design and development of chatbots. To
       facilitate the reuse of existing data sources in chatbot applications, in this paper
       we propose BotDCAT-AP, an extension of the Data Catalogue (DCAT) Appli-
       cation Profile for describing datasets for chatbots. BotDCAT-AP enables the de-
       scription of intents (i.e., the actions users want to accomplish by interacting with
       a chatbot) and entities (i.e., individual information units associated to an intent)
       supported by a dataset and the method to access it. A practical usage of
       BotDCAT-AP is shown to demonstrate the value of its adoption.


1      Introduction

The W3C’s Data Catalogue vocabulary (DCAT) [1] is an RDF vocabulary designed to
facilitate interoperability between data catalogs published on the Web. By using DCAT
to describe datasets in data catalogs, publishers increase discoverability and enable ap-
plications to easily consume metadata from multiple catalogs. The DCAT Application
Profile for data portals in Europe (DCAT-AP) [2] is a specification based on DCAT for
describing public sector datasets in Europe. Its basic use case is to enable searches for
a dataset across data portals and improve the sharing of public sector data.
   Several extensions of DCAT-AP are emerging: they focus on specific types of da-
tasets and use cases. Examples are GeoDCAT-AP [3] that makes geospatial information
better searchable across borders and sectors and StatDCAT-AP [4] that provides a com-
monly agreed dissemination vocabulary for statistical open data.
   In this paper, we propose BotDCAT-AP, an extension of DCAT-AP to describe da-
tasets for chatbots systems. Since such systems make use of external knowledge to sim-
ulate a conversation between humans and machines, BotDCAT-AP aims at simplifying
the creation of the software components managing the external knowledge.
2


2       Motivation

   Chatbots (like AzureBot1 or Herzi2) get user requests as natural language questions
through different input channels (e.g., Instant Messaging (IM) applications, social net-
works). They process requests with Natural Language Understanding (NLU) engines:
user questions are translated into machine understandable actions, because NLU en-
gines are capable of interpreting users’ input (utterances) by extracting the intent of
every single request and the possible entities contained in it. The intents represent what
the users wish to accomplish using the chatbot. The entities are domain specific infor-
mation items extracted from the user’s utterance that help in understanding the intent.
The utterance in natural language is first analyzed for the intent and entities by the NLU
engine and then mapped to a specific action that should be performed (e.g., access a
specific dataset through an API) as well as the specific dialog to be returned by the
chatbot. As described in [5], NLU engines are often complex, using various Natural
Language Processing (NLP) models and Machine Learning techniques to provide ac-
ceptable levels of accuracy (e.g., Microsoft LUIS3, Google API.ai4 and Facebook
Wit.ai5). To train NLU engines, a training set of sample utterances is used in order to
support the system at run-time to correctly associate other new and unseen utterances
to the correct intents and extract the relevant entities.
   Let us consider a chatbot providing weather forecast. This chatbot is able to interpret
utterances like “tell me the weather in Milan”, “what are the weather forecasts for to-
morrow?”, “will it rain this weekend?”. All of them are associated with the intent “get
weather”. Furthermore, the NLU engine extracts the entities “Milan”, “tomorrow” and
“weekend” that help in further understanding the intent and characterize the action to
perform (querying the weather forecast data source to get information about a specific
location and time frame).
   Fig. 1 shows the general structure of a chatbot: even when relying on existing frame-
works providing channels and NLU engines, custom development is required to create
the wrapper that connects the chatbot components to the knowledge sources, i.e. the
datasets (API, data dump, linked data, etc.). This wrapper is used at design-time to train
the NLU engine to correctly identify intents and entities, and at run-time to retrieve the
necessary information from knowledge sources to answer user questions.
   To ease the development of such wrapper components, we introduce an enriched
semantic description of knowledge sources with respect to our BotDCAT-AP vocabu-
lary: this description includes the information about intents and entities supported by
the available datasets and their access methods. The availability of such a description
can be used to standardize the wrapper development (adding value for the chatbot de-
veloper) and to enable the reuse of datasets by multiple chatbot systems (adding value
for the dataset owner). Referring to the state of the art, the proposed vocabulary does

1   https://microsoft.github.io/AzureBot/
2   https://devpost.com/software/herzi
3   https://docs.microsoft.com/en-us/azure/cognitive-services/luis/home
4   https://docs.api.ai/docs
5   https://wit.ai/docs
                                                                                         3


not aim at overcoming open challenges for Semantic Question Answering (SQA) sys-
tems [6], which mainly deal with the internals of NLU engines, but it aims at improving
the sharing of datasets useful for those systems.




      Fig. 1: How BotDCAT-AP simplifies the creation and execution of chatbot systems.


3      BotDCAT-AP

BotDCAT-AP is an RDF vocabulary, denoted by the prefix bot in the following and
openly accessible at http://swa.cefriel.it/ontologies/botdcat-ap, released with
a CC-BY-4.0 license. BotDCAT-AP is also listed on Linked Open Vocabularies at
http://lov.okfn.org/dataset/lov/vocabs/bot.
   The vocabulary was developed starting from the Data Catalogue vocabulary
(DCAT) and its Application Profile (DCAT-AP) elaborated by a Working Group under
the ISA Programme of the European Commission. BotDCAT-AP is meant to be an
extension of DCAT-AP and follows all its conformance statements. The necessity of a
sound and solid basis for describing the datasets is needed to deliver an easily adaptable
solution with reference to a well-designed standard.
   A simplified UML Class diagram of BotDCAT-AP is depicted in Fig. 2, where ad-
ditions to the main classes and properties of DCAT-AP are highlighted. The bot:Intent
class is designed to represent any possible intent supported by a dataset. The relation
bot:hasEntitiesList connects an intent to a list of supported entities enclosed in an in-
stance of the class bot:EntitiesCatalog. Entities can be represented in different ways
since BotDCAT-AP allows both standard and ad-hoc entities to be specified. A first
case is covered by the relation bot:hasEntity that is used to relate an intent to entities
already specified in external ontologies. A practical use case could be a date defined in
the OWL-Time ontology, or a point-of-interest (POI) in the LinkedGeoData ontology
[7]. The generic owl:Class is used to allow the possibility to refer any concept defined
in external ontologies.
   The bot:hasEntityConcept and bot:hasEntityDataset relations cover the other two
cases where entities are context-related and an external ontology covering such entities
4


is missing. The first relation targets the skos:Concept class and it is used when the set
of possible entities is limited and there are hierarchies among them; in this case, entities
can be directly added to the BotDCAT-AP description as a SKOS taxonomy. Other-
wise, the Entity Catalog can be linked through bot:hasEntityDataset to a dataset enu-
merating all the possible entities. A dataset is represented as an instance of the class
dcat:Dataset, and can optionally have multiple distributions denoted by dcat:Distribu-
tion accessible through a reference exposed by the relation dcat:accessURL.




                    Fig. 2. BotDCAT-AP simplified UML Class diagram

   As of today, DCAT-AP supports only the description of data catalogs and datasets
published on the web; BotDCAT-AP overcomes this limitation giving the possibility
to also define different access methods to a particular dataset [8]. This is done through
the use of the relations bot:hasMethodURL, bot:hasAssetURL and bot:hasDocumenta-
tion, corresponding respectively to access points offered by a simple REST API, a
SPARQL endpoint or any other documented method (e.g., a SOAP-based web service
documented by a WSDL file). This extension supports the delivery of information that
improves and speeds up the creation of the application logic needed by the chatbot
system to operate at run-time.


4      Use Case

The main purpose of BotDCAT-AP is to facilitate the implementation of chatbots by
providing a formal description of all the external datasets containing useful information.
In the following, we explain how we adopted the proposed vocabulary to describe the
data sources exploited by a bot application to provide information to final users. Addi-
tional information on BotDCAT-AP and the full versions of the dataset descriptions in
RDF can be found at http://swa.cefriel.it/bot/profiles2017_botdcat-ap.html.
                                                                                                   5


   Talkin’Piazza6 [9] is a web-based application, developed in the Piazza project7, that
aims to engage the urban community on the go to participate to the city life. Among its
functionalities, Talkin’Piazza offers a bot that can be queried to get information about
city events, points of interest and public transport; to reply to citizens’ questions, the
Talkin’Piazza bot accesses public, external and heterogeneous data sources, described
with BotDCAT-AP to ease the wrapper development.
   The “Milano Events” dataset contains a list of events that take place in the city of
Milan. By accessing this dataset, the Talkin’Piazza chatbot is capable of replying to the
user’s intent proposing events filtered by category, location and price. The chatbot sys-
tem can access the dataset by means of a Web API, whose reference URL is contained
in the BotDCAT-AP description at http://swa.cefriel.it/examples/botdcat-
ap/Events.ttl. Since an event is usually associated to a category stating its thematic
area (e.g., sport, art, entertainment, education) and to a type of admission (e.g., free
entrance, paid entrance), such units of information can be expressed by users in their
utterances. The EntityCatalogs EventsAdmissions and EventsCategories contain enti-
ties associated to possible types of admission and thematic areas of the events.
EventsAdmissions contains only the entities FreeEntrance and PaidEntrance and there-
fore they are simply defined as skos:Concept(s). The same approach would not be prac-
tical for EventsCategories since those entities are wide and dynamic. In this case, a
reference to an external dataset CategoriesDataset containing the list of all the possible
categories is used. In this way, the CategoriesDataset can be easily changed and up-
dated without modifying the BotDCAT-AP description associated to the “Milano
Events” dataset.
   Talkin’Piazza is able to provide the user with information about POIs all over the
city by accessing relevant data from OpenStreetMap8. The chatbot is trained to respond
to utterances such as “where can I find an ATM?”, “I’d like to know the location of the
restaurants near me”, “can you show me the nearest library?” and to assign them to the
intent GetPOIs. The OpenStreetMap profile based on BotDCAT-AP is at
http://swa.cefriel.it/examples/botdcat-ap/Overpass.ttl. Since POI catego-
ries (e.g., ATM, restaurant, kiosk, railway station, library) are well-known concepts,
the entities included in the EntityCatalog POIsCategories are taken from the LinkedGe-
oData ontology [7]. In general, this solution is useful when entities express concepts
already defined in external ontologies and vocabularies.


5       Conclusions

Chatbots represent one of the major rising trends, and their usage and distribution are
predicted to grow over the next years. Gartner places chatbot systems in the top strate-
gic technology trends for 2017 [10], evolving and expanding the use of Artificial intel-
ligence and Machine learning in apps and services during the next 20 years.

6   The beta version still in development (Italian only, to try out the bot click on “Chiedi”) of the
    bot application is deployed at https://ns3056488.ip-213-32-26.eu/talkinpiazza2/
7   http://www.piazza.eu
8   http://wiki.openstreetmap.org
6


   With this growing demand and market potential for the development of chatbots, the
need arises to simplify and standardize how those systems access and reuse data con-
tained in knowledge sources. In this paper, we introduced the BotDCAT-AP vocabu-
lary: when employed to describe datasets, it can bring benefits both to dataset owners,
which enable their data to be further reused, and to chatbot developers, which are sup-
ported in the software development.
   BotDCAT-AP can have a large impact by bringing value to the chatbot market, it
enables and fosters reusability of datasets across chatbot systems, it is designed as an
extension to DCAT-AP and it is openly available online, published and documented
according to Sematic Web best practices and released with an open license. In the fu-
ture, we will improve the evaluation of our proposal and we will investigate the com-
munity interest to establish an official working group and to proceed with the
BotDCAT-AP standardization process.

Acknowledgement
    This work is partially supported by the Piazza activity (id 16391), co-funded by EIT Digital.


References
 1. Maali, F., Erickson, J., & Archer, P. (2014). Data catalog vocabulary (DCAT). W3C Rec-
    ommendation. Available at: http://www.w3.org/TR/vocab-dcat/, last accessed 2017/05/10.
 2. ISA working group (2015). DCAT application profile for data portals in Europe. Available
    at: https://joinup.ec.europa.eu/system/files/project/dcat-ap_final_v1.00_0.html, last ac-
    cessed 2017/05/10.
 3. ISA working group (2016). GeoDCAT-AP: A geospatial extension for the DCAT applica-
    tion profile for data portals in Europe. Available at: https://joinup.ec.eu-
    ropa.eu/node/154143/, last accessed 2017/05/10.
 4. ISA working group (2016). StatDCAT-AP – DCAT Application Profile for description of
    statistical datasets. Available at: https://joinup.ec.europa.eu/node/157143, last accessed
    2017/05/10.
 5. Kar, R., & Haldar, R. (2016). Applying Chatbots to the Internet of Things: Opportunities
    and Architectural Elements. Inter. Journal of Advanced Computer Science and Applications
    7(11), 147–154.
 6. Höffner, K., Walter, S., Marx, E., Usbeck, R., Lehmann, J., & Ngonga Ngomo, A. C. (2016).
    Survey on challenges of Question Answering in the Semantic Web. Semantic Web (Pre-
    print), 1-26.
 7. Stadler, C., Lehmann, J., Höffner, K., & Auer, S. (2012). LinkedGeoData: A core for a web
    of spatial open data. Semantic Web 3(4), 333-354.
 8. Vu, Q. H., Pham, T. V., Truong, H. L., Dustdar, S., & Asal, R. (2012). Demods: A descrip-
    tion model for data-as-a-service. In Proc. of the IEEE 26th International Conference on Ad-
    vanced Information Networking and Applications (AINA 2012), pp. 605-612.
 9. Celino, I., Calegari, G. R., & Fiano, A. (2016, September). Towards Talkin'Piazza: Engaging
    citizens through playful interaction with urban objects. In Proc. of the IEEE International
    Conference on Smart Cities (ISC2 2016), pp. 1-5.
10. Gartner (2016). Top 10 Strategic Technology Trends for 2017. Gartner Report, 2016. Avail-
    able at: https://www.gartner.com/doc/3471559/top--strategic-technology-trends