<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Chatbots as a Novel Access Method for Government Open Data?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Porreca</string-name>
          <email>porreca.1673726@studenti.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Leotta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Mecella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tiziana Catarci</string-name>
          <email>catarcig@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza Universita di Roma Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this discussion paper, we propose to employ chatbots as a user-friendly interface for open data published by organizations, speci cally focusing on public administrations. Open data are especially useful in e-Government initiatives but their exploitation is currently hampered to end users by the lack of user-friendly access methods. On the other hand, current UX in social networks have made people used to chatting. Building on cognitive technologies, we prototyped a chatbot on top of the OpenCantieri dataset published by the Italian Ministero delle Infrastrutture e Trasporti, and we argue that such a model can be extended as a generally available access method to open data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Open data generally refers to the idea that some data should be freely
available to everyone to use and republish as they wish, without restrictions from
copyright, patents or other mechanisms of control. In particular, according to
the Open De nition, \a piece of data is open if anyone can freely access, use,
modify, and share for any purpose (subject, at most, to requirements that
preserve provenance and openness)"1. Some characteristics should be granted to
provide open data, namely (i) accessibility { all the users can freely access to
data, mostly free or at a very low cost, (ii) machine readability { data can be
naturally \understood" and processed by machines, (iii) rights { data are
released under certain licenses that bound softly the usage, the transformation
and the distribution of those data.</p>
      <p>Much related to open data, open government is the governing doctrine that
supports the right of citizens to access the documents and proceedings of the
government for an e ective public oversight. Enabling interested citizens to get
more directly involved in the legislative process, making government information
available to the public as machine readable open data, can facilitate government
transparency, accountability and public participation. Opening up o cial
information can support technological innovation and economic growth by enabling
third parties to develop new kinds of digital applications and services.</p>
      <p>Several national governments have created sites to distribute a portion of
the data they collect, e.g., the European Commission has created two portals
for the European Union: the EU Open Data Portal2 giving access to open data
from the EU institutions, agencies and other bodies, and the Public Data
portal3 providing datasets from local, regional and national public bodies across
Europe. In October 2015, the Open Government Partnership (OGP) launched
the International Open Data Charter, a set of principles and best practices for
the release of governmental open data formally adopted by seventeen countries
(including Italy4) during the OGP Global Summit in Mexico.</p>
      <p>Despite all these initiatives, access from citizens to such open data is not
always as large as expected. Technical issues often inhibit easy access from citizens,
if no speci c user-friendly applications are built on top of such open data for
easy access and navigation. In this discussion paper, we present the disruptive
idea of adopting chatbots as user-friendly access and querying method to open
data. Nowadays persons are used to chat with friends over popular applications
(e.g., WhatsApp or Facebook Messenger), and the typical interaction is indeed
based on the paradigm ask { get a response. Citizens accessing open data would
appreciate the same paradigm in querying the data, for which a chatbot can be
a much more natural way of interaction than traditional web applications.</p>
      <p>Developing chatbots over open data poses many challenges, such as
interpreting the natural language adopted by users in querying the dataset, and
translating into e ective queries over the dataset. In this paper, we present a
prototype of such a system built using the cognitive platform by IBM, namely
Bluemix and related APIs, in order to evaluate the technical feasibility of the
proposed idea.</p>
      <p>The following of this paper is organized as it follows: Section 2 provides some
background information and relevant work; Section 3 describes the architecture
used to build a chatbot over the Open Cantieri dataset, published by the Italian
Ministero delle Infrastrutture e Trasporti at http://opencantieri.mit.gov.
it, by using a cognitive platform, and Section 4 describes the realization aspects.
Finally Section 5 concludes the paper, by remarking future work, including a
user evaluation to assess the usability of the approach, and to prove the argued
simplicity of use.</p>
      <sec id="sec-1-1">
        <title>2 cf. http://data.europa.eu/euodp/en/data/ 3 cf. http://publicdata.eu/ 4 cf. https://www.opengovpartnership.org/country/italy</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Background and relevant work</title>
      <p>
        Chatbots are computer programs able to hold up a conversation with a user,
either in textual or vocal form. Given the growing complexity of information
systems, chatbots are speci cally designed to support the user interaction and
to make it as natural as possible. They do not only represent a faster and more
natural way to access information, but they will also be a signi cant key factor
in the process of humanizing machines in the near future [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ].
      </p>
      <p>
        In order to correctly and e ciently design a chatbot, many techniques have
to be not trivially combined, including pattern matching, parsing, arti cial
intelligence, machine learning, and ontologies. There are numerous approaches and
methodologies proposed for this; in this work, we followed the approach that
divides the chatbot architecture in three parts: responder, classi er and
graphmaster [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The responder is the interface by which users access the system. It
is responsible for taking the input and validate the output. The classi er is
located between the responder and the graphmaster. It is dedicated to normalize
the input to pass to the graphmaster and processing the output coming from
the latter (e.g., interacting with a database). Finally, the graphmaster is the
agent responsible to elaborate the correct output to the corresponding input. It
represents the pattern matching element of the chain.
      </p>
      <p>
        Tim Berners-Lee suggested a 5-star deployment scheme for open data5, being
a star when an organization makes data available on the Web (whatever format)
under an open license, 2 stars when it makes data available as structured data
(e.g., Excel le instead of image scan of a table), 3 starts when data are available
in a non-proprietary open format (e.g., CSV as well as of Excel), 4 stars by
using URIs to denote things, so that people can point at them and 5 stars when
data are linked to other data to provide context [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Technologies supporting
this vision of linked open data are those ones commonly referred as Semantic
Web, including RDF/RDFS, OWL (ontologies) and SPARQL (for querying). In
Italy, the AgID - Agenzia per l'Italia Digitale, publishes every year guidelines for
Public Administrations on how to publish their data as open, including a model
for metadata consisting of 4 levels 6. In this work, we have built our prototype
on the basis of a dataset which can be ranked at most at level 3 of the above
classi cation.
      </p>
      <p>
        During the last years, some attempts to apply chatbots to query and retrieve
data have been made. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a chatbot was constructed on top of some open data.
Here the rst step is to extract plain text from documents stored as PDF les
by employing an optical character recognition (OCR) software. At this point, a
set of possible questions about the extracted contents were constructed using a
\Overgenerating Transformations and Rankings" algorithm, which was
implemented using the question generation framework presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Finally, the
      </p>
      <sec id="sec-2-1">
        <title>5 cf. http://5stardata.info/en/ 6 cf. http://www.agid.gov.it/agenda-digitale/open-data</title>
        <p>matching patterns, essential to the chatbot's answering capability, are de ned
through Arti cial Intelligence Markup Language (AIML).</p>
        <p>
          Authors in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] presents a system, called OntBot, which employs a mapping
technique to transform an ontology into a relational database and then uses that
knowledge to construct answers. Therefore, likewise our solution, OntBot does
not need to handwrite all the knowledge base that stands behind the system. The
main drawback of traditional chatbots, implemented for example through AIML,
is the fact that the knowledge base has to be constructed ad-hoc by handwriting
thousands of possible responses. OntBot, likewise our system, does not construct
answers by looking for a matching one inside the database. Instead, it retrieves
information from the database, which will be then used to build up the response.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Case study and proposed architecture</title>
      <p>Open Cantieri, o ered by the Italian Ministero delle Infrastrutture e dei
Trasporti (MIT), is an open, complete and up-to-date repository about the
realization status and history of the public infrastructures. All the available data
are generated and published by public sources. Open Cantieri o ers a uni ed
platform, with speci c views, in which all these di erent datasets are collected
together. The platform is a collection of open data in a very raw form: datasets
can be downloaded as single CSV les, sometimes grouped in archives. Very
often, unfortunately, di erent les do not employ the same keys to represent
concepts (e.g., cities are represented using their code in some les whereas their
names is used in others) and manual mapping between di erent representations
was needed. More generally, the les do not follow any standard on eld names
and reported values.</p>
      <p>Figure 1 shows the architecture of the proposed solution. The user interface to
the system is implemented as a Facebook Messenger application. The back-end
core of the system is deployed on the IBM Bluemix cloud computing platform.
In particular, a Java Server Page (JSP) handles the requests and constructs
the corresponding responses by orchestrating two Bluemix service instances: (i)
an instance of Watson Conversation, speci cally created and trained, which is
responsible for processing, and (ii) an instance of Compose for MySQL, which
handles the connection and the query to the back-end database. In particular,
the user interacts through the chat interface, e.g., issuing a question as \How
much money have been invested in public infrastructures in the south of Italy
in 2015?" (step 1). The sentence is forwarded to the JSP, which handles it in
order to construct the appropriate response (step 2). The Watson Conversation
instance receives a request constructed starting from the user's input, and
generates the corresponding response (step 3). According to the provided response,
the JSP page de nes an SQL query to be issued to the database through
Compose (step 4). Once all the elements needed to construct the output are collected,
the JSP will proceed to generate the response to be shown through the Facebook
Messenger chat.</p>
      <p>Database
4. Retrieve data
from the database
for constructing the
response
3. Send request
to the
Conversation
service
instance</p>
      <p>Watson</p>
      <p>Conversation
Server-Side
1. “How much money
have been invested in
public infrastructures in
the south of Italy in
2015?”
2. Pass the
input to the
application
User</p>
      <p>Interface</p>
      <p>Application</p>
      <p>In order to generate the response, Watson Conversation performs the
following operations (the reader should also refer to following subsections): intents and
entities extraction; veri cation of which node, within the Dialog Tree, has
conditions satis ed by these information; and, nally, return of the nodes response.
In our speci c case, the intent is triggered by \How much money", which is
associated to the intention of knowing the investment amount about the
highway management, regarding the intent \#Investment". The entities are: \South
of Italy", which is a speci c value of the entity \@Geographical Region" and
\2015", which is a value of \Year". In the Conversation response, the
application is able to nd all the essential information for constructing the output. First
of all, a ag, called \DB Search", is retrieved from the Conversation response
in order to understand if a database search is required. This is achieved by
constructing the SQL query, starting from the information obtained from Watson
Conversation, and by send it to Compose for MySQL, which will retrieve the
desired data from the database. The SQL query is speci cally constructed with
the \Text" included into the Conversation response, which is one of many JSON
variables returned with the response itself. Once all the elements needed to
construct the output are collected, the application will proceed to generate the users
output and send it back through the interface.</p>
      <p>In the following, we describe the single components.
3.1</p>
      <sec id="sec-3-1">
        <title>Watson Developer Cloud and Conversation</title>
        <p>IBM Watson Developer Cloud (WDC) o ers a set of services for developing
Cognitive Applications, which consists of programs able to take advantage of the
most modern technologies in arti cial intelligence, machine learning, and natural
language processing. Each WDC service provides a REST API for interacting
with it, and most of these services also includes Software Development Kits
(SDKs) for various programming languages. In our work we used the Java one.</p>
        <p>Inside IBM WDC, the Watson Conversation service allows to create an
application that understands natural language input and uses machine learning to
respond to users in a way that simulates a conversation between humans. When
an instance of this service is created, it is able to contains several workspaces.
A workspace is a container for all the artifacts that de ne the conversation ow
and it is responsible for the natural language processing operations. A workspace
includes the following elements:
Intents. An intent represents the intention and the purpose behind user input.</p>
        <p>It could be associated with the \goal" the user wants to achieve with every
request and thus it is important to de ne one intent for each type of user
request the application has to support. Each intent is pre xed with the
character \#" and, during its creation, the developer is encouraged to provide
\positive examples", in order to allow the system to construct the
corresponding model. A positive example is a sentence that clari es the way in
which the intent could be presented to the system. By collecting at least ve
positive examples, the instance of the Conversation service will be able to
perform a deep learning process, which will train the service itself to
recognize that speci c intent. The most important fact is to distinctly de ne each
intent from the others. The \borders" between intents should be clear in
order to allows the system to correctly recognize them inside user requests.
If there is the need for an intent to have more \interpretations", depending
on a particular user request, it is possible to make use of another Watson
Conversation's element: the Entity.</p>
        <p>Entities. An entity is an element, corresponding to a term or an object, that
could be used in order to better specify the intention behind a user request.
They are frequently used in combination with intents to increase their range
of possible interpretations and meanings. Each entity is pre xed with the
character \@" and is associated with a set of values. Each value of a speci c
entity represents an object or a term that belongs to the same category
de ned by the entity itself. In this way, an entity called \day of the week",
could be include values as \Monday", \Thursday", etc. Associated with each
value, the developer has the possibility to insert synonyms, in order to be
sure that the system will recognize a speci c value of the entity even if the
user provides it with a di erent word.</p>
        <p>Dialog. The dialog represents the ow of the conversation, divided in several
branches, which de nes how the application responds when it recognizes the
de ned intents and entities. The dialog is composed by several nodes,
structured in a tree-like graph. At a very basic level, each node is de ned by two
main elements: the condition and the response. When the condition,
composed by elements like intents and entities, is satis ed, the node is considered
\activated" and hence its response will be returned as output. The response
could be a sentence, another node, or it can be de ned by the developer.
In order to maintain the state of the conversation through each interaction
with the user, the instance service keeps a JSON variable called \context".
In this element there are several variables, which can be customized by the
developer, and, among them, there is the \Dialog Stack", which contains the
stack of all the nodes visited during the conversation and the rst one, the
\Contextual Node", is the ID of the node that should be returned when the
user will start another interaction, within the same session, with the instance
of the service Watson Conversation.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Compose for MySQL</title>
        <p>IBM Compose for MySQL is a platform able to simplify the maintenance and
the management of a MySQL database; it automatically executes some common
operations as backups, scaling and health check. Even though the database can
be accessed as a normal MySQL database, the main bene t o ered by Compose
is that no management aspects (such as security issues or scaling) has to be
manually handled.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Realization aspects</title>
      <p>As seen in the previous sections, at a certain point of the interaction with the
user, the system (through the Compose component) requires to retrieve data
from a database in order to build answers. As stated above, the Open Cantieri
dataset does not follow any standard, thus a database schema able to rationalize
the information contained in the di erent CSVs has been de ned. The result
of this operation is a schema that does not match anymore in terms of tables
and columns with the original le, making it necessary to proceed to an ETL
(Extract, Transform and Load) operation.</p>
      <p>Once the database was set up, we had to con gure the Watson Conversation
service to be able (i) to understand user requests, (ii) to nd out if a database
search is needed to ful ll the request, and (iii) to present a response template
to the user. The response template is lled by the JSP using the data retrieved
from the database through the Compose component.</p>
      <p>In order to de ne and create a Watson Conversation instance, we need to
de ne intents and entities useful for our purposes. An intent, in our case,
represents an argument the user is interested to, e.g., the highway management or the
airport system. In order to correctly de ne them, we have collected all the
possible ways in which a user could refer to them, and then we passed these ones as
positive examples in the intent's creation process. An entity, on the other hand,
corresponds to the values that may concern a speci c intent, e.g., the
concessionaire societies for the highway management or the airports of the airport system.
All these elements were then used in order to construct the dialog of the Watson
Conversation instance. Here, we had to gure out all the possible questions the
user might ask and the ways in which he might do it. The user may, for example,
specify an argument, and then ask for more speci c data about it through other
questions. He may specify, as argument, the highway s management and then
asks for the name of all the concessionaire societies. The user may, at anytime,
specify a new argument or asks for more questions, in a human-like conversation.
The system can also recognize when the user insert an invalid input and it will
help him to correct it.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Concluding remarks</title>
      <p>We have presented our preliminary idea of combining a chatbot with open data.
It involves the employment of several novel instruments and services that are
increasingly employed by the researchers and practitioners involved in the
development of smart services. Our intent was to take advantage of these new
technologies in order to make something new, able to improve the accessibility
of open government data.</p>
      <p>Open data should be accessible to the public; with our prototype, we would
like to showcase a new mean to consult them, in such a way that allows the user
to easily retrieve and analyze them.</p>
      <p>Future work will include an extensive validation of the approach on a
sample of users. Additionally, structured data represents only a face of government
complexity. Next steps will include automatic analysis of procedures in order
to provide users with a mean to explore bureaucracy in a simpler manner. The
conjunction of structured data with unstructured ones may provide public
administrations a useful tool to turn open data into something directly usable by
citizens.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.A.</given-names>
            <surname>Abdul-Kader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Woods</surname>
          </string-name>
          .
          <source>Survey on Chatbot Design Techniques in Speech Conversation Systems International Journal of Advanced Computer Science and Applications</source>
          ,
          <volume>6</volume>
          (
          <issue>7</issue>
          ),
          <year>2015</year>
          , https://thesai.org/Downloads/Volume6No7/ Paper_
          <fpage>12</fpage>
          -Survey_on_Chatbot_Design_Techniques_in_Speech_Conversation_ Systems.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Y.P.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>An Innovative Distributed Speech Recognition Platform for Portable, Personalized and Humanized Wireless Devices Computational Linguistics</article-title>
          and
          <source>Chinese Language Processing</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>77</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked DataThe Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>122</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Pichponreay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.H.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Smart Answering Chatbot based on OCR and Overgenerating Transformations and Ranking</article-title>
          .
          <source>Proc. ICUFN</source>
          <year>2016</year>
          , IEEE, DOI: 10.1109/ICUFN.
          <year>2016</year>
          .7536948
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.A.</given-names>
            <surname>Question</surname>
          </string-name>
          <article-title>Generation via Overgenerating Transformations and Ranking</article-title>
          . Language Technologies Institute, Carnegie Mellon University,
          <source>Technical Report CMU-LTI-09-013</source>
          ,
          <year>2009</year>
          , http://www.cs.cmu.edu/~ark/mheilman/ questions/papers/heilman-smith
          <source>-qg-tech-report.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Al-Zubaide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Issa</surname>
          </string-name>
          .
          <article-title>OntBot : Ontology based ChatBot</article-title>
          .
          <source>Proc. ISIICT</source>
          <year>2011</year>
          , IEEE, DOI: 10.1109/ISIICT.
          <year>2011</year>
          .6149594
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>