A Novel Integrated Industrial Approach with Cobots in the Age of
    Industry 4.0 through Conversational Interaction and Computer Vision
           Andrea Pazienza and Nicola Macchiarulo and Felice Vitulano and
                       Antonio Fiorentini and Marco Cammisa
                            Innovation Lab, Exprivia S.p.A.
        {andrea.pazienza, nicola.macchiarulo, felice.vitulano,
           antonio.fiorentini, marco.cammisa}@exprivia.com

     Leonardo Rigutini and Ernesto Di Iorio and Achille Globo and Antonio Trevisi
                                    QuestIT S.r.l.
          {rigutini, diiorio, globo, trevisi}@quest-it.com

                      Abstract                             1996). Collaborative Robots (better known as
                                                           Cobots) are specifically designed for direct in-
    English. From robots that replace workers              teraction with a human within a defined collabo-
    to robots that serve as helpful colleagues,            rative work-space, i.e., a safeguard space where
    the field of robotic automation is experi-             the robot and a human can perform tasks simul-
    encing a new trend that represents a huge              taneously during an automatic operation. Then,
    challenge for component manufacturers.                 human-robot collaboration fosters various levels
    The contribution starts from an innovative             of automation and human intervention. Tasks can
    vision that sees an ever closer collabora-             be partially automated if a fully automated solu-
    tion between Cobot, able to do a specific              tion is not economical or too complex. Therefore,
    physical job with precision, the AI world,             manufacturers may benefiting from the rising of
    able to analyze information and support                AI-driven automation, and the progress of Adapt-
    the decision-making process, and the man               able End Effectors devices, mounted at the end
    able to have a strategic vision of the future.         of Cobot’s arms, may help to perform specific in-
                                                           telligent tasks (Dubey and Crowder, 2002).
1    Introduction                                             The way in which Cobots and humans interact,
 In the last century, the manufacturing world has          exchanging and conveying information, is funda-
adopted solutions for the advanced automation of           mental. The key role in this landscape would be
production systems. Today, thanks to the evolu-            addressed by Conversational Interfaces (Zue and
tion and maturity of new technologies such as Ar-          Glass, 2000), which exploit and take advantages
tificial Intelligence (AI), Machine Learning (ML),         from the recent achievements in the field of Nat-
new generation networks, and the growing adop-             ural Language Processing (NLP), to understand
tion of the Internet of Things (IoT) approach, a           user need and generate the right answer or action.
new paradigm emerges, aiming at integrating the            In this scenario, Computer Vision also plays an
Cyber-Physical System (CPS) with business pro-             important role in the process of creating collabo-
cesses, thus opening the doors to the fourth indus-        rative environments between humans and robots.
trial revolution (Industry 4.0) and that will allow        Systems of this type are already introduced into
us to join in the era driven by information and            the industry to facilitate tasks of product quality
further handled with cognitive computing tech-             control or component assembly inspection. By
niques (Wenger, 2014).                                     giving vision to a robot, it can make it able to un-
    Robots and humans have been co-workers for             derstand the industrial environment that surrounds
years, but rarely have we been truly working to-           it and can improve the execution of tasks in sup-
gether. This may be about to change with the               port to other people.
rise of Collaborative Robotics (Colgate et al.,               Improving robots software with AI will be key
                                                           to making robots more collaborative. The work
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   starts from an innovative vision that beholds, in
International (CC BY 4.0).                                 the future, an ever closer collaboration between
Cobot, able to do a specific physical job with pre-     sion. Section 3 describes the possible scenarios of
cision and without alienation, the AI world, able to    application specifically designed for our approach,
analyze, process, and learn from information and        such as Smart Manufacturing. Finally, Section 4
support the decision-making process, and the em-        discusses the proposed framework and concludes
ployee able to have a strategic vision of the future.   the paper, outlining future works.
To validate its effectiveness, a collaborative en-
vironment between employee, Cobot and AI sys-           2       Architecture Proposal
tems has been crafted to make possible the three        In this Section we introduce our main proposal
subjects communicate in a simple way and with-          taking into account all the requirements coming
out requiring the employee to have specific skills      from different technologies. The leading idea
to interact with the Cobot and Enterprise Resource      is to develop and validate a general framework
Planning (ERP) systems.                                 concerning an Intelligent Cyber-Physical System
   Our contribution is indeed placed in this sce-       made up of four crucial components: (i) a Cobot,
nario where the convergence of multiple technolo-       equipped with (ii) an adaptable end effector, which
gies allows us to define a new approach related         may change according to a specific scenario, and
to the management of a core business process            two major components coming from the AI world,
(e.g. shipments) which tends to ensure more and         i.e. (iii) a Computer Vision module to allow the
more flexibility of the process thanks to a simpli-     cobot detecting an object, and (iv) one or more
fication of human interaction with Cyber-Physical       Conversational Interfaces to facilitate the human-
Systems, with a better coordination between the         machine interaction and keep the man in the loop.
physical world (the packaging line), and that of IT     Figure 1 depicts the prototypical architecture of
processes (the ERP model). In the belief that the       our framework proposal.
complexity of new industrial production systems            In order to integrate different technologies,
requires interdisciplinary skills, our intents are to   from the high-level voice command to low-level
bring together knowledge from related disciplines       execute command, we developed a web applica-
such as computational linguistics, cognitive sci-       tion, powered by Spring Boot framework2 , able to
ence, machine learning, computer vision, human-         receive commands from user interfaces and trans-
machine interaction, and collaborative robotics au-     form them into machine commands. We con-
tomation towards an integrated novel approach           sider in this framework the possibility to give vo-
specifically designed for the smart management          cal commands to the Cobot. In this perspective,
of a manufacturing process line by fostering and        the mechanical arm is controlled through a series
strengthening the synergy and the interaction be-       of connected conversational devices like chat-bots,
tween robot and human.                                  powered by Cisco WebEx Teams3 and QuestIT
   Our research is broadly situated in Human-           Algho4 , and a virtual assistant such as Amazon
Robot Collaboration (HRC), a promising robotics         Alexa5 . In particular, Cisco WebEx Teams is
discipline focusing on enabling robots and humans       an all-in-one solution for messaging, file sharing,
to operate jointly to complete collaborative tasks.     white boarding, video meetings, and calling, while
Recent works tried to figure out in which way           Amazon Alexa is a voice interaction device capa-
Cobots may help humans in collaborative indus-          ble of a large set of human-interacting functions.
trial tasks (El Zaatari et al., 2019) or in partici-    The Cobot, through the use of a camera, is able
patory design in fablabs (Ionescua and Schlunda,        to acquire images and process through the use of
2019). An inital study centered cobots in advanced      Computer Vision algorithms, recognizing exactly
manufacturing systems (Djuric et al., 2016). No         the object to be selected without knowing its posi-
or litte work (Ivorra et al., 2018) is done to endow    tion in advance. Hence, the vocal commands sent
Cobots with cognitive intelligence like conversa-       via Alexa are managed by lambda functions using
tional interaction and computer vision.                 the AWS Lambda service6 , which is a serverless
                                                        event-driven computing platform. It permits to ex-
   This paper is organized as follows. Section 2
                                                            2
introduces the functionalities and the architecture           https://spring.io/projects/spring-boot
                                                            3
                                                              https://www.webex.com/team-collaboration.html
of our approach, focusing on the main four tech-            4
                                                              https://www.alghoncloud.com/it/
nological aspects: cobots, adaptable end effec-             5
                                                              https://developer.amazon.com/it/alexa
                                                            6
tors, conversational interfaces, and computer vi-             https://aws.amazon.com/it/lambda/
  Figure 1: Framework Architecture including Conversational Interfaces, Computer Vision and Cobot


ecute code in response to particular events, auto-                2.2      Adaptable End Effector with Schunk
matically managing also the resources required by
                                                                  This category includes grippers, which hold
the programming code. Indeed, lambda’s goal is
                                                                  and manipulate objects, and end-of-arm tools
to simplify the construction of on-demand applica-
                                                                  (EOATs), which are complex systems of grippers
tions that respond to events and new information.
                                                                  designed to handle large or delicate components.
   Therefore, all commands are sent via HTTP                      Handling tasks mainly include pick and place,
calls to the web application using the Spring Boot                sorting, packaging, and palletizing. As gripping
framework, receiving also calls from the one or                   tool we used the Schunk Co-act EGP-C gripper8 .
more chat-bots with which the user can inter-                     It is an Electric 2-finger parallel gripper certified
act. Once a command has been received, the web                    for collaborative operation with actuation via 24
application executes a C# application, based on                   V and digital I/O. It is used for gripping and mov-
Fanuc SDK, that sends to the Cobot the request                    ing small and medium-sized workpieces with flex-
to execute a particular script written in Teach Pen-              ible force in collaborative operation in the areas
dant (TP) language.                                               of assembly, electronics and machine tool load-
                                                                  ing. We chose this model due to its certified and
                                                                  pre-assembled gripping unit with funcional safety,
2.1    Cobot with Fanuc                                           and its “plug & work” mode with Fanuc cobots.

The right choice of a cobot comes with the ful-                   2.3      Conversational Interfaces with Algho
fillment of various safety requirements, such as a
collision stop protection, a function to restart them             The achievements in the field of Artificial Intelli-
easily and quickly after a stop, and anti-trap fea-               gence (AI) in the recent years have led to the birth
tures for additional protection. For our purposes,                of a new paradigm of human-machine interaction:
we used a Cobot from Fanuc, in particular the CR-                 the conversational agents. This new way of in-
4iA model7 . It is endowed with six axis in its arm,              teracting with a computer is based on the use of
and its maximum payload is 4 kg. Also, it handles                 natural language and is getting closer to the way
lightweight tasks that are tedious, highly manual.                humans communicate with each other. Conver-
Since it can take over these dull jobs, the operator              sational agents take advantage of recent achieve-
hands are free to focus on more intelligent work or               ments in the field of Natural Language Processing
even more pressing matters. This cobot can also                   (NLP-U) to understand user requests and behave
work side-by-side on tasks that are more complex,                 accordingly, providing appropriate answers or per-
and require more interactive approaches.                          forming required actions. The design of an inno-
                                                                  vative Cobot cannot fail to consider the use of a
                                                                  such straightforward human-machine interaction.
    7
      https://www.fanuc.eu/it/it/robot/robot-filter-page/robot-
                                                                     8
collaborativi/collaborative-cr4ia                                        https://schunk.com/it it/co-act/pinza-co-act-egp-c/
    The conversational functionalities for the Cobot    from the NLP analysis of the request (Auto-Form-
described in this paper have been provided by           Filling procedure); and (ii) proposes sequentially
using Algho4 , a proprietary conversational-agent       to the user the fields that have not been filled by the
building tool developed by QuestIT9 and based on        automatic procedure. When an user input request
NLP and AI techniques. In particular, Algho is a        trigger a conversational form, the returned NLP in-
suite designed to facilitate the creation of personal   formation are used to automatically fill the fields
conversational agents and the subsequent deploy         of the structured form without requesting further
on several proprietary channels. The user of Algho      data from the user. Furthermore, Algho allows to
can create his own chat-bot simply by entering the      specify an URL to which the collected informa-
personal knowledge base and the system, after a         tion can be sent via the call to a web-service. In
few minutes, is able to handle conversations about      this case, the system uses the field’s values as pa-
it. The natural language understanding functional-      rameters for the call to the service.
ities of Algho are based on a proprietary NLP Plat-
form developed by QuestIT9 consisting in more           2.4    Computer Vision
than 25 layers of morphological, syntactic and se-      The computer vision functionalities for the de-
mantic analysis based on Machine Learning (ML)          scribed work have been implemented with two
and Artificial Intelligence techniques: tokeniza-       open source libraries, OpenCV and TensorFlow.
tion, lemmatization, Part-Of-Sopeech (POS), Col-        OpenCV (Laganière, 2014) provides the state-of-
location Detection, Word Sense Disambiguation,          the-art algorithms in this field and, starting from
Dependency Tree Parsing, Sentiment and Emo-             version 4.0, has introduced more advanced fea-
tional Analysis, Intent Recognition, and many oth-      tures for deep learning. TensorFlow (Abadi et al.,
ers.The NLP Platform exploits the most recent           2016) is a library to develop and train machine
techniques in the field of NLP and Machine Learn-       learning models, in particular its used to create
ing to enrich the input raw text with a set of high-    deep neural networks. Our approach follows a
level cognitive information (Melacci et al., 2018;      general pipeline composed of three main steps:
Bongini et al., 2018). The Word Sense Disam-
biguation (WSD) layer is one of the main levels             • Dataset creation: several images of the ob-
of the NLP Platform and it follows a Deep Neu-                jects of interest are collected and their posi-
ral Network approach based on RNN and word                    tion is annotated manually by specifying their
embedding. It provides state-of-the-art perfor-               coordinates;
mances with regard to the disambiguation accu-
racy (Melacci et al., 2018; Bongini et al., 2018).          • Training the model: a model is trained in or-
                                                              der to recognize the objects of our interest
    The enriched text is subsequently exploited by
                                                              and its coordinates within the image. For this
the conversational engine to understand the user
                                                              purpose, we decided to fine tune the model
request, to identify the “intent” and to behave ac-
                                                              Faster R-CNN (Ren et al., 2015) with Incep-
cordingly to the knowledge base provided by the
                                                              tion V2 (Szegedy et al., 2016) pre-trained on
creator of the conversational agent. The intent of a
                                                              the COCO dataset (Lin et al., 2014);
request is defined as the hidden desire that under-
lies the user’s request.                                    • Using the model: the detection of the re-
    During the construction of the conversational             quested object through the conversational in-
agent, the Algho suite allows the user to de-                 terface is performed in real time by analysing
fine specific objects called “Conversational Form”            a video stream received from a video camera.
which can be used to collect structured informa-
tion from the user. In particular, a “conversational    3     Exprivia’s Use Case Scenarios
form” consists in a typical form for collecting data
which is linked to a set of intent defined in the       Exprivia prototyped this general framework in two
knowledge base. During the conversation, when           different use case scenarios, with the main target
an input user request triggers an intent having a       of enabling communication between all the ma-
linked conversational form, the system: (i) tries to    chines and ICT systems located in a factory in
fill the form fields by extracting the information      a capillary way, ranging from supply chain sys-
                                                        tems to administrative ones. The ultimate goal is
   9
       https://www.quest-it.com                         to manage of the entire production life-cycle to
a cost saving optimization of each resource that      over, Nuccio, through the use of a camera, was
turns into an advantage, not only economical but      able to acquire images and process through the use
also competitive, allowing company to play a lead-    of Computer Vision algorithms, recognizing ex-
ing role in the challenge of the future.              actly the pod to be selected without knowing the
                                                      position in advance. Through the Algho conver-
   Food Supply Chain. An interesting example of
                                                      sational interface, the user is helped and guided in
the application of our framework has been made
                                                      the choice of the most suitable coffee pod, accord-
within the food supply chain, in particular refer-
                                                      ing to his/her tastes.
ring to the pasta creation chain, presented at the
DevNet Create 2019 conference in Mountain View
(California) in April. The purpose of the project     4   Conclusion
was to automate a series of activities typical of     In line with the main objectives, we contributed to
daily operations, specifically to medium-high dif-    the development and validation of a framework in
ficulty activities that are the cause of most prob-   an operational environment of intelligent robotic
lems in the production life-cycle. Pasta creation     systems and HRC. In particular, we dealt with
process is very complex and requires a concate-       conversational interaction technologies useful to
nation of different work steps. Many of these are     perform: (i) high-performance linguistic analysis
performed manually (e.g. quality control) and typ-    services based on NLP technologies; (ii) models
ically the machines are not able to communicate       for the symbiotic human-robot interaction man-
with each other: this means that operators and the    agement; (iii) services and tools for the adapta-
management cannot have information on the op-         tion of linguistic interfaces with respect to user
erating status. Thanks to our framework that in-      characteristics. The Cobots are close to operat-
cludes a chat-bot to communicate with the ma-         ing in environments where the presence of man
chinery and computer vision algorithms able to au-    plays a key role. A fundamental characteristic is
tomate the pasta quality control, the communica-      therefore the Cobot’s ability in reacting to textual
tion with management systems enables a two-way        and vocal commands to properly understand the
exchange of information that automates activities,    user’s commands. The Cobot’s perception is lever-
improving overall operating efficiency.               aged with its ability to detect object and under-
   Coffee Pod Selection with Nuccio. The fol-         stand what there is around him; computer vision
lowing solution provides the possibility to use a     processing becomes crucial to the extent of giving
Fanuc Cobot to select a coffee pod. This proto-       Cobots a cognitive profile. We therefore envision
type has been presented at Mobile World Congress      our framework to be fully operable in complex
2019, in Barcelona, in February. The Cobot “Nuc-      manufacturing systems, in which the collaboration
cio” is controlled through the Algho conversa-        between robot and man is facilitated by advanced
tional interface. In particular, the idea was to      AI and cognitive techniques.
create a conversational agent focused on a specific      We showed how, already today, it is possible to
knowledge base about coffee. The resulting bot        “humanize” highly automated processes through a
was able to handle conversations about coffee and     Cobot, collecting and integrating the operational
about many aspects related to this topic. After-      information in the corporate knowledge base. In
wards, a specific “conversational form” was devel-    fact, we believe that in the long term there will
oped for collecting a set of information useful for   be a convergence between automation, AI and
preparing a coffee (taste, aroma, sugar, short or     IoT, allowing the market to create a full “Digi-
long) and required by the actuator system. Finally,   tal Twin” with an organization that will lead to a
the form has been connected with the web-service      strong automation of organizational choices driven
of the actuator system and linked to the set of in-   by data collected in the field. The digitized orga-
tents for which activation was desired. Thus, the     nization can then be equipped with its own “Com-
resulting bot was able to handle conversation con-    pany Brain”, an AI able to make autonomous com-
cerning coffee and if the user request deals with     plex decisions aimed at maximizing a business
the intent to have a coffee, the linked conversa-     goal that, working in a cooperative manner with
tional form allows to collect all the information     the company management, will be able to respond
required by the actuator system to prepare the cof-   much more precisely and quickly to changes in an
fee and to notify via a web service call. More-       increasingly unstable and fluid market.
References                                                 sense disambiguation models by semantic lexical re-
                                                           sources. In Proceedings of the Eleventh Interna-
Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng          tional Conference on Language Resources and Eval-
 Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,           uation (LREC-2018), Miyazaki, Japan, May. Euro-
 Sanjay Ghemawat, Geoffrey Irving, Michael Isard,          pean Languages Resources Association (ELRA).
 et al. 2016. Tensorflow: A system for large-scale
 machine learning. In 12th {USENIX} Symposium            Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
 on Operating Systems Design and Implementation            Sun. 2015. FasterR-CNN: Towards real-time ob-
 ({OSDI} 16), pages 265–283.                               ject detection with region proposal networks. In
                                                           Advances in neural information processing systems,
Marco Bongini, Leonardo Rigutini, and Edmondo              pages 91–99.
 Trentin. 2018. Recursive neural networks for
 density estimation over generalized random graphs.      Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
 IEEE transactions on neural networks and learning         Jon Shlens, and Zbigniew Wojna. 2016. Rethink-
 systems, (99):1–18.                                       ing the inception architecture for computer vision.
                                                           In Proceedings of the IEEE conference on computer
James Edward Colgate, Michael A. Peshkin, and              vision and pattern recognition, pages 2818–2826.
  Witaya Wannasuphoprasit. 1996. Cobots: Robots
  for collaboration with human operators. In Proceed-    Etienne Wenger. 2014. Artificial intelligence and
  ings of the International Mechanical Engineering          tutoring systems: computational and cognitive ap-
  Congress and Exhibition, Atlanta, GA, volume 58,          proaches to the communication of knowledge. Mor-
  pages 433–439. Citeseer.                                  gan Kaufmann.
                                                         Victor W Zue and James R Glass. 2000. Conversa-
Ana M. Djuric, R.J. Urbanic, and J.L. Rickli. 2016.
                                                           tional interfaces: Advances and challenges. Pro-
  A framework for collaborative robot (cobot) integra-
                                                           ceedings of the IEEE, 88(8):1166–1180.
  tion in advanced manufacturing systems. SAE Inter-
  national Journal of Materials and Manufacturing,
  9(2):457–464.

Venketesh N. Dubey and Richard M. Crowder. 2002.
  A finger mechanism for adaptive end effectors. In
  ASME 2002 International Design Engineering Tech-
  nical Conferences and Computers and Information
  in Engineering Conference, pages 995–1001. Amer-
  ican Society of Mechanical Engineers.

Shirine El Zaatari, Mohamed Marei, Weidong Li, and
  Zahid Usman. 2019. Cobot programming for col-
  laborative industrial tasks: An overview. Robotics
  and Autonomous Systems, 116:162–180.

Tudor B. Ionescua and Sebastian Schlunda. 2019. A
  participatory programming model for democratizing
  cobot technology in public and industrial fablabs.
  Procedia CIRP, 81:93–98.

Eugenio Ivorra, Mario Ortega, Mariano Alcañiz, and
  Nicolás Garcia-Aracil. 2018. Multimodal computer
  vision framework for human assistive robotics. In
  2018 Workshop on Metrology for Industry 4.0 and
  IoT, pages 1–5. IEEE.

Robert Laganière. 2014. OpenCV Computer Vision
  Application Programming Cookbook Second Edi-
  tion. Packt Publishing Ltd.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James
  Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
  and C. Lawrence Zitnick. 2014. Microsoft COCO:
  Common objects in context. In European confer-
  ence on computer vision, pages 740–755. Springer.

Stefano Melacci, Achille Globo, and Leonardo
   Rigutini. 2018. Enhancing modern supervised word