A Novel Integrated Industrial Approach with Cobots in the Age of Industry 4.0 through Conversational Interaction and Computer Vision Andrea Pazienza and Nicola Macchiarulo and Felice Vitulano and Antonio Fiorentini and Marco Cammisa Innovation Lab, Exprivia S.p.A. {andrea.pazienza, nicola.macchiarulo, felice.vitulano, antonio.fiorentini, marco.cammisa}@exprivia.com Leonardo Rigutini and Ernesto Di Iorio and Achille Globo and Antonio Trevisi QuestIT S.r.l. {rigutini, diiorio, globo, trevisi}@quest-it.com Abstract 1996). Collaborative Robots (better known as Cobots) are specifically designed for direct in- English. From robots that replace workers teraction with a human within a defined collabo- to robots that serve as helpful colleagues, rative work-space, i.e., a safeguard space where the field of robotic automation is experi- the robot and a human can perform tasks simul- encing a new trend that represents a huge taneously during an automatic operation. Then, challenge for component manufacturers. human-robot collaboration fosters various levels The contribution starts from an innovative of automation and human intervention. Tasks can vision that sees an ever closer collabora- be partially automated if a fully automated solu- tion between Cobot, able to do a specific tion is not economical or too complex. Therefore, physical job with precision, the AI world, manufacturers may benefiting from the rising of able to analyze information and support AI-driven automation, and the progress of Adapt- the decision-making process, and the man able End Effectors devices, mounted at the end able to have a strategic vision of the future. of Cobot’s arms, may help to perform specific in- telligent tasks (Dubey and Crowder, 2002). 1 Introduction The way in which Cobots and humans interact, In the last century, the manufacturing world has exchanging and conveying information, is funda- adopted solutions for the advanced automation of mental. The key role in this landscape would be production systems. Today, thanks to the evolu- addressed by Conversational Interfaces (Zue and tion and maturity of new technologies such as Ar- Glass, 2000), which exploit and take advantages tificial Intelligence (AI), Machine Learning (ML), from the recent achievements in the field of Nat- new generation networks, and the growing adop- ural Language Processing (NLP), to understand tion of the Internet of Things (IoT) approach, a user need and generate the right answer or action. new paradigm emerges, aiming at integrating the In this scenario, Computer Vision also plays an Cyber-Physical System (CPS) with business pro- important role in the process of creating collabo- cesses, thus opening the doors to the fourth indus- rative environments between humans and robots. trial revolution (Industry 4.0) and that will allow Systems of this type are already introduced into us to join in the era driven by information and the industry to facilitate tasks of product quality further handled with cognitive computing tech- control or component assembly inspection. By niques (Wenger, 2014). giving vision to a robot, it can make it able to un- Robots and humans have been co-workers for derstand the industrial environment that surrounds years, but rarely have we been truly working to- it and can improve the execution of tasks in sup- gether. This may be about to change with the port to other people. rise of Collaborative Robotics (Colgate et al., Improving robots software with AI will be key to making robots more collaborative. The work Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 starts from an innovative vision that beholds, in International (CC BY 4.0). the future, an ever closer collaboration between Cobot, able to do a specific physical job with pre- sion. Section 3 describes the possible scenarios of cision and without alienation, the AI world, able to application specifically designed for our approach, analyze, process, and learn from information and such as Smart Manufacturing. Finally, Section 4 support the decision-making process, and the em- discusses the proposed framework and concludes ployee able to have a strategic vision of the future. the paper, outlining future works. To validate its effectiveness, a collaborative en- vironment between employee, Cobot and AI sys- 2 Architecture Proposal tems has been crafted to make possible the three In this Section we introduce our main proposal subjects communicate in a simple way and with- taking into account all the requirements coming out requiring the employee to have specific skills from different technologies. The leading idea to interact with the Cobot and Enterprise Resource is to develop and validate a general framework Planning (ERP) systems. concerning an Intelligent Cyber-Physical System Our contribution is indeed placed in this sce- made up of four crucial components: (i) a Cobot, nario where the convergence of multiple technolo- equipped with (ii) an adaptable end effector, which gies allows us to define a new approach related may change according to a specific scenario, and to the management of a core business process two major components coming from the AI world, (e.g. shipments) which tends to ensure more and i.e. (iii) a Computer Vision module to allow the more flexibility of the process thanks to a simpli- cobot detecting an object, and (iv) one or more fication of human interaction with Cyber-Physical Conversational Interfaces to facilitate the human- Systems, with a better coordination between the machine interaction and keep the man in the loop. physical world (the packaging line), and that of IT Figure 1 depicts the prototypical architecture of processes (the ERP model). In the belief that the our framework proposal. complexity of new industrial production systems In order to integrate different technologies, requires interdisciplinary skills, our intents are to from the high-level voice command to low-level bring together knowledge from related disciplines execute command, we developed a web applica- such as computational linguistics, cognitive sci- tion, powered by Spring Boot framework2 , able to ence, machine learning, computer vision, human- receive commands from user interfaces and trans- machine interaction, and collaborative robotics au- form them into machine commands. We con- tomation towards an integrated novel approach sider in this framework the possibility to give vo- specifically designed for the smart management cal commands to the Cobot. In this perspective, of a manufacturing process line by fostering and the mechanical arm is controlled through a series strengthening the synergy and the interaction be- of connected conversational devices like chat-bots, tween robot and human. powered by Cisco WebEx Teams3 and QuestIT Our research is broadly situated in Human- Algho4 , and a virtual assistant such as Amazon Robot Collaboration (HRC), a promising robotics Alexa5 . In particular, Cisco WebEx Teams is discipline focusing on enabling robots and humans an all-in-one solution for messaging, file sharing, to operate jointly to complete collaborative tasks. white boarding, video meetings, and calling, while Recent works tried to figure out in which way Amazon Alexa is a voice interaction device capa- Cobots may help humans in collaborative indus- ble of a large set of human-interacting functions. trial tasks (El Zaatari et al., 2019) or in partici- The Cobot, through the use of a camera, is able patory design in fablabs (Ionescua and Schlunda, to acquire images and process through the use of 2019). An inital study centered cobots in advanced Computer Vision algorithms, recognizing exactly manufacturing systems (Djuric et al., 2016). No the object to be selected without knowing its posi- or litte work (Ivorra et al., 2018) is done to endow tion in advance. Hence, the vocal commands sent Cobots with cognitive intelligence like conversa- via Alexa are managed by lambda functions using tional interaction and computer vision. the AWS Lambda service6 , which is a serverless event-driven computing platform. It permits to ex- This paper is organized as follows. Section 2 2 introduces the functionalities and the architecture https://spring.io/projects/spring-boot 3 https://www.webex.com/team-collaboration.html of our approach, focusing on the main four tech- 4 https://www.alghoncloud.com/it/ nological aspects: cobots, adaptable end effec- 5 https://developer.amazon.com/it/alexa 6 tors, conversational interfaces, and computer vi- https://aws.amazon.com/it/lambda/ Figure 1: Framework Architecture including Conversational Interfaces, Computer Vision and Cobot ecute code in response to particular events, auto- 2.2 Adaptable End Effector with Schunk matically managing also the resources required by This category includes grippers, which hold the programming code. Indeed, lambda’s goal is and manipulate objects, and end-of-arm tools to simplify the construction of on-demand applica- (EOATs), which are complex systems of grippers tions that respond to events and new information. designed to handle large or delicate components. Therefore, all commands are sent via HTTP Handling tasks mainly include pick and place, calls to the web application using the Spring Boot sorting, packaging, and palletizing. As gripping framework, receiving also calls from the one or tool we used the Schunk Co-act EGP-C gripper8 . more chat-bots with which the user can inter- It is an Electric 2-finger parallel gripper certified act. Once a command has been received, the web for collaborative operation with actuation via 24 application executes a C# application, based on V and digital I/O. It is used for gripping and mov- Fanuc SDK, that sends to the Cobot the request ing small and medium-sized workpieces with flex- to execute a particular script written in Teach Pen- ible force in collaborative operation in the areas dant (TP) language. of assembly, electronics and machine tool load- ing. We chose this model due to its certified and pre-assembled gripping unit with funcional safety, 2.1 Cobot with Fanuc and its “plug & work” mode with Fanuc cobots. The right choice of a cobot comes with the ful- 2.3 Conversational Interfaces with Algho fillment of various safety requirements, such as a collision stop protection, a function to restart them The achievements in the field of Artificial Intelli- easily and quickly after a stop, and anti-trap fea- gence (AI) in the recent years have led to the birth tures for additional protection. For our purposes, of a new paradigm of human-machine interaction: we used a Cobot from Fanuc, in particular the CR- the conversational agents. This new way of in- 4iA model7 . It is endowed with six axis in its arm, teracting with a computer is based on the use of and its maximum payload is 4 kg. Also, it handles natural language and is getting closer to the way lightweight tasks that are tedious, highly manual. humans communicate with each other. Conver- Since it can take over these dull jobs, the operator sational agents take advantage of recent achieve- hands are free to focus on more intelligent work or ments in the field of Natural Language Processing even more pressing matters. This cobot can also (NLP-U) to understand user requests and behave work side-by-side on tasks that are more complex, accordingly, providing appropriate answers or per- and require more interactive approaches. forming required actions. The design of an inno- vative Cobot cannot fail to consider the use of a such straightforward human-machine interaction. 7 https://www.fanuc.eu/it/it/robot/robot-filter-page/robot- 8 collaborativi/collaborative-cr4ia https://schunk.com/it it/co-act/pinza-co-act-egp-c/ The conversational functionalities for the Cobot from the NLP analysis of the request (Auto-Form- described in this paper have been provided by Filling procedure); and (ii) proposes sequentially using Algho4 , a proprietary conversational-agent to the user the fields that have not been filled by the building tool developed by QuestIT9 and based on automatic procedure. When an user input request NLP and AI techniques. In particular, Algho is a trigger a conversational form, the returned NLP in- suite designed to facilitate the creation of personal formation are used to automatically fill the fields conversational agents and the subsequent deploy of the structured form without requesting further on several proprietary channels. The user of Algho data from the user. Furthermore, Algho allows to can create his own chat-bot simply by entering the specify an URL to which the collected informa- personal knowledge base and the system, after a tion can be sent via the call to a web-service. In few minutes, is able to handle conversations about this case, the system uses the field’s values as pa- it. The natural language understanding functional- rameters for the call to the service. ities of Algho are based on a proprietary NLP Plat- form developed by QuestIT9 consisting in more 2.4 Computer Vision than 25 layers of morphological, syntactic and se- The computer vision functionalities for the de- mantic analysis based on Machine Learning (ML) scribed work have been implemented with two and Artificial Intelligence techniques: tokeniza- open source libraries, OpenCV and TensorFlow. tion, lemmatization, Part-Of-Sopeech (POS), Col- OpenCV (Laganière, 2014) provides the state-of- location Detection, Word Sense Disambiguation, the-art algorithms in this field and, starting from Dependency Tree Parsing, Sentiment and Emo- version 4.0, has introduced more advanced fea- tional Analysis, Intent Recognition, and many oth- tures for deep learning. TensorFlow (Abadi et al., ers.The NLP Platform exploits the most recent 2016) is a library to develop and train machine techniques in the field of NLP and Machine Learn- learning models, in particular its used to create ing to enrich the input raw text with a set of high- deep neural networks. Our approach follows a level cognitive information (Melacci et al., 2018; general pipeline composed of three main steps: Bongini et al., 2018). The Word Sense Disam- biguation (WSD) layer is one of the main levels • Dataset creation: several images of the ob- of the NLP Platform and it follows a Deep Neu- jects of interest are collected and their posi- ral Network approach based on RNN and word tion is annotated manually by specifying their embedding. It provides state-of-the-art perfor- coordinates; mances with regard to the disambiguation accu- racy (Melacci et al., 2018; Bongini et al., 2018). • Training the model: a model is trained in or- der to recognize the objects of our interest The enriched text is subsequently exploited by and its coordinates within the image. For this the conversational engine to understand the user purpose, we decided to fine tune the model request, to identify the “intent” and to behave ac- Faster R-CNN (Ren et al., 2015) with Incep- cordingly to the knowledge base provided by the tion V2 (Szegedy et al., 2016) pre-trained on creator of the conversational agent. The intent of a the COCO dataset (Lin et al., 2014); request is defined as the hidden desire that under- lies the user’s request. • Using the model: the detection of the re- During the construction of the conversational quested object through the conversational in- agent, the Algho suite allows the user to de- terface is performed in real time by analysing fine specific objects called “Conversational Form” a video stream received from a video camera. which can be used to collect structured informa- tion from the user. In particular, a “conversational 3 Exprivia’s Use Case Scenarios form” consists in a typical form for collecting data which is linked to a set of intent defined in the Exprivia prototyped this general framework in two knowledge base. During the conversation, when different use case scenarios, with the main target an input user request triggers an intent having a of enabling communication between all the ma- linked conversational form, the system: (i) tries to chines and ICT systems located in a factory in fill the form fields by extracting the information a capillary way, ranging from supply chain sys- tems to administrative ones. The ultimate goal is 9 https://www.quest-it.com to manage of the entire production life-cycle to a cost saving optimization of each resource that over, Nuccio, through the use of a camera, was turns into an advantage, not only economical but able to acquire images and process through the use also competitive, allowing company to play a lead- of Computer Vision algorithms, recognizing ex- ing role in the challenge of the future. actly the pod to be selected without knowing the position in advance. Through the Algho conver- Food Supply Chain. An interesting example of sational interface, the user is helped and guided in the application of our framework has been made the choice of the most suitable coffee pod, accord- within the food supply chain, in particular refer- ing to his/her tastes. ring to the pasta creation chain, presented at the DevNet Create 2019 conference in Mountain View (California) in April. The purpose of the project 4 Conclusion was to automate a series of activities typical of In line with the main objectives, we contributed to daily operations, specifically to medium-high dif- the development and validation of a framework in ficulty activities that are the cause of most prob- an operational environment of intelligent robotic lems in the production life-cycle. Pasta creation systems and HRC. In particular, we dealt with process is very complex and requires a concate- conversational interaction technologies useful to nation of different work steps. Many of these are perform: (i) high-performance linguistic analysis performed manually (e.g. quality control) and typ- services based on NLP technologies; (ii) models ically the machines are not able to communicate for the symbiotic human-robot interaction man- with each other: this means that operators and the agement; (iii) services and tools for the adapta- management cannot have information on the op- tion of linguistic interfaces with respect to user erating status. Thanks to our framework that in- characteristics. The Cobots are close to operat- cludes a chat-bot to communicate with the ma- ing in environments where the presence of man chinery and computer vision algorithms able to au- plays a key role. A fundamental characteristic is tomate the pasta quality control, the communica- therefore the Cobot’s ability in reacting to textual tion with management systems enables a two-way and vocal commands to properly understand the exchange of information that automates activities, user’s commands. The Cobot’s perception is lever- improving overall operating efficiency. aged with its ability to detect object and under- Coffee Pod Selection with Nuccio. The fol- stand what there is around him; computer vision lowing solution provides the possibility to use a processing becomes crucial to the extent of giving Fanuc Cobot to select a coffee pod. This proto- Cobots a cognitive profile. We therefore envision type has been presented at Mobile World Congress our framework to be fully operable in complex 2019, in Barcelona, in February. The Cobot “Nuc- manufacturing systems, in which the collaboration cio” is controlled through the Algho conversa- between robot and man is facilitated by advanced tional interface. In particular, the idea was to AI and cognitive techniques. create a conversational agent focused on a specific We showed how, already today, it is possible to knowledge base about coffee. The resulting bot “humanize” highly automated processes through a was able to handle conversations about coffee and Cobot, collecting and integrating the operational about many aspects related to this topic. After- information in the corporate knowledge base. In wards, a specific “conversational form” was devel- fact, we believe that in the long term there will oped for collecting a set of information useful for be a convergence between automation, AI and preparing a coffee (taste, aroma, sugar, short or IoT, allowing the market to create a full “Digi- long) and required by the actuator system. Finally, tal Twin” with an organization that will lead to a the form has been connected with the web-service strong automation of organizational choices driven of the actuator system and linked to the set of in- by data collected in the field. The digitized orga- tents for which activation was desired. Thus, the nization can then be equipped with its own “Com- resulting bot was able to handle conversation con- pany Brain”, an AI able to make autonomous com- cerning coffee and if the user request deals with plex decisions aimed at maximizing a business the intent to have a coffee, the linked conversa- goal that, working in a cooperative manner with tional form allows to collect all the information the company management, will be able to respond required by the actuator system to prepare the cof- much more precisely and quickly to changes in an fee and to notify via a web service call. More- increasingly unstable and fluid market. References sense disambiguation models by semantic lexical re- sources. In Proceedings of the Eleventh Interna- Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng tional Conference on Language Resources and Eval- Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, uation (LREC-2018), Miyazaki, Japan, May. Euro- Sanjay Ghemawat, Geoffrey Irving, Michael Isard, pean Languages Resources Association (ELRA). et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium Shaoqing Ren, Kaiming He, Ross Girshick, and Jian on Operating Systems Design and Implementation Sun. 2015. FasterR-CNN: Towards real-time ob- ({OSDI} 16), pages 265–283. ject detection with region proposal networks. In Advances in neural information processing systems, Marco Bongini, Leonardo Rigutini, and Edmondo pages 91–99. Trentin. 2018. Recursive neural networks for density estimation over generalized random graphs. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, IEEE transactions on neural networks and learning Jon Shlens, and Zbigniew Wojna. 2016. Rethink- systems, (99):1–18. ing the inception architecture for computer vision. In Proceedings of the IEEE conference on computer James Edward Colgate, Michael A. Peshkin, and vision and pattern recognition, pages 2818–2826. Witaya Wannasuphoprasit. 1996. Cobots: Robots for collaboration with human operators. In Proceed- Etienne Wenger. 2014. Artificial intelligence and ings of the International Mechanical Engineering tutoring systems: computational and cognitive ap- Congress and Exhibition, Atlanta, GA, volume 58, proaches to the communication of knowledge. Mor- pages 433–439. Citeseer. gan Kaufmann. Victor W Zue and James R Glass. 2000. Conversa- Ana M. Djuric, R.J. Urbanic, and J.L. Rickli. 2016. tional interfaces: Advances and challenges. Pro- A framework for collaborative robot (cobot) integra- ceedings of the IEEE, 88(8):1166–1180. tion in advanced manufacturing systems. SAE Inter- national Journal of Materials and Manufacturing, 9(2):457–464. Venketesh N. Dubey and Richard M. Crowder. 2002. A finger mechanism for adaptive end effectors. In ASME 2002 International Design Engineering Tech- nical Conferences and Computers and Information in Engineering Conference, pages 995–1001. Amer- ican Society of Mechanical Engineers. Shirine El Zaatari, Mohamed Marei, Weidong Li, and Zahid Usman. 2019. Cobot programming for col- laborative industrial tasks: An overview. Robotics and Autonomous Systems, 116:162–180. Tudor B. Ionescua and Sebastian Schlunda. 2019. A participatory programming model for democratizing cobot technology in public and industrial fablabs. Procedia CIRP, 81:93–98. Eugenio Ivorra, Mario Ortega, Mariano Alcañiz, and Nicolás Garcia-Aracil. 2018. Multimodal computer vision framework for human assistive robotics. In 2018 Workshop on Metrology for Industry 4.0 and IoT, pages 1–5. IEEE. Robert Laganière. 2014. OpenCV Computer Vision Application Programming Cookbook Second Edi- tion. Packt Publishing Ltd. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European confer- ence on computer vision, pages 740–755. Springer. Stefano Melacci, Achille Globo, and Leonardo Rigutini. 2018. Enhancing modern supervised word