A Research Infrastructure for E-Health Big Data Analytics Alessio Botta1 , Elio Masciari1 1 University of Napoli Federico II, Napoli, Italy alessio.botta@unina.it 1 University of Napoli Federico II, Napoli, Italy elio.masciari@unina.it Abstract In this paper, we present a research infrastructure, built within the Department of Excellence project in order to support a wide spectrum of big data analysis related to e-health. More in details we built a both a public cloud based infrastructure and a private cloud one in order to guarantee a high performance approach to researchers. Keywords Big Data Platform, Cloud Services for Big Data, Research Infrastructures. 1. Introduction The Department of Electrical Engineering and Information Technologies - DIETI - of the University of Naples Federico II is the largest department in Southern Italy that works on issues relating to Information and Communication Technology (ICT). The scientific/disciplinary sectors (Settori Scientifico/Disciplinari in Italian or simply SSD) participating in the activities of DIETI have shown in the last round of Evaluation of the quality of research (VQR) 2011-2014 evaluation peaks of absolute excellence; very limited is in fact the number of SSD below the national average evaluation. With particular reference to the Area 09, ten SSD have received evaluations greater than or equal to the national average. The DIETI, unanimously, has therefore decided to develop in the next five years its research lines in the field of Information and Communication Technologies, especially with regard to the application of modern information technologies in the thematic areas of the so-called eHealth. From these premises it was natural to propose an innovative project called ICT for Health (ICTH in the following) within the Departments of Excellence call of the Italian Ministry of Research and University. The call financed 180 Excellence Departments in Italy, and DIETI was among the 14 ones with the maximum evaluation for the project, one of the only two cases in south Italy. ICTH has been financed with over 9 M Euros for a time span of five years. Using the financial tools available for the project, DIETI is recruiting a full professor chosen by a panel of international experts from a short-list of candidates of excellence. This figure is responsible for coordinating and managing the entire project ensuring the success and SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) achievement of all objectives. In addition, it is currently ongoing the recruitment of an associate professor and four Tenure-track Researchers (RTD-B) in order to strengthen the research areas that are described below. The ICT for Health project involves the strengthening of existing laboratories in the DIETI and the creation of a new experimental reality active on issues not present now within the DIETI, in order to create a network of laboratories to support the proposed research. Finally, a new doctorate school has been created, specific on the themes of the project in order to have, together with the development of the research activities, also the training of professionals ready to be employed in a socio-economic regional and national environment, particularly active in this field and whose annual European budget now exceeds 20 billion euros. In the following we will describe the requirements we would like to fulfill with our Big Data laboratory. 1.1. Research Agenda The eHealth technologies and services in the coming decades will bring deep changes in the organization of the health care system, with the aim of improving the quality and efficiency of care and, at the same time, reducing costs. In the short term, the new technologies will have to integrate with the current structures that provide healthcare, but in the long term they will produce significant changes both in the internal organization and in the architecture of the buildings that will house the hospitals and healthcare facilities of the future. These facilities will need to be able to provide personalized therapies to patients, which may also be decentralized or delivered at home and monitored remotely. The ICT for Health departmental project, based on the pre-existing infrastructure of the DIETI laboratories and on the new laboratory to be built, is therefore developed along the following lines of scientific and technological research. 1.1.1. Sensing for Health Smart transducers for the Internet of Everything environment will be developed, such as the electronic syringe, smart sensors to integrate robotic microsurgery devices, advanced biomedical instrumentation for automated and robotic surgery systems, and innovative systems for remote monitoring of patients’ health status, both in hospitals or healthcare facilities (e.g., nursing homes for the elderly), and in private homes. In particular, we will study solutions based on new sensors and wearable devices, wearable or implantable, which will allow to monitor the state of health even outside the traditional care centers. Particular attention will be paid to Exergaming technologies based on Serious Games, in which rehabilitation or treatment takes place through Augmented Reality applications, and Brain Computer Interfaces, with low training and response times, and low number of electrodes, for patients with high rates of disability. The topics of sensor development and their interaction in augmented vision systems will be experimentally developed within the new laboratory foreseen in the program. 1.1.2. Data for Health The eHealth services and technologies generate huge amounts of data and information that, for their treatment, require the use of non-traditional methodologies and technologies: Cloud Computing, Internet of Things and Big Data Analytics are the new paradigms underlying the new generation of systems for information management in eHealth. The data sources to be considered, in addition to having high volume, are also heterogeneous given their different types and origins [1]. The speed with which information is produced and stored, together with the aforementioned volume and variety, require systems and tools to collect, manage, and analyze the data and information produced by healthcare systems, directing research towards Big Data Analytics (BDA) methodologies and techniques [2]. The BDA in eHealth enables the transformation of a classical hypothesis-driven information analysis to an innovative data-driven one, able to identify non-trivial connections between heterogeneous data and information. This requires the need to investigate: (a) new cloud-based architectures that allow for the timely processing of information, from the Hadoop to the Spark Ecosystem; (b) new information management systems that integrate relational (SQL), non- relational (NoSQL) and new relational (newSQL) architectures; (c) use of descriptive, diagnostic, prescriptive and descriptive analysis techniques; (d) use of Data Mining tools and techniques on Massive Data Sets, including Deep Learning . In this context, equally important are digital infrastructures for data circulation and device interconnection, following the Internet of Things paradigm. 1.1.3. Logistics for Health In-hospital logistics services impact 15-20% of operating costs. These services include moving patients, linen, meals, medications, equipment and samples between clinics, wards, operating rooms, laboratories and warehouses. Digitization, automation, and robotic technologies can optimize these processes through solutions that enable automated patient and material transport and automated hospital warehouse management. The main difference from factory logistics systems is the need to operate in man-made environments. We aim at creating robotic systems capable of interacting with humans (patients, medi- cal/nursing staff, visiting relatives) in an intuitive and safe manner. In the “hospital 4.0” model, the automation system for logistics represents a further integrated node within the ICT network for the management of services, whose architecture is typically distributed and whose manage- ment and analysis methodologies are typical of the Data Analysis presented in the previous point. Extending such integration also to a “smart grid” for the management of the utilities, it is possible to increase the energy efficiency of the logistics system. Such an integration also enables a notable containment of the peaks of demanded power to the net (“peak shaving”), with consequent containment of the expense for electric consumption. We want to evaluate the integration of methodologies typical of energy optimization with the management of logistics automation in order to ensure the safe performance of all critical operations by controlling and, if possible, optimizing energy consumption. 1.1.4. Robotics for Health To ensure continuous and personalized care for patients in the wards or at their homes, solutions involving the use of nurse-robots will be investigated. These intelligent machines will help patients perform simple daily actions, facilitate remote monitoring and communication with medical staff or relatives, administer simple therapies, or can be used for entertainment (reading, storytelling, playing games) [3]. In addition to nursing robots, devices and control strategies for rehabilitation will be designed such as the development of virtual agents to be implemented in augmented reality that can interact with the patient through advanced techniques of automatic control and provide real- time data to medical staff through telemedicine strategies. Robotics is already a widespread reality in several medical-surgical specialties. The use of tele-operated or computer-guided machines offers numerous advantages such as precision, repeatability, and tremor filtering [4]. Robots such as the da Vinci system for minimally invasive robotic surgery allow to improve and reduce the duration of the post-operative course of patients. The aim is to improve the capabilities of currently used robots through the use of new sensors, advanced image processing and sensory fusion techniques, computerized procedures for surgery planning based on pre-operative images or guidance through intra-operative image processing, virtual and augmented reality, and new human-robot interfaces [5, 6]. These interfaces, connected to analog or software simulators, will be used, thanks to the already structured collaboration of DIETI within the ICAROS center, for the training of surgeons. New sensorized surgical instruments will be designed and controlled inspired by the human manipulation capability. Anthropomorphic gripping instruments will be developed for both surgery and rehabilitation. 2. The Research Infrastructures The laboratory is intended as an hub for collecting the computational and storage needs of researchers at Unina. More in details the goal is to provide a complete support in order to profitably leverage the skills of the laboratory team in order to create a room for exchanging ideas and providing multidisciplinary insight on how to build solid benchmarks for assessing research results in several fields. In the following, we will describe the two big data [7, 8, 9, 10, 11, 12, 13] infrastructures that will be available for researchers. 2.1. Public Cloud resources The eHBDA laboratory in the Cloud is based on virtual infrastructures, i.e. Cloud services, con- sisting of the following main components: data collection, storage, processing and consumption. Below is a description of the elements composing our infrastructure. Data collection module. It features 2 Instances of virtual machines each with 8 virtual cores with 3.0 Ghz Xeon Platinum processors providing the possibility to get access to a further increase in performance using Intel Turbo Boost technology equipped with 16 GB of RAM and 20 GB of SSD persistent storage. Data storage module. It offers a managed Relational Database Service based on MySQL compatible technologies with a minimum space of 2 TB. The service includes the ability to automatically schedule and execute backups and make the necessary tools available for a possible restore; it is also possible to extend the single database instance in order to support Big Data analysis scenarios. The database service is implemented by virtual machines having the following characteristics: 8 virtual cores with Intel Xeon Ivy Bridge processors with the possibility of having memory bandwidth larger than 60000 MB/s and 122 GB of RAM. It also provides a No SQL database service providing a storage space of 100 GB, with item size 20KB, data access times less than ten milliseconds, integrated backup and the ability to dynamically scale according to the workload without any service interruption. The Object Storage service guarantees high durability of objects provided in serverless mode and which can scale on demand. The service can also be used as a storage service for the implementation of a datalake. The service includes a storage space of at least 200GB, with the possibility of making at least 2000 Req/month and 2000 Put/month. Furthermore, there is an in-memory cache service based, on Redis, with available space of at least 60GB. The service is implemented by virtual machines having the following features: 2 virtual cores and 15 GB of RAM. Finally, a managed service for real-time data ingestion capable of receiving a data flow up to 1MB/s as input is available that allows integration with Spark to allow real-time processing of streaming data by Spark Streaming. Data processing module. This module provides a managed data warehouse service, with the possibility of scaling (scale-out) the computing cluster in order to manage the increase in the volume of data to be processed. The solution includes back-up functionalities able to manage the DB saving even when the volumes managed grow suddenly. Overall, the computing cluster provide 8 TB of storage space and 8 vcore and 120 GB of RAM. The Business Intelligence service is implemented in IaaS mode by 4 virtual machine instances having the following characteristics each: 4 virtual cores with Xeon Broadwell processors, 30 GB of RAM, 100 GB of SSD persistent storage. A Managed service for the implementation of a cluster based on Hadoop, Hive, Spark tech- nologies (including the SparkSQL, Spark Mlib, Spark Streaming and GraphX extensions) is integrated with storage solutions for the implementation of the datalake both internal and external, allowing a decoupling of costs and resources dedicated to computing from those dedicated to data storage. The cluster solution is easily integrated with the tools used for the creation of NoSQL databases and Datawarehouse. There are 8 nodes of the Hadoop cluster having the following features: 8 virtual cores with Intel Xeon Ivy Bridge, 60 GB of RAM, 100 GB of storage space in SSD technology, directly attached to the instance plus 100GB of persistent storage. Finally, a code service allows to write programs in the following languages: Node.js, (JavaScript), Python, Java (Java 8 compatible), and C # (.NET Core), and Go. It checks the limits (throttling) of the functions to prevent programming errors that can generate an unexpected increase in costs and and support the versioning of functions in order to change the definition “at a glance” without impact on the running processes. Data consume module. The Business Intelligence service in fully managed mode allows the creation of dashboards and that allows integration with the data sources included in the configuration (Datalake, Datawarehouse, Relational Database). The service allows the drafting of fully customizable graphic dashboards, the definition of the analysis paths and the caching of the data in memory to allow a rapid exploration of the same. The service must have at least 10GB of stored data model. Business Intelligence service is based on technologies such as Kibana and Qlikview and supported by at least 2 instances of virtual machines each with the following configuration: 4 virtual cores with Xeon Platinum 3.0 Ghz processors with the possibility of accessing a further increase in performance using Intel Turbo Boost technology, 8 GB of RAM and 10 GB of SSD persistent storage. 2.2. Private Cloud resources In order to provide users a broader choice of solutions to fulfill any research needs, we also built our own infrastructure for big data analysis. In the following, we will describe the main features of our infrastructure. We have 10 compute nodes configured to have a cluster with this features: power supplies for at least 7500W, with at least two hot-swappable power supplies per node, 100 cores in total with 1.80GHz frequency and at least 11M cache per CPU, 1280 GB of total RAM, in banks of at least 32GB DDR4, 4.8 TB total hot plug SSD storage, with drives of at least 480GB, Hot-plug 10TB SAS/SATA storage, with disks of at least 1TB, RAID Controller and Onboard SATA Controller, PCIe x16 expansion slot in each node, Network connectivity of at least 20Gbps for each node and an aggregate (number of interfaces per node x speed per interface x number of nodes) of at least 200Gbps, remote management card on each node with at least 1Gbps dedicated interface, rack mount kit, which must be supplied as specified below. The storage is guaranteed by 750TB SSD hard disks, having 500.000 IOPS, a latency of 10ms, a CPU with 12 Core, 2.40GHz, 30M Cache, connectivity rate of 4 x10 GbE + 4 x 1 GbE. We have the possibility to scale-up to 1PB with an availability of 99.9999% by a dedicated switch having 16 10GbE port, 2 25GbE/100GbE port. In order to obtain a high performance level, we built a proper network infrastructure as follows. A Switch for the production network, including 3mt cables for connection to the nodes, with the following characteristics: 48 25GbE SFP28 ports + 6 100GbE ports, 3.6 Tbps (full-duplex) non-blocking, store and forward switching fabric, L2 and L3 Ethernet switching with support for QoS, features for IPv4 and IPv6, including support for OSPF and BGP routing, Support for OpenFlow v 1.3. A Switch for the management network with 48 1GbE Base-T ports + 4 10GbE ports and 3mt RJ45/RJ45 cables for connection to the nodes. Finally, we have 3 Rack 750x1200mm able to contain all the nodes described above (computing nodes, storage nodes and switches). 3. Conclusion In this paper, we described the e-health laboratory we implemented at DIETI in order to support a variety of big data analyses for a broad set of application scenarios [14]. The goal of this paper is to provide a quick view of an actual infrastructure that could be used as a reference architecture for top-class applications. Acknowledgments Supported by the Ministry of University and Research for Department of Excellence Project. References [1] G. Aceto, A. Botta, A. Pescapé, C. Westphal, Efficient storage and processing of high- volume network monitoring data, IEEE Transactions on Network and Service Management 19 (2013) 1–14. [2] K. Wang, Y. Shao, L. Shu, C. Zhu, Y. Zhang, Mobile big data fault-tolerant processing for ehealth networks, IEEE Network 30 (2016) 36–42. [3] S. Chiaverini, B. Siciliano, L. Villani, A survey of robot interaction control schemes with experimental comparison, IEEE/ASME Transactions on Mechatronics 4 (1999) 273–285. doi:10.1109/3516.789685. [4] A. Mashayekhi, S. Behbahani, F. Ficuciello, B. Siciliano, Influence of human operator on stability of haptic rendering: a closed-form equation, International Journal of Intelligent Robotics and Applications 4 (2020). doi:10.1007/s41315-020-00131-6. [5] A. Botta, L. Gallo, G. Ventre, Cloud, fog, and dew robotics: Architectures for next generation applications, in: 2019 7th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (MobileCloud), IEEE, 2019, pp. 16–23. [6] G. Stanco, A. Botta, G. Ventre, Dewros: a platform for informed dew robotics in ros, in: 2020 8th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (MobileCloud), IEEE, 2020, pp. 9–16. [7] V. Persico, A. Pescapé, A. Picariello, G. Sperlí, Benchmarking big data architectures for social networks data processing using public cloud platforms, Future Generation Computer Systems 89 (2018) 98 – 109. URL: http://www.sciencedirect.com/science/article/ pii/S0167739X17328303. doi:https://doi.org/10.1016/j.future.2018.05.068. [8] D. Agrawal et al., Challenges and opportunities with big data. A community white paper developed by leading researchers across the United States (2012). [9] V. R. Borkar, M. J. Carey, C. Li, Inside “Big Data Management”: Ogres, Onions, or Parfaits?, in: International Conference on Extending Database Technology, 2012, pp. 3–14. [10] T. Economist, Data, data everywhere, The Economist (2010). [11] V. L. Heron, Michaeland Hanson, I. Ricketts, Open source and accessibility: advantages and limitations, Journal of Interaction Science 1 (2013) 1–10. URL: http://dx.doi.org/10. 1186/2194-0827-1-2. doi:10.1186/2194-0827-1-2. [12] Nature, Big data, Nature (2008). [13] S. Mgudlwa, T. Iyamu, Integration of social media with healthcare big data for improved service delivery, SA Journal of Information Management 20 (2018). doi:10.4102/sajim. v20i1.894. [14] G. Manco, E. Masciari, A. Tagarelli, A framework for adaptive mail classification, in: 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2002), 4-6 November 2002, Washington, DC, USA, IEEE Computer Society, 2002, p. 387. URL: https://doi.org/10.1109/TAI.2002.1180829. doi:10.1109/TAI.2002.1180829.