=Paper= {{Paper |id=Vol-1810/EuroPro_paper_03 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1810/EuroPro_paper_03.pdf |volume=Vol-1810 |dblpUrl=https://dblp.org/rec/conf/edbt/MonteKMRM17 }} ==None== https://ceur-ws.org/Vol-1810/EuroPro_paper_03.pdf
PROTEUS: Scalable Online Machine Learning for Predictive
   Analytics and Real-Time Interactive Visualization

                                            Bonaventura Del
          Monte1 , Jeyhun Karimov1 , Alireza Rezaei Mahdiraji1 , Tilmann Rabl1,2 , Volker Markl1,2
                              1
                                  German Research Center for Artificial Intelligence (DFKI), 2 TU Berlin
                                   1
                                     firstname.lastname@dfki.de, 2 firstname.lastname@tu-berlin.de


ABSTRACT                                                                        both batch and streaming data for making well-informed decisions
Big data analytics is a critical and unavoidable process in any busi-           in real time. These three subsystems will be integrated in a single
ness and industrial environment. Nowadays, companies that do ex-                platform running in a containerized environment. Once the platform
ploit big data’s inner value get more economic revenue than the                 is deployed in a cluster, its life-cycle is as follows: 1) the end-user
ones which do not. Once companies have determined their big data                writes data analytics tasks in LARA mixing extract-transform-load
strategy, they face another serious problem: in-house designing and             and SOLMA algorithms pipelines and executes them on top of PRO-
building of a scalable system that runs their business intelligence is          TEUS hybrid processing system, 2) the system continuously trains
difficult. The PROTEUS project aims to design, develop, and pro-                deployed machine learning models in an online fashion, 3) the visual
vide an open ready-to-use big data software architecture which is               stack queries those models and displays requested real-time predic-
able to handle extremely large historical data and data streams and             tions and statistics to end-user.
supports online machine learning predictive analytics and real-time                PROTEUS faces an additional challenge which deals with cor-
interactive visualization. The overall evaluation of PROTEUS is car-            rect integration of machine learning solutions in big data processing
ried out using a real industrial scenario.                                      systems by taking into account the principal anti-patterns and risks
                                                                                factors that affect this kind of interactions [4].
                                                                                   In addition, PROTEUS ensures the achievement of its goals through
1.     PROJECT DESCRIPTION                                                      rigorous experimental testing and industrial-validated processes. The
   PROTEUS1 is an EU Horizon20202 funded research project, which                project is indeed guided by the specific requirements of the hot strip
has the goal to investigate and develop ready-to-use, scalable online           mill steel-making process, provided by an industrial partner of PRO-
machine learning algorithms and real-time interactive visual analyt-            TEUS’ consortium. Hot strip mill produces coils, whose quality is
ics, taking care of scalability, usability, and effectiveness. In partic-       affected by several parameters (e.g. temperature, vibration inten-
ular, PROTEUS aims to solve the following big data challenges by                sity, tension in the rollers). Since coils are used in further production
surpassing the current state-of-art technologies with original contri-          stages, they must present no defect. Predicting anomalies through
butions:                                                                        the analysis of massive real-time data generated during the hot strip
     1. Handling extremely large historical data and data streams               mill is the main target in this validation scenario.
                                                                                Regardless the above validation scenario, PROTEUS platform is
     2. Analytics on massive, high-rate, and complex data streams               also applicable for general data streams analysis in other domains.
     3. Real-time interactive visual analytics of massive datasets, con-           Acknowledgements. This work was supported by the EU Hori-
        tinuous unbounded streams, and learned models                           zon 2020 project PROTEUS (687691).
   PROTEUS’s solutions for the challenges above are: 1) a real-time
hybrid processing system built on top of Apache Flink3 (formerly                2.    REFERENCES
Stratosphere4 [1]) with optimized relational algebra and linear al-             [1] A. Alexandrov, R. Bergmann,
gebra operations support through LARA declarative language [2,                      et al. The stratosphere platform for big data analytics. The
3], 2) a new library for scalable online machine learning and data                  VLDB Journal, 23(6):939–964, Dec. 2014. ISSN 1066-8888.
mining called SOLMA, and 3) investigation and development of in-
                                                                                [2] A. Alexandrov, A. Kunft,
cremental visual methods that allow end-users to efficiently explore
                                                                                    et al. Implicit parallelism through deep language embedding.
1                                                                                   In Proceedings of the 2015 ACM SIGMOD International
  https://www.proteus-bigdata.com/
2                                                                                   Conference on Management of Data, SIGMOD ’15, pp. 47–61.
  https://ec.europa.eu/programmes/horizon2020/
3
  https://flink.apache.org/                                                         ACM, New York, NY, USA, 2015. ISBN 978-1-4503-2758-9.
4
  http://stratosphere.eu/                                                       [3] A. Kunft, A. Alexandrov, et al. Bridging the gap: Towards opti-
                                                                                    mization across linear and relational algebra. In Proceedings of
                                                                                    the 3rd ACM SIGMOD Workshop on Algorithms and Systems
                                                                                    for MapReduce and Beyond, BeyondMR ’16, pp. 1:1–1:4.
                                                                                    ACM, New York, NY, USA, 2016. ISBN 978-1-4503-4311-4.
                                                                                [4] D. Sculley, G. Holt, et al. Machine learning: The high interest
 c 2017, Copyright is with the authors. Published in Proc. 20th International       credit card of technical debt. In SE4ML: Software Engineering
Conference on Extending Database Technology (EDBT), March 21-24,
2017 - Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceedings.org.
                                                                                    for Machine Learning (NIPS 2014 Workshop). 2014.
Distribution of this paper is permitted under the terms of the Creative
Commons license CC-by-nc-nd 4.0