Data Science


 SOFTWARE AND HARDWARE INFRASTRUCTURE
      FOR DATA STREAM PROCESSING

         V.I. Protsenko1,2, P.G. Seraphimovich1,2, S.B. Popov1,2, N.L. Kazanskiy1,2
1
    Image Processing Systems Institute - Branch of the Federal Scientific Research Centre "Crys-
           tallography and Photonics" of Russian Academy of Sciences, Samara, Russia
                      2
                        Samara National Research University, Samara, Russia


         Abstract. In this paper state-of-the-art hardware and software technologies for
         stream data processing are reviewed. IBM InfoSphere Streams and Apache
         Spark are among of the most popular software products that alleviates burden of
         distributed program development for data analysis tasks. Capabilities of these
         systems are considered in application to the time series analysis. IBM In-
         foSphere Streams turns to be more suitable for online processing, whereas
         Apache Spark time series library focuses on a bulk processing of big collections
         of the time series.

         Keywords: data stream processing, high performance computing, time series
         analysis.


         Citation: Protsenko VI, Seraphimovich PG, Popov SB, Kazanskiy NL. Firm-
         ware and hardware infrastructure for data stream processing. CEUR Workshop
         Proceedings, 2016; 1638: 782-787. DOI: 10.18287/1613-0073-2016-1638-782-
         787


1        Introduction

In the recent years business and scientific organizations faced the problem of devel-
opment of analytic pipelines that could process large amounts of data in real-time and
be able to seamlessly incorporate new data sources and new queries. Previous ap-
proach, that suggest to use conventional data bases, was not suited for real-time data
analysis because of the need to store data before the processing. Priority in ACID
principle in databases also constrains its ability to scale well on clusters of tenths to
thousands of nodes. As a consequence, it is hard or impossible to process massive
amounts of data in a fixed time. New approaches were proposed: Hadoop framework
for bulk processing and dataflow graph based stream processing systems. Still a lot of
research results in the data base field is now used in data stream processing systems:
random sampling [1], aggregations [2, 3], join techniques [4, 5], query plan optimiz-
ers [6] and schedulers [7, 8, 9].


Information Technology and Nanotechnology (ITNT-2016)                                       782
Data Science                                      Protsenko VI, Seraphimovich PG. et al…


A trend of using commodity hardware started with projects Beowulf, Berkley NOW
and HPVM is still present today. 10 Gigabit Ethernet is already wide spread and 100
Gigabit Ethernet is emerging [10], top x86_x64 processors for PC include up to 8
cores on one chip. Also according to performance benchmarks [11] commodity ver-
sions of GPU have similar computational power as HPC variants, but lack durability
and support for high performance computing transport mechanisms. It turns out that
clusters could achieve a remarkable performance gain by using GPU, moving respon-
sibility of fault tolerance to the software framework like Apache Spark and IBM In-
foSphere Streams. This is true also to recently introduced coprocessor Intel Phi [12]
and FPGA units [13].


2      Data stream processing systems

2.1    Stream model
Data stream can be viewed as an sequence of elements 𝑥1, 𝑥2 , … , 𝑥𝑛 that has following
properties:

─ elements arrive continuously,
─ number of the elements in the stream could be infinite.

Depending on the application processing can be done per element of in terms of win-
dows. Windows are the rules according to which elements in the sequence are divided
in subsequences that should be analyzed as a whole. Windows are commonly divided
into two types: sliding window and tumbling window.
Tumbling windows are divided into 4 types according to the trigger policy that
launches processing of the current window:

─ count-based policy,
─ delta-based policy depending on changing attribute,
─ time based policy, according to the local or global time stamp,
─ punctuation based policy.

Sliding windows have more types than tumbling windows. They have three pro-
cessing trigger types and three evicting policy types, that together form nine variants.
Evicting policy could be one of the following:

─ count based,
─ delta-based,
─ time based.

Window semantics fits well time series analysis tasks and naturally matches wide
range of applications. Syntax and semantics of stream query language is still a topic
of the research [14-18].


Information Technology and Nanotechnology (ITNT-2016)                               783
Data Science                                              Protsenko VI, Seraphimovich PG. et al…


2.2       IBM InfoSphere Streams
One of the most popular commercial system for data stream analysis is IBM In-
foSphere Streams. Research on this system could be traced to 2008 when the first
engine was presented [19]. Since then the project matured into the robust system that
can be used in production for real-time text analysis, data extraction and financial
insights. System also offers good capabilities in the time series analysis.
Time series could be represented in the system in two ways: like a sequence stream
of elements or like a vector element of data stream. Moreover, each vector can be
interpreted as a time slice of several time series, and in this case data stream multi-
plexes several time series.
The system bundle contains com.ibm.streams.timeseries toolkit with number of im-
plemented algorithms for time series analysis. Among its capabilities are:

─ anomaly detection,
─ cross-correlation of two streams,
─ times series normalization,
─ DFT and DWT transforms,
─ stream statistics evaluation,
─ application of DSP to input sequence.

For short and long term predictions package has implementation of ARIMA model,
Holt-Winters model, Kalman filter and multivariate autoregression model. Fig 1
shows an example of online predictions according to the ARIMA model.
               400


                              Input time series
                              Predicted values
               300
        x(t)

               200
               100


                       0          20          40           60          80         100

                                                      t
      Fig. 1. Online ARIMA predictions one time unit ahead after the initial period of training


Information Technology and Nanotechnology (ITNT-2016)                                             784
Data Science                                            Protsenko VI, Seraphimovich PG. et al…


2.3    Apache Spark
Apache Spark is a new programming framework for distributed data processing. It has
a good integration with Hadoop ecosystem. Basic part of the project offers an ability
for bulk and stream processing in the terms of RDD [20] and DStream [21] abstrac-
tions. There are also additional projects: MLlib - a scalable machine learning library,
graph processing library GraphX, Spark SQL module for structured data processing,
SparkR module that enables integration with language R.
Time series analysis in Apache Spark is enabled by SparkTS library. Current imple-
mentation of the library includes following models: ARGARCH, ARIMA, EGARCH,
EWMA, GARCH and AR. Also a number of statistical tests: augmented Dickey-
Fuller, Breusch-Godfrey, Breusch-Pagan, Durbin-Watson, KPSS and Ljung-Box.
Apache Spark makes available bulk and stream distributed processing over thousands
of independent time series. However, it has no online processing algorithms as it is
the case with IBM InfoSphere Streams.
An experiment conducted on 2 nodes with 32 cores (four E5-2450 processors) con-
nected by 10Gb Ethernet shows that processing big amount of time series, 40000 in
our case, can be reduced from 27 minutes to 1.2 minutes. Results are shown in the fig
2 and table 1.
                       1600
                       500
           Time, sec

                       200
                       70


                                1 3 5 7 9     13   17       21   25     29

                                              Parallelism


 Fig. 2. Performance of Apache Spark on testing 40000 time series to be stationary with KPSS
                                             test

Table 1. Performance of Apache Spark on testing 40000 time series to be stationary with KPSS
                                           test

Parallelism                   Time, seconds
1                             1600
9                             207
17                            118
31                            78


Information Technology and Nanotechnology (ITNT-2016)                                     785
Data Science                                        Protsenko VI, Seraphimovich PG. et al…


3      Conclusion

Data analysis is an interactive task. Because of increasing speed of accumulating data
it start to be more a more important to use a lot of computational power to match suf-
ficient response time of analytic queries. Moreover, a subset of queries could be an-
swered in online fashion that can also decrease wait time. Apache Spark and IBM
InfoSphere Stream reviewed systems achieves these by managing large number of
nodes and are able to run data analysis tasks in distributed fashion. After the fixed
period of ARIMA learning time IBM InfoSphere Streams was able to make short-
term predictions in online fashion. Bulk processing time for test that time series is
stationary in Apache Spark was decreased from 27 minutes to 1.2 minutes. This way
data processing systems help to reveal the potential of commodity hardware and bring
the ability of time series analysis on large amounts of data. Short-term and long-term
predictions play important role in finance [22-23] and automatic control [24-26].
Elaborate design of IBM InfoSphere Streams system also allows to transform multi-
media data [27,28] which result can be further piped into time series analysis opera-
tors.


References
 1. Olken F. Random sampling from databases. University of California at Berkeley, 1993.
 2. Hellerstein JM, Haas PJ, Wang HJ. Online aggregation. ACM SIGMOD Record. ACM,
    1997; 26(2): 171-182.
 3. Fang M et al. Computing Iceberg Queries Efficiently. Internaational Conference on Very
    Large Databases (VLDB'98), New York, August 1998. Stanford InfoLab, 1999.
 4. Haas PJ, Hellerstein JM. Ripple joins for online aggregation. ACM SIGMOD Record.
    ACM, 1999; 28(2): 287-298.
 5. Luo G et al. A scalable hash ripple join algorithm. Proceedings of the 2002 ACM
    SIGMOD international conference on Management of data. ACM, 2002: 252-262.
 6. Graefe G, McKenna WJ. The Volcano optimizer generator: Extensibility and efficient
    search. Data Engineering, 1993. Proceedings Ninth International Conference on IEEE,
    1993: 209-218.
 7. Armbrust M et al. Spark sql: Relational data processing in spark. Proceedings of the 2015
    ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1383-
    1394.
 8. Abadi DJ et al. Aurora: a new model and architecture for data stream management. The
    VLDB Journal – the International Journal on Very Large Data Bases, 2003; 12(2): 120-
    139.
 9. Isard M et al. Dryad: distributed data-parallel programs from sequential building blocks.
    ACM SIGOPS Operating Systems Review. ACM, 2007; 41(3): 59-72.
10. Renaudier J et al. Transmission of 100Gb’s Coherent PDM-QPSK over 16x100km of
    standard fiber with allerbium amplifiers. Optics Express, 2009; 17(7): 5112-5119.
11. Bergdorf M et al. Desmond. GPU Performance as of October, 2015.
12. Reinders J. An overview of programming for Intel Xeon processors and Intel Xeon Phi co-
    processors. Intel Corporation, Santa Clara, 2012.


Information Technology and Nanotechnology (ITNT-2016)                                    786
Data Science                                         Protsenko VI, Seraphimovich PG. et al…


13. Hauck S, DeHon A. Reconfigurable computing: the theory and practice of FPGA-based
    computation. Morgan Kaufmann, 2010.
14. Maier D et al. Semantics of data streams and operators. Database Theory-ICDT 2005.
    Springer Berlin Heidelberg, 2005: 37-52.
15. Krämer J, Seeger B. Semantics and implementation of continuous sliding window queries
    over data streams. ACM Transactions on Database Systems (TODS), 2009; 34(1): 4 p.
16. Li J et al. Semantics and evaluation techniques for window aggregates in data streams.
    Proceedings of the 2005 ACM SIGMOD international conference on Management of data.
    ACM, 2005: 311-322.
17. Arasu A, Babu S, Widom J. The CQL continuous query language: semantic foundations
    and query execution. The VLDB Journal – the International Journal on Very Large Data
    Bases, 2006; 15(2): 121-142.
18. Beck H et al. LARS: A Logic-Based Framework for Analyzing Reasoning over Streams.
    AAAI, 2015: 1431-1438.
19. Gedik B et al. SPADE: the system s declarative stream processing engine. Proceedings of
    the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008:
    1123-1134.
20. Zaharia M et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory
    cluster computing. Proceedings of the 9th USENIX conference on Networked Systems
    Design and Implementation. USENIX Association, 2012; 2 p.
21. Zaharia M et al. Discretized streams: an efficient and fault-tolerant model for stream pro-
    cessing on large clusters. Presented as part of the, 2012.
22. Bollerslev T, Chou RY, Kroner KF. ARCH modeling in finance: A review of the theory
    and empirical evidence. Journal of econometrics, 1992; 52(1-2): 5-59.
23. Franses PH, Van Dijk D. Non-linear time series models in empirical finance. Cambridge
    University Press, 2000.
24. Karpenko S et al. Visual navigation of the UAVs on the basis of 3D natural landmarks.
    Eighth International Conference on Machine Vision. International Society for Optics and
    Photonics, 2015: 98751I-98751I-10.
25. Shvets EA, Nikolaev DP. Complex approach to long-term multi-agent mapping in low dy-
    namic environments. Eighth International Conference on Machine Vision. International
    Society for Optics and Photonics, 2015: 98752A-98752A-10.
26. Miller A, Miller B. Stochastic control of light UAV at landing with the aid of bearing-only
    observations. Eighth International Conference on Machine Vision. International Society
    for Optics and Photonics, 2015: 987529-987529-10.
27. Protsenko VI, Kazanskiy NL, Serafimovich PG. Real-time analysis of parameters of mul-
    tiple object detection systems. Computer Optics, 2015; 39(4): 582-591. DOI:
    10.18287/0134-2452-2015-39-4-582-591.
28. Kazanskiy NL, Protsenko VI, Serafimovich PG. Comparison of system performance for
    streaming data analysis in image processing tasks by sliding window. Computer Optics,
    2014; 38(4): 804-810.


Information Technology and Nanotechnology (ITNT-2016)                                      787