V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214,
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 V. Jirkovský, M. Obitko


                       Semantic Heterogeneity Reduction for Big Data
                                 in Industrial Automation

                                                Václav Jirkovskýa,b and Marek Obitkob
                                   a
                                       Czech Technical University in Prague, Prague, Czech Republic
                                        b
                                          Rockwell Automation Research and Development Center,
                                                Pekařská 695/10a Prague, Czech Republic
                                                {vjirkovsky,mobitko}@ra.rockwell.com


                               Abstract. The large amounts of diverse data collected in industrial au-
                               tomation domain, such as sensor measurements together with informa-
                               tion in MES/ERP1 systems need special handling that was not possible
                               in past. The Big Data technologies contribute a lot to the possibility of
                               analyzing such amounts of data. However, we need to handle not only
                               data volume, which is usually the major focus of Big Data research, but
                               we also need to focus on variety of data. In this paper, we primarily
                               focus on variety of industrial automation data and present and discuss
                               a possible approach of handling the semantic heterogeneity of them. We
                               show the process of heterogeneity reduction that exploits Semantic Web
                               technologies. The steps include construction of upper ontology describ-
                               ing all data sources, transformation of data according to this ontology
                               and finally the analysis with the help of Big Data paradigm. The pro-
                               posed approach is demonstrated on data measured by sensors in a passive
                               house.

                               Keywords: Big Data, Semantic heterogeneity, Data integration, Indus-
                               trial automation


                       1     Introduction
                       The large amounts of diverse data can be collected in virtually any domain today,
                       including industrial automation domain. A typical example is an assembly line
                       — at the lowest level, there are sensors reading values important for low level
                       control such as moving machines, while in the upper levels, there are quality
                       checks and monitoring with a possible flow of customer orders etc. The data
                       that can be captured are of huge quantities and variety. In addition, for the
                       meaningful use of the results of their analysis, it is often necessary to react
                       quickly. These characteristics overlap with the Big Data three Vs - volume,
                       velocity and variety. However, let us emphasize from the beginning that the area
                       of automation is much wider than just assembly lines — a control system can
                       be used for any processes where it is needed to reduce human intervention for
                       various reasons.
                       1
                           MES - Manufacturing Execution System; ERP - Enterprise Resource Planning
                                                                               V. Jirkovský, M. Obitko


2      V. Jirkovsky and M. Obitko

    Similarly to other domains that need to use the Big Data paradigms, in
industrial automation data are already captured, processed and analyzed, how-
ever, the properties of these data were enabling only smaller scale processing.
For example, sensor data are used for low-level control of a single machine. In
addition, the data can be kept in historian software for further offline analysis —
and even this analysis of single time series stored from measured data streams
is hard with larger amounts of data for classical historian software packages.
    The Big Data technologies were developed primarily by web companies to
handle large amounts of user clickstreams and other information to be able to
offer suitable products and ads in real time. Using these technologies it becomes
possible and meaningful to analyze much larger amounts of data than when
using classical data analysis packages. However, we must note that the focus of
the existing technologies has been mainly on the volume of data. This is clear
— handling the volume even for simple analysis was something that was needed
most. As we discuss later in the paper, the variety of data is something that is
important and not satisfactorily solved as well. In fact, one of the main outcomes
of the survey [1] among the Fortune 500 leading companies is that ”it’s about
variety, not volume” and that data integration will continue to be one of the
greatest challenges faced.
    Handling the semantic heterogeneity for Big Data application in industrial
automation is the main topic we discuss in this paper. The rest of this paper
is organized as follows: We describe Big Data features and existing technologies
related to the types of data that are to be processed in industrial automation.
Then we discuss the data heterogeneity problems we are facing and we propose
how to handle the problem at least in our domain. We illustrate the approach on
an example of data measured from a passive house, for which automation helps
to maintain conditions suitable for humans. After that we provide a conclusion
with the outlook for our future work.


2   Big Data

The term Big Data is used for data growing so that it becomes difficult to
manage them using existing database management concepts and tools [17]. The
difficulties may be related to data capture, storage, search, sharing, analytics
etc. Such data noticeably exhibit the following properties: the amount of data
is large and is quickly growing (volume), the data are generated and need to
be processed with higher speed, usually in “real” time (velocity) and the data
include less structured forms, including texts, images and videos (variety).
    The main reason for the necessity of handling such data is better data driven
decision making. The ability to handle these data presents an opportunity to
create business advantage; however, it also requires new infrastructures and new
ways of thinking different from the classical business intelligence approaches.
    The infrastructures, tools and frameworks developed to manage such data
were originally developed by web companies for storing and analyzing data such
as for searching, for analyzing clickstreams and user behavior etc. The required
Semantic Heterogeneity Reduction for Big Data in Industrial Automation


                             Semantic Heterogeneity Reduction for Big Data in Industrial Automation    3

                       outcome is typically offering relevant advertisements, products and services for
                       individual users.
                          Let us mention an example from the automation domain [2]: a CPG (Con-
                       sumer Packaged Goods) company generates 5000 data samples every 33 millisec-
                       onds leading to 4 trillion of samples per year. Such data need to be stored and
                       processed in real time.


                       3      Related Work
                       There are many already developed solutions providing more or less capability
                       how to deal with data heterogeneity problem within Big Data domain. Some of
                       the interesting and promising solutions are listed below.
                           The Autonomy IDOL 102 developed by HP Company, where IDOL stands
                       for Intelligent Data Operating Layer, is aimed at providing single processing
                       layer for conceptual, contextual, and real-time understanding of data. The IDOL
                       allows organizations to form a conceptual understanding of information based
                       on a patented probabilistic algorithms to automatically recognize concepts and
                       ideas expressed in all forms of information (documents, video, chat, phone calls,
                       and emails).
                           The next available complex solution is IBM InfoSphere, especially Master
                       Data Management [4] and Information Server [3]. These systems are corner-
                       stones of the IIG — Information Integration and Governance. It is a unified
                       set of capabilities that bring together data from diverse sources. The IIG deliv-
                       ers critical capabilities to Watson Foundations, the IBM big data and analytics
                       platform.
                           The Optique [6] is an ontology-based data access solution for Big Data. The
                       system aims at providing an end-to-end solution, where end-users will formulate
                       queries based on ontology. The main components are the ontology and mappings
                       that provide the relationships between the ontology and the underlying data.
                       User queries over the heterogeneous data are transformed with the help of these
                       mappings and ontology.


                       4      Data Heterogeneity
                       Information integration from heterogeneous data sources is significant problem
                       in every complex IT system and every successful system has to be able deal to
                       with this integration problem. In this section, the categorization of information
                       integration and heterogeneity are presented.
                           General information integration consists of a set of local information sources
                       potentially storing their data in different formats (RDF, XML ...) which have to
                       be integrated to provide users with unified data access.
                           Schema integration [5] — it is the oldest scenario of information integra-
                       tion and represents situation when two (or more) local data sources integration
                        2
                            http://www.autonomy.com/technology/idol-the-os/
                                                                                 V. Jirkovský, M. Obitko


4         V. Jirkovsky and M. Obitko

is needed. The essential step of schema integration process is to identify corre-
spondences between semantically identical entities of the schemas.
    Catalogue integration [11] — In Business-to-Business (B2B) applications,
trade partners store information about their products in electronic catalogues
(product directories of electronic sales portal) — central warehouse of the mar-
ketplace. Finding correspondences among entries of the catalogues is referred
to the catalogue matching problem. Having identified the correspondences be-
tween the entries, users of a marketplace can benefit from a unified access to the
products which are on sale.
    Data integration [10], also known as enterprise information integration —
is an integration of information from multiple local sources without loading their
data into a central warehouse. It ensures inter-operation of multiple local sources
having access to the up-to-date data.
    There are many different classifications of heterogeneity - e.g., in [9]. We
consider the following types of heterogeneity according to [8]:

    – Syntactic heterogeneity occurs when two data sources are not expressed
      in the same language, e.g., it can be caused when two ontologies are modelled
      by using different knowledge representation formalisms - OWL and F-logic.
    – Terminological heterogeneity stands for variations in names when refer-
      ring to the same entities in different data sources.
    – Conceptual heterogeneity, also known as semantic heterogeneity or logi-
      cal mismatch, denotes the differences in modeling the same domain of inter-
      est. it is caused due to the use of different axioms for defining concepts or
      due to the use of totally different concepts. We can distinguish a difference
      between the conceptualization mismatch (i.e., differences between modeled
      concepts) and the explication mismatch (i.e., concepts are expressed in dif-
      ferent way).
        • Difference in coverage occurs when two data sources describe different
          regions of the world at the same level of detail and from a unique per-
          spective.
        • Difference in granularity occurs when two data sources describe the same
          region of the world from the same perspective but at different levels of
          detail (e.g., geographic maps with different scales).
        • Difference in perspective (difference in scope) occurs when two data
          sources describe the same region of the world, at the same level of detail,
          but from different perspective (e.g., a political map vs. geological map).
    – Semiotic heterogeneity, also known as pragmatic heterogeneity, stands
      for different interpretation of entities by people. This kind of heterogeneity
      is difficult for computer to detect and even more difficult to solve.

   Furthermore, it is common to face several types of heterogeneity together.
One possible way, how to reduce (partially or in total) heterogeneity between
data sources, is similarity matching.
Semantic Heterogeneity Reduction for Big Data in Industrial Automation


                             Semantic Heterogeneity Reduction for Big Data in Industrial Automation     5

                       4.1     Heterogeneous Data in Industry Automation

                       Data integration in industry automation domain can be expressed as an En-
                       terprise Information Integration (EII) problem. The goal of EII is the ability
                       to provide a uniform access to multiple data and information from an entire
                       organization.
                           There are many different sources of information in an industry company —
                       data from ERP system, MES, production line, and from outside of company.
                       ERP system knows what customers want, MES systems know how to build it,
                       and data from production line sensors shows how the production system works.
                       However, oftentimes the enterprise data source systems are created by different
                       vendors and these systems do not speak the same language.
                           Furthermore, it is no exception in the industry domain that a same kind
                       of device (from different vendors) mounted at a production line has different
                       interface, i.e., different output. This situation is caused from the fact that the
                       vendors follow for example different standards.


                       5      Heterogeneity Reduction for Big Data

                       This section demonstrates main characteristics of data interoperability problem
                       with focus on Big Data and proposes a solution for dealing with it. General
                       process of the proposed heterogeneity reduction schema is depicted in Fig. 1.
                           The first type of heterogeneity we have to deal with is structure heterogeneity
                       (or syntactic heterogeneity). Various data sources are essential to ensure quick
                       and valuable decisions. In industry automation, it is also common to predict
                       next steps for control from various sources as mentioned earlier — from low level
                       systems as well as high level systems. It is necessary to take into consideration
                       many different types of data sources for our application — text files, XML files,
                       databases, etc.
                           The second type of heterogeneity is semantic heterogeneity. A possible solu-
                       tion for this problem is the creation of a shared ontology which ensures trans-
                       formation of the data sources into the same “language”.
                           We present a possible solution how to deal with previously mentioned types
                       of heterogeneity and general heterogeneity reduction problem as well in the fol-
                       lowing subsection. The suggested approach has to take into account all the Big
                       Data characteristics, not only variety, but also volume and velocity. It is con-
                       fronted with processing huge data amount (e.g., data stored in databases or data
                       streams from production lines) and a quickness of data processing, data loading,
                       and data storing.


                       5.1     Ontology Construction

                       The goal is to deal with data heterogeneity and the first step in our approach
                       is creation of a shared ontology which ensures knowledge sharing. The proper
                       creation of the shared ontology is essential for subsequent processing.
                                                                                                                                              V. Jirkovský, M. Obitko


6       V. Jirkovsky and M. Obitko


                                                 Semi-
                Unstructured                  structured                Structured


                                                                                              Data Sources
                 TXT         TXT               XML      XML

                       TXT                           HTML
                                                                      Database   Ontology


                                                                                                             Big Data Information Retrieval
                                                            Pre-processing


                                                                                            Constuction
                                                                                             Ontology
                                                            Data Mapping


                                                        User Verification


                                                Shared Ontology


                                   Transformation of Data Sources


                               Shared Storage (RDF Triplestore)


                              Fig. 1. Heterogeneity reduction process


    First, we have to deal with structural diversity. Structural heterogeneity prob-
lem falls into the pre-processing step. Data source processing strategies differ de-
pending on its category. Semi-structured documents may be transformed into a
predefined relational structure. HTML documents can be indexed and reduced to
free text. For processing textual data sources, the system must utilize language-
specific natural language processing systems (e.g., GATE [7]).
    Next step is the construction of a shared ontology from pre-processed data.
A crucial step is the understanding of a content and identifying the correspon-
dent entities across all data sources. Some ontology matching systems can be
exploited for this task. We adopted our previously developed system MAPSOM
presented in [12]. This system is semi-automatic matching system and is based
on similarity measure aggregation with the help of Kohonen’s self-organizing
maps, and hierarchical clustering. The MAPSOM is focused on user involve-
ment in ontology matching which is suitable for the domains as the industrial
automation — i.e., for the application where the highest precision is preferred
to the processing time.
    The other possible supporting tool for shared ontology construction is formal
concept analysis (FCA). The FCA is according to [15] a possible solution for
supporting and simplifying ontology design process. It is a theory of data analysis
Semantic Heterogeneity Reduction for Big Data in Industrial Automation


                             Semantic Heterogeneity Reduction for Big Data in Industrial Automation       7

                       which identifies conceptual structures among data sets. This approach helps to
                       design better ontologies that are more suitable for ontology sharing than pure
                       taxonomies.
                           After shared ontology construction, a transformation of data is needed for
                       a subsequent utilization. In the following paragraphs, two possible ways are
                       described.
                           Data source transformation “on the fly” (on demand — i.e., when a user
                       submits a query) is common in many solutions. This approach has one significant
                       drawback and it is the additional computational and time requirement on every
                       access. If a user submits a query then an initialization and transformation of data
                       sources is needed. This is suitable solution only in the case where the storage
                       capabilities are limited or when data access is not frequent.
                           The second way is a creation of a “snapshot” (a shared storage). This is more
                       suitable in many application — it can speed up following analysis. On the other
                       hand, it is important to take up-to-dateness of data sources into account — how
                       often it is needed to re-transform source data into the new shared storage.


                       6      Testing Scenario

                       To demonstrate previously mentioned approach and to provide the wide usability
                       of the methods, we have selected the following use case. It is based on our
                       former project dealing with measuring physical behavior of passive houses and
                       fine-tuning a dynamic simulation model approximating their operation. This
                       use-case and the simulation model to be fine-tuned were presented in [14].
                           First of all, the starting point of the process is the pre-processing step. The
                       data sources for this demonstrative example are the SSN Ontology3 and mea-
                       sured sensors data stored in text files, i.e., temperature, carbon dioxide con-
                       centration, relative humidity, and air pressure. All sensor data are contained in
                       text file divided day by day. Every text file consists of sensor headers and lists
                       of sensors records. The SSN Ontology can be the other source for the informa-
                       tion completion. The source ontology is a domain-independent and end-to-end
                       model for sensing application by merging sensor-focused, observation-focused,
                       and system-focused views. It is the backbone for a new shared ontology. In this
                       case, the sensor data text file is parsed into a more useable structure (i.e., object
                       representing properties and records of certain sensor) and SSN Ontology can
                       stay in the original form. The pre-processing step is depicted in the Fig. 2.
                           The manual (semi-manual) ontology construction is possible in this simple
                       example. It can be supported with the similarity mapping of the data source
                       entities. Mapping step lies in finding similar concepts and properties with the
                       help of similarity measures. It is not sufficient to use only one specific similarity
                       measure, because one similarity measure is able to reflect only some specific
                       heterogeneity aspect. Therefore the usage of similarity aggregation is suitable
                       to capture the most of dissimilarity problems according to [12]. We used the
                        3
                            http://www.w3.org/2005/Incubator/ssn
                                                                                                          V. Jirkovský, M. Obitko


8                             V. Jirkovsky and M. Obitko

                                                                                   SSN Ontology
                                                                             (General Sensor Knowledge)


                                   Record date: 1.4.2012
    CO2_SENSOR_1430


                                          Name: CO2_SENSOR_1430
                                          Locality: pcpu1
                          Device


                                          Address: 0027
                                          Type: CO2 – A MJ: CO2 [ppm]
                                          Description:
                                          SENSOR_00158D000007CCA2
                                          Static Address: 00158D000007CCA2

                      Data:          Time = 00:00:41 Value = 286
                                     Time = 00:01:41 Value = 283
                                     Time = 00:02:41 Value = 283
                                     Time = 00:03:41 Value = 286
                                     Time = 00:04:41 Value = 286
                                     Time = 00:05:41 Value = 283
                                     Time = 00:06:41 Value = 283
                                                  ...


                                             Fig. 2. Pre-processing step of heterogeneity reduction


string similarity measures (prefix, suffix, and n-gram similarity measure) and
language-based techniques, i.e., extrinsic WordNet similarity measure (Resnik ).
The following corresponding entities are results of this step:

    – Device — Device
    – Sensor — CO2 SENSOR 1430
    – ObservationValue — Value

    The shared ontology construction is important step and has to be provided
by skilled domain expert. The created ontology (Fig. 3) is the base for ontology-
driven data transformation of the data sources. The shared ontology includes
concepts and relationships for sensor properties, measurements, and taxonomies.
    Finally, the last step is storage of transformed data from data sources. Natu-
rally, it is possible to use original data sources with on-line ontology-driven data
transformation, but this approach is inappropriate for the case of big amount
of sensor data that require fast data access. Therefore, the storage of trans-
formed data is required for faster data access. The Hadoop RDF triplestore, i.e.,
H2RDF [16], was selected due to the data characteristics and ontology-driven
transformation as well.
    Using the described approach, we can take the advantage of integrated data
sources for knowledge discovery with the help of Map&Reduce programming
as known from the Big Data technologies, directly using RDF processing. The
implementation of this approach, which is in progress and extends data process-
ing without prior integration [13], will show the advantages of direct usage of
integrated data sources.
Semantic Heterogeneity Reduction for Big Data in Industrial Automation


                             Semantic Heterogeneity Reduction for Big Data in Industrial Automation                   9


                                          Name                                          SensorDevice
                                                            hasName
                                         Locality
                                                           hasLocality

                                         Address                               Device                    Sensor
                                                         hasDescription

                                       Description
                                                         hasStaticAddress                              IsProducedBy
                                          Static
                                         Address
                                                                 Observation
                                                                                        hasValue   SensorOutput
                                                                    Vaue


                                                     Fig. 3. Generated shared ontology


                       6.1     Summary

                       The cornerstones of the proposed heterogeneity reduction approach are trans-
                       formation of the data sources into a simple common format, translation to the
                       “same language”, and shared ontology construction. All of the mentioned steps
                       are complex tasks and we have demonstrated one of the possible approaches of
                       how to deal with these challenges.
                           Our approach offers possibility to integrate data sources for batch processing
                       (utilization of the joint RDF storage), real-time processing of sensor measure-
                       ments (shared ontology serves for data source interface), and combination of the
                       both mentioned processing.


                       7      Conclusions

                       We have shown how the heterogeneity of data sources can be reduced by means
                       of shared ontology. This solution is proposed to enable dealing with integration
                       of diverse data collected in industrial automation domain, such as sensor mea-
                       surements or data from MES/ERP systems. The proposed solution was demon-
                       strated on the passive house testing use case.
                           The capability of handling various data sources stored in databases or files
                       as well as various streams of data is the significant advantage not only for the
                       industrial automation domain. This capability can precisely capture various re-
                       lationships among data sources and therefore can improve the process of decision
                       making. As have demonstrated in the state of the art review, the variety aspect
                       of Big Data is very important one, but also very hard to be solved.
                           Our future work includes further development of the proposed process for
                       heterogeneity reduction in Big Data with the focus on industrial automation
                       domain. This work involves construction of complex software solution which will
                       be able to deal with huge amount of data from ERP/MES system, streams of
                       data generated by sensors from a production line, and also integrating data from
                       external data sources, such as weather conditions.
                                                                                      V. Jirkovský, M. Obitko


10      V. Jirkovsky and M. Obitko

   To conclude, the possibility of handling large volumes of data at high speed
and especially the integration of various data sources is essential for gaining
competitiveness in every enterprise. The heterogeneity of data was always an
important issue to handle, but it is much more visible and important today
when the trend of increasing processed amounts of data continues.

8    Acknowledgements
This research has been supported by Rockwell Automation Laboratory for Dis-
tributed Intelligent Control (RA-DIC) and by the Grant Agency of the Czech
Technical University in Prague, grant No. SGS12/188/OHK3/3T/13.

References
 1. Big Data executive survey, Consolidated summary report. Consolidated summary
    report, NewVantage Partners (2012)
 2. The rise of industrial Big Data. Whitepaper, GE Intelligent Platforms (2012)
 3. Integrating and governing big data. Whitepaper, IBM Corporation (2014)
 4. The MDM advantage: Creating insight from big data. Whitepaper, IBM Corpora-
    tion (2014)
 5. Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies
    for database schema integration. ACM computing surveys 18(4), 323–364 (1986)
 6. Calvanese, D., Giese, M., Haase, P., Horrocks, I., et al.: Optique: OBDA Solu-
    tion for Big Data, pp. 293–295. The Semantic Web: ESWC 2013 Satellite Events,
    Springer (2013)
 7. Cunningham, H.: Gate, a general architecture for text engineering. Computers and
    the Humanities 36(2), 223–254 (2002)
 8. Euzenat, J., Shvaiko, P.: Ontology matching, vol. 18. Springer (2007)
 9. Goh, C.H.: Representing and reasoning about semantic conflicts in heterogeneous
    information systems. Ph.D. thesis, Massachusetts Institute of Technology (1996)
10. Halevy, A.Y., Ashish, N., Bitton, D., Carey, M., Draper, D., Pollock, J., Rosen-
    thal, A., Sikka, V.: Enterprise information integration: successes, challenges and
    controversies. In: Proceedings of the 2005 ACM SIGMOD international conference
    on Management of data. pp. 778–787. ACM (2005)
11. Ichise, R., Hamasaki, M., Takeda, H.: Discovering relationships among catalogs.
    In: Discovery Science. pp. 371–379. Springer (2004)
12. Jirkovský, V., Ichise, R.: Mapsom: User involvement in ontology matching. In:
    Proceedings of the 3rd JIST Conference. Springer (2013)
13. Jirkovský, V., Obitko, M., Novák, P., Kadera, P.: Big Data analysis for sensor
    time-series in automation. In: Proc. of 19th ETFA Conference (2014)
14. Novák, P., Šindelář, R.: Design and verification of simulation models of passive
    houses. In: Proc. of 17th ETFA Conference (2012)
15. Obitko, M., Snášel, V., Šmı́d, J.: Ontology design with formal concept analysis. In:
    CLA. vol. 110 (2004)
16. Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H2rdf: adaptive query
    processing on rdf data in the cloud. In: Proceedings of the 21st international con-
    ference companion on World Wide Web. pp. 397–400. ACM (2012)
17. Singh, S., Singh, N.: Big Data analytics. In: 2012 International Conference on
    Communication, Information & Computing Technology (ICCICT). pp. 1–4 (2012)