<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Corresponding author: Erkan Karabulut
$ e.karabulut@uva.nl (E. Karabulut); v.o.degeler@uva.nl (V. Degeler); p.t.groth@uva.nl (P. Groth)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Semantic Association Rule Learning from Time Series Data and Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erkan Karabulut</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoria Degeler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Groth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Science Park 904, Amsterdam, 1098 XH, North Holland</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time series data. In addition to this initial pipeline, we also propose new semantic association rule criterion. The approach is evaluated on an industrial water network scenario. Initial evaluation shows that the proposed approach is able to learn a high number of association rules with semantic information which are more generalizable. The paper aims to set a foundation for further work on using semantic association rule learning especially in the context of industrial applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;rule learning</kwd>
        <kwd>knowledge graph</kwd>
        <kwd>time series data</kwd>
        <kwd>digital twin</kwd>
        <kwd>internet of things</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        We hypothesize that for numerical data produced in industrial IoT networks, incorporating
semantics of the system components in rule learning can be beneficial including discovering
previously unknown relations, e.g. higher number of rules, and helping to generalize association
rules of the above form. This hypothesis is tested in a specific type of IoT scenario, a Digital
Twin (DT). DTs have many diferent proposed definitions over the past two decades [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
main goal is to create a precise representation (‘twin’) of a physical system, often referred as
Physical Twin (PT), in a digital environment and to maintain a bi-directional communication in
between them. Recently, semantic technologies such as ontologies and Knowledge Graph (KG)s,
started to be used in DTs, for system/data modeling, establishing semantic interoperability,
extracting semantic relations and/or facilitating reasoning processes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To the best of our knowledge, at the time of writing, there is no approach for learning rules
containing semantic information as well as time series data in a DT. This study aims to fill this
gap by proposing a first semantic rule learning approach utilizing KGs, based on the well-known
FP-Growth [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] algorithm. Concretely, the contributions of this paper are as follows:
• Describing a full pipeline of operations that consists of: i) KG construction in DTs, ii)
semantic association rule learning based on the KG and time series data, and iii) making
inferences based on the learned rules (Section 2).
• A first approach (Naive SemRL) that extends FP-Growth algorithm to learn rules
containing semantic information from KGs and time-series data (Section 3).
• A semantic rule quality measure in order to evaluate rules generated by semantic rule
learning algorithms (Section 4).
      </p>
      <p>The proposed approach is evaluated in an industrial use case, water networks, which refer to
water distribution systems that bring water to consumers, i.e. apartments, industrial sites. This
is an ongoing research with many open issues and research questions emphasized in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Semantic Rule Learning and Inference Pipeline</title>
      <p>This section first motivates utilization of semantic association rule learning techniques in DTs,
and then describes a pipeline of operations.</p>
      <p>In a best case scenario, a DT has the full knowledge of its PT. In this situation, we say that
the PT is 100% “twinned”, or the “twinning ratio” is 100%. Low twinning ratio intuitively might
afect the performance of any reasoning or learning algorithm running in the DT, e.g. too many
missing values. A major reason that can cause low twinning ratio is to have discrepancies in
between PT and its DT. A discrepancy refers to a state or attribute of a PT component that has
incorrect or inaccurate representation in its DT. For instance, in a water network scenario, an
undetected leakage in a pipe is considered a discrepancy. Instead of using separate solutions for
each such issue, this study proposes a discrepancy detection method that is a generalization
over any such discrepancy. The proposed approach consists of a pipeline (Figure 1) of three
operations: i) KG construction, ii) semantic rule learning, iii) making inferences.</p>
      <p>
        Knowledge Graph Construction in DTs. KGs are already being used for information
modeling in DTs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We hypothesize that DTs with high dependency among its sub-components,
e.g. DT of a water network, can benefit from KGs, not only in information modeling but also in
rule learning and making inferences. KGs for DTs can be constructed using a top-down approach
from DT metadata, and a domain ontology that shows how the components are related to each
other [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Types of entities in DTs are never obscure, meaning that when a representation for a
physical object is created in the digital environment, e.g. a water pipe, the type of the object is
also explicitly or implicitly assigned, e.g. by putting metadata in a “water_pipe” table. Then, an
ontology or a data schema can be used to label each entity and the relations. Figure 2 shows KG
construction from a partial water network metadata given in EPANET input1 format.
      </p>
      <p>Semantic Rule Learning refers to learning association rules with semantic information
so that the learned rules are not only applicable to specific entities, but applicable to a set of
entities with certain characteristics. The proposed semantic rule learning algorithm Naive
SemRL, described in Section 3, utilizes a KG constructed in the previous step, and historical time
series data. A simple example of a rule that does not contain semantic information is ‘if sensor1
measures V1, then sensor2 measures V2’. The goal of the proposed approach is to generate rules</p>
      <sec id="sec-2-1">
        <title>1https://www.epa.gov/water-research/epanet</title>
        <p>Algorithm 1 Naive SemRL
1: procedure NaiveSemRL(knowledge_graph, disc_hist_time_series, k_neighbors)
2: enriched_transactions = []
3: for transaction in disc_hist_time_series do
4: topology = graph.topology(transaction.sensor_list(), k_neighbors)
5: attributes = graph.attributes(transaction.sensor_list(), k_neighbors)
6: enriched_transactions.append(transaction + topology + attributes)
7: end for
8: return FP-Growth(enriched_transactions)
9: end procedure
in the form of ‘if a sensor with type T placed in a pipe P1 with attribute1 &gt; A1 measures V1, then
the sensor with type T2 that is placed in a Junction J1 connected to P1 measures V2’.</p>
        <p>Making Inferences Based on Semantic Rules. In this phase of the pipeline, real-time time
series data is analyzed based on the previously obtained semantic association rules. An inference
engine gathers the rules that are not met, e.g. for a certain period of time, and makes inferences
based on the unmet rules. The methodology to be used in this step remains to be among future
work, while the focus of this study is on the first and second phases of the pipeline.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. A Naive Semantic Rule Learning Approach - Naive SemRL</title>
      <p>The main intuition behind the proposed approach, Algorithm 1, is that instead of learning
association rules for individual sensors, it generalizes sensor data using its metadata from the
KG. The algorithm does that by extending transactions in a transaction database with semantic
information extracted from the KG. As an example, rather than seeing sensor data as ‘sensor
X measured value Y in a time frame T’, it generalizes to ’a sensor with these attributes and
neighboring components measured value Y in a time frame T‘.</p>
      <p>Naive SemRL requires a KG (knowledge_graph), discretized historical time series data
(disc_hist_time_series, from now on ‘TS’), and number of neighbors (k_neighbors) to be analysed
for semantic relations as input. TS is a set of transactions where each transaction contains
discretized sensor data (items) for a certain time frame. It goes through each of the transaction
in TS (lines 3-7), and first finds the topology of the items based on the k_neighbors variable
(line 4). As an example, in Figure 2, value of the topology variable for J1:Junction would be
[Pipe_C_ConnectedTo_J1, Pipe_B_ConnectedTo_J1, Pipe_A_ConnectedTo_J1], assuming the value
of k_neighbors is equal to 1. A list of attributes for each of the components and links are
extracted in line 5, and a new transaction is created in line 6. FP-Growth algorithm is run with
the new set of transactions to discover association rules with semantic info in line 8.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Evaluation and Industrial Use Case</title>
      <p>This section presents a semantic rule quality criteria that measures generalizability of semantic
association rules, and an industrial use case where the proposed approach is applied.</p>
      <sec id="sec-4-1">
        <title>4.1. A Semantic Rule Quality Criterion</title>
        <p>
          Many quality criteria for association rules are proposed with the most fundamental ones being
support and confidence [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, we were unable to identify an association rule quality
criterion that is specific to evaluating the semantic aspect of the learned rules. Therefore, we
propose the following “semantic expressivity” association rule quality measure:
Definition 4.1 (Semantic Expressivity). Let  = {1, 2, ..., } be a set of classes in an ontology
(or a data schema). ∀ ∈ (has_attributes(, {1 , 2 , ..., })), with has_attributes(, ) =
class x has set of  attributes, and assume attributes() = number of attributes in class x. Let
 = {1, 2, ..., } be a set of items. ∀ ∈ ( = (#)), where  = an attribute of a class c, #
= any comparison operation, and z any value.  →  is an implication (association rule) where
,  ⊆ . Finally, let instances() be the diferent class instances in X and let attr_count(, )
be number of items which have attributes of instance of a class c in X.
        </p>
        <p>attr_ratio() = ∏︀instances() attr_count(,)</p>
        <p>attributes()
Semantic_Expressivity( →  ) = (1− attr_ratio())× (1− attr_ratio( ))</p>
        <p>(instances()+instances( ))/2</p>
        <p>Intuition. Learned rules may contain diferent levels of semantic information. Including
too much semantics in the rule makes it less generalizable, hence less ‘semantically expressive’.
For instance, “Junctions with 3 pipes have 1500-2000Pa water pressure” is more general than
“Junctions with 3 pipes where each of the pipes is 50-100m long, and has 2-3m diameter have
1500-2000Pa water pressure”. The main goal of the proposed quality criterion is to understand
how semantically expressive a rule is by giving each rule a score between 0 and 1. Assume X in
 →  is ‘{_ &gt; 2}’, with pipe being a class in a water network ontology that
can have 3 attributes: diameter, length and elevation. In this case _() = 1/3 since X
is about one of the attributes only. Low attribute ratio leads to high semantic expressivity as the
formula includes (1 − _()) × (1 − _( )). Average number of instances is
included in the divisor part for the purpose of incorporating topology into the formula. As an
example, an association rule about a node with 3 neighbors will have a higher divisor than a
node with 2 neighbors which is more general.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Industrial Use Case</title>
        <p>
          The proposed algorithm is demonstrated on LeakDB dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], an artificially created realistic
leakage dataset for water distribution networks. It contains metadata of 31 junctions, 1
reservoir, 34 pipes and 1,716,960 sensor measurements. Water network related classes in EPYNET
Python package2 is used as a data schema while creating a KG. MLxtend’s [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] FP-Growth
implementation is used while implementing the Naive SemRL algorithm. For simplicity, the
proposed approach tested using a straightforward discretization method of lowering sensor
measurement precisions and calculating daily averages.
        </p>
        <p>A sample rule learned from the described dataset: ‘{(WaterPressureSensor: WPS, placed_in,
Junction: J1), (Junction: J1, connected_to, Pipe: P1), (WaterPressureSensor: WPS, measures, 43)} →
{(WaterConsumptionSensor: WCS, placed_in, Junction: J2), (Junction: J2, connected_to, Pipe: P2),
(Junction: J2, connected_to, Pipe: P3), (WaterConsumptionSensor: WCS, measures, 38)}’.</p>
        <sec id="sec-4-2-1">
          <title>2https://github.com/Vitens/epynet</title>
          <p>Interpretation of the rule: ‘When there is a water pressure sensor WPS placed inside a junction
J1, and J1 is connected to a Pipe P1, and WPS measures 43, then a water consumption sensor WCS
placed in a Junction J2 that is connected to Pipes P2 and P3 must measure 38’. The semantic
expressivity of the rule is 0.28, as it does not contain any attribute and based on 3 instances on
the antecedents side, and 4 instances on the consequents side. In this experiment, Naive SemRL
is run with topology info only without node attributes, as it increases runtime of the algorithm
exponentially. Currently, an intuition/search-based approach is investigated that can tell when
and where to include semantics in order to avoid exponential increase in the runtime. Table 1
shows how incorporating semantics in the rule learning process increases the number of rules,
together with min and max semantic expressivity values. Besides finding new rules about the
same nodes extended with semantics, when run with low support thresholds, Naive SemRL can
ifnd new rules which FP-Growth can not find. Having more rules is not necessarily good as it
may increase the time required for post-processing and making inferences. In order to overcome
this hurdle, incorporating semantics within evolutionary or other approaches that can directly
learn rules satisfying certain rule quality criteria is being investigated as part of future work.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This study proposed a semantic association rule learning pipeline for Digital Twins. The pipeline
consists of knowledge graph construction, semantic association rule learning from knowledge
graphs and time series data, and making inferences. An initial naive approach for semantic rule
learning, Naive SemRL, and a first semantic rule quality criterion is proposed. The new approach
is evaluated in a water network use case and the results show that incorporating knowledge
graphs allows us to learn rules with semantic information which are more generalizable.</p>
      <p>There are many open issues and research questions yet to be answered. KG construction
for DTs from system metadata and a domain ontology will be automatized. FP-Growth will
be replaced by novel rule learning methods with diferent perspectives, e.g. statistical vs.
optimization-based NARM methods. And these methods will be compared based on applicability
of the proposed approach and quality criterion. Lastly, an inference mechanism that can detect
and find root-causes of discrepancies from semantic rules will be developed.
This work has received support from The Dutch Research Council (NWO), in the scope of
Digital Twin for Evolutionary Changes in water networks (DiTEC) project, file number 19454.
We would like to thank Vitens N.V. for providing us a historical water network sensor dataset
to test our approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sunhare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Chowdhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <article-title>Internet of things and data mining: An application oriented survey</article-title>
          ,
          <source>Journal of King Saud University-Computer and Information Sciences</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>3569</fpage>
          -
          <lpage>3590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Degeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazovik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <article-title>Itemset-based mining of constraints for enacting smart environments</article-title>
          ,
          <source>in: 2014 IEEE International Conference on Pervasive Computing and Communication Workshops (Percom Workshops)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          . 1109/PerComW.
          <year>2014</year>
          .
          <volume>6815162</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Fister</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Draheim</surname>
          </string-name>
          ,
          <article-title>Numerical association rule mining: A systematic literature review</article-title>
          ,
          <source>arXiv preprint arXiv:2307.00662</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>R. D. D'Amico</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Erkoyuncu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Addepalli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Penver</surname>
          </string-name>
          ,
          <article-title>Cognitive digital twin: An approach to improve the maintenance management</article-title>
          ,
          <source>CIRP Journal of Manufacturing Science and Technology</source>
          <volume>38</volume>
          (
          <year>2022</year>
          )
          <fpage>613</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Karabulut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Pileggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Degeler</surname>
          </string-name>
          ,
          <article-title>Ontologies in digital twins: A systematic literature review</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>15168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Pei,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Mining frequent patterns without candidate generation</article-title>
          ,
          <source>ACM sigmod record 29</source>
          (
          <year>2000</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Akroyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mosbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kraft</surname>
          </string-name>
          ,
          <article-title>Universal digital twin-a dynamic knowledge graph, Data-Centric Engineering 2 (</article-title>
          <year>2021</year>
          )
          <article-title>e14</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tamašauskaitė</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Defining a knowledge graph development process through a systematic review</article-title>
          ,
          <source>ACM Transactions on Software Engineering and Methodology</source>
          <volume>32</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Vrachimis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kyriakou</surname>
          </string-name>
          , et al.,
          <article-title>Leakdb: a benchmark dataset for leakage diagnosis in water distribution networks:(146)</article-title>
          , in: WDSA/CCWI Joint Conference Proceedings, volume
          <volume>1</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raschka</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mlxtend:</surname>
          </string-name>
          <article-title>Providing machine learning and data science utilities and extensions to python's scientific computing stack</article-title>
          ,
          <source>The Journal of Open Source Software</source>
          <volume>3</volume>
          (
          <year>2018</year>
          ). URL: https://joss.theoj.org/papers/10.21105/joss.00638. doi:
          <volume>10</volume>
          .21105/joss.00638.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>