<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Active Simulation Data Mining?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mirko Bunse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal Saadallah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katharina Morik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund, AI Group</institution>
          ,
          <addr-line>44221 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>104</fpage>
      <lpage>107</lpage>
      <abstract>
        <p>Simulations have recently been considered as data generators for machine learning. However, the high computational cost associated with them requires a smart sampling of what to simulate. We distinguish between two scenarios of simulation data mining, which can be optimized with active learning and active class selection. * This work has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876. “Providing Information by ResourceConstrained Data Analysis”, projects C3 and B3. http://sfb876.tu-dortmund.de</p>
      </abstract>
      <kwd-group>
        <kwd>Simulation</kwd>
        <kwd>Active learning</kwd>
        <kwd>Active class selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Simulations are powerful tools for investigating the behavior of complex systems
in science and engineering. Recently, there is an increase of attention towards
the employment of simulated data in machine learning, an integration that is
sometimes termed simulation data mining [
        <xref ref-type="bibr" rid="ref11 ref12 ref2 ref4">11,2,4,12</xref>
        ]. Its applications range from
integrated circuit design [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] over milling processes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], mechanized tunneling [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
robotized surgery [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and cancer treatment [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to astro-particle physics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The goal of simulation data mining is to reason about a real system under
study by learning from data which is generated by a simulation of that system.
The benefit of this paradigm is that less or even no data is required from the
actual system. Acquiring “real” data would often be costly or even be infeasible,
e.g. if the actual system is still in the design phase and not yet deployed.
Oppositely, simulations have the potential to provide large volumes of data, only at the
expense of their computation. However, the need for accurate simulations often
leads to complex simulation models (e.g. 3D numerical Finite-Element
simulations), which result in high costs associated with data generation. The time and
computational resources required by simulations motivate the active sampling of
data, more precisely active learning (AL) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and active class selection (ACS)
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Both of these frameworks seek to select the minimal amount of training data
while maximizing the performance of a prediction model trained with that data.
In this short paper, we argue that there are two different strategies for the
simulation of training data which distinctively correspond to AL and ACS. In fact, a
simulation may either generate labels from a set of input features [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref2 ref8 ref9">11,12,13,2,9,8</xref>
        ]
or it may generate feature vectors from input labels [
        <xref ref-type="bibr" rid="ref1 ref7">7,1</xref>
        ]. The need for cost
efficiency thus makes simulation data mining an imminent application scenario for
methods from AL and from ACS.
c 2019 for this paper by its authors. Use permitted under CC BY 4.0.
T2owards MAicrtkioveBuSnimse,uAlamtiaolnSDaaadtaallaMh,inainndgKatharina Morik
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Active Sampling from Simulated Data</title>
      <p>Every simulation is based on some kind of generative model. Such a simulation
model may comprise analytical, geometric, agent-based, and probabilistic
modeling approaches which represent the dynamics of the studied system. Namely,
such a model represents how the state s ∈ S of the system evolves over time:
Simρ (s t, Δt ) = s t+Δt ,
0 ≤ t ≤ T,
(1)
where ρ ∈ P is a vector of simulation parameters, which can be directly related
to the parameters of the real system or process. In this view, the simulation
is a fixed black box which encodes domain knowledge up to minor details. In
the following, we distinguish between two scenarios in which machine learning
models are trained on simulated data.
2.1</p>
      <sec id="sec-2-1">
        <title>Forward Learning Scenario</title>
        <p>
          In the first learning scenario, the simulation model has the same direction of
inference as the machine learning model f : X → Y that is to be trained. This
means that the initial state s0 ∈ S of the simulation is a function of the feature
vector x ∈ X . The simulation then comprises multiple steps s1 ∈ S , s2 ∈ S , . . .
until a label y ∈ Y is obtained in the the final state sT ∈ S . Thus, the simulation
and the machine learning model both infer y from x, as illustrated in Fig. 1.
This learning scenario is probably the most common to date, being approached
for example in [
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref2 ref8 ref9">11,12,13,2,9,8</xref>
          ].
        </p>
        <p>s0
x</p>
        <p>Simρ
s1
. . .</p>
        <p>Simρ
Simρ
f
sT
y</p>
        <p>
          Since the mappings from x to s0 and from sT to y are given by the problem
statement, we could use the simulation to predict y directly—without learning
another model f from simulated data. However, simulations often encompass
even those details of the analyzed system that are only minor for the prediction
task at hand. The computational resources required to compute data from such
a precise model limit the resource efficiency of the simulation with respect to
the prediction task. It is therefore often not feasible to run a simulation for
prediction, particularly for resource-aware or real-time applications. Machine
learning can then be used to build surrogate models which solve the prediction
task efficiently [
          <xref ref-type="bibr" rid="ref8 ref9">9,8</xref>
          ]. The simulation can take the role of an oracle oAL : X → Y ,
so that an AL technique can optimize the data being simulated.
Towards Active Simulation Data MTionwinargds Active Simulation Data Mining
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Backward Learning Scenario</title>
        <p>
          In the second scenario, the goal is to learn a prediction model of the “opposite
direction” of the simulation. In other words, the prediction task to find the
causes of observed effects. This task is modeled by the label y defining the
input of the simulation and a corresponding feature vector x being produced,
as outlined in Fig. 2. Since the machine learning model now solves another
task than the simulation, it is able to achieve analysis goals which can not be
achieved with the simulation alone. This second scenario is applied, for example,
in robotized surgery, where the force which caused a deformation is predicted [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
or in astro-particle physics, where particle properties are predicted from indirect
observations [
          <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
          ].
        </p>
        <p>s0
y</p>
        <p>Simρ
s1
. . .</p>
        <p>Simρ
Simρ
f
sT
x</p>
        <p>Other than in the forward scenario, a “backward” simulation can not predict
y from x. It can thus not be used as an AL oracle. However, we can use the
simulation as the data generator oACS : Y → X that is assumed by ACS. One
reason for distinguishing the two scenarios is thus the applicability of active
sampling techniques. AL is only amenable in the forward scenario, ACS only in
the backward case.
The goal of AL and ACS is to reduce the cost of training data generation.
Starting from an initial data set, the simulation candidates are scored according
to a selection criterion s and the best candidates are being simulated until a
stopping criterion is met after some iterations. In this framework, AL scores
feature vectors and ACS—in contrast—scores labels.</p>
        <p>sAL : X →
sACS : Y →</p>
        <p>R
R</p>
        <p>Having a simulation, we can generalize this concept to a scoring of all
simulation inputs, also comprising the auxiliary simulation parameters ρ ∈ P . Namely,
AL can score each (x, ρ ) and ACS can score each (y, ρ ) to have a higher chance
of identifying the relevant input sub-spaces and to improve efficiency further.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We distinguish between two scenarios in which machine learning models are
trained from simulated data. Our distinction corresponds to the applicability
T4owards MAicrtkioveBuSnimse,uAlamtiaolnSDaaadtaallaMh,inainndgKatharina Morik
of AL and ACS, a property not previously detailed in simulation data science.
Moreover, we conceive that active sampling techniques can be improved by
accounting for the parameters of the simulation.</p>
      <p>In upcoming work, we will further elaborate the paradigm of learning from
simulations. In this regard, we deem data quality a particular issue because
simulated data does not always picture the real system exactly. This problem
may be tackled with transfer learning or domain adaptation techniques, which
make the differences between multiple data sources—the simulation and the real
system—explicit. Therefore, we consider simulation data science a promising use
case also for combinations of active sampling and transfer learning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bockermann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Bru¨gge,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Buss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Egorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Morik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Rhode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Ruhe</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Online analysis of high-volume data streams in astroparticle physics</article-title>
          .
          <source>In: Proc. of the ECML-PKDD, Part III. LNCS</source>
          , vol.
          <volume>9286</volume>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>115</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brady</surname>
            ,
            <given-names>T.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yellig</surname>
          </string-name>
          , E.:
          <article-title>Simulation data mining: a new form of computer simulation output</article-title>
          .
          <source>In: Proc. of the 37th Winter Simulation Conf</source>
          . pp.
          <fpage>285</fpage>
          -
          <lpage>289</lpage>
          . IEEE (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bunse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatkowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruhe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rhode</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Unification of deconvolution algorithms for Cherenkov astronomy</article-title>
          .
          <source>In: Proc. of the 5th Int. Conf. on Data Science and Advanced Analytics (DSAA)</source>
          . pp.
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frochte</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiesner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Mu¨ller, K.:
          <article-title>Simulation data mining for supporting bridge design</article-title>
          .
          <source>In: Proc. of the 9th Australasian Data Mining Conf. (AusDM)</source>
          .
          <source>CRPIT</source>
          , vol.
          <volume>121</volume>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          . Australian Computer Society (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Deist</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krane</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sorenson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craft</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Simulation assisted machine learning (</article-title>
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1802</year>
          .05688, under review
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lomasky</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodley</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aernecke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedl</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Active class selection</article-title>
          .
          <source>In: Proc. of the ECML. LNCS</source>
          , vol.
          <volume>4701</volume>
          , pp.
          <fpage>640</fpage>
          -
          <lpage>647</lpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mendizabal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fountoukidou</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Hermann, J.,
          <string-name>
            <surname>Sznitman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cotin</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A combined simulation and machine learning approach for image-based force classification during robotized intravitreal injections</article-title>
          .
          <source>In: Proc. of the 21st Int. Conf. on Medical Image Computing and Computer Assisted Intervention (MICCAI)</source>
          .
          <source>LNCS</source>
          , vol.
          <volume>11073</volume>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>20</lpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Saadallah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexey</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>B.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meschke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Active learning for accurate settlement prediction using numerical simulations in mechanized tunneling</article-title>
          .
          <source>Procedia CIRP</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Saadallah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkeldey</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiederkehr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Stability prediction in milling processes using a simulation-based machine learning approach</article-title>
          .
          <source>In: 51st CIRP Conf. on Manufacturing Systems. Elsevier</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Active Learning</article-title>
          .
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          , Morgan &amp; Claypool Publishers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Shao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A machine learning based global simulation data mining approach for efficient design changes</article-title>
          .
          <source>Advances in Engineering Software</source>
          <volume>124</volume>
          ,
          <fpage>22</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Trittenbach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gauch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Bo¨hm,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Towards simulation-data science - a case study on material failures</article-title>
          .
          <source>In: Proc. of the 5th Int. Conf. on Data Science and Advanced Analytics (DSAA)</source>
          . pp.
          <fpage>450</fpage>
          -
          <lpage>459</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marek-Sadowska</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Machine learning in simulation-based analysis</article-title>
          .
          <source>In: Proc. of the Int. Symp. on Physical Design (ISPD)</source>
          . pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          . ACM (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>