<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Poster &amp; Tools Track, Birmingham, UK</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On the Role of Data Engineering Decisions in AI-Based Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aurek Chattopadhyay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Van Doren</string-name>
          <email>matthew.vandoren@cincinnati-oh.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reese Johnson</string-name>
          <email>reese.johnson@cincinnati-oh.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nan Niu</string-name>
          <email>nan.niu@uc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Horkof, A. Perini, A. Susi, M. Daneva, A. Herrmann, K. Schneider</institution>
          ,
          <addr-line>P. Mennig, F. Dalpiaz, D. Dell'Anna, S. Kopczyńska, L</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: J. Fischbach, N. Condori-Fernández</institution>
          ,
          <addr-line>J. Doerr, M. Ruiz, J.-P. Steghöfer, L. Pasquale, A. Zisman, R. Guizzardi, J</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Metropolitan Sewer District of Greater Cincinnati</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Cincinnati</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>Artificial Intelligence (AI) and Deep Learning (DL) solutions are becoming increasingly popular in our daily life. However, one of the main concerns that stakeholders face is the lack of accountability for the inevitable errors made by AI. In this paper, we build on the emerging research of developing DL solutions to predict combined sewer overflows, and present our vision of linking data engineering decisions to key requirements such as predictive accuracy and explainability.</p>
      </abstract>
      <kwd-group>
        <kwd>data and requirements engineering</kwd>
        <kwd>deep learning</kwd>
        <kwd>explainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Artificial Intelligence (AI) has become an increasingly incorporated part of many decisions that
we take in our daily life. Recent advancements have made AI an appealing tool for improving
the eficiency of manual work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The unstoppable penetration of AI also reaches into the public
sector. Although AI mistakes are inevitable, the lack of explainability raises significant concerns
from the citizens and public organizations about AI-based decision making’s accountability,
fairness, responsibility, and transparency.
      </p>
      <p>
        In our recent work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we reported a case study of using Deep Learning (DL) solutions to
predict combined sewer overflow events for a municipal wastewater treatment organization. In
particular, we built a Long Short Term Memory (LSTM) model by taking diferent time-series
data, such as flow and velocity collected from the sensor networks, as well as the rainfall
measures obtained from a weather service.
      </p>
      <p>A crucial issue emerged from our case study is to what extent data engineering decisions
influence DL solution’s performances. Specifically, the data of our LSTM model are available
in diferent timestamps, e.g., the sensor network data are collected in every five minutes
https://homepages.uc.edu/~niunn (N. Niu)
whereas the rainfall measures are stored in every minute. For a DL solution like LSTM to
work, synchronizing the heterogeneous data is necessary. However, diferent data engineering
decisions can be made to achieve synchronization. One can take the mean from a five-minute
span of the rainfall measures to align with the sensor network data, yet others may regard the
median, mode, minimum, or maximum rainfall values as a plausible data engineering decision
to make. Clearly, diferent decisions can be made but whether or not they impact the LSTM’s
working is less understood.</p>
      <p>In this paper, we concentrate on the influence of data engineering decisions on DL solution’s
performances by focusing on LSTM’s predictive accuracy and explainability. We anticipate this
detailed investigation to pave ways for integrating data as an integral component into
requirements engineering of AI-based applications. In what follows, we introduce the background of
the combined sewer overflows and our ongoing research of developing DL solutions in Section 2.
We then present our study design and discuss the study results in Section 3. Finally, we draw
some concluding remarks in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>Combined sewer overflows are significant problems afecting human and environmental health.
For instance, nearly 860 cities and towns across the U.S. have combined sewer systems, which
manage stormwater as well as wastewater, creating what the U.S. Environmental Protection
Agency (EPA) considers to be the largest unaddressed risk to human health from the water
infrastructure.</p>
      <p>
        We recently began to collaborate with a regional wastewater treatment organization,
Metropolitan Sewer District of Greater Cincinnati (MSDGC) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. MSDGC services an operating area of
about 300 square miles, over 850,000 customers, and over 3,000 miles of combined sewers. Our
stakeholder organization has set up a large scale sensor network to collect data and remotely
operate their system. Their current practice is to reference weather forecast, then alert citizens
if a combined sewer overflow event may occur within the next day. Since they are a public
service, they need to be able to justify their reasoning for their decisions, especially when their
decisions afect the safety of customers. This need for transparency is why their current system
of mostly relying on weather forecasts is preferred.
      </p>
      <p>
        Our collaboration with MSDGC experimented a DL solution of predicting the combined
sewer overflows based on the LSTM implementation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The data used for our LSTM algorithm
was taken from various sensors at a MSDGC’s outflow site, a manhole approximately 450 ft
upstream of the outflow site, and a rainfall sensor for the area. The site is considered to be
“overflowing” whenever the level of water exceeds the site’s capacity. Notably, the data were
collected at diferent rates. The slowest sampling rate is one sample for every five minutes
while the fastest is every minute. As illustrations with fictitious data, Table 1 shows a sample of
the rainfall data, and Table 2 shows samples from sensors in a manhole a few minutes upstream
in a pipe upstream.
      </p>
      <p>
        Related work on DL-based smart sewer systems includes the use of a recurrent neural network
to forecast the stormwater runof in terms of the precipitation and the previous runof discharge
at a combined sewer overflow site near the District of Columbia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The experiment on the
34,721-timestep data showed that runof prediction accuracy was high when the hidden layer
reached the maximum capacity of hardware constraints. Our ongoing work compared three
variants of the recurrent neural network architecture, namely LSTM, GRU, and IndRNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
and further showed that LSTM exhibited a high robustness and stationarity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this paper,
our interests focus specifically on the decision of synchronizing data sampled at diferent rates
(cf. Tables 1 and 2), and the decision’s impacts on the key requirements of predictive accuracy
and explainability.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Study Design and Results</title>
      <p>Using the data shown in Table 1 as an example, aggregating the rainfall measures from a
ifve-minute span to a single value represents an important yet subtle data engineering decision.
Suppose a synchronization decision at 14:35 (Oct 12, 2018) needs to be made, a baseline choice
may be taking the instantaneous value of 0.0014. However, these exist several competing
decisions: from 14:31 to 14:35, mode would take 0.0006, median would use 0.0009, mean would
equal 0.00096, minimum would point to 0.0006, and maximum would match 0.0014.</p>
      <p>
        Our study thus tests these diferent data engineering decisions’ impacts on a DL solution’s
performances. Specifically, we reuse the LSTM implementation of Maltbie et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] where the
related artifacts (including source code, hyper parameter search scripts, etc.) are available at
https://doi.org/10.5281/zenodo.4818970 . This LSTM model is trained with 12 hours of
continuous data and predicts whether a combined sewer overflow event would occur in the next hour
(i.e., between the 12th and 13th hour). In addition, we exploit the LIME tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for generating
the explanations of why LSTM makes certain predictions in the same way as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. LIME creates
a local approximation of the deep learning model’s output space by sampling various inputs
from our dataset, and then uses this approximation of the output space to determine which
features in the input space are the most significant to determine the model’s prediction.
      </p>
      <p>In our study, six diferent datasets are prepared, each corresponding to a data engineering
decision: baseline, mean, median, mode, maximum, and minimum. To generate a data point for
a particular timestamp (at a five-minute interval), five timestamps from the original collected
rainfall data between two five-minute interval timestamps are taken according to the statistical
measures. To assess predictive accuracy, we compute the confidence scores of the LSTM’s
predictions made for elevated flow for a true positive case where the combined sewer overflow
event actually occurs. Table 3 lists the results, from which we can conclude that, everything
else being equal, the diferent data engineering decisions do lead to diferent DL performances.
Furthermore, using the mean and the median gives rise to the best (0.64) and worse (0.50)
confidence score respectively.</p>
      <p>While the confidence scores reported in Table 3 appear quantitatively similar, we run LIME
to generate qualitative insights into why the LSTM has made its prediction on our chosen case.
Figures 1 and 2 show the LIME explanation results for the true positive case under mean and
median respectively. It can be seen from these figures that the factors contributing to the LSTM
predictions are noticeably diferent. When the mean is used, for instance, a top-ranked factor
turns out to contribute to a negative prediction (i.e., contributing to normal flow). The LIME
results of Figure 2, on the other hand, show the top factors indeed contribute to a positive,
elevated flow prediction; however, the confidence score of such a decision is as strong as a coin
lfip (0.50 shown in Table 3).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Concluding Remarks</title>
      <p>Our study results indicate that data engineering decisions impact DL’s results quantitatively
and qualitatively. One may argue that which decision to take is a matter of domain expertise.
Yet our experience shows that such expertise is subjective, e.g., maximum may be a logical
choice for a wastewater treatment engineer who is willing to tolerate the false-positive overflow
predictions, whereas mean or median might be reasonable choices for a statistical analyst.</p>
      <p>A main contribution of our paper is to recognize the data engineering decisions as a
firstclass citizen in the software engineering process for AI-based systems as shown in Figure 3.</p>
      <p>
        Not only do the data engineering decisions influence DL solution’s performances, but they
also shape the requirements engineering activities in important ways much like aspects cut
across the conventional object-oriented modularity [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. For instance, a wastewater treatment
engineer’s requirement of willing to tolerate false positives would be better understood and
more quantifiable if the results of using maximum could be compared with other decisions like
minimum.
      </p>
      <p>
        In a similar fashion, the DL performance diferences resulted from using the mean and the
median may provoke a more rigorous discussion among the statistical analyst and related
stakeholders to better operationalize data synchronization [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], or even create new ways to
handle asynchronous data (e.g., increasing the sampling frequency from the sensor networks
dynamically informed by the rainfall measures and/or LSTM’s predictions). Due to the dynamic
nature of data engineering decisions, aligning requirements closely with testing may be
valuable [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Finally, it is interesting to realize that the data engineering decisions themselves
may help define the emerging requirements of flexibility and customizability that allow the end
users to bind such decisions on-the-fly; however, balancing the tradeofs between being highly
customizable and safely operable is of significant importance to AI-based systems.
Requirements
Engineering
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank Nicholas Maltbie for his seminal and replicable work on which this paper builds.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>Requirements engineering in the days of artificial intelligence</article-title>
          ,
          <source>IEEE Software 37</source>
          (
          <year>2020</year>
          )
          <fpage>7</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maltbie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Doren</surname>
          </string-name>
          , R. Johnson,
          <article-title>XAI tools in the public sector: A case study on predicting combined sewer overflows</article-title>
          , in: ESEC/FSE,
          <year>2021</year>
          , pp.
          <fpage>1032</fpage>
          -
          <lpage>1044</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gudaparthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , H. Challa,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>Deep learning for smart sewer systems: Assessing nonfunctional requirements</article-title>
          ,
          <source>in: ICSE-SEIS</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Urban stormwater runof prediction using recurrent neural networks</article-title>
          ,
          <source>in: ISNN</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Challa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          , R. Johnson,
          <article-title>Faulty requirements made valuable: On the role of data quality in deep learning</article-title>
          ,
          <source>in: AIRE</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “
          <article-title>Why Should I Trust You?” Explaining the predictions of any classifier</article-title>
          ,
          <source>in: KDD</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Easterbrook</surname>
          </string-name>
          ,
          <article-title>Analysis of early aspects in requirements goal models: A conceptdriven approach</article-title>
          ,
          <source>Transactions on Aspect-Oriented Software Development</source>
          <volume>3</volume>
          (
          <year>2007</year>
          )
          <fpage>40</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          , B. G.
          <string-name>
            <surname>-B. Yijun Yu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ernst</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Leite</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mylopoulos</surname>
          </string-name>
          ,
          <article-title>Aspects across software life cycle: A goal-driven approach</article-title>
          ,
          <source>Transactions on Aspect-Oriented Software Development</source>
          <volume>6</volume>
          (
          <year>2009</year>
          )
          <fpage>83</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Easterbrook</surname>
          </string-name>
          ,
          <article-title>So, you think you know others' goals? A repertory grid study</article-title>
          ,
          <source>IEEE Software 24</source>
          (
          <year>2007</year>
          )
          <fpage>53</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Lopez</surname>
          </string-name>
          , J.-R. C. Cheng,
          <article-title>Using soft systems methodology to improve requirements practices: An exploratory case study</article-title>
          ,
          <source>IET Software 5</source>
          (
          <year>2011</year>
          )
          <fpage>487</fpage>
          -
          <lpage>495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bhowmik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Chekuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>The role of environment assertions in requirements-based testing</article-title>
          , in: RE,
          <year>2019</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rathod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bhowmik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Environment-driven abstraction identification for requirements-based testing</article-title>
          , in: RE,
          <year>2021</year>
          , pp.
          <fpage>245</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>