<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Capabilities for Better ML Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chenyang Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachel Brower-Sinning</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grace A. Lewis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Kästner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tongshuang Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon Software Engineering Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, Carnegie Mellon University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In spite of machine learning's rapid growth, its engineering support is scattered in many forms, and tends to favor certain engineering stages, stakeholders, and evaluation preferences. We envision a capability-based framework, which uses finegrained specifications for ML model behaviors to unite existing eforts towards better ML engineering. We use concrete scenarios (model design, debugging, and maintenance) to articulate capabilities' broad applications across various diferent dimensions, and their impact on building safer, more generalizable and more trustworthy models that reflect human needs. Through preliminary experiments, we show the potential of capabilities for reflecting model generalizability, which can provide guidance for the ML engineering process. We discuss challenges and opportunities for the integration of capabilities into ML engineering.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;machine learning engineering</kwd>
        <kwd>capability</kwd>
        <kwd>specification</kwd>
        <kwd>testing</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        academic research on ML engineering tends to focus on
the narrow space of model testing and debugging for data
Despite the rapid evolution of machine learning models, scientists [e.g., 7, 8], whereas industrial eforts are mostly
most efort has been on prototyping models — developing limited to supporting pipeline automation and model
models under idealized settings (e.g., with static datasets, deployment (“MLOps”) [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]. More importantly, because
following the i.i.d. assumption, assuming equal impor- these eforts are isolated, it is unclear how insights from
tance of all mistakes). These models tend to sufer in the one stage can be transferred to benefit the entire ML
wild where the ideal assumptions do not hold, leading engineering process (e.g., how error analysis results help
to safety issues, fairness issues, and project failures [
        <xref ref-type="bibr" rid="ref2">1</xref>
        ]. update model design decisions). In other words, there is
For example, a pedestrian detection model trained on still a lack of synergy among existing eforts for better
images taken on sunny days would not correctly respond ML engineering practices.
to natural weather changes [
        <xref ref-type="bibr" rid="ref3">2</xref>
        ] and may have never seen In this work, we envision a unified framework for
a wheelchair user in training or test data. Oversimplifi- ML engineering. In particular, we center our framework
cation has real consequences. If we had only tested the around capabilities [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]. A capability is a form of
fineaforementioned pedestrian detector on similar, sunny grained specification for ML model behavior. It helps
detest examples, and used our overly optimistic evaluation ifne concrete model behaviors in various scenarios which
to support deployment decisions, then an automated ve- are finer-grained and more holistic than standard
evaluahicle with the detector would be likely to cause accidents. tion metrics. In our pedestrian detector example, diferent
      </p>
      <p>
        To actually integrate models into production, substan- capabilities can be used to express safety requirements
tial additional engineering efort is required by interdis- from diferent aspects, e.g., recognizing pedestrians in
ciplinary teams [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ]: Not only do we need to make care- wheelchairs, being robust to extreme weather, or being
ful decisions at the model level (e.g., develop evaluation fair to people from diferent age groups [
        <xref ref-type="bibr" rid="ref3">2</xref>
        ].
metrics that reflect human expectations on models [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]), Similar to other ML engineering eforts, the term
capabut we also need to connect the model with the broader bility emerged specifically from (and is mostly used in)
system design (e.g., the model functionalities should be model testing and debugging [
        <xref ref-type="bibr" rid="ref5 ref9">4, 8</xref>
        ]. However, its natural
well-specified in a requirements engineering process [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ], link with expected model behaviors makes it ideal for ML
similar to how we design user interfaces). model specification which, akin to software specification,
      </p>
      <p>
        The importance of these eforts, commonly referred to (1) builds the root for the entire ML engineering cycle,
as ML engineering [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ], has been well-recognized, but the going from model design all the way to deployment and
actual implementation tends to be scattered. For example, maintenance, and (2) serves as the boundary object [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ]
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023) for diferent stakeholders to negotiate their (sometimes
* Corresponding author. conflicting) expectations of models. Moreover,
capabili$ chenyangy@cmu.edu (C. Yang) ties have the potential to reflect multiple essential factors
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License in ML engineering, e.g., distribution shift [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ],
robustCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
ness [12], fairness [13] (see Tab. 1). However, capabilities examples (i.e., test cases) for assessing models in the
engihave yet to fulfill their potential due to several challenges, neering process. We refer to the process of deriving test
e.g., it is not clear how to (1) best identify capabilities, data from capabilities as instantiation. Capabilities can
(2) instantiate abstract capabilities, and (3) operationalize be instantiated in many diferent ways, including slicing
capabilities to maximize their utility. existing data [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ], transformation of existing data [16],
      </p>
      <p>
        We take the first step towards presenting the vision of generating data from templates [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ], and targeted
curaa capability-based framework that both unites existing ef- tion of new data (possibly with crowdsourcing) [17] –
forts and sheds light on future opportunities. Specifically, see examples in Tab. 1. Diferent instantiation strategies
we illustrate the broad applicability of the framework have diferent costs and benefits, and it is often necessary
from both the technical perspective and the practical to make trade-ofs between them.
perspective, by (1) summarizing how existing ML engi- However, capabilities also difer from traditional
specneering concepts can be expressed with capabilities, and ifications in fundamental ways: Traditional software is
(2) describing four usage scenarios with unique character- built using a deductive reasoning process. Their
specificaistics (model debugging, collaboration, external quality tions are usually hard rules the software must satisfy –
assurance, and model maintenance). We also conduct an a single input-output pair that violates the specification
exploratory study to demonstrate the feasibility of our will be considered a bug. In contrast, machine learning
vision. We conclude the paper by discussing challenges uses inductive reasoning, where models are derived from
and opportunities for capabilities’ integration into ML observations and are expected to make occasional
misengineering that emerge from our preliminary results. takes [18]. As such, instead of declaring a model as buggy
for a single mistake related to a capability, we measure to
what degree the model has certain capabilities with a
fail2. Capabilities ure rate. In this sense, capability can be viewed as a soft
Capability definition: ML “specification.” A capa- lfoowreisrsbuoeusnwd hspeerceificaamtioonde,lasnydstwemeautsiecafalliyluurnedraetrepsetrofolromoks
bility can roughly be defined as a fine-grained specifica- with regard to a capability.
tion of behaviors expected of an ML model. The key idea
is to go beyond just considering the overall accuracy of
a model but analyzing to what degree the model exhibits Capabilities as a unifying framework. There are
specific kinds of expected behaviors. The term capability many existing eforts to support ML engineering, but they
was popularized by work on testing specific behaviors are often scattered and unconnected. Evaluating models
of ML models [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ], but similar concepts can be found in on specific qualities like robustness, fairness, and
generother work on model testing (e.g., stress tests [14]) and alizability is extensively discussed [e.g., 19, 13, 20], but
in various work exploring nuances of model misbehav- they often focus exclusively on a narrow set of
capabiliior and shortcut learning (e.g., underspecifications [ 15]). ties (e.g., robust to word replacement [21], data shift [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ],
Previous work [e.g., 4, 8] has shown that capabilities and spurious correlations [22]). Diferent strategies for
can expose many systematic problems in state-of-the-art model evaluation and data augmentation, from slicing [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ],
models, are useful for interactive testing and debugging, counterfactuals [
        <xref ref-type="bibr" rid="ref13 ref14">17, 23, 24</xref>
        ], templates [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ], to
perturbaand can guide data augmentation to train better models. tions [16] are widely explored, but there are very little
      </p>
      <p>
        Capabilities share similarities with traditional software eforts on combining them, evaluating their relative costs
specifications (and functional requirements) in that both and efectiveness, and often such eforts are limited to
prescribe how software should behave in specific scenar- individual qualities (e.g., robustness [12]). Recent work
ios. These prescriptions are general concepts or descrip- has shown interest in model debugging [
        <xref ref-type="bibr" rid="ref15 ref9">8, 25</xref>
        ] and error
tions but can be concretized into a list of input-output analysis [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ], but they often use diferent terminologies
Table 2 Capabilities can systematize this process and help
AlExample usage scenarios for capabilities. These scenarios ice generalize from individual mistakes to systematic
cover diferent ML engineering stages and stakeholders, show- problems. Instead of chasing mistakes, Alice now
identiing capabilities are beneficial across dimensions. ifes common capabilities from model mistakes. Then she
Scenario Stages Stakeholders assesses the importance of diferent capabilities,
instantiates the prioritized ones, and uses the instantiated tests
Model Debugging Development Data Scientists for both training and evaluation. Alice now evaluates
Collaboration Requirements, Software Engineers, the new model not only on some general test data, but
      </p>
      <p>Evaluation Data Scientists also on the test suites of diferent capabilities. She finds
External QA Evaluation External Evaluators, that the new model handles numerical reasoning better</p>
      <p>Regulators but is slightly worse on a diferent test suite that requires
Model Maintenance Deployment Data Scientists, complex co-reference resolution. She decides that this is
End Users acceptable and releases the model.</p>
      <sec id="sec-1-1">
        <title>Scenario 2: Collaboration. Bob is a software engi</title>
        <p>despite the similar underlying ideas. neer working in a government department, dealing with</p>
        <p>
          We argue that a capability is a generic abstraction that classified information. The department has a contract
can unify existing eforts. For example, diferent model with an external data science team on a vision model for
evaluation strategies can be seen as ways to instantiate satellite images, which is expected to be robust to
varicapabilities; diferent model qualities can be viewed as (a ous attacks and stable across various environments. Due
series of) capabilities that might matter in specific sce- to strict data security policies, the external data science
narios; a model’s reliance on spurious correlations can be team relies on public datasets instead of actual
producinterpreted as a lack of specific capabilities (e.g., ignoring tion data. Bob struggles to communicate requirements
backgrounds for object detection [
          <xref ref-type="bibr" rid="ref16">26</xref>
          ]). Furthermore, as and report useful feedback when the delivered model
we will argue, capabilities can go beyond existing liter- does not work in production.
ature to benefit engineering stages (e.g., requirements Capabilities can serve as a communication interface
engineering) and stakeholders (e.g., external evaluators between diferent stakeholders. Bob would be able to
or software engineers) that are currently under-explored. clearly describe the failures in ways the data science team
can understand, if he abstracts concrete private data, and
identifies sharable capabilities from them. Or even better,
3. Capabilities for Better ML he can instantiate capabilities with public data points,
Engineering such that the data science team can develop the next
version of the model with a clear goal of improvement
in mind in terms of capability failure rates.
        </p>
        <sec id="sec-1-1-1">
          <title>ML engineering efort happens at diferent development</title>
          <p>stages, with diferent stakeholders in the loop, and
targets diferent model qualities. We argue that capabilities
can help unify ML engineering eforts and lead to more
systematic practice because they can play important roles
in all these diverse dimensions.</p>
          <p>Below, we describe four concrete ML engineering
scenarios (summarized in Tab. 2), which cover diferent
dimensions and highlight challenges and opportunities.
3.1. Illustrative Scenarios
Scenario 1: Model Debugging. Alice is a data
scientist responsible for a chatbot used in her company. She is
now debugging the conversational model that performs
poorly on some inputs. She tries to understand what is
going wrong with these model mistakes. For each
mistake, she speculates the potential issue behind it (e.g.,
input sentence contains numerical reasoning that the
current model does not handle well) and updates the
model accordingly. However, she finds the entire process
ad-hoc and does not always produce a better model.</p>
          <p>Scenario 3: External Quality Assurance. Carolyn
works for a quality assurance team that previously
focused on testing traditional software components.
Carolyn is now responsible for independently evaluating
models delivered by external contractors — this time a
model for fraud detection. Trained in traditional software
testing, Carolyn finds it challenging to move forward
without concrete specifications at hand, and is unsure
what to do beyond standard accuracy evaluations.</p>
          <p>
            Capabilities provide a more holistic view of how
models perform in diferent scenarios. Carolyn reuses known
capabilities for fraud detection, which her team
developed for assessments on previous models, and evaluates
the model on instantiated test suites from these
capabilities, diving into specific capabilities of the model rather
than providing just a single broad accuracy measure. She
also looks at production data and past mistakes, and uses
them to identify new capabilities. Her final report
communicates how the model performs on diferent capability
test suites and highlights the model’s major weaknesses. Table 3
Capabilities and their instantitation keywords for sentiment
analysis, selected based on existing work [
            <xref ref-type="bibr" rid="ref17">27</xref>
            ]. We slice the
validation data on keywords to instantiate these capabilities,
and the % column represents the ratio of validation data that
is included in the slice.
          </p>
          <p>Scenario 4: Model Maintenance. Dan is a data
scientist for a social media platform. They are responsible for
a model that detects toxicity from user posts. The model
performs well on previously curated data, but its
performance degrades over time because of evolving trends in
user posts. Dan tries to update the model periodically to
cope with data shift. However, they find that the model
is still frequently suboptimal to unknown future shifts
even when trained with more recent data.</p>
          <p>Capabilities can be used to track how data evolves
through time and characterize data shift. Dan now
maintains a list of high-quality capability test suites as
regression tests. They regularly review new data to
identify whether the model needs additional capabilities, or
whether the reliance on existing capabilities changes
over time. This way, Dan gets to track the capability shift
trajectory, anticipate (to some extent) what future shift
might look like, and can instantiate suitable capabilities
tests beforehand. With capabilities, Dan now builds and
selects models that are more robust to data shift.
negation 51.6 not, n’t
negation (v2) 18.7 no, never, neither, nobody, none, nor, nothing
shifter 4.5 refuse, reject, deny, doubt, abandon, miss,</p>
          <p>question, abort, stop
modality 3.6 would have, could have, should have
comparative 16.6 better, worse, than
mixed 36.4 but, however, though, although, despite, even</p>
          <p>if, rather than, except that
reducer 14.1 kind of, all that, less, a little, somewhat, still
amplifier 48.8 really, very, super, so, incredibly, extremely, at</p>
          <p>all, whatsoever, much
model qualities have to be balanced.</p>
          <p>Despite the promising future, these scenarios share
common challenges, from identifying, assessing,
communicating, to instantiating capabilities. Yet diferent
sceDiscussion. We described four diferent scenarios of narios focus on diferent aspects and might have diferent
using capabilities for better ML engineering, illustrating requirements for the same challenge. For example, all
scetheir broad applicability. As a recap, narios require identifying capabilities, but the ways they
• Capabilities can be used at diferent stages of ML engi- are identified or expressed vary; a shared language would
neering. On the one hand, they provide specifications be required for collaboration, but if diferent
stakeholdfor ML models, which is fundamental to (collabora- ers describe the same capabilities in diferent ways, or
tive) model design, development, and testing. On the have diferent instantiation ideas, then additional
inconother hand, they also provide valuable abstractions for sistency arises and has to be mitigated. We will discuss
concrete data points, serve as a form of data specifica- these practical barriers in the next section.
tion, and allow for characterizing (possibly changing)
deployment environments. Notably, this potential for
data documentation/specification further enlarges ca- 3.2. Exploratory Experiment
pabilities’ impact on various stages that concern data, To explore the practicality of our envisioned
capabile.g., data collection, dataset evaluation, etc. ity framework, we conducted an experiment to explore
• Diferent stakeholders can utilize capabilities. whether capabilities are reflective of model
generalizThough data scientists, external evaluators, etc. in our ability. We focus on generalizability first because it is
scenarios have diferent priorities in mind, they are a primary design goal for any ML model, and a model
able to converge on the capability framing — whether quality essential for various use scenarios (e.g., the
aforeto use capabilities to exploit their hypotheses on model mentioned model maintenance and collaboration).
mistakes, to communicate the characteristics of a
nonshareable deployment environment, or to utilize prior Experiment setup. We define “reflective” as the
statraining practices. Notably, as in the communication tistical correlation between model performance on
cercase, such convergence enables knowledge sharing or tain capability tests, and their performance on
out-ofeven negotiation between stakeholders, as everyone distribution data points. 1
can speak the same “language.” Specifically, in the experiment, we repeatedly
fine• Capabilities can relate to diferent qualities of ML tuned BERT with diferent random seeds on the
Amazonmodels, ranging from accuracy (e.g., in debugging), wilds dataset [20], and obtained 100 sentiment analysis
robustness (e.g., in collaboration), fairness, to gener- models with similar source domain accuracy (Amazon
alizability (e.g., in maintenance). This enables
multifaceted evaluation without more consistent metric
designs, which is valuable especially when multiple</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>1Experiment details can be found in an online appendix (https://</title>
          <p>
            github.com/malusamayo/Capabilities-Experiment-Details) and are
not essential for the main vision outlined in this paper.
11/7/22, 8:44 PM
the source domain, using a proxy -distance [
            <xref ref-type="bibr" rid="ref18">28</xref>
            ]. As in
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Challenges and Opportunities</title>
      <p>stakeholders, who might have diferent requirements and
potential conflicts, or may describe the same capabilities
To more systematically use capabilities, further research in drastically diferent ways depending on their
experis needed. We argue that ML engineering can gener- tise (e.g., an expert may say “invariant to environmental
ally benefit from software engineering disciplines, with conditions” when a lay user says “performs the same in
principles from requirements engineering and software sunny, raining, stormy weathers.”) Common
communitesting in particular. In the following, we identify promis- cation vocabularies and conflict resolution mechanisms,
ing research directions based on gaps in the literature possibly informed by existing requirements engineering
and our own observations in our experiment. literature, would greatly facilitate the process.
RQ6 How can we develop a shared language or
inter</p>
      <p>face to facilitate capability communication?
RQ7 How can capabilities support conflict resolution
between diferent stakeholders?</p>
      <sec id="sec-2-1">
        <title>Identifying capabilities. It is challenging to identify</title>
        <p>
          capabilities for concrete scenarios. Capabilities often
differ across diferent modes (vision vs. language), diferent
tasks (sentiment analysis vs. natural language inference),
and diferent domains (product reviews vs. book reviews). Instantiating capabilities. Abstract capabilities need
While we may develop a catalog of common capabili- to be instantiated as concrete test cases, to be further
ties for general-purpose tasks, such as sentiment analy- used as regression tests, examples for communication,
sis [
          <xref ref-type="bibr" rid="ref17">27</xref>
          ], we will likely need to identify specific capabilities or augmentation data for training. Existing work has
for each domain-specific problem. Existing strategies in- explored diferent strategies for instantiating capabilities
clude using domain knowledge [16], performing error (c.f. Sec. 2), but it remains unclear how diferent strategies
analysis [
          <xref ref-type="bibr" rid="ref15 ref8">14, 7, 25</xref>
          ], and mining knowledge from existing perform in diferent scenarios and whether they could be
corpora [
          <xref ref-type="bibr" rid="ref19">29</xref>
          ]. Most strategies require extensive eforts of combined in a meaningful way. These strategies are
simdomain experts or crowdsource workers, making them ilar to software testing (e.g., unit tests and metamorphic
hard to scale. They are also often conducted in an un- testing [
          <xref ref-type="bibr" rid="ref20">30</xref>
          ]) and can be informed by existing software
systematic way and do not draw on classic requirement engineering literature (e.g., test case generation, fuzzing,
elicitation and participatory design approaches. Future prioritization, and requirements validation).
work could explore: RQ8 How should we select instantiation strategies in
RQ1 How could we support more efective discovery diferent scenarios? How to measure and trade
and reuse of domain knowledge? When and how of costs and benefits?
can we automate discovery? RQ9 How do diferent instantiation strategies
compleRQ2 What kinds of mechanisms could support more ment each other?
        </p>
        <p>eficient human-AI interaction in error analysis?
RQ3 How could we design a better process to help both</p>
        <p>experts and non-experts identify capabilities? 5. Conclusion
Assessing capabilities. Capabilities often exhibit a
hierarchical structure. For example, understanding negation
is a very general capability, whereas understanding
double negation or handling modifiers as “hardly” and “never”
are more specific (sub-)capabilities. How fine-grained a
capability should be will likely depend on the specific
scenarios. More coarse capabilities are more reusable,
whereas finer-grained ones capture concrete concepts
that might be especially useful for the domain (but may
not transfer — e.g., concrete adjectives like “cold” is
positive when describing refrigerators but not so much for
thermos). Their predictiveness also difers across
scenarios, as we observed in our experiments. When identifying
capabilities, we need to determine the proper granularity,
and evaluate their importance within the context:
RQ4 What is a good granularity for a capability?
RQ5 How do we evaluate/rank capabilities by context?</p>
        <sec id="sec-2-1-1">
          <title>Communicating capabilities. Identified capabilities</title>
          <p>need to be eficiently communicated between diferent</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>A capability is a generic abstraction that unifies exist</title>
        <p>ing eforts on model testing, debugging, and evaluation.
It can also benefit the entire ML engineering lifecycle
from data collection to model deployment, addressing the
needs of diferent stakeholders and model qualities. Our
exploratory experiments showed that capabilities could
provide strong signals for model generalizability, as well
as highlighted challenges in integrating them into the ML
engineering process. We hope future research will
better support identifying, assessing, communicating, and
instantiating capabilities.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>Kästner and Yang’s work is supported in part by NSF
awards 1813598, 2131477, and 2206859 and support from
the SEI. Lewis’ and Brower-Sinning’s work was funded
and supported by the Department of Defense under
Contract No. FA8702-15-D-0002 with Carnegie Mellon
University for the operation of the Software Engineering</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Institute, a federally funded research</article-title>
          and development [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Taschdjian</surname>
          </string-name>
          , M. Bansal, center (
          <issue>DM22-1187</issue>
          ). C.
          <article-title>Ré, Robustness gym: Unifying the NLP evaluation landscape, in: Proceedings of the 2021 Conference of the North American Chapter of the AsReferences sociation for Computational Linguistics: Human Language Technologies: Demonstrations</article-title>
          , Associ-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Panetta</surname>
          </string-name>
          ,
          <article-title>Gartner identifies the top strategic tech- ation for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , nology trends for
          <year>2021</year>
          . (
          <year>2020</year>
          ). pp.
          <fpage>42</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerónimo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Sappa</surname>
          </string-name>
          , T. Graf, [13]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Predictive biSurvey of pedestrian detection for advanced driver ases in natural language processing models: A conassistance systems</article-title>
          , IEEE Transactions on
          <article-title>Pattern ceptual framework and overview</article-title>
          , in
          <source>: Proceedings Analysis and Machine Intelligence</source>
          <volume>32</volume>
          (
          <year>2010</year>
          )
          <article-title>1239- of the 58th Annual Meeting of the Association for 1258</article-title>
          .
          <string-name>
            <surname>Computational</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , Association for Compu-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kästner</surname>
          </string-name>
          , Collabo- tational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5248</fpage>
          -
          <lpage>5264</lpage>
          .
          <article-title>ration challenges in building ml-enabled systems</article-title>
          : [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravichander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rose</surname>
          </string-name>
          , G. NeuCommunication, documentation, engineering, and
          <article-title>big, Stress test evaluation for natural language process</article-title>
          , in: 2022 IEEE/
          <article-title>ACM 44th International inference</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Software Engineering (ICSE)</source>
          ,
          <year>2022</year>
          , Conference on Computational Linguistics, Associapp.
          <fpage>413</fpage>
          -
          <lpage>425</lpage>
          . tion for Computational Linguistics, Santa Fe, New
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          , Be- Mexico, USA,
          <year>2018</year>
          , pp.
          <fpage>2340</fpage>
          -
          <lpage>2353</lpage>
          . yond accuracy: Behavioral testing of NLP models [15]
          <string-name>
            <surname>A. D'Amour</surname>
          </string-name>
          , et al.,
          <article-title>Underspecification presents chalwith CheckList, in: Proceedings of the 58th Annual lenges for credibility in modern machine learning</article-title>
          ,
          <source>Meeting of the Association for Computational Lin- 2020</source>
          . guistics, Association for Computational Linguistics, [16]
          <string-name>
            <surname>K. D. Dhole</surname>
          </string-name>
          , et al.,
          <source>Nl-augmenter: A framework Online</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4902</fpage>
          -
          <lpage>4912</lpage>
          .
          <article-title>for task-sensitive natural language augmentation,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <surname>A. van Lamsweerde</surname>
          </string-name>
          , Requirements Engineering:
          <year>2021</year>
          . From System Goals to UML Models to Software [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lipton</surname>
          </string-name>
          , Learning the diferSpecifications, 1st ed., Wiley Publishing,
          <year>2009</year>
          .
          <article-title>ence that makes a diference with counterfactually-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Burkov</surname>
          </string-name>
          ,
          <article-title>Machine learning engineering</article-title>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>augmented</surname>
            <given-names>data</given-names>
          </string-name>
          ,
          <source>in: International Conference on True Positive Incorporated</source>
          ,
          <year>2020</year>
          . Learning Representations,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weld</surname>
          </string-name>
          , Errudite: [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kaestner</surname>
          </string-name>
          ,
          <article-title>Machine learning is requirements enScalable, reproducible, and testable error analysis, gineering - on the role of bugs, verification, and in: Proceedings of the 57th Annual Meeting of the validation in machine learning</article-title>
          ,
          <source>Blog</source>
          ,
          <year>2020</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Associ- [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <article-title>On adversarial examation for Computational Linguistics, Florence, Italy, ples for character-level neural machine translation</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>747</fpage>
          -
          <lpage>763</lpage>
          . in: Proceedings of the 27th International Confer-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          , Adaptive testing and de- ence
          <source>on Computational Linguistics</source>
          ,
          <article-title>Association for bugging of NLP models</article-title>
          ,
          <source>in: Proceedings of the 60th Computational Linguistics</source>
          , Santa Fe, New Mexico,
          <article-title>Annual Meeting of the Association for Computa-</article-title>
          USA,
          <year>2018</year>
          , pp.
          <fpage>653</fpage>
          -
          <lpage>663</lpage>
          . tional Linguistics (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Associa- [20]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          , et al.,
          <article-title>Wilds: A benchmark of in-the-wild tion for Computational Linguistics</article-title>
          , Dublin, Ireland, distribution shifts, in: M.
          <string-name>
            <surname>Meila</surname>
          </string-name>
          , T. Zhang (Eds.),
          <year>2022</year>
          , pp.
          <fpage>3253</fpage>
          -
          <lpage>3267</lpage>
          . Proceedings of the 38th International Conference
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mäkinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Skogström</surname>
          </string-name>
          , E. Laaksonen, T. Mikko- on
          <source>Machine Learning</source>
          , volume
          <volume>139</volume>
          of Proceedings of nen, Who needs mlops:
          <article-title>What data scientists seek to accomplish and how can mlops help</article-title>
          ?,
          <year>2021</year>
          5M6a6c4h.
          <article-title>ine Learning Research</article-title>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>5637</fpage>
          -
          <lpage>IEEE</lpage>
          /ACM 1st Workshop on AI Engineering - Soft- [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Paware Engineering for AI (WAIN</article-title>
          ) (
          <year>2021</year>
          )
          <fpage>109</fpage>
          -
          <lpage>112</lpage>
          . padakis, L. Zhang, Improving machine translation
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Star</surname>
          </string-name>
          ,
          <article-title>The Structure of Ill-Structured Solutions: systems via isotopic replacement</article-title>
          ,
          <source>in: Proceedings Boundary Objects and Heterogeneous Distributed of the 44th International Conference on Software Problem Solving</source>
          , Morgan Kaufmann Publishers Engineering, ICSE '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Inc., San Francisco, CA, USA,
          <year>1989</year>
          , p.
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          . Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>1181</fpage>
          -
          <lpage>1192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rabanser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Günnemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          , Failing [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>McCoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          , T. Linzen,
          <article-title>Right for the wrong Loudly: An Empirical Study of Methods for Detect- reasons: Diagnosing syntactic heuristics in natural ing Dataset Shift, Curran Associates Inc., Red Hook, language inference</article-title>
          ,
          <source>in: Proceedings of the 57th NY, USA</source>
          ,
          <year>2019</year>
          .
          <article-title>Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3428</fpage>
          -
          <lpage>3448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          , et al.,
          <article-title>Evaluating models' local decision boundaries via contrast sets</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1307</fpage>
          -
          <lpage>1323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <article-title>Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>6707</fpage>
          -
          <lpage>6723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Druck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. I.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perer</surname>
          </string-name>
          ,
          <article-title>Discovering and validating ai errors with crowdsourced failure reports</article-title>
          ,
          <source>Proc. ACM Hum.-Comput. Interact</source>
          .
          <volume>5</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Beery</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Van Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <article-title>Recognition in terra incognita</article-title>
          , in: Computer Vision - ECCV
          <year>2018</year>
          : 15th European Conference, Munich, Germany, September 8-
          <issue>14</issue>
          ,
          <year>2018</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>XVI</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2018</year>
          , p.
          <fpage>472</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Barnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , E. Velldal,
          <article-title>Sentiment analysis is not solved! assessing and probing sentiment classification</article-title>
          ,
          <source>in: Proceedings of the 2019 ACL Workshop BlackboxNLP</source>
          : Analyzing and
          <article-title>Interpreting Neural Networks for NLP, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Blitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , Biographies, Bollywood, boom
          <article-title>-boxes and blenders: Domain adaptation for sentiment classification</article-title>
          ,
          <source>in: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics</source>
          , Association for Computational Linguistics, Prague, Czech Republic,
          <year>2007</year>
          , pp.
          <fpage>440</fpage>
          -
          <lpage>447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H.</given-names>
            <surname>Barzamini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shahzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alhoori</surname>
          </string-name>
          ,
          <article-title>Improving generalizability of ml-enabled software through domain specification</article-title>
          ,
          <source>in: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI</source>
          , CAIN '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>181</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T. Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-C.</given-names>
            <surname>Kuo</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>P.-L.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Towey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Tse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Metamorphic testing: A review of challenges and opportunities</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>