1. Introduction

Capabilities for Better ML Engineering

Chenyang Yang

Rachel Brower-Sinning

Grace A. Lewis

Christian Kästner

Tongshuang Wu

1 0 Carnegie Mellon Software Engineering Institute 1 School of Computer Science, Carnegie Mellon University

In spite of machine learning's rapid growth, its engineering support is scattered in many forms, and tends to favor certain engineering stages, stakeholders, and evaluation preferences. We envision a capability-based framework, which uses finegrained specifications for ML model behaviors to unite existing eforts towards better ML engineering. We use concrete scenarios (model design, debugging, and maintenance) to articulate capabilities' broad applications across various diferent dimensions, and their impact on building safer, more generalizable and more trustworthy models that reflect human needs. Through preliminary experiments, we show the potential of capabilities for reflecting model generalizability, which can provide guidance for the ML engineering process. We discuss challenges and opportunities for the integration of capabilities into ML engineering.

eol>machine learning engineering capability specification testing evaluation

1. Introduction

academic research on ML engineering tends to focus on the narrow space of model testing and debugging for data Despite the rapid evolution of machine learning models, scientists [e.g., 7, 8], whereas industrial eforts are mostly most efort has been on prototyping models — developing limited to supporting pipeline automation and model models under idealized settings (e.g., with static datasets, deployment (“MLOps”) [ 9 ]. More importantly, because following the i.i.d. assumption, assuming equal impor- these eforts are isolated, it is unclear how insights from tance of all mistakes). These models tend to sufer in the one stage can be transferred to benefit the entire ML wild where the ideal assumptions do not hold, leading engineering process (e.g., how error analysis results help to safety issues, fairness issues, and project failures [ 1 ]. update model design decisions). In other words, there is For example, a pedestrian detection model trained on still a lack of synergy among existing eforts for better images taken on sunny days would not correctly respond ML engineering practices. to natural weather changes [ 2 ] and may have never seen In this work, we envision a unified framework for a wheelchair user in training or test data. Oversimplifi- ML engineering. In particular, we center our framework cation has real consequences. If we had only tested the around capabilities [ 4 ]. A capability is a form of fineaforementioned pedestrian detector on similar, sunny grained specification for ML model behavior. It helps detest examples, and used our overly optimistic evaluation ifne concrete model behaviors in various scenarios which to support deployment decisions, then an automated ve- are finer-grained and more holistic than standard evaluahicle with the detector would be likely to cause accidents. tion metrics. In our pedestrian detector example, diferent

To actually integrate models into production, substan- capabilities can be used to express safety requirements tial additional engineering efort is required by interdis- from diferent aspects, e.g., recognizing pedestrians in ciplinary teams [ 3 ]: Not only do we need to make care- wheelchairs, being robust to extreme weather, or being ful decisions at the model level (e.g., develop evaluation fair to people from diferent age groups [ 2 ]. metrics that reflect human expectations on models [ 4 ]), Similar to other ML engineering eforts, the term capabut we also need to connect the model with the broader bility emerged specifically from (and is mostly used in) system design (e.g., the model functionalities should be model testing and debugging [ 4, 8 ]. However, its natural well-specified in a requirements engineering process [ 5 ], link with expected model behaviors makes it ideal for ML similar to how we design user interfaces). model specification which, akin to software specification,

The importance of these eforts, commonly referred to (1) builds the root for the entire ML engineering cycle, as ML engineering [ 6 ], has been well-recognized, but the going from model design all the way to deployment and actual implementation tends to be scattered. For example, maintenance, and (2) serves as the boundary object [ 10 ] The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023) for diferent stakeholders to negotiate their (sometimes * Corresponding author. conflicting) expectations of models. Moreover, capabili$ chenyangy@cmu.edu (C. Yang) ties have the potential to reflect multiple essential factors © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License in ML engineering, e.g., distribution shift [ 11 ], robustCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ness [12], fairness [13] (see Tab. 1). However, capabilities examples (i.e., test cases) for assessing models in the engihave yet to fulfill their potential due to several challenges, neering process. We refer to the process of deriving test e.g., it is not clear how to (1) best identify capabilities, data from capabilities as instantiation. Capabilities can (2) instantiate abstract capabilities, and (3) operationalize be instantiated in many diferent ways, including slicing capabilities to maximize their utility. existing data [ 7 ], transformation of existing data [16],

We take the first step towards presenting the vision of generating data from templates [ 4 ], and targeted curaa capability-based framework that both unites existing ef- tion of new data (possibly with crowdsourcing) [17] – forts and sheds light on future opportunities. Specifically, see examples in Tab. 1. Diferent instantiation strategies we illustrate the broad applicability of the framework have diferent costs and benefits, and it is often necessary from both the technical perspective and the practical to make trade-ofs between them. perspective, by (1) summarizing how existing ML engi- However, capabilities also difer from traditional specneering concepts can be expressed with capabilities, and ifications in fundamental ways: Traditional software is (2) describing four usage scenarios with unique character- built using a deductive reasoning process. Their specificaistics (model debugging, collaboration, external quality tions are usually hard rules the software must satisfy – assurance, and model maintenance). We also conduct an a single input-output pair that violates the specification exploratory study to demonstrate the feasibility of our will be considered a bug. In contrast, machine learning vision. We conclude the paper by discussing challenges uses inductive reasoning, where models are derived from and opportunities for capabilities’ integration into ML observations and are expected to make occasional misengineering that emerge from our preliminary results. takes [18]. As such, instead of declaring a model as buggy for a single mistake related to a capability, we measure to what degree the model has certain capabilities with a fail2. Capabilities ure rate. In this sense, capability can be viewed as a soft Capability definition: ML “specification.” A capa- lfoowreisrsbuoeusnwd hspeerceificaamtioonde,lasnydstwemeautsiecafalliyluurnedraetrepsetrofolromoks bility can roughly be defined as a fine-grained specifica- with regard to a capability. tion of behaviors expected of an ML model. The key idea is to go beyond just considering the overall accuracy of a model but analyzing to what degree the model exhibits Capabilities as a unifying framework. There are specific kinds of expected behaviors. The term capability many existing eforts to support ML engineering, but they was popularized by work on testing specific behaviors are often scattered and unconnected. Evaluating models of ML models [ 4 ], but similar concepts can be found in on specific qualities like robustness, fairness, and generother work on model testing (e.g., stress tests [14]) and alizability is extensively discussed [e.g., 19, 13, 20], but in various work exploring nuances of model misbehav- they often focus exclusively on a narrow set of capabiliior and shortcut learning (e.g., underspecifications [ 15]). ties (e.g., robust to word replacement [21], data shift [ 11 ], Previous work [e.g., 4, 8] has shown that capabilities and spurious correlations [22]). Diferent strategies for can expose many systematic problems in state-of-the-art model evaluation and data augmentation, from slicing [ 7 ], models, are useful for interactive testing and debugging, counterfactuals [ 17, 23, 24 ], templates [ 4 ], to perturbaand can guide data augmentation to train better models. tions [16] are widely explored, but there are very little

Capabilities share similarities with traditional software eforts on combining them, evaluating their relative costs specifications (and functional requirements) in that both and efectiveness, and often such eforts are limited to prescribe how software should behave in specific scenar- individual qualities (e.g., robustness [12]). Recent work ios. These prescriptions are general concepts or descrip- has shown interest in model debugging [ 8, 25 ] and error tions but can be concretized into a list of input-output analysis [ 7 ], but they often use diferent terminologies Table 2 Capabilities can systematize this process and help AlExample usage scenarios for capabilities. These scenarios ice generalize from individual mistakes to systematic cover diferent ML engineering stages and stakeholders, show- problems. Instead of chasing mistakes, Alice now identiing capabilities are beneficial across dimensions. ifes common capabilities from model mistakes. Then she Scenario Stages Stakeholders assesses the importance of diferent capabilities, instantiates the prioritized ones, and uses the instantiated tests Model Debugging Development Data Scientists for both training and evaluation. Alice now evaluates Collaboration Requirements, Software Engineers, the new model not only on some general test data, but

Evaluation Data Scientists also on the test suites of diferent capabilities. She finds External QA Evaluation External Evaluators, that the new model handles numerical reasoning better

Regulators but is slightly worse on a diferent test suite that requires Model Maintenance Deployment Data Scientists, complex co-reference resolution. She decides that this is End Users acceptable and releases the model.

Scenario 2: Collaboration. Bob is a software engi

despite the similar underlying ideas. neer working in a government department, dealing with

We argue that a capability is a generic abstraction that classified information. The department has a contract can unify existing eforts. For example, diferent model with an external data science team on a vision model for evaluation strategies can be seen as ways to instantiate satellite images, which is expected to be robust to varicapabilities; diferent model qualities can be viewed as (a ous attacks and stable across various environments. Due series of) capabilities that might matter in specific sce- to strict data security policies, the external data science narios; a model’s reliance on spurious correlations can be team relies on public datasets instead of actual producinterpreted as a lack of specific capabilities (e.g., ignoring tion data. Bob struggles to communicate requirements backgrounds for object detection [ 26 ]). Furthermore, as and report useful feedback when the delivered model we will argue, capabilities can go beyond existing liter- does not work in production. ature to benefit engineering stages (e.g., requirements Capabilities can serve as a communication interface engineering) and stakeholders (e.g., external evaluators between diferent stakeholders. Bob would be able to or software engineers) that are currently under-explored. clearly describe the failures in ways the data science team can understand, if he abstracts concrete private data, and identifies sharable capabilities from them. Or even better, 3. Capabilities for Better ML he can instantiate capabilities with public data points, Engineering such that the data science team can develop the next version of the model with a clear goal of improvement in mind in terms of capability failure rates.

ML engineering efort happens at diferent development

stages, with diferent stakeholders in the loop, and targets diferent model qualities. We argue that capabilities can help unify ML engineering eforts and lead to more systematic practice because they can play important roles in all these diverse dimensions.

Below, we describe four concrete ML engineering scenarios (summarized in Tab. 2), which cover diferent dimensions and highlight challenges and opportunities. 3.1. Illustrative Scenarios Scenario 1: Model Debugging. Alice is a data scientist responsible for a chatbot used in her company. She is now debugging the conversational model that performs poorly on some inputs. She tries to understand what is going wrong with these model mistakes. For each mistake, she speculates the potential issue behind it (e.g., input sentence contains numerical reasoning that the current model does not handle well) and updates the model accordingly. However, she finds the entire process ad-hoc and does not always produce a better model.

Scenario 3: External Quality Assurance. Carolyn works for a quality assurance team that previously focused on testing traditional software components. Carolyn is now responsible for independently evaluating models delivered by external contractors — this time a model for fraud detection. Trained in traditional software testing, Carolyn finds it challenging to move forward without concrete specifications at hand, and is unsure what to do beyond standard accuracy evaluations.

Capabilities provide a more holistic view of how models perform in diferent scenarios. Carolyn reuses known capabilities for fraud detection, which her team developed for assessments on previous models, and evaluates the model on instantiated test suites from these capabilities, diving into specific capabilities of the model rather than providing just a single broad accuracy measure. She also looks at production data and past mistakes, and uses them to identify new capabilities. Her final report communicates how the model performs on diferent capability test suites and highlights the model’s major weaknesses. Table 3 Capabilities and their instantitation keywords for sentiment analysis, selected based on existing work [ 27 ]. We slice the validation data on keywords to instantiate these capabilities, and the % column represents the ratio of validation data that is included in the slice.

Scenario 4: Model Maintenance. Dan is a data scientist for a social media platform. They are responsible for a model that detects toxicity from user posts. The model performs well on previously curated data, but its performance degrades over time because of evolving trends in user posts. Dan tries to update the model periodically to cope with data shift. However, they find that the model is still frequently suboptimal to unknown future shifts even when trained with more recent data.

Capabilities can be used to track how data evolves through time and characterize data shift. Dan now maintains a list of high-quality capability test suites as regression tests. They regularly review new data to identify whether the model needs additional capabilities, or whether the reliance on existing capabilities changes over time. This way, Dan gets to track the capability shift trajectory, anticipate (to some extent) what future shift might look like, and can instantiate suitable capabilities tests beforehand. With capabilities, Dan now builds and selects models that are more robust to data shift. negation 51.6 not, n’t negation (v2) 18.7 no, never, neither, nobody, none, nor, nothing shifter 4.5 refuse, reject, deny, doubt, abandon, miss,

question, abort, stop modality 3.6 would have, could have, should have comparative 16.6 better, worse, than mixed 36.4 but, however, though, although, despite, even

if, rather than, except that reducer 14.1 kind of, all that, less, a little, somewhat, still amplifier 48.8 really, very, super, so, incredibly, extremely, at

all, whatsoever, much model qualities have to be balanced.

Despite the promising future, these scenarios share common challenges, from identifying, assessing, communicating, to instantiating capabilities. Yet diferent sceDiscussion. We described four diferent scenarios of narios focus on diferent aspects and might have diferent using capabilities for better ML engineering, illustrating requirements for the same challenge. For example, all scetheir broad applicability. As a recap, narios require identifying capabilities, but the ways they • Capabilities can be used at diferent stages of ML engi- are identified or expressed vary; a shared language would neering. On the one hand, they provide specifications be required for collaboration, but if diferent stakeholdfor ML models, which is fundamental to (collabora- ers describe the same capabilities in diferent ways, or tive) model design, development, and testing. On the have diferent instantiation ideas, then additional inconother hand, they also provide valuable abstractions for sistency arises and has to be mitigated. We will discuss concrete data points, serve as a form of data specifica- these practical barriers in the next section. tion, and allow for characterizing (possibly changing) deployment environments. Notably, this potential for data documentation/specification further enlarges ca- 3.2. Exploratory Experiment pabilities’ impact on various stages that concern data, To explore the practicality of our envisioned capabile.g., data collection, dataset evaluation, etc. ity framework, we conducted an experiment to explore • Diferent stakeholders can utilize capabilities. whether capabilities are reflective of model generalizThough data scientists, external evaluators, etc. in our ability. We focus on generalizability first because it is scenarios have diferent priorities in mind, they are a primary design goal for any ML model, and a model able to converge on the capability framing — whether quality essential for various use scenarios (e.g., the aforeto use capabilities to exploit their hypotheses on model mentioned model maintenance and collaboration). mistakes, to communicate the characteristics of a nonshareable deployment environment, or to utilize prior Experiment setup. We define “reflective” as the statraining practices. Notably, as in the communication tistical correlation between model performance on cercase, such convergence enables knowledge sharing or tain capability tests, and their performance on out-ofeven negotiation between stakeholders, as everyone distribution data points. 1 can speak the same “language.” Specifically, in the experiment, we repeatedly fine• Capabilities can relate to diferent qualities of ML tuned BERT with diferent random seeds on the Amazonmodels, ranging from accuracy (e.g., in debugging), wilds dataset [20], and obtained 100 sentiment analysis robustness (e.g., in collaboration), fairness, to gener- models with similar source domain accuracy (Amazon alizability (e.g., in maintenance). This enables multifaceted evaluation without more consistent metric designs, which is valuable especially when multiple

1Experiment details can be found in an online appendix (https://

github.com/malusamayo/Capabilities-Experiment-Details) and are not essential for the main vision outlined in this paper. 11/7/22, 8:44 PM the source domain, using a proxy -distance [ 28 ]. As in

4. Challenges and Opportunities

stakeholders, who might have diferent requirements and potential conflicts, or may describe the same capabilities To more systematically use capabilities, further research in drastically diferent ways depending on their experis needed. We argue that ML engineering can gener- tise (e.g., an expert may say “invariant to environmental ally benefit from software engineering disciplines, with conditions” when a lay user says “performs the same in principles from requirements engineering and software sunny, raining, stormy weathers.”) Common communitesting in particular. In the following, we identify promis- cation vocabularies and conflict resolution mechanisms, ing research directions based on gaps in the literature possibly informed by existing requirements engineering and our own observations in our experiment. literature, would greatly facilitate the process. RQ6 How can we develop a shared language or inter

face to facilitate capability communication? RQ7 How can capabilities support conflict resolution between diferent stakeholders?

Identifying capabilities. It is challenging to identify

capabilities for concrete scenarios. Capabilities often differ across diferent modes (vision vs. language), diferent tasks (sentiment analysis vs. natural language inference), and diferent domains (product reviews vs. book reviews). Instantiating capabilities. Abstract capabilities need While we may develop a catalog of common capabili- to be instantiated as concrete test cases, to be further ties for general-purpose tasks, such as sentiment analy- used as regression tests, examples for communication, sis [ 27 ], we will likely need to identify specific capabilities or augmentation data for training. Existing work has for each domain-specific problem. Existing strategies in- explored diferent strategies for instantiating capabilities clude using domain knowledge [16], performing error (c.f. Sec. 2), but it remains unclear how diferent strategies analysis [ 14, 7, 25 ], and mining knowledge from existing perform in diferent scenarios and whether they could be corpora [ 29 ]. Most strategies require extensive eforts of combined in a meaningful way. These strategies are simdomain experts or crowdsource workers, making them ilar to software testing (e.g., unit tests and metamorphic hard to scale. They are also often conducted in an un- testing [ 30 ]) and can be informed by existing software systematic way and do not draw on classic requirement engineering literature (e.g., test case generation, fuzzing, elicitation and participatory design approaches. Future prioritization, and requirements validation). work could explore: RQ8 How should we select instantiation strategies in RQ1 How could we support more efective discovery diferent scenarios? How to measure and trade and reuse of domain knowledge? When and how of costs and benefits? can we automate discovery? RQ9 How do diferent instantiation strategies compleRQ2 What kinds of mechanisms could support more ment each other?

eficient human-AI interaction in error analysis? RQ3 How could we design a better process to help both

experts and non-experts identify capabilities? 5. Conclusion Assessing capabilities. Capabilities often exhibit a hierarchical structure. For example, understanding negation is a very general capability, whereas understanding double negation or handling modifiers as “hardly” and “never” are more specific (sub-)capabilities. How fine-grained a capability should be will likely depend on the specific scenarios. More coarse capabilities are more reusable, whereas finer-grained ones capture concrete concepts that might be especially useful for the domain (but may not transfer — e.g., concrete adjectives like “cold” is positive when describing refrigerators but not so much for thermos). Their predictiveness also difers across scenarios, as we observed in our experiments. When identifying capabilities, we need to determine the proper granularity, and evaluate their importance within the context: RQ4 What is a good granularity for a capability? RQ5 How do we evaluate/rank capabilities by context?

Communicating capabilities. Identified capabilities

need to be eficiently communicated between diferent

A capability is a generic abstraction that unifies exist

ing eforts on model testing, debugging, and evaluation. It can also benefit the entire ML engineering lifecycle from data collection to model deployment, addressing the needs of diferent stakeholders and model qualities. Our exploratory experiments showed that capabilities could provide strong signals for model generalizability, as well as highlighted challenges in integrating them into the ML engineering process. We hope future research will better support identifying, assessing, communicating, and instantiating capabilities.

Acknowledgments

Kästner and Yang’s work is supported in part by NSF awards 1813598, 2131477, and 2206859 and support from the SEI. Lewis’ and Brower-Sinning’s work was funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering

Institute, a federally funded research and development [12]

Goel ,

N. F.

Rajani ,

Vig ,

Taschdjian , M. Bansal, center ( DM22-1187 ). C. Ré, Robustness gym: Unifying the NLP evaluation landscape, in: Proceedings of the 2021 Conference of the North American Chapter of the AsReferences sociation for Computational Linguistics: Human Language Technologies: Demonstrations , Associ-

[1]

Panetta , Gartner identifies the top strategic tech- ation for Computational Linguistics , Online, 2021 , nology trends for 2021 . ( 2020 ). pp. 42 - 55 .

[2]

Gerónimo ,

A. M.

López ,

A. D.

Sappa , T. Graf, [13]

D. S.

Shah ,

H. A.

Schwartz ,

Hovy , Predictive biSurvey of pedestrian detection for advanced driver ases in natural language processing models: A conassistance systems , IEEE Transactions on Pattern ceptual framework and overview , in : Proceedings Analysis and Machine Intelligence 32 ( 2010 ) 1239- of the 58th Annual Meeting of the Association for 1258 . Computational

Linguistics

, Association for Compu-

[3]

Nahar ,

Zhou ,

Lewis ,

Kästner , Collabo- tational Linguistics , Online, 2020 , pp. 5248 - 5264 . ration challenges in building ml-enabled systems : [14]

Naik ,

Ravichander ,

Sadeh ,

Rose , G. NeuCommunication, documentation, engineering, and big, Stress test evaluation for natural language process , in: 2022 IEEE/ ACM 44th International inference , in: Proceedings of the 27th International Conference on Software Engineering (ICSE) , 2022 , Conference on Computational Linguistics, Associapp. 413 - 425 . tion for Computational Linguistics, Santa Fe, New

[4]

M. T.

Ribeiro ,

Wu ,

Guestrin ,

Singh , Be- Mexico, USA, 2018 , pp. 2340 - 2353 . yond accuracy: Behavioral testing of NLP models [15] A. D'Amour , et al., Underspecification presents chalwith CheckList, in: Proceedings of the 58th Annual lenges for credibility in modern machine learning , Meeting of the Association for Computational Lin- 2020 . guistics, Association for Computational Linguistics, [16] K. D. Dhole , et al., Nl-augmenter: A framework Online , 2020 , pp. 4902 - 4912 . for task-sensitive natural language augmentation,

[5] A. van Lamsweerde , Requirements Engineering: 2021 . From System Goals to UML Models to Software [17]

Kaushik ,

Hovy ,

Lipton , Learning the diferSpecifications, 1st ed., Wiley Publishing, 2009 . ence that makes a diference with counterfactually-

[6]

Burkov , Machine learning engineering , volume 1 , augmented

data

, in: International Conference on True Positive Incorporated , 2020 . Learning Representations, 2020 .

[7]

Wu ,

M. T.

Ribeiro ,

Heer ,

Weld , Errudite: [18]

Kaestner , Machine learning is requirements enScalable, reproducible, and testable error analysis, gineering - on the role of bugs, verification, and in: Proceedings of the 57th Annual Meeting of the validation in machine learning , Blog , 2020 . Association for Computational Linguistics , Associ- [19]

Ebrahimi ,

Lowd ,

Dou , On adversarial examation for Computational Linguistics, Florence, Italy, ples for character-level neural machine translation , 2019 , pp. 747 - 763 . in: Proceedings of the 27th International Confer-

[8]

M. T.

Ribeiro ,

Lundberg , Adaptive testing and de- ence on Computational Linguistics , Association for bugging of NLP models , in: Proceedings of the 60th Computational Linguistics , Santa Fe, New Mexico, Annual Meeting of the Association for Computa- USA, 2018 , pp. 653 - 663 . tional Linguistics (Volume 1 : Long

Papers)

, Associa- [20]

P. W.

Koh , et al., Wilds: A benchmark of in-the-wild tion for Computational Linguistics , Dublin, Ireland, distribution shifts, in: M. Meila , T. Zhang (Eds.), 2022 , pp. 3253 - 3267 . Proceedings of the 38th International Conference

[9]

Mäkinen ,

Skogström , E. Laaksonen, T. Mikko- on Machine Learning , volume 139 of Proceedings of nen, Who needs mlops: What data scientists seek to accomplish and how can mlops help ?, 2021 5M6a6c4h. ine Learning Research , PMLR, 2021 , pp. 5637 - IEEE /ACM 1st Workshop on AI Engineering - Soft- [21]

Sun ,

J. M.

Zhang ,

Xiong ,

Harman , M. Paware Engineering for AI (WAIN ) ( 2021 ) 109 - 112 . padakis, L. Zhang, Improving machine translation

[10]

S. L.

Star , The Structure of Ill-Structured Solutions: systems via isotopic replacement , in: Proceedings Boundary Objects and Heterogeneous Distributed of the 44th International Conference on Software Problem Solving , Morgan Kaufmann Publishers Engineering, ICSE '22, Association for Computing Inc., San Francisco, CA, USA, 1989 , p. 37 - 54 . Machinery, New York, NY, USA, 2022 , p. 1181 - 1192 .

[11]

Rabanser ,

Günnemann ,

Z. C.

Lipton , Failing [22]

McCoy ,

Pavlick , T. Linzen, Right for the wrong Loudly: An Empirical Study of Methods for Detect- reasons: Diagnosing syntactic heuristics in natural ing Dataset Shift, Curran Associates Inc., Red Hook, language inference , in: Proceedings of the 57th NY, USA , 2019 . Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 3428 - 3448 .

[23]

Gardner , et al., Evaluating models' local decision boundaries via contrast sets , in: Findings of the Association for Computational Linguistics: EMNLP 2020 , Association for Computational Linguistics , Online, 2020 , pp. 1307 - 1323 .

[24]

Wu ,

M. T.

Ribeiro ,

Heer ,

Weld , Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models , in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Online, 2021 , pp. 6707 - 6723 .

[25]

A. A.

Cabrera ,

A. J.

Druck ,

J. I.

Hong ,

Perer , Discovering and validating ai errors with crowdsourced failure reports , Proc. ACM Hum.-Comput. Interact . 5 ( 2021 ).

[26]

Beery , G. Van Horn ,

Perona , Recognition in terra incognita , in: Computer Vision - ECCV 2018 : 15th European Conference, Munich, Germany, September 8- 14 , 2018 , Proceedings, Part

XVI

, Springer-Verlag, Berlin, Heidelberg, 2018 , p. 472 - 489 .

[27]

Barnes ,

Øvrelid , E. Velldal, Sentiment analysis is not solved! assessing and probing sentiment classification , in: Proceedings of the 2019 ACL Workshop BlackboxNLP : Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 12 - 23 .

[28]

Blitzer ,

Dredze ,

Pereira , Biographies, Bollywood, boom -boxes and blenders: Domain adaptation for sentiment classification , in: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics , Association for Computational Linguistics, Prague, Czech Republic, 2007 , pp. 440 - 447 .

[29]

Barzamini ,

Rahimi ,

Shahzad ,

Alhoori , Improving generalizability of ml-enabled software through domain specification , in: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI , CAIN '22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 181 - 192 .

[30]

T. Y.

Chen ,

F.-C.

Kuo , H. Liu,

P.-L.

Poon ,

Towey ,

T. H.

Tse ,

Z. Q.

Zhou , Metamorphic testing: A review of challenges and opportunities , ACM Comput. Surv . 51 ( 2018 ).